"crash the app" sounds like the app's problem (ie. not handling exceptions properly) as opposed to the design of the API. It doesn't seem that unreasonable to throw an exception if unexpected conditions are hit? Also, more likely than not, there is probably an explicit reason that an exception is thrown here instead of something else.
I think you missed the part where I had to give them hinits to solve it. All 3 initially couldn't or refused saying it was not a real problem on their first try.
You must be on the wrong side of an A/B test or very unlucky.
Because I gave your exact prompt to o3, Gemini, and Claude and they all produced reasonable answers like above on the first shot, with no hints, multiple times.
FWIW I just gave a similar question to Claude Sonnet 4 (I asked about something other than pianos, just in case they're doing some sort of constant fine-tuning on user interactions[1] and to make it less likely that the exact same question is somewhere in its training data[2]) and it gave a very reasonable-looking answer. I haven't tried to double-check any of its specific numbers, some of which don't match my immediate prejudices, but it did the right sort of thing and considered more ways for things to end up on the ocean floor than I instantly thought of. No hints needed or given.
[1] I would bet pretty heavily that they aren't, at least not on the sort of timescale that would be relevant here, but better safe than sorry.
[2] I picked something a bit more obscure than pianos.
I’m so tired of hearing this be repeated, like the whole “LLMs are _just_ parrots” thing.
It’s patently obvious to me that LLMs can reason and solve novel problems not in their training data. You can test this out in so many ways, and there’s so many examples out there.
______________
Edit for responders, instead of replying to each:
We obviously have to define what we mean by "reasoning" and "solving novel problems". From my point of view, reasoning != general intelligence. I also consider reasoning to be a spectrum. Just because it cannot solve the hardest problem you can think of does not mean it cannot reason at all. Do note, I think LLMs are generally pretty bad at reasoning. But I disagree with the point that LLMs cannot reason at all or never solve any novel problems.
In terms of some backing points/examples:
1) Next token prediction can itself be argued to be a task that requires reasoning
2) You can construct a variety of language translation tasks, with completely made up languages, that LLMs can complete successfully. There's tons of research about in-context learning and zero-shot performance.
1) Even though they start to fail at some complexity threshold, it's incredibly impressive that LLMs can solve any of these difficult puzzles at all! GPT3.5 couldn't do that. We're making incremental progress in terms of reasoning. Bigger, smarter models get better at zero-shot tasks, and I think that correlates with reasoning.
2) Regarding point 4 ("Bigger models might to do better"): I think this is very dismissive. The paper itself shows a huge variance in the performance of different models. For example, in figure 8, we see Claude 3.7 significantly outperforming DeepSeek and maintaining stable solutions for a much longer sequence length. Figure 5 also shows that better models and more tokens improve performance at "medium" difficulty problems. Just because it cannot solve the "hard" problems does not mean it cannot reason at all, nor does it necessarily mean it will never get there. Many people were saying we'd never be able to solve problems like the medium ones a few years ago, but now the goal posts have just shifted.
> It’s patently obvious that LLMs can reason and solve novel problems not in their training data.
Would you care to tell us more ?
« It’s patently obvious » is not really an argument, I could say just as well that everyone know LLM can’t resonate or think (in the way we living beings do).
I'm working on new API. I asked the LLM to read the spec and write tests for it. It does. I don't know if that's "reasoning". I know that no tests exist for this API. I know that the internet is not full of training data for this API because it's a new API. It's also not a CRUD API or some other API that's got a common pattern. And yet, with a very short prompt, Gemini Code Assist wrote valid tests for a new feature.
It certainly feels like more than fancy auto-complete. That is not to say I haven't run into issue but I'm still often shocked at how far it gets. And that's today. I have no idea what to expect in 6 months, 12, 2 years, 4, etc.
> I know that the internet is not full of training data for this API because it's a new API.
1) are you sure? That's a bold guess. It was also a really stupid assumption made by the HumanEval benchmark authors. That if you "hand write" simple leet code style questions then you can train on all of GitHub. Go ahead, go look at what kinds of questions are in that benchmark...
2) LLMs aren't discrete databases. They are curve fitting functions. Compression. They work in very very high dimensions. They can generate new data but that is limited. People mostly aren't saying that LLMs can't create novel things but that they can't reason in the way that humans can. Humans can't memorize half of what a LLM can yet are able to figure out lots of crazy shit.
I just made up this scenario and these words, so I'm sure it wasn't in the training data.
Kwomps can zark but they can't plimf. Ghirns are a lot like Kwomps, but better zarkers. Plyzers have the skills the Ghirns lack.
Quoning, a type of plimfing, was developed in 3985. Zhuning was developed 100 years earlier.
I have an erork that needs to be plimfed. Choose one group and one method to do it.
> Use Plyzers and do a Quoning procedure on your erork.
If that doesn't count as reasoning or generalization, I don't know what does.
It’s just a truth table. I had a hunch that it was a truth table and then I asked AI how it figured it out and it confirmed it built a truth table. Still impressive either way
* Goal: Pick (Group ∧ Method) such that Group can plimf ∧ Method is a type of plimfing
* Only one group (Plyzers) passes the "can plimf" test
* Only one method (Quoning) is definitely plimfing
Therefore, the only valid (Group ∧ Method) combo is: → (Plyzer ∧ Quoning)
If anything you'd think that the neurosymbolic people would be pleased that the LLMs do in fact reason by learning circuits representing boolean logic and truth tables. In a way they were right, it's just that starting with logic and then feeding in knowledge grounded in that logic (like Cyc) seems less scalable than feeding in knowledge and letting the model infer the underlying logic.
Right, that’s my point. LLMs are doing pattern abstraction and in this way can mimic logic. They are not trained explicitly to do just truth tables even thought truth tables are fundamental.
So far they cannot even answer questions which are straight up fact checking and search engine like queries. Reasoning means they would be able to work through a problem and generate a proof they way a student might.
You're mistaking pattern matching and the modeling of relationships in latent space for genuine reasoning.
I don't know what you're working on, but while I'm not curing cancer, I am solving problems that aren't in the training data and can't be found on Google. Just a few days ago, Gemini 2.5 Pro literally told me it didn’t know what to do and asked me for help. The other models hallucinated incorrect answers. I solved the problem in 15 minutes.
If you're working on yet another CRUD app, and you've never implemented transformers yourself or understood how they work internally, then I understand why LLMs might seem like magic to you.
It's definitely not true in any meaningful sense. There are plenty of us practitioners in software engineering wishing it was true, because if it was, we'd all have genius interns working for us on Mac Studios at home.
It's not true. It's plainly not true. Go have any of these models, paid, or local try to build you novel solutions to hard, existing problems despite being, in some cases, trained on literally the entire compendium of open knowledge in not just one, but multiple adjacent fields. Not to mention the fact that being able to abstract general knowledge would mean it would be able to reason.
They. Cannot. Do it.
I have no idea what you people are talking about because you cannot be working on anything with real substance that hasn't been perfectly line fit to your abundantly worked on problems, but no, these models are obviously not reasoning.
I built a digital employee and gave it menial tasks that compare to current cloud solutions who also claim to be able to provide you paid cloud AI employees and these things are stupider than fresh college grads.
>> 1) Next token prediction can itself be argued to be a task that requires reasoning
That is wishful thinking popularised by Ilya Sutskever and Greg Brockman of OpenAI to "explain" why LLMs are a different class of system than smaller language models or other predictive models.
I'm sorry to say that (John Mearsheimer voice) that's simply not a serious argument. Take a multivariate regression model that predicts blood pressure from demographic data (age, sex, weight, etc). You can train a pretty accurate model for that kind of task if you have enough data (a few thousand data points). Does that model need to "reason" about human behaviour in order to be good at predicting BP? Nope. All it needs is a lot of data. That's how statistics works. So why is it different for a predictive model of BP and different for a next-token prediction model? The only answer seems to be "because language is magickal and special". But without any attempt to explain why, in terms of sequence prediction, language is special. Unless the er reasoning is that humans can produce language, humans can reason, LLMs can produce language, therefore LLMs can reason; which obviously doesn't follow.
But I have to guess here because neither Sutskever nor Brockman have ever tried to explain why next token prediction needs reasoning (or, more precisely, "understanding", the term they have used).
> That is wishful thinking popularised by Ilya Sutskever
Ilya and Hinton have claimed even crazier things
| to understand next token prediction you must understand the casual reality
This is objectively false. It's a result known in physics to be wrong for centuries. You can probably reason a weaker case yourself, that I'm sure you can make accurate predictions about some things without fully understanding them.
But the stronger version is the entire difficulty of physics and causal modeling. Distinguishing a confounding variable is very very hard. But you can still make accurate predictions without access to the underlying causal graph
Hinton and Sutskever are victims of their own success: they can say whatever they like and nobody dares criticise them, or tell them how they're wrong.
I recently watched a video of Sutskever speaking to some students, not sure where and I can't dig out the link now. To summarise he told them that the human brain is a biological computer. He repeated this a couple of times then said that this is why we can create a digital computer that can do everything a brain can.
This is the computational theory of mind, reduced to a pin-point with all context removed. Two seconds of thought suffice to show how that doesn't work: if a digital computer can do everything the brain can do, because the brain is a biological computer, then how come the brain can't do everything a digital computer can do? Is it possible that two machines can be both computers, and still not equivalent in every sense of the term? Nooooo!!! Biological computers!! AGI!!
Those guys really need to stop and think about what they're talking about before someone notices what they're saying and the entire field becomes a laughing stock.
> Two seconds of thought suffice to show how that doesn't work: if a digital computer can do everything the brain can do, because the brain is a biological computer, then how come the brain can't do everything a digital computer can do? Is it possible that two machines can be both computers, and still not equivalent in every sense of the term? Nooooo!!! Biological computers!! AGI!!
Another two seconds of thought would suffice to answer that: because you can freely change neither hardware or software of the brain, like you can with computers.
Obviously, Angry Birds on the phone can't do everything digital computers can do, but that doesn't mean a smartphone isn't a digital computer.
Another 2 seconds of thought might have told you only a magic genie can "freely" change hardware and software capability.
Humans have to work within whatever constraints accompany being physical things with physical bodies trying to invent software and hardware in the physical world.
I'm fine with calling the brain a computer. A computer is a very vague term. But yes, I agree that the conclusion does not necessarily follow. It's possible, not not necessarily
> Take a multivariate regression model that predicts blood pressure from demographic data (age, sex, weight, etc). You can train a pretty accurate model for that kind of task if you have enough data (a few thousand data points). Does that model need to "reason" about human behaviour in order to be good at predicting BP? Nope. All it needs is a lot of data. That's how statistics works. So why is it different for a predictive model of BP and different for a next-token prediction model?
For one, because the goal function for the latter is "predict output that makes sense to humans", in the fully broad, fully general sense of that statement.
It's not just one thing, like parse grocery lists, XOR write simple code, XOR write a story, XOR infer sentiment. XOR be a lossy cache for Wikipedia. It's all of them, separate or together, plus much more, plus correctly handling humor, sarcasm, surface-level errors (e.g. typos, naming), implied rules, shorthands, deep errors (think user being confused and using terminology wrong; LLMs can handle that fine), and an uncountable number of other things (because language is special, see below). It's quite obvious this is a different class of things than a narrowly specialized model like BP predictor.
And yes, language is special. Despite Chomsky's protestations to the contrary, it's not really formally structured; all the grammar and syntax and vocabulary is merely classification of high-level patterns that tend to occur (though invention of print and public education definitely strengthened them). Any experience with learning a language, or actual talking to other people, makes it obvious that grammar or vocabulary are neither necessary nor sufficient to communication. At the same time, though, once established, the particular choices become another dimension that packs meaning (as it becomes apparent when e.g. pondering why some books or articles seem better than other).
Ultimately, language not a set of easy patterns you can learn (or code symbolically!) - it's a dance people do when communicating, whose structure is fluid and bound by reasoning capabilities of humans. Being able to reason this way is required to communicate with real humans in real, generic scenarios. Now, this isn't a proof LLMs can do it, but the degree to which they excel at this is at least a strong suggestion they qualitatively could be.
I've done this excercise dozens of times because people keep saying it, but I can't find an example where this is true. I wish it was. I'd be solving world problems with novel solutions right now.
People make a common mistake by conflating "solving problems with novel surface features" with "reasoning outside training data." This is exactly the kind of binary thinking I mentioned earlier.
"Solving novel problems" does not mean "solving world problems that even humans are unable to solve", it simply means solving problems that are "novel" compared to what's in the training data.
Can you reason? Yes? Then why haven't you cured cancer? Let's not have double standards.
I think that "solving world problems with novel solutions" is a strawman for an ability to reason well. We cannot solve world problems with reasoning, because pure reasoning has no relation to reality. We lack data and models about the world to confirm and deny our hypotheses about the world. That is why the empirical sciences do experiments instead of sit in an armchair and mull all day.
They can't create anything novel and it's patently obvious if you understand how they're implemented.
But I'm just some anonymous guy on HN, so maybe this time I will just cite the opinion of the DeepMind CEO, who said in a recent interview with The Verge (available on YouTube) that LLMs based on transformers can't create anything truly novel.
Since when is reasoning synonymous with invention? All humans with a functioning brain can reason, but only a tiny fraction have or will ever invent anything.
Read what OP said "It’s patently obvious to me that LLMs can ... solve novel problems", this is what I was replying to. I see everyone is smarter here than researchers at DeepMind, without any proofs or credentials to back their claims.
"I don't think today's systems can invent, you know, do true invention, true creativity, hypothesize new scientific theories. They're extremely useful, they're impressive, but they have holes."
Demis Hassabis On The Future of Work in the Age of AI (@ 2:30 mark)
He doesn't say "that LLMs based on transformers can't create anything truly novel". Maybe he thinks that, maybe not, but what he says is that "today's systems" can't do that. He doesn't make any general statement about what transformer-based LLMs can or can't do; he's saying: we've interacted with these specific systems we have right now and they aren't creating genuinely novel things. That's a very different claim, with very different implications.
Again, for all I know maybe he does believe that transformer-based LLMs as such can't be truly creative. Maybe it's true, whether he believes it or not. But that interview doesn't say it.
"... so there is likely to be equivalent information processing complexity to be found in them"
This sounds like a really wild take. Just because something has been evolving for millions of years doesn't necessarily mean it's evolving information processing capabilities. It's patently obvious to me that the information processing capabilities of animals (eg. just vision alone) are far beyond those of plants.
AFAIK there are a handful of companies that gobble the whole yearly allowance of H1B visas among them. Usual suspects are BigTech and large consulting groups. The later act as intermediaries: they sell worker hours at a higher rate, skimming the difference between their prices and employee's salary. If they were somehow barred from H1B program the H1B visa holders would presumably find better paying jobs elsewhere.
H1B rules around changing jobs means that even if the employee joins at a market-level salary when they come to the US, they tent to stay at the same company much longer and can be exploited. The new company has to go through a lengthy paperwork process to allow the visa holder to switch jobs. Also, since the tech world tends to use things like stock options / RSUs / monetary bonuses for large parts of compensation package and those do not count towards "salary" you may have a situation where an h1b holder on paper seems to be paid fairly but in practice get only about 40-50% of what their peers get.
If they were allowed to change jobs freely they would be able to negotiate their compensation fairly. The companies would be less intensified to hire H1Bs to save money and would also consider local talent for same positions. Everybody would win: both H1B visa holders and their families and American workers, too. The only losers would be consulting firms (not a huge loss, to be honest, most of their employees are overseas anyway, so the can absorb the cost) and BigTech (they have enough money, anyway).
There are other problems for H1B holders, like getting a green card is something their employer, and not them, can do - another area for abuse. And then some nationalities have to wait much longer to go through this process then others (essentially, the US migration service says that the country has too many people from India and Pakistan already, thank you very much), and there are other issues I don't recall.
I'd say most of foreign devs in the US are actually L-1 that is actually worse because L-1 prohibits the dev from changing jobs unless the dev gets a new visa.
It's subsidy for big corporations so they can get cheap talent whilst removing incentives for domestic workers to learn the trade or upskill. You also get more people competing for resources which means higher prices. Quality of life going down whilst corporations getting richer.
I'm going to preface this by saying that I support broad liberalization of border controls; immigrants are the backbone of the USA, the engine on which we run, and we should encourage immigration and make it easy for immigrants to settle here. We have the space and resources; anyone who tells you otherwise is lying to you for political gain.
So, that said: H-1B shouldn't exist for software. The point of it is to fill jobs that cannot be filled by an American for some reason; a condition that doesn't exist in software development. Hire immigrants as software engineers, fine. But find a way to do it that isn't bullshit.
it would be really interesting to see how this played out. the entire way you build circuits changes. e.g. current adder designs use extra transistors to save carry propagation latency, but for optical, that might make the latency worse...
My recommendation would be to use them as a tool to build applications. There's much more potential there, and it will be easier to get started as an engineer.
If you want to switch fields and work on LLM internals/fundamentals in a meaningful way, you'd probably want to become a research scientist at one of the big companies. This is pretty tough because that's almost always gated by a PhD requirement.
I’m going out on a limb to guess that his commendable efforts at writing (and keeping a digital copy of) a hilarious satirical mathematical takedown were aimed squarely at going viral after the fact, exactly as has been done, rather than reaching the eyeballs of a petty corporate bureaucrat with a drawer full of coupons.
That being said, hopefully he brightened someone’s day over there, because surely the Complaints Department for a shitty breakfast brand consists of a thankless saga of reading illiterate screeds full of abuse and trivial outrage. Who needs Facebook?
Is "Hoping to restore your faith in us" really part of their form letter? It fits right in with the melodrama of the complaint, but I guess it's plausible that's what they always say.
The clothes last longer, and some fabrics don't like machine washing and using a bag for more than a couple things in the washer and dryer makes it even more of a chore.
I've hand-washed before, and I still "hand wash" pillows in a bathtub. If I had a lot of vintage shirts or something that I really liked I'd probably hand wash them in the kitchen sink.
There is definitely some "more research is needed here". Some fabrics (like silk) are clearly marked "hand wash" but I wonder how much of it is backed by actual research.
Washing has several components, like detergent, temperature, agitation and time, and you have to balance them for good cleaning and minimal damage. Hand washing is usually weak on all counts, so maybe just using less aggressive wash cycles in a machine could work as well. In marketing, they tend to promote more aggressive cleaning just to be sure your clothes come out clean (it would be bad publicity if they didn't) at the cost of increased wear.
In addition, how much do cleaning affects how long your clothes last. I remember seeing an article saying that what damage clothes the most is wearing them, followed by washing, followed by drying. If true, there is not much you can do besides the most obvious and unsatisfying solution of not wearing the clothes you want to preserve.
Without better knowledge, I try to stick with the instructions on the label.
reply