You're assuming we're saying LLMs can't reason. That's not what we're saying. They can execute reasoning-like processes when they've seen similar patterns, but this breaks down when true novel reasoning is required. Most people do the same thing. Some poeple can come up with novel solutions to new problems, but LLMs will choke. Here's an example:
Prompt: "Let's try a reasoning test. Estimate how many pianos there are at the bottom of the sea."
I tried this on three advanced AIs* and they all choked on it without further hints from me. Claude then said:
Roughly 3 million shipwrecks on ocean floors globally
Maybe 1 in 1000 ships historically carried a piano (passenger ships, luxury vessels)
So ~3,000 ships with pianos sunk
Average maybe 0.5 pianos per ship (not all passenger areas had them)
Estimate: ~1,500 pianos
GPT4o isn't considered an "advanced" LLM at this point. It doesn't use reasoning.
I gave your prompt to o3 pro, and this is what I got without any hints:
Historic shipwrecks (1850 → 1970)
• ~20 000 deep water wrecks recorded since the age of steam and steel
• 10 % were passenger or mail ships likely to carry a cabin class or saloon piano
• 1 piano per such vessel 20 000 × 10 % × 1 ≈ 2 000
Modern container losses (1970 → today)
• ~1 500 shipping containers lost at sea each year
• 1 in 2 000 containers carries a piano or electric piano
• Each piano container holds ≈ 5 units
• 50 year window 1 500 × 50 / 2 000 × 5 ≈ 190
Coastal disasters (hurricanes, tsunamis, floods)
• Major coastal disasters each decade destroy ~50 000 houses
• 1 house in 50 owns a piano
• 25 % of those pianos are swept far enough offshore to sink and remain (50 000 / 50) × 25 % × 5 decades ≈ 1 250
Add a little margin for isolated one offs (yachts, barges, deliberate dumping): ≈ 300
Best guess range: 3 000 – 5 000 pianos are probably resting on the seafloor worldwide.
What does "choked on it" mean for you? Gemini 2.5 pro gives this, even estimating what amouns of those 3m ships that sank after pianos became common item. Not pasting the full reasoning here since it's rather long.
Combining our estimates:
From Shipwrecks: 12,500
From Dumping: 1,000
From Catastrophes: 500
Total Estimated Pianos at the Bottom of the Sea ≈ 14,000
Also I have to point out that 4o isn't a reasoning model and neither is Sonnet 4, unless thinking mode was enabled.
I think you missed the part where I had to give them hinits to solve it. All 3 initially couldn't or refused saying it was not a real problem on their first try.
You must be on the wrong side of an A/B test or very unlucky.
Because I gave your exact prompt to o3, Gemini, and Claude and they all produced reasonable answers like above on the first shot, with no hints, multiple times.
FWIW I just gave a similar question to Claude Sonnet 4 (I asked about something other than pianos, just in case they're doing some sort of constant fine-tuning on user interactions[1] and to make it less likely that the exact same question is somewhere in its training data[2]) and it gave a very reasonable-looking answer. I haven't tried to double-check any of its specific numbers, some of which don't match my immediate prejudices, but it did the right sort of thing and considered more ways for things to end up on the ocean floor than I instantly thought of. No hints needed or given.
[1] I would bet pretty heavily that they aren't, at least not on the sort of timescale that would be relevant here, but better safe than sorry.
[2] I picked something a bit more obscure than pianos.
Prompt: "Let's try a reasoning test. Estimate how many pianos there are at the bottom of the sea."
I tried this on three advanced AIs* and they all choked on it without further hints from me. Claude then said:
*Claude Sonnet 4, Google Gemini 2.5 and GPT 4o