Hacker News new | past | comments | ask | show | jobs | submit login

You're assuming we're saying LLMs can't reason. That's not what we're saying. They can execute reasoning-like processes when they've seen similar patterns, but this breaks down when true novel reasoning is required. Most people do the same thing. Some poeple can come up with novel solutions to new problems, but LLMs will choke. Here's an example:

Prompt: "Let's try a reasoning test. Estimate how many pianos there are at the bottom of the sea."

I tried this on three advanced AIs* and they all choked on it without further hints from me. Claude then said:

    Roughly 3 million shipwrecks on ocean floors globally
    Maybe 1 in 1000 ships historically carried a piano (passenger ships, luxury vessels)
    So ~3,000 ships with pianos sunk
    Average maybe 0.5 pianos per ship (not all passenger areas had them)
    Estimate: ~1,500 pianos
*Claude Sonnet 4, Google Gemini 2.5 and GPT 4o





GPT4o isn't considered an "advanced" LLM at this point. It doesn't use reasoning.

I gave your prompt to o3 pro, and this is what I got without any hints:

  Historic shipwrecks (1850 → 1970)
  • ~20 000 deep water wrecks recorded since the age of steam and steel  
  • 10 % were passenger or mail ships likely to carry a cabin class or saloon piano   
  • 1 piano per such vessel 20 000 × 10 % × 1 ≈ 2 000

  Modern container losses (1970 → today)
  • ~1 500 shipping containers lost at sea each year  
  • 1 in 2 000 containers carries a piano or electric piano   
  • Each piano container holds ≈ 5 units   
  • 50 year window 1 500 × 50 / 2 000 × 5 ≈ 190

  Coastal disasters (hurricanes, tsunamis, floods)
  • Major coastal disasters each decade destroy ~50 000 houses  
  • 1 house in 50 owns a piano   
  • 25 % of those pianos are swept far enough offshore to sink and remain (50 000 / 50) × 25 % × 5 decades ≈ 1 250

  Add a little margin for isolated one offs (yachts, barges, deliberate dumping): ≈ 300

  Best guess range: 3 000 – 5 000 pianos are probably resting on the seafloor worldwide.

The difference between o3 and o4-mini is so substantial I think this is the reason why people can't agree on how capable LLMs are nowadays.

The correct answer is: I'm sorry, I don't have time for this.

What does "choked on it" mean for you? Gemini 2.5 pro gives this, even estimating what amouns of those 3m ships that sank after pianos became common item. Not pasting the full reasoning here since it's rather long.

Combining our estimates:

From Shipwrecks: 12,500 From Dumping: 1,000 From Catastrophes: 500 Total Estimated Pianos at the Bottom of the Sea ≈ 14,000

Also I have to point out that 4o isn't a reasoning model and neither is Sonnet 4, unless thinking mode was enabled.


That seems like a totally reasonable response ... ?

I think you missed the part where I had to give them hinits to solve it. All 3 initially couldn't or refused saying it was not a real problem on their first try.

Can you share the chats? I tried with o3 and it gave a pretty reasonable answer on the first try.

https://chatgpt.com/share/684e02de-03f0-800a-bfd6-cbf9341f71...


You must be on the wrong side of an A/B test or very unlucky.

Because I gave your exact prompt to o3, Gemini, and Claude and they all produced reasonable answers like above on the first shot, with no hints, multiple times.


FWIW I just gave a similar question to Claude Sonnet 4 (I asked about something other than pianos, just in case they're doing some sort of constant fine-tuning on user interactions[1] and to make it less likely that the exact same question is somewhere in its training data[2]) and it gave a very reasonable-looking answer. I haven't tried to double-check any of its specific numbers, some of which don't match my immediate prejudices, but it did the right sort of thing and considered more ways for things to end up on the ocean floor than I instantly thought of. No hints needed or given.

[1] I would bet pretty heavily that they aren't, at least not on the sort of timescale that would be relevant here, but better safe than sorry.

[2] I picked something a bit more obscure than pianos.


How much of that is inability to reason vs. being trained to avoid making things up?



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: