Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Ok. My go to puzzle is this:

You have 2 minutes to cool down a cup of coffee to the lowest temp you can

You have two options:

1. Add cold milk immediately, then let it sit for 2 mins.

2. Let it sit for 2 mins, then add the cold milk.

Which one cools the coffee to the lowest temperature and why?

And Mercury gets this right - while as of right now ChatGPT 4o get it wrong.

So that’s pretty impressive.




Depends on the shape of the cup! You can contrive a cup shaped like an exponentially flaring horn, where adding the milk increases the volume a little, which massively increases the surface area, and so leads to faster cooling. Or you can have a cup with a converging top, like a brandy glass, where adding the milk reduces the surface area, and makes cooling even slower.


And add the twist: what if the cup is made from a highly conductive material?


Clever, but it's implicitly a normal cylindrical cup.


Claude 3.7 gets it exactly right:

To determine which option cools coffee the most, I'll analyze the heat transfer physics involved. The key insight is that the rate of heat loss depends on the temperature difference between the coffee and the surrounding air. When the coffee is hotter, it loses heat faster. Option 1 (add milk first, then wait):

- Adding cold milk immediately lowers the coffee temperature right away

- The coffee then cools more slowly during the 2-minute wait because the temperature difference with the environment is smaller

Option 2 (wait first, then add milk):

- The hot coffee cools rapidly during the 2-minute wait due to the large temperature difference

- Then the cold milk is added, creating an additional temperature drop at the end

Option 2 will result in the lowest final temperature. This is because the hotter coffee in option 2 loses heat more efficiently during the waiting period (following Newton's Law of Cooling), and then gets the same cooling benefit from the milk addition at the end. The mathematical principle behind this is that the rate of cooling is proportional to the temperature difference, so keeping the coffee hotter during the waiting period maximizes heat loss to the environment.


That's totally cribbed from some discussion hat occurred in its training.


As apposed to humans who all derive the physics of heat transfer independently when given a question like this?

Not picking on you - this brings up something we could all get better at:

There should be a "First Rule of Critiquing Models": Define a baseline system to compare performance against. When in doubt, or for general critiques of models, compare to real world random human performance.

Without a real practical baseline to compare with, its to easy to fall into subjective or unrealistic judgements.

"Second Rule": Avoid selectively biasing judgements by down selecting performance dimensions. For instance, don't ignore difference in response times, grammatical coherence, clarity of communication, and other qualitative and quantitative differences. Lack of comprehensive performance dimension coverage is like comparing runtimes of runners, without taking into account differences in terrain, length of race, altitude, temperature, etc.

It is very easy to critique. It is harder to critique in a way that sheds light.


> As apposed to humans who all derive the physics of heat transfer independently when given a question like this?

Isn't that the difference between learning and memorizing, though? If you were taught Newton's Law of Cooling using this example and truly learned it, you could apply it to other problems as well. But if you only memorized it, you might be able to recite it when asked the same question, yet still be unable to apply it to anything else.


> It is very easy to critique. It is harder to critique in a way that sheds light.

Well said. This is the sort of ethos I admire and aspire to on HN.


So is my knowledge of newtons law of cooling


If an LLM has only that knowledge and nothing else (pieces of text saying that heat transfer is proportional to some function of the temp difference) such that is not trained on any texts that give problems and solutions in this area, it will not work this out, since it has nothing to generate tokens from.

Also, your knowledge doesn't come from anywhere near having scanned terabytes of text, which would take you multiple lifetimes of full time work.


We get way more info than llms do, just not solely from text


You have not read every accessible piece of text in existence.


There is more to life than just text e.g. this is part of lecun argument against LLMs


Lecun's argument is based off a bad interpretation of how data is processed by the optic nerve, we don't receive that much raw data.

What we do have, is billions of years of evolution that has given a lot of innate knowledge which means we are radically more capable than LLMs despite having little data.


There is more to text than just predicting tokens based on a vast volume of text.

There isn't an argument "against LLMs" as such; the argumentation is more oriented against the hype and incessant promotion of AI.


This exact problem was in Martin Gardner's column for Scientific American in the 1970s. There are surely references all over the internet.


If it was just ‘in the training data’ they’d all get it right.

But they don’t.


I don't think that can be postulated as a law, because they are a kind of lossy compression. Different lossy compressions will lose different details.


Hmm a good nerd-snipe puzzle. I was never very good at physics, so hopefully someone can check my work... assuming upon mixing coffee is at Tc and milk at Tm, and simplifying to assume equivalent mass & specific temp we have (Tf - Tc) = -(Tf - Tm) => Tf = (Tc+Tm)/2 which is intuitive (upon mixing we get the average temperature).

On the assumption that the cold milk is always at a fixed temperature until it's mixed in, then the temperature of coffee at point of mixing is the main factor. Before and after, it follows newton's law of cooling. So we're comparing something like Tenv + [(Tc+Tm)/2 - Tenv]e^(-2) vs (Tenv + [Tc - Tenv]e^(-2) + Tm)/2. The latter is greater than the former only when Tm > Tenv (the milk isn't cold), or in other words it's better to let the coffee cool as much as possible before mixing assuming the milk is colder than the environment.

Another interesting twist is to consider the case where the milk isn't kept at a fixed temperature but is also subject to warming (it's taken out of the fridge). Then the former equation is unchanged but the latter becomes (Tenv + [Tc - Tenv]e^(-2) + Tenv + [Tm - Tenv]e^(-2))/2. But this is equivalent to the former equation, so in this case it doesn't matter when you mix it.

Not 100% confident in both analysis, but I wonder if there's a more intuitive way to see it. I also don't know if deviating from the assumption of equivalent mass & specific temp changes the analysis (it might lead to a small range where for the fixed case, situation 1 is better?) It's definitely not "intuitive" to me.


There is a more intuitive way: Adding the milk later has a chance to get the cup below room temperature, whereas adding it early never will.


That example is probably in the training data?

The puzzle assumes that the room temperature is greater than the cold milk's temperature. When I added that the room temperature is, say, -10 °C, Mercury fails to see the difference.


Under any reasonable assumptions for the size and shape of the cup, the amount of coffee, the makeup of the air, etc., the room being -10c won't change the result.

It would only matter if the air were able to cool the coffee to a temperature less than that of the milk in under 2 minutes.


Token-based Gemini is 4 seconds of thinking:

Okay, let's break this down using the principle of heat transfer (specifically Newton's Law of Cooling):

    Heat loss is faster when the temperature difference is greater. A very hot object loses heat to cool surroundings much faster than a warm object.

    Option 1: Add milk immediately.

        You instantly lower the coffee's temperature by mixing it with cold milk.

        Now, this warmer (not hot) mixture sits for 2 minutes. Because the temperature difference between the mixture and the room is smaller, it cools down more slowly over those 2 minutes.

    Option 2: Let it sit for 2 mins, then add milk.

        The very hot coffee sits for 2 minutes. Because the temperature difference between the hot coffee and the room is large, it loses heat more quickly during these 2 minutes.

        After 2 minutes of rapid cooling, you then add the cold milk, lowering the temperature further.
Conclusion:

To get the coffee to the lowest temperature, you should choose Option 2: Let it sit for 2 mins, then add the cold milk.


Is there a trick question in here, where the cold milk is trending closer to room temperature over the two minutes waited?


The rate of change of temperature is proportional to the difference in temperature. Adding the cold milk later let's you juice a larger temperature differential out of the 2-minute waiting phase. So you should add the milk later.


> the cold milk later

I think what the other person is asking about is: can you be sure that the milk is (as) cold later?

There's a lot of discussion about what happens to the temperature of the coffee during the 2 minutes. What happens to the temperature of the milk during that same time?

Where is the milk stored? Do you grab it out of the refrigerator the moment you add it to the coffee? Or the cold milk sitting out on the countertop getting warmer? If so, how rapidly?


It's a safe bet that freshly brewed coffee is much farther from room temperature than refrigerated milk is. However deriving properties related to that symmetry (or lack thereof) would make an excellent question for an exam in an introductory class.


The two options are equivalent, since the final (equilibrium) temp of an adiabatic system (coffee + Milk + room) must be the same - ie it's the total amount of heat transferred that matters, and not the rates of heat transfer.

If the system is not adiabatic, ie the room is not big enough to remain constant temp, or equilibrium is not achieved in 2 mins of cooling, then the puzzle statement must be specify all three initial temps to be well poised.


What gave you the idea that this is an adiabatic system? It's a cup of freshly brewed coffee on your kitchen counter. Equilibrium will certainly not be achieved within 2 minutes. Even if it were, different schemes would reach it at different time points.

It is fundamentally a question about rate of energy transfer.

The thing to notice about the symmetric system is that both items will experience the same rate of transfer. However there's presumably more coffee in your coffee than there is milk so it's not actually symmetric. If you adjust the parameters for volume and specific heat to make the final mixed product symmetric then it no longer matters when you do the mixing.


For me, ChatGPT (the free version, GPT-4o mini I believe?) gets it right, choosing option 2 because the coffee will cool faster due to the larger temperature difference.

Unless there's a gotcha somewhere in your prompt that I'm missing, like what if the temperature of the room is hotter than the coffee, or so cold that the coffee becomes colder than the milk, or something?

I would be suprised if any models get it wrong, since I assume it shows up in training data a bunch?


This is what I got from full-fat 4o. Maybe thinking less helps!

ChatGPT:

Option 1 — Add the cold milk immediately — will result in a lower final temperature after 2 minutes.

Why: • Heat loss depends on the temperature difference between the coffee and the environment (usually room temperature). • If you add the milk early, the overall temperature of the coffee-milk mixture is reduced immediately. This lowers the average temperature over the 2 minutes, so less heat is lost to the air. • If you wait 2 minutes to add the milk, the hotter coffee loses more heat to the environment during those 2 minutes, but when you finally add the milk, it doesn’t cool it as much because the coffee’s already cooler and the temp difference between the milk and the coffee is smaller.

Summary: • Adding milk early = cooler overall drink after 2 minutes. • Adding milk late = higher overall temp after 2 minutes, because more heat escapes during the time the coffee is hotter.

Want me to show a simple simulation or visualisation of this?


Oof. I wonder what makes it so bad?

In my experience LLM's tend to be pretty good at basic logic as long as they understand the domain well enough.

I mean, it even gets it right at first -- "This lowers the average temperature over the 2 minutes, so less heat is lost to the air." -- but then it seems to get conceptually confused about heat loss vs cooling, which is surprising.


If you let it sit for 2 minutes your time is up and you don't have time to add the cold milk


By this logic you can't let it sit for 2 mins after you add the cold milk either, so both options are invalid.

In math/science questions some things are assumed to be (practically impossibly) instant.


Ah but if you do that you get time up after 1.58s and you still win


> My go to puzzle is this:

> Mercury gets this right - while as of right now ChatGPT 4o get it wrong.

This is so common a puzzle it's discussed all over the internet. It's in the data used to build the models. What's so impressive about a machine that can spit out something easily found with a quick web search?


Just that what I thought would be better models don’t do it right.

I was expecting this model to be no-where near chatGPT

Although someone above is saying 4o-mini got it right so maybe it’s meaningless. Or maybe thinking less helps…


There is sufficient stochasticity in LLMs to invalidate most comparisons at this level. Minor changes in the prompt text, even from run to run in the same model, will produce different results (depending on temperature and other paramters), much less different models.

Try re-running your test on the same model multiple times with the identical prompt, or varying the prompt. Depending on how much context the service you choose is keeping for you across a conversation, the behavior can change. Something as simple as prompting an incorrect response with a request to try again because the result was wrong can give different results.

Statistically, the model will eventually hit on the right combination of vectors and generate the right words from the training set, and as I noted before, this problem has a very high probability of being in the training data used to build all the models easily available.


I had it write a Python program to calculate disk usage by directory -- basically a `du` clone. It was astonishly fast (2s) and correct. I've tried other models which have got it wrong, slow, and they've ignored my instructions to use topdown=False in the call to walk().





Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: