I remember when this stuff was all coming out and people were finally excited about ChatGPT getting the problem with "which is heavier, a 10 pound bag of feathers or a 10 pound bag of bricks?" problem correct. But of course it got it correct. It was in the training set. Vary the problem slightly by just changing the nouns, or changing the numbers so that one in fact was heavier than the other, and performance went all over the map.
I just went to chatgpt.com and put into the chat box "Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton?", and the very first answer I got (that is, I didn't go fishing here) was
The 9.99-pound bag of steel ingots is heavier than the 10.01-pound
bag of fluffy cotton by a small margin. Although the cotton may
appear larger due to its fluffy nature, the steel ingots are denser
and the weight of the steel bag is 9.99 pounds compared to the 10.01
pounds of cotton. So, the fluffy cotton weighs just a tiny bit more
than the steel ingots.
Which, despite getting it both right and wrong, must still be graded as a "fail".
If you want to analyze these thing for their true capability, you need to make sure you're out of the training set... and most of the things that leap to your mind in 5 seconds are leaping to your mind precisely because they are either something you've seen quite often or something that you can easily think of and therefore many other people have easily thought of them as well. Get off the beaten path a bit and the math gets much less impressive.
ChatGPT Plus user here. The following are all fresh sessions and first answers, no fishing.
GPT 4:
The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The type of material doesn’t affect the weight comparison; it’s purely a matter of which bag weighs more on the scale.
GPT 4o:
The 10.01-pound bag of fluffy cotton is heavier. Weight is independent of the material, so the bag of cotton’s 10.01 pounds outweighs the steel ingots’ 9.99 pounds.
GPT o1:
Since both weights are measured on the same scale (pounds), the 10.01-pound bag of cotton is heavier than the 9.99-pound bag of steel, despite steel being denser. The key is simply that 10.01 pounds exceeds 9.99 pounds—density doesn’t affect the total weight in this comparison.
So do what the commenter suggests and make irrelevant permutations to the input to find when it fails. ie., engage in hypothesis testing rather than confirmation bias.
If a system has the capability to solve problems of {parts1...parts_n}, then it only has that capability if irrelevant permutations {parts1..parts2'...parts_n} make no difference.
Its very obvious that such permutations can destory such apparent capabilities.
I've just tested a number of permutations with Claude 3.5 Sonnet. It correctly answered all variants I tried on the first attempt, as follows:
Which is heavier, a 9.99 kilogram tungsten cube or a 10.01 kilogram block of aerogel?
Which is heavier, 10,000 steel balls weighing 0.999 grams each or 10,000 polystyrene balls weighing 1.001 grams each?
Which is heavier, a 10.01kg block of steel on Venus or a 9.99kg bag of feathers on Earth?
Which is heavier, a 10cm^3 block of steel or a 100cm^3 block of balsa wood?
Which is heavier, a golf ball made of steel or a baseball made of lithium?
In all cases, Claude clearly used CoT and reasoned out the problem in full. I would be interested in seeing if anyone can find any variant of this problem that stumps any of the leading LLMs. I'm bored of trying.
Hey, ChatGPT please write me a python program which randomly samples from various materials and various weights then poses a problem to the ChatGPT 4o API -- the goal is to find cases where the LLM fails to obtain the correct answer....
Which weighs more on earth: 100kg of steel or a 100kg polystyrene ball? Be as technically correct as possible
A:
Technically, 100 kg of steel and 100 kg of polystyrene weigh the same on Earth when measured on a scale, as they both experience the same gravitational force due to their equal mass. However, there are some nuances to consider:
Buoyancy Effect
Earth’s atmosphere exerts an upward buoyant force on objects, proportional to the volume of the object and the density of the surrounding air. Since polystyrene has a much lower density than steel, a 100 kg polystyrene ball will have a significantly larger volume and experience a greater buoyant force compared to 100 kg of steel.
As a result:
• Effective weight (apparent weight): The steel will weigh slightly more than the polystyrene on a scale because the buoyant force acting on the polystyrene reduces its apparent weight more.
Actual Weight
Weight is the gravitational force on an object and is given by:
Where and . Without considering buoyancy, both objects have the same weight of approximately (newtons).
Conclusion
• True weight (in a vacuum): Both are identical at 981 N.
• Apparent weight (in air): Steel is slightly heavier due to reduced buoyant force acting on it compared to the polystyrene ball.
Yup, the models are smart, but are trained to follow standard human patterns for this type of questions. And even on hackernews vast majority will not think that they would need to correct for buoyancy when actually attempting the experiment in standard conditions.
They very often get popular "tricky" questions wrong because they saw it so many times that they switch from internal reasoning to memorization/retrieval.
I tried 3 times the "Which is heavier, a 10.01kg block of steel on or a 9.99kg bag of feathers?" and ChatGPT keep converting kg to pound and saying the 9.99kg is heavier.
Also couldn't get o1 to fail. I tried the following with o1:
"Which is heavier, a solid titanium box weighing 9.9 flubs, or an empty wooden box weighing 10.1 blobs, where 1 flub = 1 kg, and 1 blob is 1kg".
The answer:
"Since 1 flub = 1 kg and 1 blob = 1 kg, the titanium box’s mass is 9.9 kg and the wooden box’s mass is 10.1 kg. Therefore, the wooden box (10.1 kg) is heavier."
I found that the models consistently ignore buoyancy (objects with the same mass but different density can have different weight as measured by a scale).
If GP's hypothesis was "it fails for small variations of the input, like this one", then testing that hypothesis with that exact variation on a couple models seems fair and scientific.
Testing it with more variations until one fails feels a bit like p-hacking. You'd need to engage in actual statistics to get reliable results from that, beyond "If I really try, I can make it fail". Which would be a completely different hypothesis than the one presented at the start
I feel like I'm almost 100% certain that the smart guys at OpenAI have added many more variations of the problem to their training set since OP did his failing test, so it doesn't surprise me at all to know that this exact one now passes.
In fact, in my use of o1 it's incredibly clear that it still has the same problems. It's incredibly common that the second I ask for someone even slightly outside the training set, it's more likely to "round" to some wrong solution in the training set, rather than use any sort of human-like reasoning to figure out the right answer (often the right answer isn't hard to get, just not found in a Google search).
I asked "Which is louder, a 12.5 decibel scream, or a 12.8 decibel whisper?", which I wouldn't think has much of a presence in its training set, and it answered correctly.
May still get it wrong in more subtle ways, though. Personally, I think it'll continue to get physics wrong until someone builds it some robot arms so it can train on actually interactive physical spaces and behavior.
Except that if the model genuinely was reasoning about the problem, you could test it with every variation of materials and weights in the world and it would pass. Failing that problem at all in any way under any conditions is a failure of reasoning.
By that logic, humans can't genuinely reason, because they're often fooled by counter-intuitive problems like Monty Hall or the Birthday Problem, or sometimes just make mistakes on trivial problems.
I’m not really sure what you’re trying to say here - that LLMs don’t work like human brains? We don’t need to conduct any analyses to know that LLMs don’t “know” anything in the way humans “know” things because we know how LLMs work. That doesn’t mean that LLMs aren’t incredibly powerful; it may not even mean that they aren’t a route to AGI.
>We don’t need to conduct any analyses to know that LLMs don’t “know” anything in the way humans “know” things because we know how LLMs work.
People, including around HN, constantly argue (or at least phrase their arguments) as if they believed that LLMs do, in fact, possess such "knowledge". This very comment chain exists because people are trying to defend against a trivial example refuting the point - as if there were a reason to try.
> That doesn’t mean that LLMs aren’t incredibly powerful; it may not even mean that they aren’t a route to AGI.
I don't accept your definition of "intelligence" if you think that makes sense. Systems must be able to know things in the way that humans (or at least living creatures) do, because intelligence is exactly the ability to acquire such knowledge.
It boggles my mind that I have to explain to people that sophisticated use of language doesn't inherently evidence thought, in the current political environment where the Dead Internet Theory is taken seriously, elections are shown over and over again to be more about tribalism and personal identity than anything to do with policy, etc.
You don't have to listen to or engage with those people though, just ignore 'em. People say all kinds of things on the Internet. It's completely futile to try to argue with or "correct" them all.
> I don't accept your definition of "intelligence" if you think that makes sense. Systems must be able to know things in the way that humans (or at least living creatures) do, because intelligence is exactly the ability to acquire such knowledge.
According to whom? There is certainly no single definition of intelligence, but most people who have studied it (psychologists, in the main) view intelligence as a descriptor of the capabilities of a system - e.g., it can solve problems, it can answer questions correctly, etc. (This is why we call some computer systems "artificially" intelligent.) It seems pretty clear that you're confusing intelligence with the internal processes of a system (e.g. mind, consciousness - "knowing things in the way that humans do").
We are pretty certain that humans can reason, yet they are sometimes wrong. Even if you give them the same problem over and over again with slight variations.
LLMs get things wrong due to different factors than humans (humans lose focus, LLMs have randomness applied when sampling their responses to improve results). But clearly we have to choose a goal somewhat below 100% if we want a test that doesn't conclude that humans are incapable of reasoning.
The difference is we _know_ that LLMs are fancy stochastic models, we don't know that they're capable of reasoning, and the null hypothesis is that they're not (because we know what they _are_ - we built them) - any "reasoning" is an emergent property of the system, not something we built them to do. In that case, evidence they're not reasoning - evidence they're stochastic parrots doing a performance of reasoning - weighs heavier, because the performance of reasoning fits into what we know they can do, whereas genuine reasoning would be something new to the model.
There's deeper philosophical questions about what reasoning actually _is_, and LLMs have made those sharper, because they've shown it's clearly possible for a complex statistical model to generate words that look like reasoning, but the question is whether there's a difference between what they're doing and what humans are doing, and evidence that they're _not_ reasoning - evidence that they're just generating words in specific orders - weighs heavily against them.
We haven't coded LLMs to be stochastic models, we coded them to predict text with any method gradient decent finds on a transformer architecture. That's not exactly the same.
But more importantly, if you want to show that LLMs can't reason you obviously have to use a test that when applied to humans would show that humans can reason. Otherwise your test isn't testing reasoning but something more strict.
It's not p-hacking, he's right. You're both right. First test the same prompt on different versions then the ones that got it right go to the next round, variations on the prompt
We aren't testing whether the model's results are stable or correct for a given class of problem. The goal is to establish whether the model can reason.
Nothing capable of reasoning would contradict itself so blatantly and in such a short span while failing to indicate any kind of uncertainty.
> The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Interestingly, that was the best answer out of all models and also a correct one. O1-preview doesn't allow setting the temperature, so it's hard to know where the real probabilities fall.
I asked ChatGPT-4 mini the previous comment's question (""Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton?") in new incognito tabs until I got an incorrect result. It only took four tries.
1: "The 10.01-pound bag of fluffy cotton would be heavier. Even though cotton is much less dense than steel, the weight of the cotton bag is 10.01 pounds, while the steel ingot bag weighs 9.99 pounds, making the cotton bag the heavier of the two by 0.02 pounds."
2: "The 10.01-pound bag of fluffy cotton is heavier. Even though the steel ingots are denser and more compact, the cotton weighs more at 10.01 pounds compared to the steel ingots' 9.99 pounds. The difference is only 0.02 pounds, but in terms of weight, the cotton bag is heavier."
3: "The 10.01-pound bag of fluffy cotton is heavier. Despite cotton being less dense than steel, the weight is determined by the total mass, not the material's density. So, since the cotton bag weighs 10.01 pounds and the steel ingot bag weighs 9.99 pounds, the cotton bag is heavier by 0.02 pounds."
4: "The 9.99-pound bag of steel ingots is heavier. It weighs 9.99 pounds, while the bag of fluffy cotton weighs 10.01 pounds. Despite the cotton being fluffier and taking up more space, the steel ingots are still lighter by 0.02 pounds."
To determine which bag is heavier when measured in air—the 9.99-pound bag of steel ingots or the 10.01-pound bag of fluffy cotton—we need to consider not just their masses but also the effect of air buoyancy on their apparent weights.
*Step 1: Calculate the Masses*
First, convert the given weights into masses using the conversion \(1 \text{ pound} = 0.453592 \text{ kg}\):
- Mass of steel bag, \(m_s = 9.99 \, \text{lb} \times 0.453592 \, \text{kg/lb} \approx 4.5310 \, \text{kg}\)
- Mass of cotton bag, \(m_c = 10.01 \, \text{lb} \times 0.453592 \, \text{kg/lb} \approx 4.5401 \, \text{kg}\)
*Step 2: Calculate the Volumes*
Next, calculate their volumes using the densities:
- Density of steel, \(\rho_s \approx 8000 \, \text{kg/m}^3\)
- Density of cotton (fluffy), \(\rho_c \approx 50 \, \text{kg/m}^3\)
Despite the cotton bag having slightly more mass and a greater weight in a vacuum, when measured in air, the steel bag is heavier due to the significantly smaller upward buoyant force acting on it compared to the cotton bag. This means that on a scale in air, the 9.99-pound bag of steel ingots will weigh more than the 10.01-pound bag of fluffy cotton.
That appears to be the same model I used. This is why I emphasized I didn't "go shopping" for a result. That was the first result I got.
I'm not at all surprised that it will nondeterministically get it correct sometimes. But if it doesn't get it correct every time, it doesn't "know".
(In fact "going shopping" for errors would still even be fair. It should be correct all the time if it "knows". But it would be different if I was fishing over and over and over and finally got one, versus the first time I asked.)
Edit: It appears it isn't the model I used. The point holds, though, you need to make sure you're off the training set for it to matter. This isn't a "ChatGPT can't do that" post as some are saying, it's more a "you aren't asking what you think you're asking" post.
You get the same problem in a human context in things like code interviews. If you ask an interviewee the exact question "how do you traverse a binary tree in a depth-first manner", you aren't really learning much about the interviewee. It's a bad interview question. You need to get at least a bit off the beaten trail to do any sort of real analysis.
I believe this is just a case of OpenAI's naming scheme being weird and confusing.
The default model I see on chatgpt.com is GPT 4o-mini, which is not o1-mini.
OpenAI describes GPT 4o-mini as "Our fast, affordable small model for focused tasks" and o1/o1-mini as "Reasoning models that excel at complex, multi-step tasks".
Could you share the exact chat you used for when it failed? There is a share chat button on openai.
It's very difficult to be an AI bull when the goalposts are moving so quickly that ai answering core correctly across multiple models is brushed off as 'nondeterministically getting it correct sometimes'
A single-purpose state machine not failing to do the single thing it was created to do does not make for the clever retort you think it makes.
"AGI": emphasis on "G" for "General". The LLMs are not failing to do generalized tasks, and that they are nondeterministic is not a bug. Just don't use them for calculating sales tax. You wouldn't hire a human to calculate sales tax in their head, so why do you make this a requirement in order to call an LLM "AGI"?
I wonder when the goalposts will stop moving from "We have superhuman intelligences which are able to rather reliably converse in many languages, do generalized tasks and automate operations we thought were impossible to automate 3 years ago" (and by the way, this is what we have TODAY), all the way to "It's not AGI unless it's an omnipotent god that knows how to turn water into wine and calculate the applicable sales tax of that operation".
I do not get it, where are all the serious people? Just come out and say what it is and what it is not. Pretend for a moment you have the ethics and work philosophy of an undergraduate science student.
> You wouldn't hire a human to calculate sales tax in their head
Everyone did that 60 years ago, humans are very capable at learning and doing that. Humans built jetplanes, skyscrapers, missiles, tanks, carriers without the help of electronic computers.
Yeah... They used side rules and vast lookup tables of function values printed on dead trees. For the highest value work, they painstakingly built analog calculators. They very carefully checked their work, because it was easy to make a mistake when composing operations.
Humans did those things by designing failsafe processes, and practicing the hell out of them. What we would likely consider over fitting in the llm training context.
While this may be true, it's a very common problem that people who want to demonstrate how bad a model is fail to provide a direct link or simply state the name of the model.
Yes it does, at the top of every chat there is a drop-down to select the model, which displays the current model. It's been a constant part of the UI since forever.
Logged in, non paid account, on a desktop, for me, it's exactly as the person you're replying to has stated.
If I log out, it shows 4o mini, and when I try to change it, it asks me to login or sign in rather than giving me any options.
When I use enough chatgpt when logged in it gives me some nebulous "you've used all your xyz tokens for the day". But other than that there is no real signal to me that I'm getting a degraded experience.
> With a free account the model it claims to be using is "4o auto", which is not a model but apparently an attempt to automatically decide models for you to be more cost effective.
I usually test models using the OpenAI API which doesn't offer links the way I think you mean. If I provide some output I got from a particular model you're just going to have to take my word for it.
you sure? i just asked o1-mini (not 4o mini) 5 times in a row (new chats obviously) and it got it right every time
perhaps you stumbled on a rarer case but reading the logs you posted this sounds more like a 4o model than an o1 because it’s doing its thinking in the chat itself plus the procedure you described would probably get you 4o-mini
> What is heavier a liter of bricks or a liter of feathers?
>> A liter of bricks and a liter of feathers both weigh the same—1 kilogram—since they each have a volume of 1 liter. However, bricks are much denser than feathers, so the bricks will take up much less space compared to the large volume of feathers needed to make up 1 liter. The difference is in how compactly the materials are packed, but in terms of weight, they are identical.
Hmm... Gemini (1.5 Flash) just aced that exact question for me:
These lines celebrate the victory of the British ship HMS Victory, led by the famous Admiral Lord Nelson, in the Battle of Trafalgar in 1805.
"Here's success unto the Victory": This line directly praises the ship itself, acknowledging its role in the successful battle.
"and crew of noble fame": This recognizes the bravery and skill of the sailors who served aboard the Victory.
"and glory to the captain": This line specifically honors Admiral Nelson, the captain of the Victory, for his leadership and strategic brilliance in the battle.
"bold Nelson was his name": This emphasizes Nelson's courage and daring, which were legendary.
The lines express admiration for the ship, its crew, and most importantly, Admiral Nelson, who became a national hero in Britain for his victory at Trafalgar.
Well, the rest of the song helps, in that it specifies that (1) the toast upset the wedding, and (2) the singer responded to that by insulting "you", which is presumably one or more of the bride, the groom, and the guests.
But I think specifying that the singer has crashed his ex-girlfriend's wedding is already enough that you deserve to fail if your answer is "he says he's not upset, so what he means is that he's not upset". It's not any kind of leap to guess that the bride's ex-boyfriend's toast might cause a scene at a wedding - that's why the bride's ex-boyfriends are never invited.
(The question has already provided every word of the toast that appears in the song.)
See also the sidethread comment by mikeruiz, noting that o1-pro reproduces the rest of the lyrics to The Victory, but gets the question wrong anyway.
Nah, intermittent failures are apparently enough to provide evidence that an entire class of entities is incapable of reason. So I think we've figured this one out...
Well, I proved that he's happy to express an opinion on whether an answer to a question is correct regardless of whether he knows anything about the question. I wouldn't trust advice from him or expect his work output to stand up to scrutiny.
i'd prefer an easily verifiable question rather than one where we can always go "no that's not what they really meant" but someone else with o1-mini quota can respond
“They’re toasting Admiral Nelson’s ship (HMS Victory) and its valiant crew, hailing the ship’s successes and Nelson’s heroism. In other words, the singer is offering tribute—“success unto the Victory”—to the vessel and its famed sailors, and “glory to the captain” who led them, namely the celebrated Admiral Horatio Nelson.”
…but to your point, no idea if the artist intended some more obscure reference.
o1-pro was also able to produce a relatively complete version of original source, though, amusingly, referred to it as a ‘broadside’ rather than ‘broadsheet’. Appropriate given the context!
> no idea if the artist intended some more obscure reference.
No, but the answer is obviously wrong. Very, very wrong. The lines are not praise. The song's view of the Victory, and of the battle of Trafalgar, is overwhelmingly negative.
> o1-pro was also able to produce a relatively complete version of original source
...as should be clear, if you read the original source. (Or, heck, just the two lines following the ones in the question.) Being able to reproduce the source makes the error look that much worse.
> though, amusingly, referred to it as a ‘broadside’ rather than ‘broadsheet’.
>> A broadside (also known as a broadsheet) is a single sheet of inexpensive paper printed on one side, often with a ballad, rhyme, news and sometimes with woodcut illustrations.
i think it's a bit tricky, the surface meaning is extremely praiseworthy and some portion of readers might interpret as someone who has praise for Admiral Nelson but hates the press gangs.
of course, it is a sardonic, implicit critique of Admiral Nelson/the victory, etc. but i do think it is a bit subtle.
It got it right, but an interesting result that it rambled on about monetary value for... no reason.
> While the lint bag is heavier in terms of weight, it's worth mentioning that gold is significantly more valuable per pound compared to lint. This means that even though the lint bag weighs more, the gold bag holds much greater monetary value.
I wrote a script and pounded 01 mini and gpt 4 with a wide vareity of tempature and top_p parameters, and was unable to get it to give the wrong answer a single time.
Just a whole bunch of:
(openai-example-py3.12) <redacted>:~/code/openAiAPI$ python3 featherOrSteel.py
Response 1: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots.
Response 2: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots.
Response 3: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Response 4: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Response 5: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots.
Response 6: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Response 7: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Response 8: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Response 9: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Response 10: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots.
All responses collected and saved to 'responses.txt'.
Script with one example set of params:
import openai
import time
import random
# Replace with your actual OpenAI API key
openai.api_key = "your-api-key"
# The question to be asked
question = "Which is heavier, a 9.99-pound bag of steel ingots or a 10.01-pound bag of fluffy cotton?"
# Number of times to ask the question
num_requests = 10
responses = []
for i in range(num_requests):
try:
# Generate a unique context using a random number or timestamp, this is to prevent prompt caching
random_context = f"Request ID: {random.randint(1, 100000)} Timestamp: {time.time()}"
# Call the Chat API with the random context added
response = openai.ChatCompletion.create(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": f"You are a creative and imaginative assistant. {random_context}"},
{"role": "user", "content": question}
],
temperature=2.0,
top_p=0.5,
max_tokens=100,
frequency_penalty=0.0,
presence_penalty=0.0
)
# Extract and store the response text
answer = response.choices[0].message["content"].strip()
responses.append(answer)
# Print progress
print(f"Response {i+1}: {answer}")
# Optional delay to avoid hitting rate limits
time.sleep(1)
except Exception as e:
print(f"An error occurred on iteration {i+1}: {e}")
# Save responses to a file for analysis
with open("responses.txt", "w", encoding="utf-8") as file:
file.write("\n".join(responses))
print("All responses collected and saved to 'responses.txt'.")
Down voted for not actually countering the argument in question? The script doesn't alter the phrasing of the question itself. It just generates a randomized, irrelevant preamble.
Well, I understood the argument in question to be: was it possible for the model to be fooled by this question, not was it possible to prompt engineer it into failure.
The parameter space I was exploring, then, was the different decoding parameters available during the invocation of the model, with the thesis that if were possible to for the model to generate an incorrect answer to the question, I would be able to replicate it by tweaking the decoding parameters to be more "loose" while increasing sample size. By jacking up temperature while lowering Top-p, we see the biggest variation of responses and if there were an incorrect response to be found, I would have expected to see in the few hundred times I ran during my parameter search.
If you think you can fool it by slight variations on the wording of the problem, I would encourage you to perform a similar experiment as mine and prove me wrong =P
We use variations on questions that are well known for duping inattentive humans, to test a system that we expect a priori to be incapable of such inattention.
Unless "getting easy things wrong sometimes" is an inherent property of intelligence, we should expect that a properly "intelligent" computerized system would never err on problems far below its level of comprehension - unless we had some reason to believe it "wanted to", and as of yet I see no reason to believe this is even possible in principle.
Humans err, broadly speaking, for two reasons: genuinely reaching the limits of their comprehension, or trusting "system 1" (in Kahneman's analysis) too much.
Because for things like the Putnam questions, we are trying to get the performance of a smart human. Are LLMs just stochastic parrots or are they capable of drawing new, meaningful inferences? We keep getting more and more evidence of the latter, but things like this throw that into question.
I would agree if we weren't starting with LLMs for a baseline. The first AGI will know at least as much as LLMs, IMO, and that's already not-stupid. Especially once they can separate out the truth in their training.
This happens literally every time. Someone always says "ChatGPT can't do this!", but then when someone actually runs the example, chatGPT gets it right. Now what the OP is going to do next is proceed to move goalposts and say like "but umm I just asked chatgpt this, so clearly they modified the code in realtime to get the answer right"
Prompt: In the Netherlands, in terms of drinks, is there a particular spirit that represents the country?
> Yes, in the Netherlands, jenever (also known as genever) is the traditional spirit that represents the country. Jenever is a type of Dutch gin that has a distinctive flavor, often made from malt wine and flavored with juniper berries. It has a long history in the Netherlands, dating back to the 16th century, and is considered the precursor to modern gin.
> Jenever comes in two main types: oude (old) jenever, which has a richer, maltier flavor and is often aged in wooden casks, and jonge (young) jenever, which is lighter and has a more neutral taste. Both types can be enjoyed straight or used in cocktails.
> In addition to jenever, the Netherlands is also known for its liqueurs, such as advocaat, a rich and creamy drink made from eggs, sugar, and brandy, often enjoyed as a dessert or in cocktails. However, jenever remains the most iconic spirit associated with Dutch culture.
This is completely wrong. Jenever certainly is very Dutch, but no one would say it is iconic as the Dutch spirit. For example, if you asked up north in Friesland, they would say Berenburg.
This happens literally every time. Someone always says "ChatGPT can do this!", but then within one or two prompts, its gets it wrong.
"This is completely wrong."
Is it completely wrong? Or maybe "somewhat" wrong? Maybe just lacking nuance?
I know nothing about the answer to this question, so this is an honest question.
Using just a plain old search engine, for things like "national drink of the netherlands" and simlar queries, I am directed to Wikipedia's Jenever page as the top hit, and Wikipedia's list of national drinks lists Jenever and Heineken as the entries for the Netherlands. Search engines also give page after page of travel guides and blog posts, most of which list Jenever at or near the top of of their listings. One travel guide calls it "the most famous Dutch spirit and most famous Amsterdam liquor, Jenever, also spelled Genever or simply Dutch gin."
I am rather saying that there is no one national drink for The Netherlands, like a Frenchman would say wine, a German/Belgian would say beer, and a Scotsman would say whisky. Note that I prompted "In the Netherlands, in terms of drinks, is there a particular spirit that represents the country?" I didn't ask which spirit is consumed the most.
For example, France has been trending towards beer more and more, and within a few decades they might be consuming more beer than wine. But even then, the French wouldn't slowly start to say beer represents France.
Furthermore, "just adding some herbs" does a large disservice to the flavor change of Berenburg. Jenever (aka jonge/unaged jenever) is straight-up vile. I've heard it described by expats as "having the worst elements of both cheap gin and cheap whisky".
Berenburg in comparison is spicy and vanilla-y and actually debatebly enjoyable.
Aged/oude jenever is much closer to Berenburg (or Berenburg to aged jenever), also with hints of vanilla and spices.
But, virtually no one except for dusty old men orders aged jenever. The kind ordered by far the most is jonge jenever, and then its only in a sense of "haha lets drink this terrible thing" or "let's get shitfaced quick".
If o1 supposedly "oneshots every question", it should have been aware of these nuances instead of just confidently assigning jenever as 'the' spirit of the Dutch.
The question of whether there is a national drink seems to me to be entirely different than the question you asked the LLM "Prompt: In the Netherlands, in terms of drinks, is there a particular spirit that represents the country?"
The question in the prompt comes off to me as a sort of qualitative determination rather than asking about pure factual information (is there an officially designated spirit). As such I don't think it can necessarily be right or wrong.
Anyway, I'm not sure what you'd expect. In terms of acquisition of knowledge, LLMs fundamentally rely on a written corpus. Their knowledge of information that is passed through casual spoken conversation is limited. Sure, as human beings, we rely a great deal on the latter. But for an LLM to lack access to that information means that it's going to miss out on cultural nuances that are not widely expressed in writing. Much in the same way that a human adult can live in a foreign country for decades, speaking their adopted language quite fluently, but if they don't have kids of their own, they might be quite ignorant of that country's nursery rhymes and children's games, simply because they were never part of their acquired vocabulary and experience.
I was just proving the people wrong that were saying akin to that o1 was "oneshotting every question".
I completely understand from how LLMs work that they wouldn't be able to get this right. But then people shouldn't be proudly be pronouncing that o1 (or any model) is getting every question right, first time.
My conjecture is that you still haven't proven that it didn't get the answer "right"
I have opened the question of why you thought jenever was not jenever, and your non-responsiveness I think compels the fact that AI was more correct in your contrived instance.
If you add pear and spices to vodka, we call it liqueur and not pear-flavored vodka. So no, you are wrong. And the AI is wrong. But that is okay, if you want to enjoy leaning into the hype that's your choice.
So you believe they are incorrect because regionally some area would select something different because it represented that area. But your question asked nationally.. is there a better answer than the one they gave? Were you expecting a no?
But, like Zahlman points out, its a trick question, and instead of admitting it doesn't know or even prepending "I don't know for sure, but:", it just burps up its best-effort answer. There is no one spirit that represents The Netherlands. If a LLM is so good it "oneshots any question", it should realize it doesn't have a unanimous answer and tell me.
The point is that there is no correct national answer, because the locals don't see it as a matter of national identity.
What's expected is an ability to identify trick questions, i.e., to recognize fundamental problems in the phrasing of a question rather than trying to provide a "helpful" answer at all costs.
This corresponds to one of the many reasons LLM output is banned on Stack Overflow.
But what does this have to do with reasoning? Yes, LLMs are not knowledge bases, and seeing people treat them as such absolutely terrifies me. However, I don’t see how the fact that LLMs often hallucinate “facts” is relevant to a discussion about their reasoning capabilities.
Reasoning involves making accurate inferences based on the information provided in the current context, rather than recalling arbitrary facts from the training data.
Yes, that's what I said. The whole point of hallucinations is that they aren't "arbitrary facts recalled from the training data". They represent attempts to synthesize (i.e., infer) new facts. But because the inferences are not accurate, and because the synthesis process is not sound, the attempt cannot be called reasoning.
It is equally possible to "reason" about things you already know, as about things you've just been told. In fact, the capacity to speculatively, without prompting attempt such reasoning is a big part of cognition.
> Someone always says "ChatGPT can't do this!", but then when someone actually runs the example, chatGPT gets it right
I mean, if I had OpenAI’s resources I’d have a team tasked with monitoring social to debug trending fuck-ups. (Before that: add compute time to frequently-asked novel queries.)
I was thinking something very similar. Posting about a problem adds information back to the system, and every company selling model time for money has a vested interest in patching publicly visible holes.
This could even be automated; LLMs can sentiment-analyze social media posts to surface ones that are critical of LLM outputs, then automatically extract features of the post to change things about the running model to improve similar results with no intervention.
Similarly, in every thread there’s an AI skeptic who says LLMs are “useless” for coding, and never provides an example query for what they were trying.
If you ask about more niche language features or libraries, chatgpt will make up libraries or functions to fill the gap.
When asking an LLM to write a script for you, I would say 10 to 30 % of the time that it completely fails. Again, making up an API or just getting things straight up wrong.
Its very helpful, especially when starting from 0 with the beginner questions, but it fails in many scenarios.
FYI: If you do that without a subscrpition, you currently (most likely) get a response generated through 4o-mini — which is not any of their reasoning models (o1, o1-mini or previously o1-preview) of the branch discussed in the linked paper.
Notably, it's not even necessarily 4o, their premiere "non-reasoning"-model, but likely the cheaper variant: With a free account the model it claims to be using is "4o auto", which is not a model but apparently an attempt to automatically decide models for you to be more cost effective.
Without a ChatGPT subscription you can't select a specific model anymore, not even rate limited, as was previously possible.
There doesn't seem to be a way to choose a model up-front with a free account, but after you make a query you can click on the "regenerate" button and select whether to try again with "auto", 4o, or 4o-mini. At least until you use 4o too many times and get rate limited.
I just asked Claude 3.5 Sonnet, which appears to have improved its response with CoT but there's mistakes that demonstrate the model doesn't really "understand":
Q: A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to hospital. When the doctor sees the boy he says "I can't operate on this child, he is my son". How is this possible?
C: Let me think about this step by step:
A woman and her son are in an accident
The woman (the mother) dies
The boy is taken to hospital
A doctor says they can't operate because the boy is their son
But the mother is dead...
Ah! The doctor is the boy's father. This is a classic riddle that highlights gender bias - many people assume doctors must be men, but don't initially consider that the doctor could be the father.Alternatively, the doctor could be the boy's other mother in a same-sex relationship, though this wasn't the original intent of the traditional riddle.
> The doctor is the boy's father. This is a classic riddle that highlights gender bias - many people assume doctors must be men, but don't initially consider that the doctor could be the father.
It identified the pattern, but it didn't understand the father is a man while the mother is a woman. If people assume doctors must be men, they wouldn't consider that the doctor could be the mother.
(Side note: this is an antiquated riddle/joke ... for me in particular because my mother is a doctor, but in general around half of new doctors are female, and think about Grey's Anatomy and other shows.)
I read it as “most people don’t connect put 2 and 2 together - the doctor could be the boys father”. It is poorly worded, but not strictly wrong I think?
The joke/riddle text is "he says" but Claude says "their son" and suggests the doctor might be a woman.
More substantively: "This is a classic riddle that highlights gender bias - many people assume doctors must be men, but don't initially consider that the doctor could be the father." is totally nonsensical.
The text is a gender (and meaning) inversion of the classic riddle to confuse LLMs. Even though Claude correctly CoTs that the doctor is the father, it seemingly then copypastas the explanation to the original riddle but swaps the gender, completely missing the point.
Except that Claude often takes into account things it thinks might be typos.
This is not code. Forgetting a semi colon will not make the output break. It thinks 'maybe they wrote he instead of she' and then gives options for both situations.
It is meant to solve real world situations where people might not type properly, it is not a word problem solving machine.
> the problem with "which is heavier, a 10 pound bag of feathers or a 10 pound bag of bricks?"
Interestingly, the variation of this problem that I first encountered, personally, was "which weighs more, a pound of feathers or a pound of gold?"
This is a much more difficult question. The answer given to me was that the pound of feathers weighs more, because gold is measured in troy weight, and a troy pound consists of only 12 ounces compared to the 16 ounces in a pound avoirdupois.
And that's all true. Gold is measured in troy weight, feathers aren't, a troy pound consists of only 12 ounces, a pound avoirdupois consists of 16, and a pound avoirdupois weighs more than a troy pound does.
The problem with this answer is that it's not complete; it's just a coincidence that the ultimate result ("the feathers are heavier") is correct. Just as a pound avoirdupois weighs more than a troy pound, an ounce avoirdupois weighs less than a troy ounce. But this difference, even though it goes in the opposite direction, isn't enough to outweigh the difference between 16 vs 12 ounces per pound.
Without acknowledging the difference in the ounces, the official answer to the riddle is just as wrong as the naive answer is.
Yeah, this is the original version of this riddle. People who don't know it think the trick is that people will reflexively say the metal is heavier instead of "they're the same", when it actually goes deeper.
No idea if GP did it intentionally to further drift from training data, but steel doesn't count as a precious metal, so it messes up the riddle by putting the two weights in the same system.
> Yeah, this is the original version of this riddle. People who don't know it think the trick is that people will reflexively say the metal is heavier instead of "they're the same"
...Have you really never encountered people who would reflexively say that?
Even simpler, I asked Gemini (Flash 1.5) this variant of the question:
```
I have two bags, one can hold a pound of gold and one can hold a pound of feathers. Which bag is heavier?
```
The point here a) the question really is a bit too vague, b) if you assume that each back is made of the same material and that each bag is just big enough to hold the contents, the bag for the feathers will need to be much bigger than the bag for the gold and thus would weigh more.
The first response was wrong because the model completely missed that the question was about the bags themselves:
```
They weigh the same.
A pound of any substance, regardless of its density or volume, weighs one pound.
Gold is much denser than feathers, meaning a pound of gold occupies a much smaller volume.
Feathers are very light and fluffy, so a pound of feathers takes up a much larger volume.
```
Asking a followup `But which bag is heavier?` resulted in recognition that this is a trick question, and an incorrect answer:
```
This is a classic trick question!
Even though the bags hold the same weight (one pound), the bag of gold is likely to be heavier.
Weight vs. Mass:
Weight is the force of gravity acting on an object.
Mass is the amount of matter in an object.
The Bag Itself:
The bag holding the gold, even if it's a lightweight material, will have some weight.
The bag holding the feathers, while it might seem negligible, will also have some weight.
Practical Considerations:
In reality, the bag holding the gold might be significantly heavier due to the weight of the bag itself.
Therefore, while the gold and feathers have the same mass (one pound), the bag containing the gold is likely to be heavier in a practical sense.
```
I think Gemini did better than you think with its second answer! Your original question didn't mention that the bags were made of the same material or the same density of material. The set of all possible bags that could hold 1 pound of feathers includes some thinner, weaker bags than the set of all possible bags that could hold 1 pound of gold (the gold being denser). So absent any other prior information the probability is greater than 50% that the gold-bag would be heavier than the feather-bag on that basis.
One could go further into the linguistic nuance of saying "this can hold one pound of [substance]", which often implies that that's its maximum carrying capacity; this would actually make the "trick question" answer all the more correct, as a bag that is on the cusp of ripping when holding one pound of feathers would almost certainly rip when holding one pound of (much denser) gold.
Saran wrap would would a pound of gold. It's just a pound in roughly a cubic inch of material. A cloth bag (like a pillow which is how features are normally contained) would definitely suffice. Remember, a pound of 800 fill down would require 8 cubic feet.
The bag for the gold doesn't need anything substantial.
So, no, the chances are strongly on the side of the feather bag being heavier.
I reproduced this on Claude Sonnet 3.5, but found that changing your prompt to "Which is heavier, a 9.99-pound back of steel ingots or a 10.01-pound bag of fluffy cotton?" corrected its reasoning, after repeated tests. For some reason it was not able to figure out that "10.01" referred to pounds.
2. The cotton bag has a mass of 10.01 pounds, while the steel bag has a mass of 9.99 pounds
3. While steel is much denser than cotton (meaning it takes up less space per pound), this doesn't affect the weight - a pound is a pound regardless of material
4. 10.01 is greater than 9.99 by 0.02 pounds
Therefore, the 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The cotton may take up much more space, but when we're just comparing weights, density doesn't matter - only the mass does.
Ah, good catch -- I had just copied the OP. Changing it to bag instead of back doesn't seem to change much. It still oscillates between the right and wrong answer, versus qualifying the 10.01 value as pounds, which has not yet given me the wrong answer (small sample size, of course).
As long as an LLM is capable of inserting "9.99 > 10.01?" into an evaluation tool, we're on a good way.
It feels a bit like "if all you have is a hammer, everything looks like a nail", where we're trying to make LLMs do stuff which it isn't really designed to do.
Why don't we just limit LLMs to be an interface to use other tools (in a much more human way) and train them to be excellent at using tools. It would also make them more energy efficient.
But it's OK if we currently try to make them do as much as possible, not only to check where the limits are, but also to gain experience in developing them and for other reasons. We just shouldn't expect them to be really intelligent.
> As long as an LLM is capable of inserting "9.99 > 10.01?" into an evaluation tool, we're on a good way
chatgpt will switch to python for some arithmetic with the result that you get floating point math issues when a 8yo will get the result right. I think "switch to a tool" still requires understanding of which tool to get a reliable result, which in turn means understanding the problem. It's an interesting issue.
* What’s heavier? 9.99 kg of super dense lead or 10 kg fluffy feathers?*
Let me think about this step by step:
1. We're comparing two masses:
* 9.99 kg of lead
* 10 kg of feathers
2. The density of the materials doesn't affect their mass:
* While lead is indeed much denser than feathers
* A kilogram is a unit of mass, not volume
* 1 kg of any material has the same mass as 1 kg of any other material
3. We can directly compare the numbers:
* 10 kg > 9.99 kg
Therefore, 10 kg of feathers is heavier than 9.99 kg of lead. The feathers would take up much more space (volume) due to being less dense, but they have a greater mass.
Are you sure you weren't fishing? I ran 5 sessions and never got the wrong answer. All using gpt 4o-mini, which is the default non logged in experience on chatgpt.com.
1. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Despite the difference in material density, the key factor here is the weight itself, with 10.01 pounds being greater than 9.99 pounds, regardless of the substance.
2. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The weight of an object is directly determined by its mass, and in this case, the cotton bag weighs 10.01 pounds, while the steel ingot bag weighs 9.99 pounds, making the cotton bag heavier by 0.02 pounds.
3. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots, even though the cotton is much less dense. The comparison is purely based on weight, and 10.01 pounds is greater than 9.99 pounds.
4. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. While steel is much denser than cotton, the weight is the determining factor here, and 10.01 pounds is slightly more than 9.99 pounds. So, the cotton bag weighs more.
5. The 10.01-pound bag of fluffy cotton is heavier. Despite the cotton being less dense and bulkier, the weight of the cotton is still 10.01 pounds, which is greater than the 9.99-pound bag of steel ingots.
A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots because 10.01 pounds is greater than 9.99 pounds. The material doesn't matter in this case; weight is the deciding factor.
Step-by-Step Reasoning
1. Clarify what is being compared
The question asks us to compare two bags: one containing steel ingots weighing 9.99 pounds and another containing fluffy cotton weighing 10.01 pounds.
2. Check the actual weight
Despite our intuition that steel is “heavier” than cotton by density, the problem specifically states the total weight in pounds of each bag. Since 9.99 < 10.01, the 10.01-pound bag of cotton is heavier in total mass than the 9.99-pound bag of steel.
3. No trick in the measurement
Each weight is given in the same units (pounds). This is crucial because if different units or measurement standards were used (e.g., 9.99 kilograms versus 10.01 pounds), we would need to convert them to the same unit to compare. But here, both are simply in pounds.
4. Conclusion
The bag of cotton weighing 10.01 pounds is heavier than the bag of steel weighing 9.99 pounds—even though steel is denser than cotton. The numeric value of the total weight is the deciding factor.
Answer
A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots, because 10.01 pounds is strictly greater than 9.99 pounds.
Reference
• Basic arithmetic: 10.01 is greater than 9.99.
• For a playful twist on a similar concept, see any version of the riddle “What weighs more—a pound of feathers or a pound of lead?” In that classic riddle, both weigh the same; here, the numbers differ.
"To determine which is heavier, we need to compare the weights of the two objects.
The steel ingots weigh 9.99 pounds, and the fluffy cotton weighs 10.01 pounds.
Since 10.01 pounds is greater than 9.99 pounds, the bag of fluffy cotton is heavier than the bag of steel ingots, despite the cotton being "fluffy" and the steel being dense. Weight is a measure of the amount of mass in an object, and in this case, the cotton has a slightly greater mass than the steel."
IMO the fuzziness is actually a feature most of the time b/c I can pass misspelled words or close enough words and it'll still figure it out.
Also, if we model the mental state of the llm as a frazzled retail worker dealing with thousands of customers per second, the rote response is reasonable. As a dev, sometimes I get at annoyed at QA for a hyper narrow "trap" test case
It seems more like ChatGPT was asked a rather bizarre question with far too little detail to make sense, and ChatGPT failed to notice or to ask for more information. Although it did get rather impressively confused about the pressure of the air.
I mean that ChatGPT had no questions about the container of the gas (does it have weight? is it tared out?) or about buoyancy. And it’s really rather sad that ChatGPT calculated the volume of air at atmospheric temperature and pressure and didn’t notice that it was nowhere near fitting into its supposed container. (At least 1.01lb of liquid nitrogen would fit.)
If you define a new universe with physical laws that are different from our own, performance will also suffer because there is no training data for that custom universe. While some may say this is cheating, a human would be able to deduce correct answers by understanding the deep implications of the new physical laws in that custom universe.
prompt> Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton? Please state in your answer what's the difference in grams.
answer> The 10.01-pound bag of cotton is heavier. The difference is 0.02 pounds, which is roughly 9.07 grams (using 1 pound ≈ 453.59237 grams).
Reference
• National Institute of Standards and Technology (NIST): Conversion Factors
Still waiting for some reasoning about an upward buoyancy force on the bags, which is equal to the weight of the air displaced by the bags, and the force acting on the cotton is bigger simply because the bag of cotton displaces more air (because it "appears larger due to its fluffy nature".)
Deepseek got it right:
"A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound pack of steel ingots. Even though steel is denser and takes up much less space, the weight is determined by the mass, and 10.01 pounds is greater than 9.99 pounds."
fwiw i think reasoning models have at least solved this. even the smallest reasoning model, o1-mini, gets it right first try on my test:
A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Despite the significant difference in density and volume between steel and cotton, the weights provided clearly indicate that the cotton bag has a greater mass.
Summary:
Steel ingots: 9.99 pounds
Fluffy cotton: 10.01 pounds
Conclusion: The 10.01-pound bag of cotton is heavier.
An ounce of gold is heavier than an ounce of feathers, because the "ounce of gold" is a troy ounce, and the "ounce of feathers" is an avoirdupois ounce. But that shouldn't be true between feathers and bricks - they're both avoirdupois.
Note that a pound of gold is 12 troy ounces, while a pound of feathers is 16 avoirdupois ounces. So one of my favorite pedantic bits of trivia is that an ounce of gold is heavier than an ounce of feathers but a pound of gold is lighter than a pound of feathers.
Pounds are a unit of weight, not of mass. 10 lbs of feathers is whatever amount of feathers causes a scale to display 10 lbs. If the scale also displays 10 lbs for the quantity of bricks, then they weigh the same, regardless of any differences in mass.
Is this still true? I thought pounds are now defined in terms of kilograms (about 0.453)? Because kilograms are definitely a unit of mass, not weight. Or is the pound defined as some amount of kilograms at a specific point on earth, in a specific phase of the moon?
It seems the pound has since been redefined and split into separate units: pound mass and pound force, the former in terms of kilograms (1 lb = 0.45359237 kg) and the latter in terms of the force exerted by one pound of mass in earth’s gravitational field (standard g = 9.80665m/s^2).
So using the word pound without qualification is ambiguous in contexts where it’s not clear whether mass or force is meant.
According to the dictionary, "heavier" can refer to weight or density. In their typical form, bricks are heavier (more dense) than feathers. But one should not make assumptions before answering the question. It is, as written, unanswerable without followup questions.
I just went to chatgpt.com and put into the chat box "Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton?", and the very first answer I got (that is, I didn't go fishing here) was
Which, despite getting it both right and wrong, must still be graded as a "fail".If you want to analyze these thing for their true capability, you need to make sure you're out of the training set... and most of the things that leap to your mind in 5 seconds are leaping to your mind precisely because they are either something you've seen quite often or something that you can easily think of and therefore many other people have easily thought of them as well. Get off the beaten path a bit and the math gets much less impressive.