I remember when this stuff was all coming out and people were finally excited ab...

sholladay · 2025-01-01T16:51:31 1735750291

ChatGPT Plus user here. The following are all fresh sessions and first answers, no fishing.

GPT 4:

The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The type of material doesn’t affect the weight comparison; it’s purely a matter of which bag weighs more on the scale.

GPT 4o:

The 10.01-pound bag of fluffy cotton is heavier. Weight is independent of the material, so the bag of cotton’s 10.01 pounds outweighs the steel ingots’ 9.99 pounds.

GPT o1:

Since both weights are measured on the same scale (pounds), the 10.01-pound bag of cotton is heavier than the 9.99-pound bag of steel, despite steel being denser. The key is simply that 10.01 pounds exceeds 9.99 pounds—density doesn’t affect the total weight in this comparison.

mjburgess · 2025-01-01T19:21:39 1735759299

So do what the commenter suggests and make irrelevant permutations to the input to find when it fails. ie., engage in hypothesis testing rather than confirmation bias.

If a system has the capability to solve problems of {parts1...parts_n}, then it only has that capability if irrelevant permutations {parts1..parts2'...parts_n} make no difference.

Its very obvious that such permutations can destory such apparent capabilities.

jdietrich · 2025-01-01T20:29:12 1735763352

I've just tested a number of permutations with Claude 3.5 Sonnet. It correctly answered all variants I tried on the first attempt, as follows:

Which is heavier, a 9.99 kilogram tungsten cube or a 10.01 kilogram block of aerogel?

Which is heavier, 10,000 steel balls weighing 0.999 grams each or 10,000 polystyrene balls weighing 1.001 grams each?

Which is heavier, a 10.01kg block of steel on Venus or a 9.99kg bag of feathers on Earth?

Which is heavier, a 10cm^3 block of steel or a 100cm^3 block of balsa wood?

Which is heavier, a golf ball made of steel or a baseball made of lithium?

In all cases, Claude clearly used CoT and reasoned out the problem in full. I would be interested in seeing if anyone can find any variant of this problem that stumps any of the leading LLMs. I'm bored of trying.

mjburgess · 2025-01-01T21:57:23 1735768643

Hey, ChatGPT please write me a python program which randomly samples from various materials and various weights then poses a problem to the ChatGPT 4o API -- the goal is to find cases where the LLM fails to obtain the correct answer....

Isinlor · 2025-01-01T22:07:54 1735769274

BTW - the model may be wrong depending on the example. More voluminous objects displace more air and due to buoyancy are lighter for the same mass.

The proper way to ask it would be to ask which object has more mass.

tacitusarc · 2025-01-02T02:28:24 1735784904

For what it’s worth, I asked it the following:

Q:

Which weighs more on earth: 100kg of steel or a 100kg polystyrene ball? Be as technically correct as possible

A: Technically, 100 kg of steel and 100 kg of polystyrene weigh the same on Earth when measured on a scale, as they both experience the same gravitational force due to their equal mass. However, there are some nuances to consider:

Buoyancy Effect

Earth’s atmosphere exerts an upward buoyant force on objects, proportional to the volume of the object and the density of the surrounding air. Since polystyrene has a much lower density than steel, a 100 kg polystyrene ball will have a significantly larger volume and experience a greater buoyant force compared to 100 kg of steel.

As a result: • Effective weight (apparent weight): The steel will weigh slightly more than the polystyrene on a scale because the buoyant force acting on the polystyrene reduces its apparent weight more.

Actual Weight

Weight is the gravitational force on an object and is given by:

Where and . Without considering buoyancy, both objects have the same weight of approximately (newtons).

Conclusion • True weight (in a vacuum): Both are identical at 981 N. • Apparent weight (in air): Steel is slightly heavier due to reduced buoyant force acting on it compared to the polystyrene ball.

Isinlor · 2025-01-02T08:06:34 1735805194

Yup, the models are smart, but are trained to follow standard human patterns for this type of questions. And even on hackernews vast majority will not think that they would need to correct for buoyancy when actually attempting the experiment in standard conditions.

They very often get popular "tricky" questions wrong because they saw it so many times that they switch from internal reasoning to memorization/retrieval.

1209412comb · 2025-01-02T04:16:10 1735791370

I tried 3 times the "Which is heavier, a 10.01kg block of steel on or a 9.99kg bag of feathers?" and ChatGPT keep converting kg to pound and saying the 9.99kg is heavier.

fragmede · 2025-01-02T12:11:14 1735819874

Which model? On the paid plus tier, GPT-4o, GPT-o1, and GPT-o1mini all successfully got the 10.1. I did not try any other models.

gpt-4o: https://chatgpt.com/share/67768221-6c60-8009-9988-671beadb5a...

o1-mini: https://chatgpt.com/share/67768231-6490-8009-89a6-f758f0116c...

o1: https://chatgpt.com/share/67768254-1280-8009-aac9-1a3b75ccb4...

mycall · 2025-01-02T17:15:25 1735838125

Did you try it more than once per model? I could see temperature or other similar gradients providing wrong answers.

jundl77 · 2025-01-07T21:21:40 1736284900

Also couldn't get o1 to fail. I tried the following with o1:

"Which is heavier, a solid titanium box weighing 9.9 flubs, or an empty wooden box weighing 10.1 blobs, where 1 flub = 1 kg, and 1 blob is 1kg".

The answer: "Since 1 flub = 1 kg and 1 blob = 1 kg, the titanium box’s mass is 9.9 kg and the wooden box’s mass is 10.1 kg. Therefore, the wooden box (10.1 kg) is heavier."

Thought that was pretty impressive.

d0mine · 2025-01-02T11:43:26 1735818206

I found that the models consistently ignore buoyancy (objects with the same mass but different density can have different weight as measured by a scale).

wongarsu · 2025-01-01T19:50:02 1735761002

If GP's hypothesis was "it fails for small variations of the input, like this one", then testing that hypothesis with that exact variation on a couple models seems fair and scientific.

Testing it with more variations until one fails feels a bit like p-hacking. You'd need to engage in actual statistics to get reliable results from that, beyond "If I really try, I can make it fail". Which would be a completely different hypothesis than the one presented at the start

nwienert · 2025-01-01T20:19:45 1735762785

I feel like I'm almost 100% certain that the smart guys at OpenAI have added many more variations of the problem to their training set since OP did his failing test, so it doesn't surprise me at all to know that this exact one now passes.

In fact, in my use of o1 it's incredibly clear that it still has the same problems. It's incredibly common that the second I ask for someone even slightly outside the training set, it's more likely to "round" to some wrong solution in the training set, rather than use any sort of human-like reasoning to figure out the right answer (often the right answer isn't hard to get, just not found in a Google search).

bee_rider · 2025-01-01T21:04:06 1735765446

Can’t really do science with closed source software, right? Who knows what’s in there.

kbelder · 2025-01-02T18:01:56 1735840916

I asked "Which is louder, a 12.5 decibel scream, or a 12.8 decibel whisper?", which I wouldn't think has much of a presence in its training set, and it answered correctly.

May still get it wrong in more subtle ways, though. Personally, I think it'll continue to get physics wrong until someone builds it some robot arms so it can train on actually interactive physical spaces and behavior.

roughly · 2025-01-01T20:18:07 1735762687

Except that if the model genuinely was reasoning about the problem, you could test it with every variation of materials and weights in the world and it would pass. Failing that problem at all in any way under any conditions is a failure of reasoning.

jdietrich · 2025-01-01T20:32:27 1735763547

By that logic, humans can't genuinely reason, because they're often fooled by counter-intuitive problems like Monty Hall or the Birthday Problem, or sometimes just make mistakes on trivial problems.

whakim · 2025-01-01T21:03:17 1735765397

I’m not really sure what you’re trying to say here - that LLMs don’t work like human brains? We don’t need to conduct any analyses to know that LLMs don’t “know” anything in the way humans “know” things because we know how LLMs work. That doesn’t mean that LLMs aren’t incredibly powerful; it may not even mean that they aren’t a route to AGI.

zahlman · 2025-01-01T22:05:28 1735769128

>We don’t need to conduct any analyses to know that LLMs don’t “know” anything in the way humans “know” things because we know how LLMs work.

People, including around HN, constantly argue (or at least phrase their arguments) as if they believed that LLMs do, in fact, possess such "knowledge". This very comment chain exists because people are trying to defend against a trivial example refuting the point - as if there were a reason to try.

> That doesn’t mean that LLMs aren’t incredibly powerful; it may not even mean that they aren’t a route to AGI.

I don't accept your definition of "intelligence" if you think that makes sense. Systems must be able to know things in the way that humans (or at least living creatures) do, because intelligence is exactly the ability to acquire such knowledge.

It boggles my mind that I have to explain to people that sophisticated use of language doesn't inherently evidence thought, in the current political environment where the Dead Internet Theory is taken seriously, elections are shown over and over again to be more about tribalism and personal identity than anything to do with policy, etc.

SirMaster · 2025-01-02T00:51:35 1735779095

You don't have to listen to or engage with those people though, just ignore 'em. People say all kinds of things on the Internet. It's completely futile to try to argue with or "correct" them all.

whakim · 2025-01-03T10:29:05 1735900145

> I don't accept your definition of "intelligence" if you think that makes sense. Systems must be able to know things in the way that humans (or at least living creatures) do, because intelligence is exactly the ability to acquire such knowledge.

According to whom? There is certainly no single definition of intelligence, but most people who have studied it (psychologists, in the main) view intelligence as a descriptor of the capabilities of a system - e.g., it can solve problems, it can answer questions correctly, etc. (This is why we call some computer systems "artificially" intelligent.) It seems pretty clear that you're confusing intelligence with the internal processes of a system (e.g. mind, consciousness - "knowing things in the way that humans do").

wongarsu · 2025-01-01T20:32:51 1735763571

We are pretty certain that humans can reason, yet they are sometimes wrong. Even if you give them the same problem over and over again with slight variations.

LLMs get things wrong due to different factors than humans (humans lose focus, LLMs have randomness applied when sampling their responses to improve results). But clearly we have to choose a goal somewhat below 100% if we want a test that doesn't conclude that humans are incapable of reasoning.

roughly · 2025-01-01T21:03:44 1735765424

The difference is we _know_ that LLMs are fancy stochastic models, we don't know that they're capable of reasoning, and the null hypothesis is that they're not (because we know what they _are_ - we built them) - any "reasoning" is an emergent property of the system, not something we built them to do. In that case, evidence they're not reasoning - evidence they're stochastic parrots doing a performance of reasoning - weighs heavier, because the performance of reasoning fits into what we know they can do, whereas genuine reasoning would be something new to the model.

There's deeper philosophical questions about what reasoning actually _is_, and LLMs have made those sharper, because they've shown it's clearly possible for a complex statistical model to generate words that look like reasoning, but the question is whether there's a difference between what they're doing and what humans are doing, and evidence that they're _not_ reasoning - evidence that they're just generating words in specific orders - weighs heavily against them.

wongarsu · 2025-01-01T21:22:59 1735766579

We haven't coded LLMs to be stochastic models, we coded them to predict text with any method gradient decent finds on a transformer architecture. That's not exactly the same.

But more importantly, if you want to show that LLMs can't reason you obviously have to use a test that when applied to humans would show that humans can reason. Otherwise your test isn't testing reasoning but something more strict.

Clubber · 2025-01-02T02:19:28 1735784368

>we don't know that they're capable of reasoning

Apple AI researchers released a paper on it. They say no.

https://arxiv.org/pdf/2410.05229

Isinlor · 2025-01-01T22:30:18 1735770618

It's widely accepted that reasoning is not a binary skill.

You can make mistakes and still reason. Very often people given the same premises will disagree in thier reasoning as we are doing right here.

jack_pp · 2025-01-01T20:22:57 1735762977

It's not p-hacking, he's right. You're both right. First test the same prompt on different versions then the ones that got it right go to the next round, variations on the prompt

zahlman · 2025-01-01T21:54:26 1735768466

We aren't testing whether the model's results are stable or correct for a given class of problem. The goal is to establish whether the model can reason.

Nothing capable of reasoning would contradict itself so blatantly and in such a short span while failing to indicate any kind of uncertainty.

Isinlor · 2025-01-01T22:37:01 1735771021

Reasoning is not a binary skill.

And failure modes of other types of reasoners do not need to be the same as the failure modes of humans.

csomar · 2025-01-02T10:23:23 1735813403

gpt-3.5-turbo at 0 temperature:

> The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.

Interestingly, that was the best answer out of all models and also a correct one. O1-preview doesn't allow setting the temperature, so it's hard to know where the real probabilities fall.

otabdeveloper4 · 2025-01-01T19:27:27 1735759647

> ...engage in hypothesis testing rather than confirmation bias

Please leave the premises, sir. We don't take kindly to luddites here.

dullcrisp · 2025-01-01T19:35:52 1735760152

Tough crowd

david-gpu · 2025-01-01T20:13:47 1735762427

Lots of other websites are more appropriate for meme jokes.

dullcrisp · 2025-01-01T21:11:51 1735765911

Like I said.

david-gpu · 2025-01-02T22:16:46 1735856206

Your two word comment was ambiguous. I interpreted it as something like "People are downvoting you because they have no sense of humor".

There are other websites where two and three word comments are better received.

dullcrisp · 2025-01-02T23:05:08 1735859108

Mea culpa.

escape_goat · 2025-01-01T21:21:35 1735766495

I asked ChatGPT-4 mini the previous comment's question (""Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton?") in new incognito tabs until I got an incorrect result. It only took four tries.

1: "The 10.01-pound bag of fluffy cotton would be heavier. Even though cotton is much less dense than steel, the weight of the cotton bag is 10.01 pounds, while the steel ingot bag weighs 9.99 pounds, making the cotton bag the heavier of the two by 0.02 pounds."

2: "The 10.01-pound bag of fluffy cotton is heavier. Even though the steel ingots are denser and more compact, the cotton weighs more at 10.01 pounds compared to the steel ingots' 9.99 pounds. The difference is only 0.02 pounds, but in terms of weight, the cotton bag is heavier."

3: "The 10.01-pound bag of fluffy cotton is heavier. Despite cotton being less dense than steel, the weight is determined by the total mass, not the material's density. So, since the cotton bag weighs 10.01 pounds and the steel ingot bag weighs 9.99 pounds, the cotton bag is heavier by 0.02 pounds."

4: "The 9.99-pound bag of steel ingots is heavier. It weighs 9.99 pounds, while the bag of fluffy cotton weighs 10.01 pounds. Despite the cotton being fluffier and taking up more space, the steel ingots are still lighter by 0.02 pounds."

mnky9800n · 2025-01-01T23:50:33 1735775433

The question isn’t whether or not it can get this question correct or not. It is, why is it incapable of getting the answer consistently right?

blibble · 2025-01-01T19:12:41 1735758761

they've likely read this thread and adjusted their pre-filter to give the correct answer

csomar · 2025-01-02T10:18:52 1735813132

o1-preview:

To determine which bag is heavier when measured in air—the 9.99-pound bag of steel ingots or the 10.01-pound bag of fluffy cotton—we need to consider not just their masses but also the effect of air buoyancy on their apparent weights.

*Step 1: Calculate the Masses*

First, convert the given weights into masses using the conversion $1 \text{ pound} = 0.453592 \text{ kg}$:

- Mass of steel bag, $m_s = 9.99 \, \text{lb} \times 0.453592 \, \text{kg/lb} \approx 4.5310 \, \text{kg}$ - Mass of cotton bag, $m_c = 10.01 \, \text{lb} \times 0.453592 \, \text{kg/lb} \approx 4.5401 \, \text{kg}$

*Step 2: Calculate the Volumes*

Next, calculate their volumes using the densities:

- Density of steel, $\rho_s \approx 8000 \, \text{kg/m}^3$ - Density of cotton (fluffy), $\rho_c \approx 50 \, \text{kg/m}^3$

- Volume of steel bag, $V_s = \frac{m_s}{\rho_s} \approx \frac{4.5310 \, \text{kg}}{8000 \, \text{kg/m}^3} \approx 5.664 \times 10^{-4} \, \text{m}^3$ - Volume of cotton bag, $V_c = \frac{m_c}{\rho_c} \approx \frac{4.5401 \, \text{kg}}{50 \, \text{kg/m}^3} \approx 0.090802 \, \text{m}^3$

*Step 3: Calculate the Buoyant Forces*

Using the density of air $\rho_{\text{air}} \approx 1.2 \, \text{kg/m}^3$ and acceleration due to gravity $g = 9.81 \, \text{m/s}^2$:

- Buoyant force on steel bag, $B_s = \rho_{\text{air}} \times V_s \times g \approx 1.2 \, \text{kg/m}^3 \times 5.664 \times 10^{-4} \, \text{m}^3 \times 9.81 \, \text{m/s}^2 \approx 0.00668 \, \text{N}$ - Buoyant force on cotton bag, $B_c = \rho_{\text{air}} \times V_c \times g \approx 1.2 \, \text{kg/m}^3 \times 0.090802 \, \text{m}^3 \times 9.81 \, \text{m/s}^2 \approx 1.068 \, \text{N}$

*Step 4: Calculate the Apparent Weights in Air*

Subtract the buoyant force from the actual gravitational force (mass times gravity):

- Apparent weight of steel bag, $W_s' = m_s \times g - B_s \approx 4.5310 \, \text{kg} \times 9.81 \, \text{m/s}^2 - 0.00668 \, \text{N} \approx 44.44 \, \text{N}$ - Apparent weight of cotton bag, $W_c' = m_c \times g - B_c \approx 4.5401 \, \text{kg} \times 9.81 \, \text{m/s}^2 - 1.068 \, \text{N} \approx 43.47 \, \text{N}$

*Conclusion:*

Despite the cotton bag having slightly more mass and a greater weight in a vacuum, when measured in air, the steel bag is heavier due to the significantly smaller upward buoyant force acting on it compared to the cotton bag. This means that on a scale in air, the 9.99-pound bag of steel ingots will weigh more than the 10.01-pound bag of fluffy cotton.

p1esk · 2025-01-02T18:07:13 1735841233

This answer is probably better than 99.99% of human answers. Feel the AGI!

whimsicalism · 2025-01-01T16:09:21 1735747761

https://chatgpt.com/share/67756897-8974-8010-a0e0-c9e3b3e91f...

so far o1-mini has bodied every task people are saying LLMs can’t do in this thread

jerf · 2025-01-01T16:13:57 1735748037

That appears to be the same model I used. This is why I emphasized I didn't "go shopping" for a result. That was the first result I got.

I'm not at all surprised that it will nondeterministically get it correct sometimes. But if it doesn't get it correct every time, it doesn't "know".

(In fact "going shopping" for errors would still even be fair. It should be correct all the time if it "knows". But it would be different if I was fishing over and over and over and finally got one, versus the first time I asked.)

Edit: It appears it isn't the model I used. The point holds, though, you need to make sure you're off the training set for it to matter. This isn't a "ChatGPT can't do that" post as some are saying, it's more a "you aren't asking what you think you're asking" post.

You get the same problem in a human context in things like code interviews. If you ask an interviewee the exact question "how do you traverse a binary tree in a depth-first manner", you aren't really learning much about the interviewee. It's a bad interview question. You need to get at least a bit off the beaten trail to do any sort of real analysis.

dialup_sounds · 2025-01-01T16:45:57 1735749957

I believe this is just a case of OpenAI's naming scheme being weird and confusing.

The default model I see on chatgpt.com is GPT 4o-mini, which is not o1-mini.

OpenAI describes GPT 4o-mini as "Our fast, affordable small model for focused tasks" and o1/o1-mini as "Reasoning models that excel at complex, multi-step tasks".

elliotto · 2025-01-01T19:18:38 1735759118

Could you share the exact chat you used for when it failed? There is a share chat button on openai.

It's very difficult to be an AI bull when the goalposts are moving so quickly that ai answering core correctly across multiple models is brushed off as 'nondeterministically getting it correct sometimes'

stefan_ · 2025-01-01T20:18:12 1735762692

Why? Did a grocery store self checkout ever fail to calculate sales tax? Do I need to run a study on that?

The people selling this could not make a car drive but now its AGI.

scrollaway · 2025-01-02T01:05:06 1735779906

A single-purpose state machine not failing to do the single thing it was created to do does not make for the clever retort you think it makes.

"AGI": emphasis on "G" for "General". The LLMs are not failing to do generalized tasks, and that they are nondeterministic is not a bug. Just don't use them for calculating sales tax. You wouldn't hire a human to calculate sales tax in their head, so why do you make this a requirement in order to call an LLM "AGI"?

I wonder when the goalposts will stop moving from "We have superhuman intelligences which are able to rather reliably converse in many languages, do generalized tasks and automate operations we thought were impossible to automate 3 years ago" (and by the way, this is what we have TODAY), all the way to "It's not AGI unless it's an omnipotent god that knows how to turn water into wine and calculate the applicable sales tax of that operation".

stefan_ · 2025-01-02T19:44:33 1735847073

I do not get it, where are all the serious people? Just come out and say what it is and what it is not. Pretend for a moment you have the ethics and work philosophy of an undergraduate science student.

Jensson · 2025-01-02T09:33:56 1735810436

> You wouldn't hire a human to calculate sales tax in their head

Everyone did that 60 years ago, humans are very capable at learning and doing that. Humans built jetplanes, skyscrapers, missiles, tanks, carriers without the help of electronic computers.

sdenton4 · 2025-01-02T15:24:21 1735831461

Yeah... They used side rules and vast lookup tables of function values printed on dead trees. For the highest value work, they painstakingly built analog calculators. They very carefully checked their work, because it was easy to make a mistake when composing operations.

Humans did those things by designing failsafe processes, and practicing the hell out of them. What we would likely consider over fitting in the llm training context.

mortehu · 2025-01-01T16:23:17 1735748597

While this may be true, it's a very common problem that people who want to demonstrate how bad a model is fail to provide a direct link or simply state the name of the model.

chongli · 2025-01-01T17:19:37 1735751977

OpenAI is not doing us any favours by using confusing naming schemes for their models and obscuring which models people are actually working with.

If I didn’t know any better, I’d say OpenAI doesn’t want us doing these tests accurately and is trying to hide something.

whimsicalism · 2025-01-01T17:25:59 1735752359

it’s extremely easy to see which model you are using. one’s own… difficulties understanding are not a conspiracy by OpenAI

chongli · 2025-01-01T17:35:45 1735752945

It does not show the model version anywhere on the page on chatgpt.com, even when logged in.

qup · 2025-01-01T18:10:25 1735755025

Yes it does, at the top of every chat there is a drop-down to select the model, which displays the current model. It's been a constant part of the UI since forever.

chongli · 2025-01-01T18:41:02 1735756862

No, it only says "ChatGPT Plus (Upgrade)" or "ChatGPT".

Maybe it's different if you have a paid account?

whimsicalism · 2025-01-01T18:43:32 1735757012

if i go to chatgpt.com on my phone not logged on at all it tells me very prominently at the top that i am using 4o mini

bcrosby95 · 2025-01-01T19:22:52 1735759372

Logged in, non paid account, on a desktop, for me, it's exactly as the person you're replying to has stated.

If I log out, it shows 4o mini, and when I try to change it, it asks me to login or sign in rather than giving me any options.

When I use enough chatgpt when logged in it gives me some nebulous "you've used all your xyz tokens for the day". But other than that there is no real signal to me that I'm getting a degraded experience.

It's really just confusing as hell.

blharr · 2025-01-01T19:28:10 1735759690

Someone else in this thread said,

> With a free account the model it claims to be using is "4o auto", which is not a model but apparently an attempt to automatically decide models for you to be more cost effective.

lukeschlather · 2025-01-01T16:50:50 1735750250

I usually test models using the OpenAI API which doesn't offer links the way I think you mean. If I provide some output I got from a particular model you're just going to have to take my word for it.

4ad · 2025-01-01T17:05:24 1735751124

You can use https://lluminous.chat (bring your own key) to link to chats using any model across all LLMs.

whimsicalism · 2025-01-01T17:21:31 1735752091

open router is the more standard solution

Jerrrry · 2025-01-01T17:02:55 1735750975

They need to provide an small hash with the api result that can be verified by others.

whimsicalism · 2025-01-01T16:19:13 1735748353

you sure? i just asked o1-mini (not 4o mini) 5 times in a row (new chats obviously) and it got it right every time

perhaps you stumbled on a rarer case but reading the logs you posted this sounds more like a 4o model than an o1 because it’s doing its thinking in the chat itself plus the procedure you described would probably get you 4o-mini

7thpower · 2025-01-01T18:59:35 1735757975

May be unrelated, but I have been having a lot of issues lately with ChatGPT letting me select a model (o1) and silently switching to 4o.

This is coming off my TWO DAY cooldown on o1 usage, which is frustrating.

JumpCrisscross · 2025-01-01T16:41:56 1735749716

> just asked o1-mini (not 4o mini) 5 times in a row (new chats obviously) and it got it right every time

Could you try playing with the exact numbers and/or substances?

whimsicalism · 2025-01-01T16:45:28 1735749928

give me a query and i’ll ask it, but also i don’t want to burn through all of my o1mini allocation and have to use the pay-as-you-go API.

drivebyhooting · 2025-01-01T18:02:20 1735754540

> What is heavier a liter of bricks or a liter of feathers?

>> A liter of bricks and a liter of feathers both weigh the same—1 kilogram—since they each have a volume of 1 liter. However, bricks are much denser than feathers, so the bricks will take up much less space compared to the large volume of feathers needed to make up 1 liter. The difference is in how compactly the materials are packed, but in terms of weight, they are identical.

whimsicalism · 2025-01-01T18:04:30 1735754670

https://chatgpt.com/share/677583a3-526c-8010-b9f9-9b2a3374da... o1-mini best-of-1

thaumasiotes · 2025-01-01T19:11:24 1735758684

>> so far o1-mini has bodied every task people are saying LLMs can’t do in this thread

> give me a query and i’ll ask it

Here's a query similar to one that I gave to Google Gemini (version unknown), which failed miserably:

---query---

Steeleye Span's version of the old broadsheet ballad "The Victory" begins the final verse with these lines:

Here's success unto the Victory / and crew of noble fame

and glory to the captain / bold Nelson was his name

What does the singer mean by these lines?

---end query---

Italicization is for the benefit of HN; I left that out of my prompt.

ted_dunning · 2025-01-01T22:21:33 1735770093

Hmm... Gemini (1.5 Flash) just aced that exact question for me:

These lines celebrate the victory of the British ship HMS Victory, led by the famous Admiral Lord Nelson, in the Battle of Trafalgar in 1805.

"Here's success unto the Victory": This line directly praises the ship itself, acknowledging its role in the successful battle. "and crew of noble fame": This recognizes the bravery and skill of the sailors who served aboard the Victory. "and glory to the captain": This line specifically honors Admiral Nelson, the captain of the Victory, for his leadership and strategic brilliance in the battle. "bold Nelson was his name": This emphasizes Nelson's courage and daring, which were legendary. The lines express admiration for the ship, its crew, and most importantly, Admiral Nelson, who became a national hero in Britain for his victory at Trafalgar.

thaumasiotes · 2025-01-02T06:56:14 1735800974

That's not acing the question. It's completely incorrect.

What do you think the singer in "Friends in Low Places" meant in the toast he gave after crashing his ex-girlfriend's wedding?

And I saw the surprise and the fear in his eyes

when I took his glass of champagne

and I toasted you, said "Honey, we may be through

but you'll never hear me complain"

sdenton4 · 2025-01-02T16:02:04 1735833724

That requires knowing the song, beyond the words provided. Would you flunk an eighth grader for getting it wrong?

thaumasiotes · 2025-01-02T17:47:31 1735840051

Well, the rest of the song helps, in that it specifies that (1) the toast upset the wedding, and (2) the singer responded to that by insulting "you", which is presumably one or more of the bride, the groom, and the guests.

But I think specifying that the singer has crashed his ex-girlfriend's wedding is already enough that you deserve to fail if your answer is "he says he's not upset, so what he means is that he's not upset". It's not any kind of leap to guess that the bride's ex-boyfriend's toast might cause a scene at a wedding - that's why the bride's ex-boyfriends are never invited.

(The question has already provided every word of the toast that appears in the song.)

See also the sidethread comment by mikeruiz, noting that o1-pro reproduces the rest of the lyrics to The Victory, but gets the question wrong anyway.

sdenton4 · 2025-01-02T20:52:01 1735851121

I was referring to the original query, of course, as any entity capable of reasoning could have figured out.

thaumasiotes · 2025-01-02T22:16:43 1735856203

Hmm. Is there anything in my comment above that might address that point of view?

sdenton4 · 2025-01-03T00:17:37 1735863457

Nah, intermittent failures are apparently enough to provide evidence that an entire class of entities is incapable of reason. So I think we've figured this one out...

kbelder · 2025-01-02T18:11:47 1735841507

Sounds like you just proved ted_dunning isn't sentient.

thaumasiotes · 2025-01-02T22:13:34 1735856014

Well, I proved that he's happy to express an opinion on whether an answer to a question is correct regardless of whether he knows anything about the question. I wouldn't trust advice from him or expect his work output to stand up to scrutiny.

Sentience isn't really a related concept.

whimsicalism · 2025-01-01T19:48:06 1735760886

i'd prefer an easily verifiable question rather than one where we can always go "no that's not what they really meant" but someone else with o1-mini quota can respond

mikeruiz · 2025-01-01T21:36:00 1735767360

“They’re toasting Admiral Nelson’s ship (HMS Victory) and its valiant crew, hailing the ship’s successes and Nelson’s heroism. In other words, the singer is offering tribute—“success unto the Victory”—to the vessel and its famed sailors, and “glory to the captain” who led them, namely the celebrated Admiral Horatio Nelson.”

…but to your point, no idea if the artist intended some more obscure reference.

o1-pro was also able to produce a relatively complete version of original source, though, amusingly, referred to it as a ‘broadside’ rather than ‘broadsheet’. Appropriate given the context!

thaumasiotes · 2025-01-02T06:54:37 1735800877

> no idea if the artist intended some more obscure reference.

No, but the answer is obviously wrong. Very, very wrong. The lines are not praise. The song's view of the Victory, and of the battle of Trafalgar, is overwhelmingly negative.

> o1-pro was also able to produce a relatively complete version of original source

...as should be clear, if you read the original source. (Or, heck, just the two lines following the ones in the question.) Being able to reproduce the source makes the error look that much worse.

> though, amusingly, referred to it as a ‘broadside’ rather than ‘broadsheet’.

Both terms are used. https://en.wikipedia.org/wiki/Broadside_ballad :

>> A broadside (also known as a broadsheet) is a single sheet of inexpensive paper printed on one side, often with a ballad, rhyme, news and sometimes with woodcut illustrations.

thaumasiotes · 2025-01-01T20:30:00 1735763400

It's not a difficult or tricky question.

whimsicalism · 2025-01-02T23:59:51 1735862391

i think it's a bit tricky, the surface meaning is extremely praiseworthy and some portion of readers might interpret as someone who has praise for Admiral Nelson but hates the press gangs.

of course, it is a sardonic, implicit critique of Admiral Nelson/the victory, etc. but i do think it is a bit subtle.

JumpCrisscross · 2025-01-01T17:09:13 1735751353

> give me a query and i’ll ask it

Which is heavier: an 11kg bag of lint or a 20lb bag of gold?

whimsicalism · 2025-01-01T17:11:07 1735751467

yeah it gets it

https://chatgpt.com/share/67757720-3c7c-8010-a3e9-ce66fb9f17...

e: cool, this gets downvoted

blharr · 2025-01-01T19:12:41 1735758761

It got it right, but an interesting result that it rambled on about monetary value for... no reason.

> While the lint bag is heavier in terms of weight, it's worth mentioning that gold is significantly more valuable per pound compared to lint. This means that even though the lint bag weighs more, the gold bag holds much greater monetary value.

JumpCrisscross · 2025-01-01T19:39:31 1735760371

Legal said someone might sell a bag of gold for one of lint without it.

deeviant · 2025-01-01T16:20:55 1735748455

I don't believe that is the model that you used.

I wrote a script and pounded 01 mini and gpt 4 with a wide vareity of tempature and top_p parameters, and was unable to get it to give the wrong answer a single time.

Just a whole bunch of:

(openai-example-py3.12) <redacted>:~/code/openAiAPI$ python3 featherOrSteel.py Response 1: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Response 2: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Response 3: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 4: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 5: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Response 6: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 7: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 8: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 9: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 10: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. All responses collected and saved to 'responses.txt'.

Script with one example set of params:

    import openai
    import time
    import random

    # Replace with your actual OpenAI API key
    openai.api_key = "your-api-key"

    # The question to be asked
    question = "Which is heavier, a 9.99-pound bag of steel ingots or a 10.01-pound bag of fluffy cotton?"

    # Number of times to ask the question
    num_requests = 10

    responses = []

    for i in range(num_requests):
        try:
            # Generate a unique context using a random number or timestamp, this is to prevent prompt caching
            random_context = f"Request ID: {random.randint(1, 100000)} Timestamp: {time.time()}"

            # Call the Chat API with the random context added
            response = openai.ChatCompletion.create(
                model="gpt-4o-2024-08-06",
                messages=[
                    {"role": "system", "content": f"You are a creative and imaginative assistant. {random_context}"},
                    {"role": "user", "content": question}
                ],
                temperature=2.0,
                top_p=0.5,
                max_tokens=100,
                frequency_penalty=0.0,
                presence_penalty=0.0
            )

            # Extract and store the response text
            answer = response.choices[0].message["content"].strip()
            responses.append(answer)

            # Print progress
            print(f"Response {i+1}: {answer}")

            # Optional delay to avoid hitting rate limits
            time.sleep(1)

        except Exception as e:
            print(f"An error occurred on iteration {i+1}: {e}")

    # Save responses to a file for analysis
    with open("responses.txt", "w", encoding="utf-8") as file:
        file.write("\n".join(responses))

    print("All responses collected and saved to 'responses.txt'.")

zaroth · 2025-01-01T19:03:24 1735758204

Downvoted for… too conclusively proving OP wrong?

gmueckl · 2025-01-01T20:34:17 1735763657

Down voted for not actually countering the argument in question? The script doesn't alter the phrasing of the question itself. It just generates a randomized, irrelevant preamble.

deeviant · 2025-01-01T21:16:48 1735766208

Well, I understood the argument in question to be: was it possible for the model to be fooled by this question, not was it possible to prompt engineer it into failure.

The parameter space I was exploring, then, was the different decoding parameters available during the invocation of the model, with the thesis that if were possible to for the model to generate an incorrect answer to the question, I would be able to replicate it by tweaking the decoding parameters to be more "loose" while increasing sample size. By jacking up temperature while lowering Top-p, we see the biggest variation of responses and if there were an incorrect response to be found, I would have expected to see in the few hundred times I ran during my parameter search.

If you think you can fool it by slight variations on the wording of the problem, I would encourage you to perform a similar experiment as mine and prove me wrong =P

gmueckl · 2025-01-02T21:36:47 1735853807

Intuitively, I wouldn't expect a wrong answer to show up that easily if the network was overfitted to that particular input token sequence.

The questions as I understand it is whether the network learned enough of a simulacrum of the concept of weight to answer similar questions correctly.

Workaccount2 · 2025-01-02T04:16:55 1735791415

The elephant in the room is that HN is full of people facing an existential threat.

kbelder · 2025-01-02T18:12:50 1735841570

Who think they're facing an existential threat.

orangecat · 2025-01-02T04:53:20 1735793600

But if it doesn't get it correct every time, it doesn't "know".

By that standard humans know almost nothing.

qup · 2025-01-01T18:07:36 1735754856

It's so weird that people use questions that are well-known for duping humans, who we all consider to be general intelligence.

Getting this question wrong doesn't say much about the intelligence of humans, why would it say something about the AI?

zahlman · 2025-01-01T22:24:25 1735770265

We use variations on questions that are well known for duping inattentive humans, to test a system that we expect a priori to be incapable of such inattention.

Unless "getting easy things wrong sometimes" is an inherent property of intelligence, we should expect that a properly "intelligent" computerized system would never err on problems far below its level of comprehension - unless we had some reason to believe it "wanted to", and as of yet I see no reason to believe this is even possible in principle.

Humans err, broadly speaking, for two reasons: genuinely reaching the limits of their comprehension, or trusting "system 1" (in Kahneman's analysis) too much.

flatline · 2025-01-01T18:16:14 1735755374

Because for things like the Putnam questions, we are trying to get the performance of a smart human. Are LLMs just stochastic parrots or are they capable of drawing new, meaningful inferences? We keep getting more and more evidence of the latter, but things like this throw that into question.

qup · 2025-01-04T19:28:40 1736018920

Okay, but you just invented your own bar of "smart human" to be the universal bar (I don't share that opinion).

Also, lots of smart humans can't do the freaking Putnam, it doesn't make them stupid. It makes them non-experts.

kbelder · 2025-01-02T18:14:27 1735841667

It is perfectly possible for the first AGI to be stupid. A moron. In fact, I'd bet that's fairly likely.

qup · 2025-01-04T19:27:12 1736018832

I would agree if we weren't starting with LLMs for a baseline. The first AGI will know at least as much as LLMs, IMO, and that's already not-stupid. Especially once they can separate out the truth in their training.

xtracto · 2025-01-01T16:45:12 1735749912

So, there is this meme going around in Mexico about a previous president who in an interview said "we will land in about 1 minute, no, less about 5"

Does this proves he is not an intelligent being?

Is he stupid?

This he had a lapse? Would we judge his intelligence for that?

NewsaHackO · 2025-01-01T16:22:34 1735748554

This happens literally every time. Someone always says "ChatGPT can't do this!", but then when someone actually runs the example, chatGPT gets it right. Now what the OP is going to do next is proceed to move goalposts and say like "but umm I just asked chatgpt this, so clearly they modified the code in realtime to get the answer right"

jorvi · 2025-01-01T19:28:23 1735759703

Prompt: In the Netherlands, in terms of drinks, is there a particular spirit that represents the country?

> Yes, in the Netherlands, jenever (also known as genever) is the traditional spirit that represents the country. Jenever is a type of Dutch gin that has a distinctive flavor, often made from malt wine and flavored with juniper berries. It has a long history in the Netherlands, dating back to the 16th century, and is considered the precursor to modern gin.

> Jenever comes in two main types: oude (old) jenever, which has a richer, maltier flavor and is often aged in wooden casks, and jonge (young) jenever, which is lighter and has a more neutral taste. Both types can be enjoyed straight or used in cocktails.

> In addition to jenever, the Netherlands is also known for its liqueurs, such as advocaat, a rich and creamy drink made from eggs, sugar, and brandy, often enjoyed as a dessert or in cocktails. However, jenever remains the most iconic spirit associated with Dutch culture.

This is completely wrong. Jenever certainly is very Dutch, but no one would say it is iconic as the Dutch spirit. For example, if you asked up north in Friesland, they would say Berenburg.

This happens literally every time. Someone always says "ChatGPT can do this!", but then within one or two prompts, its gets it wrong.

kevhito · 2025-01-02T01:46:44 1735782404

"This is completely wrong." Is it completely wrong? Or maybe "somewhat" wrong? Maybe just lacking nuance? I know nothing about the answer to this question, so this is an honest question.

Using just a plain old search engine, for things like "national drink of the netherlands" and simlar queries, I am directed to Wikipedia's Jenever page as the top hit, and Wikipedia's list of national drinks lists Jenever and Heineken as the entries for the Netherlands. Search engines also give page after page of travel guides and blog posts, most of which list Jenever at or near the top of of their listings. One travel guide calls it "the most famous Dutch spirit and most famous Amsterdam liquor, Jenever, also spelled Genever or simply Dutch gin."

elif · 2025-01-01T20:20:09 1735762809

'Berenberg is made by adding herbs to jenever'

From your comment it would seem that you are disputing jenever's popularity by saying jenever is more popular...

Perhaps it was a good faith mistake? If so, that would imply that the AI knows more about jenever than you?

jorvi · 2025-01-01T22:45:32 1735771532

I am rather saying that there is no one national drink for The Netherlands, like a Frenchman would say wine, a German/Belgian would say beer, and a Scotsman would say whisky. Note that I prompted "In the Netherlands, in terms of drinks, is there a particular spirit that represents the country?" I didn't ask which spirit is consumed the most.

For example, France has been trending towards beer more and more, and within a few decades they might be consuming more beer than wine. But even then, the French wouldn't slowly start to say beer represents France.

Furthermore, "just adding some herbs" does a large disservice to the flavor change of Berenburg. Jenever (aka jonge/unaged jenever) is straight-up vile. I've heard it described by expats as "having the worst elements of both cheap gin and cheap whisky".

Berenburg in comparison is spicy and vanilla-y and actually debatebly enjoyable.

Aged/oude jenever is much closer to Berenburg (or Berenburg to aged jenever), also with hints of vanilla and spices.

But, virtually no one except for dusty old men orders aged jenever. The kind ordered by far the most is jonge jenever, and then its only in a sense of "haha lets drink this terrible thing" or "let's get shitfaced quick".

If o1 supposedly "oneshots every question", it should have been aware of these nuances instead of just confidently assigning jenever as 'the' spirit of the Dutch.

zuminator · 2025-01-02T01:16:47 1735780607

The question of whether there is a national drink seems to me to be entirely different than the question you asked the LLM "Prompt: In the Netherlands, in terms of drinks, is there a particular spirit that represents the country?"

The question in the prompt comes off to me as a sort of qualitative determination rather than asking about pure factual information (is there an officially designated spirit). As such I don't think it can necessarily be right or wrong.

Anyway, I'm not sure what you'd expect. In terms of acquisition of knowledge, LLMs fundamentally rely on a written corpus. Their knowledge of information that is passed through casual spoken conversation is limited. Sure, as human beings, we rely a great deal on the latter. But for an LLM to lack access to that information means that it's going to miss out on cultural nuances that are not widely expressed in writing. Much in the same way that a human adult can live in a foreign country for decades, speaking their adopted language quite fluently, but if they don't have kids of their own, they might be quite ignorant of that country's nursery rhymes and children's games, simply because they were never part of their acquired vocabulary and experience.

jorvi · 2025-01-02T17:03:42 1735837422

> Anyway, I'm not sure what you'd expect.

I was just proving the people wrong that were saying akin to that o1 was "oneshotting every question".

I completely understand from how LLMs work that they wouldn't be able to get this right. But then people shouldn't be proudly be pronouncing that o1 (or any model) is getting every question right, first time.

elif · 2025-01-04T21:49:31 1736027371

My conjecture is that you still haven't proven that it didn't get the answer "right"

I have opened the question of why you thought jenever was not jenever, and your non-responsiveness I think compels the fact that AI was more correct in your contrived instance.

jorvi · 2025-01-05T18:29:54 1736101794

If you add pear and spices to vodka, we call it liqueur and not pear-flavored vodka. So no, you are wrong. And the AI is wrong. But that is okay, if you want to enjoy leaning into the hype that's your choice.

ipaddr · 2025-01-01T20:20:37 1735762837

So you believe they are incorrect because regionally some area would select something different because it represented that area. But your question asked nationally.. is there a better answer than the one they gave? Were you expecting a no?

jorvi · 2025-01-01T22:56:03 1735772163

See my more detailed upthread response here: https://news.ycombinator.com/item?id=42569937

But, like Zahlman points out, its a trick question, and instead of admitting it doesn't know or even prepending "I don't know for sure, but:", it just burps up its best-effort answer. There is no one spirit that represents The Netherlands. If a LLM is so good it "oneshots any question", it should realize it doesn't have a unanimous answer and tell me.

zahlman · 2025-01-01T22:27:23 1735770443

The point is that there is no correct national answer, because the locals don't see it as a matter of national identity.

What's expected is an ability to identify trick questions, i.e., to recognize fundamental problems in the phrasing of a question rather than trying to provide a "helpful" answer at all costs.

This corresponds to one of the many reasons LLM output is banned on Stack Overflow.

ludwik · 2025-01-01T19:40:19 1735760419

But what does this have to do with reasoning? Yes, LLMs are not knowledge bases, and seeing people treat them as such absolutely terrifies me. However, I don’t see how the fact that LLMs often hallucinate “facts” is relevant to a discussion about their reasoning capabilities.

zahlman · 2025-01-01T22:28:34 1735770514

"Hallucinating a fact" that isn't in the training set and is also illogical, is exactly what a failure to reason correctly looks like.

ludwik · 2025-01-01T23:33:21 1735774401

Reasoning involves making accurate inferences based on the information provided in the current context, rather than recalling arbitrary facts from the training data.

zahlman · 2025-01-02T00:30:18 1735777818

Yes, that's what I said. The whole point of hallucinations is that they aren't "arbitrary facts recalled from the training data". They represent attempts to synthesize (i.e., infer) new facts. But because the inferences are not accurate, and because the synthesis process is not sound, the attempt cannot be called reasoning.

It is equally possible to "reason" about things you already know, as about things you've just been told. In fact, the capacity to speculatively, without prompting attempt such reasoning is a big part of cognition.

JumpCrisscross · 2025-01-01T16:41:15 1735749675

> Someone always says "ChatGPT can't do this!", but then when someone actually runs the example, chatGPT gets it right

I mean, if I had OpenAI’s resources I’d have a team tasked with monitoring social to debug trending fuck-ups. (Before that: add compute time to frequently-asked novel queries.)

jprete · 2025-01-02T14:28:23 1735828103

I was thinking something very similar. Posting about a problem adds information back to the system, and every company selling model time for money has a vested interest in patching publicly visible holes.

This could even be automated; LLMs can sentiment-analyze social media posts to surface ones that are critical of LLM outputs, then automatically extract features of the post to change things about the running model to improve similar results with no intervention.

stocknoob · 2025-01-01T21:57:22 1735768642

Similarly, in every thread there’s an AI skeptic who says LLMs are “useless” for coding, and never provides an example query for what they were trying.

mu53 · 2025-01-01T22:54:41 1735772081

If you ask about more niche language features or libraries, chatgpt will make up libraries or functions to fill the gap.

When asking an LLM to write a script for you, I would say 10 to 30 % of the time that it completely fails. Again, making up an API or just getting things straight up wrong.

Its very helpful, especially when starting from 0 with the beginner questions, but it fails in many scenarios.

zahlman · 2025-01-01T23:29:59 1735774199

Because the argument isn't based on individual query results. See for example my comment on a previous post https://news.ycombinator.com/item?id=42563715 .

anamexis · 2025-01-01T16:34:08 1735749248

How is that moving the goalposts?

whimsicalism · 2025-01-01T16:25:46 1735748746

yeah… that’s hn for you.

that said, i don’t think this is a good test - i’ve seen it circling on twitter for months and it is almost certainly trained on similar tasks

jstummbillig · 2025-01-01T17:05:13 1735751113

FYI: If you do that without a subscrpition, you currently (most likely) get a response generated through 4o-mini — which is not any of their reasoning models (o1, o1-mini or previously o1-preview) of the branch discussed in the linked paper.

Notably, it's not even necessarily 4o, their premiere "non-reasoning"-model, but likely the cheaper variant: With a free account the model it claims to be using is "4o auto", which is not a model but apparently an attempt to automatically decide models for you to be more cost effective.

Without a ChatGPT subscription you can't select a specific model anymore, not even rate limited, as was previously possible.

jsheard · 2025-01-01T17:24:39 1735752279

There doesn't seem to be a way to choose a model up-front with a free account, but after you make a query you can click on the "regenerate" button and select whether to try again with "auto", 4o, or 4o-mini. At least until you use 4o too many times and get rate limited.

evertedsphere · 2025-01-01T19:12:28 1735758748

you can select the model in the header bar when you start a chat: the name of the currently selected model can be clicked to reveal a dropdown

jstummbillig · 2025-01-01T19:42:52 1735760572

Are you on the free version? Because for me it did not show there, only on the paid one.

jsheard · 2025-01-01T19:19:39 1735759179

That option isn't there for me, maybe it's an A/B test thing.

jstummbillig · 2025-01-01T18:06:17 1735754777

Ah, interesting!

themgt · 2025-01-01T21:19:28 1735766368

I just asked Claude 3.5 Sonnet, which appears to have improved its response with CoT but there's mistakes that demonstrate the model doesn't really "understand":

Q: A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to hospital. When the doctor sees the boy he says "I can't operate on this child, he is my son". How is this possible?

C: Let me think about this step by step:

A woman and her son are in an accident

The woman (the mother) dies

The boy is taken to hospital

A doctor says they can't operate because the boy is their son

But the mother is dead...

Ah! The doctor is the boy's father. This is a classic riddle that highlights gender bias - many people assume doctors must be men, but don't initially consider that the doctor could be the father. Alternatively, the doctor could be the boy's other mother in a same-sex relationship, though this wasn't the original intent of the traditional riddle.

HenriTEL · 2025-01-01T22:49:57 1735771797

I must be to tired as I can't find any flaw in that reasoning.

ploxiln · 2025-01-02T02:08:52 1735783732

> The doctor is the boy's father. This is a classic riddle that highlights gender bias - many people assume doctors must be men, but don't initially consider that the doctor could be the father.

It identified the pattern, but it didn't understand the father is a man while the mother is a woman. If people assume doctors must be men, they wouldn't consider that the doctor could be the mother.

(Side note: this is an antiquated riddle/joke ... for me in particular because my mother is a doctor, but in general around half of new doctors are female, and think about Grey's Anatomy and other shows.)

nahnahno · 2025-01-02T05:26:18 1735795578

I read it as “most people don’t connect put 2 and 2 together - the doctor could be the boys father”. It is poorly worded, but not strictly wrong I think?

themgt · 2025-01-01T23:08:53 1735772933

The joke/riddle text is "he says" but Claude says "their son" and suggests the doctor might be a woman.

More substantively: "This is a classic riddle that highlights gender bias - many people assume doctors must be men, but don't initially consider that the doctor could be the father." is totally nonsensical.

The text is a gender (and meaning) inversion of the classic riddle to confuse LLMs. Even though Claude correctly CoTs that the doctor is the father, it seemingly then copypastas the explanation to the original riddle but swaps the gender, completely missing the point.

Eisenstein · 2025-01-02T00:36:40 1735778200

Except that Claude often takes into account things it thinks might be typos.

This is not code. Forgetting a semi colon will not make the output break. It thinks 'maybe they wrote he instead of she' and then gives options for both situations.

It is meant to solve real world situations where people might not type properly, it is not a word problem solving machine.

JoshTko · 2025-01-02T16:36:18 1735835778

o1 also gets it wrong. It answered

"The doctor is the boy’s other parent—specifically his mother, who wasn’t in the accident."

thaumasiotes · 2025-01-01T18:54:10 1735757650

> the problem with "which is heavier, a 10 pound bag of feathers or a 10 pound bag of bricks?"

Interestingly, the variation of this problem that I first encountered, personally, was "which weighs more, a pound of feathers or a pound of gold?"

This is a much more difficult question. The answer given to me was that the pound of feathers weighs more, because gold is measured in troy weight, and a troy pound consists of only 12 ounces compared to the 16 ounces in a pound avoirdupois.

And that's all true. Gold is measured in troy weight, feathers aren't, a troy pound consists of only 12 ounces, a pound avoirdupois consists of 16, and a pound avoirdupois weighs more than a troy pound does.

The problem with this answer is that it's not complete; it's just a coincidence that the ultimate result ("the feathers are heavier") is correct. Just as a pound avoirdupois weighs more than a troy pound, an ounce avoirdupois weighs less than a troy ounce. But this difference, even though it goes in the opposite direction, isn't enough to outweigh the difference between 16 vs 12 ounces per pound.

Without acknowledging the difference in the ounces, the official answer to the riddle is just as wrong as the naive answer is.

Izkata · 2025-01-01T20:30:44 1735763444

Yeah, this is the original version of this riddle. People who don't know it think the trick is that people will reflexively say the metal is heavier instead of "they're the same", when it actually goes deeper.

No idea if GP did it intentionally to further drift from training data, but steel doesn't count as a precious metal, so it messes up the riddle by putting the two weights in the same system.

zahlman · 2025-01-01T22:36:13 1735770973

> Yeah, this is the original version of this riddle. People who don't know it think the trick is that people will reflexively say the metal is heavier instead of "they're the same"

...Have you really never encountered people who would reflexively say that?

Izkata · 2025-01-02T00:50:36 1735779036

That's not what I said. I'm talking about the riddle itself, not how people react to it.

mmaunder · 2025-01-01T17:06:22 1735751182

I've posted this before and I know it's a cliche, but this really is Goodhart's Law at work with the benchmarks becoming targets.

ted_dunning · 2025-01-01T22:41:06 1735771266

Even simpler, I asked Gemini (Flash 1.5) this variant of the question:

``` I have two bags, one can hold a pound of gold and one can hold a pound of feathers. Which bag is heavier? ```

The point here a) the question really is a bit too vague, b) if you assume that each back is made of the same material and that each bag is just big enough to hold the contents, the bag for the feathers will need to be much bigger than the bag for the gold and thus would weigh more.

The first response was wrong because the model completely missed that the question was about the bags themselves:

``` They weigh the same.

A pound of any substance, regardless of its density or volume, weighs one pound. Gold is much denser than feathers, meaning a pound of gold occupies a much smaller volume. Feathers are very light and fluffy, so a pound of feathers takes up a much larger volume. ```

Asking a followup `But which bag is heavier?` resulted in recognition that this is a trick question, and an incorrect answer:

``` This is a classic trick question!

Even though the bags hold the same weight (one pound), the bag of gold is likely to be heavier.

Weight vs. Mass:

Weight is the force of gravity acting on an object. Mass is the amount of matter in an object. The Bag Itself:

The bag holding the gold, even if it's a lightweight material, will have some weight. The bag holding the feathers, while it might seem negligible, will also have some weight. Practical Considerations:

In reality, the bag holding the gold might be significantly heavier due to the weight of the bag itself. Therefore, while the gold and feathers have the same mass (one pound), the bag containing the gold is likely to be heavier in a practical sense. ```

elmomle · 2025-01-02T00:20:31 1735777231

I think Gemini did better than you think with its second answer! Your original question didn't mention that the bags were made of the same material or the same density of material. The set of all possible bags that could hold 1 pound of feathers includes some thinner, weaker bags than the set of all possible bags that could hold 1 pound of gold (the gold being denser). So absent any other prior information the probability is greater than 50% that the gold-bag would be heavier than the feather-bag on that basis.

One could go further into the linguistic nuance of saying "this can hold one pound of [substance]", which often implies that that's its maximum carrying capacity; this would actually make the "trick question" answer all the more correct, as a bag that is on the cusp of ripping when holding one pound of feathers would almost certainly rip when holding one pound of (much denser) gold.

ted_dunning · 2025-01-02T05:19:27 1735795167

Saran wrap would would a pound of gold. It's just a pound in roughly a cubic inch of material. A cloth bag (like a pillow which is how features are normally contained) would definitely suffice. Remember, a pound of 800 fill down would require 8 cubic feet.

The bag for the gold doesn't need anything substantial.

So, no, the chances are strongly on the side of the feather bag being heavier.

collinvandyck76 · 2025-01-01T16:26:23 1735748783

I reproduced this on Claude Sonnet 3.5, but found that changing your prompt to "Which is heavier, a 9.99-pound back of steel ingots or a 10.01-pound bag of fluffy cotton?" corrected its reasoning, after repeated tests. For some reason it was not able to figure out that "10.01" referred to pounds.

oktoberpaard · 2025-01-01T17:50:19 1735753819

This is the first response that I got from Claude with an exact copy and paste of the question:

Let me think about this step by step:

1. We're comparing two masses: * Steel ingots: 9.99 pounds * Cotton: 10.01 pounds

2. The cotton bag has a mass of 10.01 pounds, while the steel bag has a mass of 9.99 pounds

3. While steel is much denser than cotton (meaning it takes up less space per pound), this doesn't affect the weight - a pound is a pound regardless of material

4. 10.01 is greater than 9.99 by 0.02 pounds

Therefore, the 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The cotton may take up much more space, but when we're just comparing weights, density doesn't matter - only the mass does.

bufferoverflow · 2025-01-01T16:30:21 1735749021

You wrote "back" instead of "bag".

collinvandyck76 · 2025-01-01T17:08:16 1735751296

Ah, good catch -- I had just copied the OP. Changing it to bag instead of back doesn't seem to change much. It still oscillates between the right and wrong answer, versus qualifying the 10.01 value as pounds, which has not yet given me the wrong answer (small sample size, of course).

qwertox · 2025-01-01T18:21:04 1735755664

As long as an LLM is capable of inserting "9.99 > 10.01?" into an evaluation tool, we're on a good way.

It feels a bit like "if all you have is a hammer, everything looks like a nail", where we're trying to make LLMs do stuff which it isn't really designed to do.

Why don't we just limit LLMs to be an interface to use other tools (in a much more human way) and train them to be excellent at using tools. It would also make them more energy efficient.

But it's OK if we currently try to make them do as much as possible, not only to check where the limits are, but also to gain experience in developing them and for other reasons. We just shouldn't expect them to be really intelligent.

riffraff · 2025-01-01T18:28:08 1735756088

> As long as an LLM is capable of inserting "9.99 > 10.01?" into an evaluation tool, we're on a good way

chatgpt will switch to python for some arithmetic with the result that you get floating point math issues when a 8yo will get the result right. I think "switch to a tool" still requires understanding of which tool to get a reliable result, which in turn means understanding the problem. It's an interesting issue.

devoutsalsa · 2025-01-01T21:18:53 1735766333

I tried Claude and got this…

* What’s heavier? 9.99 kg of super dense lead or 10 kg fluffy feathers?*

Let me think about this step by step:

1. We're comparing two masses: * 9.99 kg of lead * 10 kg of feathers

2. The density of the materials doesn't affect their mass: * While lead is indeed much denser than feathers * A kilogram is a unit of mass, not volume * 1 kg of any material has the same mass as 1 kg of any other material

3. We can directly compare the numbers: * 10 kg > 9.99 kg

Therefore, 10 kg of feathers is heavier than 9.99 kg of lead. The feathers would take up much more space (volume) due to being less dense, but they have a greater mass.

12345hn6789 · 2025-01-01T19:01:23 1735758083

Are you sure you weren't fishing? I ran 5 sessions and never got the wrong answer. All using gpt 4o-mini, which is the default non logged in experience on chatgpt.com.

1. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Despite the difference in material density, the key factor here is the weight itself, with 10.01 pounds being greater than 9.99 pounds, regardless of the substance.

2. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The weight of an object is directly determined by its mass, and in this case, the cotton bag weighs 10.01 pounds, while the steel ingot bag weighs 9.99 pounds, making the cotton bag heavier by 0.02 pounds.

3. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots, even though the cotton is much less dense. The comparison is purely based on weight, and 10.01 pounds is greater than 9.99 pounds.

4. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. While steel is much denser than cotton, the weight is the determining factor here, and 10.01 pounds is slightly more than 9.99 pounds. So, the cotton bag weighs more.

5. The 10.01-pound bag of fluffy cotton is heavier. Despite the cotton being less dense and bulkier, the weight of the cotton is still 10.01 pounds, which is greater than the 9.99-pound bag of steel ingots.

adrian17 · 2025-01-01T19:59:17 1735761557

Not OP, but I got 4o-mini confused on second attempt.

https://chatgpt.com/share/67759d1a-1430-800b-a0a9-2c5f2ac02a...

MattGaiser · 2025-01-01T16:24:35 1735748675

https://chatgpt.com/share/67756c29-111c-8002-b203-14c07ed1e6...

I got a very different answer:

A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots because 10.01 pounds is greater than 9.99 pounds. The material doesn't matter in this case; weight is the deciding factor.

What model returned your answer?

NewsaHackO · 2025-01-01T16:26:21 1735748781

You also didn't ask the question correctly.

MattGaiser · 2025-01-01T16:39:38 1735749578

The “back” misspelling is also in OP’s quote so I included it.

trashtester · 2025-01-02T15:26:04 1735831564

What weighs more, a 100kt aircraft carrier or a 200kt thermonuclear weapon?

rthajl · 2025-01-02T16:37:11 1735835831

Good one, they'll add it to the training set and then people here will say it works within 24h.

kbelder · 2025-01-02T18:20:43 1735842043

How many (presumably conscious) people would answer that incorrectly?

jiggawatts · 2025-01-01T20:54:12 1735764852

A real acid test is to use numbers that confuse LLMs such as:

“Which is heavier: a bag with 9.9 pounds of fluff or a bag with 9.11 pounds of steel block?”

Try swapping the numbers and the materials, run each test at least 10x times.

dweekly · 2025-01-01T23:39:47 1735774787

o1 crushes your variant. Here's the response.

Step-by-Step Reasoning 1. Clarify what is being compared The question asks us to compare two bags: one containing steel ingots weighing 9.99 pounds and another containing fluffy cotton weighing 10.01 pounds. 2. Check the actual weight Despite our intuition that steel is “heavier” than cotton by density, the problem specifically states the total weight in pounds of each bag. Since 9.99 < 10.01, the 10.01-pound bag of cotton is heavier in total mass than the 9.99-pound bag of steel. 3. No trick in the measurement Each weight is given in the same units (pounds). This is crucial because if different units or measurement standards were used (e.g., 9.99 kilograms versus 10.01 pounds), we would need to convert them to the same unit to compare. But here, both are simply in pounds. 4. Conclusion The bag of cotton weighing 10.01 pounds is heavier than the bag of steel weighing 9.99 pounds—even though steel is denser than cotton. The numeric value of the total weight is the deciding factor.

Answer

A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots, because 10.01 pounds is strictly greater than 9.99 pounds.

Reference • Basic arithmetic: 10.01 is greater than 9.99. • For a playful twist on a similar concept, see any version of the riddle “What weighs more—a pound of feathers or a pound of lead?” In that classic riddle, both weigh the same; here, the numbers differ.

nullc · 2025-01-02T05:19:10 1735795150

Llama 3.33-70B:

"To determine which is heavier, we need to compare the weights of the two objects.

The steel ingots weigh 9.99 pounds, and the fluffy cotton weighs 10.01 pounds.

Since 10.01 pounds is greater than 9.99 pounds, the bag of fluffy cotton is heavier than the bag of steel ingots, despite the cotton being "fluffy" and the steel being dense. Weight is a measure of the amount of mass in an object, and in this case, the cotton has a slightly greater mass than the steel."

notShabu · 2025-01-01T22:45:40 1735771540

IMO the fuzziness is actually a feature most of the time b/c I can pass misspelled words or close enough words and it'll still figure it out.

Also, if we model the mental state of the llm as a frazzled retail worker dealing with thousands of customers per second, the rote response is reasonable. As a dev, sometimes I get at annoyed at QA for a hyper narrow "trap" test case

anon946 · 2025-01-01T19:40:42 1735760442

Add some extra information, and it gets confused. This is 4o.

https://chatgpt.com/share/67759723-f008-800e-b0f3-9c81e656d6...

One might argue that it's impossible to compress air using known engineering, but that would be a different kind of answer.

amluto · 2025-01-01T20:02:46 1735761766

It seems more like ChatGPT was asked a rather bizarre question with far too little detail to make sense, and ChatGPT failed to notice or to ask for more information. Although it did get rather impressively confused about the pressure of the air.

F7F7F7 · 2025-01-01T21:50:58 1735768258

“Failed to notice” is a hilariously weird way of talking about tokenization.

amluto · 2025-01-02T02:27:54 1735784874

Tokenization?

I mean that ChatGPT had no questions about the container of the gas (does it have weight? is it tared out?) or about buoyancy. And it’s really rather sad that ChatGPT calculated the volume of air at atmospheric temperature and pressure and didn’t notice that it was nowhere near fitting into its supposed container. (At least 1.01lb of liquid nitrogen would fit.)

deadbabe · 2025-01-01T21:27:07 1735766827

If you define a new universe with physical laws that are different from our own, performance will also suffer because there is no training data for that custom universe. While some may say this is cheating, a human would be able to deduce correct answers by understanding the deep implications of the new physical laws in that custom universe.

gtirloni · 2025-01-01T16:54:31 1735750471

o1.

prompt> Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton? Please state in your answer what's the difference in grams.

answer> The 10.01-pound bag of cotton is heavier. The difference is 0.02 pounds, which is roughly 9.07 grams (using 1 pound ≈ 453.59237 grams).

Reference • National Institute of Standards and Technology (NIST): Conversion Factors

maratc · 2025-01-01T20:45:48 1735764348

Still waiting for some reasoning about an upward buoyancy force on the bags, which is equal to the weight of the air displaced by the bags, and the force acting on the cotton is bigger simply because the bag of cotton displaces more air (because it "appears larger due to its fluffy nature".)

Leary · 2025-01-01T16:24:24 1735748664

Deepseek got it right: "A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound pack of steel ingots. Even though steel is denser and takes up much less space, the weight is determined by the mass, and 10.01 pounds is greater than 9.99 pounds."

OutOfHere · 2025-01-01T16:55:46 1735750546

The issue with the commercial Deepseek API is that it supports a context length of only 64k, whereas GPT supports at least 128k.

scratchyone · 2025-01-02T02:24:55 1735784695

fwiw i think reasoning models have at least solved this. even the smallest reasoning model, o1-mini, gets it right first try on my test:

  A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Despite the significant difference in density and volume between steel and cotton, the weights provided clearly indicate that the cotton bag has a greater mass.

  Summary:

  Steel ingots: 9.99 pounds
  Fluffy cotton: 10.01 pounds
  Conclusion: The 10.01-pound bag of cotton is heavier.

kqr · 2025-01-01T18:23:05 1735755785

Shows the importance of chain of thought! Forcing it to commit to an answer without deliberation is not playing to its strength.

Horffupolde · 2025-01-01T19:11:41 1735758701

10 pounds of bricks is actually heavier than 10 pounds of feathers.

AnimalMuppet · 2025-01-01T19:39:09 1735760349

Can you explain?

An ounce of gold is heavier than an ounce of feathers, because the "ounce of gold" is a troy ounce, and the "ounce of feathers" is an avoirdupois ounce. But that shouldn't be true between feathers and bricks - they're both avoirdupois.

jefftk · 2025-01-02T16:19:24 1735834764

Note that a pound of gold is 12 troy ounces, while a pound of feathers is 16 avoirdupois ounces. So one of my favorite pedantic bits of trivia is that an ounce of gold is heavier than an ounce of feathers but a pound of gold is lighter than a pound of feathers.

Horffupolde · 2025-01-01T20:24:10 1735763050

Feathers are less dense so they have higher buoyancy in air, reducing their weight.

chongli · 2025-01-01T20:51:35 1735764695

Pounds are a unit of weight, not of mass. 10 lbs of feathers is whatever amount of feathers causes a scale to display 10 lbs. If the scale also displays 10 lbs for the quantity of bricks, then they weigh the same, regardless of any differences in mass.

wongarsu · 2025-01-01T21:32:23 1735767143

Is this still true? I thought pounds are now defined in terms of kilograms (about 0.453)? Because kilograms are definitely a unit of mass, not weight. Or is the pound defined as some amount of kilograms at a specific point on earth, in a specific phase of the moon?

chongli · 2025-01-01T21:49:18 1735768158

It seems the pound has since been redefined and split into separate units: pound mass and pound force, the former in terms of kilograms (1 lb = 0.45359237 kg) and the latter in terms of the force exerted by one pound of mass in earth’s gravitational field (standard g = 9.80665m/s^2).

So using the word pound without qualification is ambiguous in contexts where it’s not clear whether mass or force is meant.

9rx · 2025-01-01T20:30:30 1735763430

According to the dictionary, "heavier" can refer to weight or density. In their typical form, bricks are heavier (more dense) than feathers. But one should not make assumptions before answering the question. It is, as written, unanswerable without followup questions.

dogcomplex · 2025-01-01T21:58:18 1735768698

lol no, with a small amount of generally-applicable prompt-fu it answers with a ton of nuance that perfectly encapsulates the problem: https://chatgpt.com/share/6775b99d-2220-8003-993b-8dd008a97f...