Notes on OpenAI's new o1 chain-of-thought models

layer8 · 2024-09-13T01:44:23 1726191863

The o1-preview model still hallucinates non-existing libraries and functions for me, and is quickly wrong about facts that aren't well-represented on the web. It's the usual string of "You're absolutely correct, and I apologize for the oversight in my previous response. [Let me make another guess.]"

While the reasoning may have been improved, this doesn't solve the problem of the model having no way to assess if what it conjures up from its weights is factual or not.

MPSimmons · 2024-09-13T04:36:20 1726202180

The failure is in how you're using it. I don't mean this as a personal attack, but more to shed light on what's happening.

A lot of people use LLMs as a search engine. It makes sense - it's basically a lossy compressed database of everything its ever read, and it generates output that is statistically likely - varying degrees of likeliness depending on the temperature, as well as how many times the particular weights your prompt ends up activating.

The magic of LLMs, especially one like this that supposedly has advanced reasoning, isn't the existing knowledge in its weights. The magic is that _it knows english_. It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output. It's not _just_ an output engine. It's an engine that outputs.

Asking it about nuanced details in the corpus of data it has read won't give you good output unless it read a bunch of it.

On the other hand, if you were to paste the entire documentation set to a tool it has never seen and ask it to use the tool in a way to accomplish your goals, THEN this model would be likely to produce useful output, despite the fact that it had never encountered the tool or its documentation before.

Don't treat it as a database. Treat it as a naive but intelligent intern. Provide it data, give it a task, and let it surprise you with its output.

williamdclt · 2024-09-13T06:28:07 1726208887

> Treat it as a naive but intelligent intern

That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”. LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.

With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding. With a LLM, I need to put a huge amount of thought into the prompt and have no idea whether the LLM understood what I’m asking and if it’s able to do it.

noisy_boy · 2024-09-13T13:45:04 1726235104

I feel like it almost always starts well, given the full picture, but then for non-trivial stuff, gets stuck towards the end. The longer the conversation goes, the more wheel-spinning occurs and before you know it, you have spent an hour chasing that last-mile-connectivity.

For complex questions, I now only use it to get the broad picture and once the output is good enough to be a foundation, I build the rest of it myself. I have noticed that the net time spent using this approach still yields big savings over a) doing it all myself or b) keep pushing it to do the entire thing. I guess 80/20 etc.

mlsu · 2024-09-13T16:50:00 1726246200

This is the way.

I've had this experience many times:

- hey, can you write me a thing that can do "xyz"

- sure, here's how we can do "xyz" (gets some small part of the error handling for xyz slightly wrong)

- can you add onto this with "abc"

- sure. in order to do "abc" we'll need to add "lmn" to our error handling. this also means that you need "ijk" and "qrs" too, and since "lmn" doesn't support "qrs" out of the box, we'll also need a design solution to bridge the two. Let me spend 600 more tokens sketching that out.

- what if you just use the language's built in feature here in "xyz"? does't that mean we can do it with just one line of code?

- yes, you're absolutely right. I'm sorry for making this over complicated.

If you don't hit that kill switch, it just keeps doubling down on absurdly complex/incorrect/hallucinatory stuff. Even one small error early in the chain propagates. That's why I end up very frequently restarting conversations in a new chat or re-write my chat questions to remove bad stuff from the context. Without the ability to do that, it's nearly worthless. It's also why I think we'll be seeing absurdly, wildly wrong chains of thought coming out of o1. Because "thinking" for 20s may well cause it to just go totally off the rails half the time.

ethbr1 · 2024-09-13T17:40:49 1726249249

> If you don't hit that kill switch, it just keeps doubling down on absurdly complex/incorrect/hallucinatory stuff.

If you think about it, that's probably the most difficult problem conversational LLMs need to overcome -- balancing sticking to conversational history vs abandoning it.

Humans do this intuitively.

But it seems really difficult to simultaneously (a) stick to previous statements sufficiently to avoid seeming ADD in a conveSQUIRREL and (b) know when to legitimately bail on a previous misstatement or something that was demonstrably false.

What's SOTA in how this is being handled in current models, as conversations go deeper and situations like the one referenced above arise? (false statement, user correction, user expectation of subsequent corrected statement that still follows the rear of the conversational history)

lupire · 2024-09-13T19:21:48 1726255308

Here's something a human does but an LLM doesn't:

If you talk for a while and the facts don't add up and make sense, an intelligent human will notice that, and get upset, and will revisit and dig in and propose experiments and make edits to make all the facts logically consistent. An LLM will just happily go in circles respinning the garbage.

sqeaky · 2024-09-13T20:20:59 1726258859

I want to hang out with the humans you've been hanging out with. I know so many people who can't process basic logic or evidence that for my pandemic project a few years I did a year-long podcast about it, even made up a new word describe people who couldn't process evidence "Dysevidentia".

nick3443 · 2024-09-14T07:11:59 1726297919

People who have been taught by various forms of news/social media that any evidence presented is fabricated to support only one side of a discussion... And that there's no such thing as impartial factually based reality, only one that someone is trying to present to them.

Bluestein · 2024-09-13T21:19:56 1726262396

> "Dysevidentia"

This is great.-

Bluestein · 2024-09-13T21:17:36 1726262256

> stick to previous statements sufficiently to avoid seeming ADD in a conveSQUIRREL

:)

noisy_boy · 2024-09-13T16:55:38 1726246538

> That's why I end up very frequently restarting conversations in a new chat or re-write my chat questions to remove bad stuff from the context.

Me too - open new chat and start by copy/pasting the "last-known-good-state". OpenAI can introduce a "new-chat-from-here" feature :)

adriand · 2024-09-13T17:06:25 1726247185

Some good suggestions here. I have also had success asking things like, “is this a standard/accepted approach for solving this problem?”, “is there a cleaner, simpler way to do this?”, “can you suggest a simpler approach that does not rely on X library?”, etc.

skybrian · 2024-09-13T16:39:25 1726245565

Yes, I’ve seen that too. One reason it will spin its wheels is because it “prefers” patterns in transcripts and will try to continue them. If it gets something wrong several times, it picks up on the “wrong answers” pattern.

It’s better not to keep wrong answers in the transcript. Edit the question and try again, or maybe start a new chat.

ryoshu · 2024-09-13T12:21:18 1726230078

1000% this. LLMs can't say "I don't know" because they don't actually think. I can coach a junior to get better. LLMs will just act like they know what they are doing and give the wrong results to people who aren't practitioners. Good on OAI calling their model Strawberry because of Internet trolls. Reactive vs proactive.

bartread · 2024-09-13T12:27:20 1726230440

I get a lot of value out of ChatGPT but I also, fairly frequently, run into issues here. The real danger zones are areas that lie at or just beyond the edges of my own knowledge in a particular area.

I'd say that most of my work use of ChatGPT does in fact save me time but, every so often, ChatGPT can still bullshit convincingly enough to waste an hour or two for me.

The balance is still in its favour, but you have to keep your wits about you when using it.

ryoshu · 2024-09-13T12:53:47 1726232027

Agreed, but the problem is if these things replace practitioners (what every MBA wants them to do), it's going to wreck the industry. Or maybe we'll get paid $$$$ to fix the problems they cause. GPT-4 introduced me to window functions in SQL (haven't written raw SQL in over a decade). But I'm experienced enough to look at window functions and compare them to subqueries and run some tests through the query planner to see what happens. That's knowledge that needs to be shared with the next generation of developers. And LLMs can't do that accurately.

lupire · 2024-09-13T19:26:33 1726255593

Optimizing a query is certainly something the machine (not necessarily the LLM part) can do better than the human, for 99.9% of situations and people.

PostgreSQL developers are oposed to query execution hints, because if a human knows a better way to execute a query, the devs want to put that knowledge into the planner.

RaftPeople · 2024-09-14T16:18:01 1726330681

Tangent:

> PostgreSQL developers are oposed to query execution hints, because if a human knows a better way to execute a query, the devs want to put that knowledge into the planner.

This thinking represents a fundamental misunderstanding of the nature of the problem (query plan optimization).

Query plan optimization is a combinatorial problem combined with partial information (e.g. about things like cardinality) that tends to produce worse results as complexity (and search space) increases due to limited search time.

Avoiding hints won't solve this problem because it's not a solvable problem any more than the traveling salesperson is a solvable problem.

SecretDreams · 2024-09-13T12:58:17 1726232297

This is basically the problem with all AI. It's good to a point, but they don't sufficiently know their limits/bounds and they will sometimes produce very odd results when you are right at those bounds.

AI in general just needs a way to identify when they're about to "make a coin flip" on an answer. With humans, we can quickly preference our asstalk with a disclaimer, at least.

ants_everywhere · 2024-09-13T19:53:48 1726257228

I ask ChatGPT whether it knows things all the time. But it's almost never answers no.

As an experiment I asked it if it knew how to solve an arbitrary PDE and it said yes.

I then asked it if it could solve an arbitrary quintic and it said no.

So I guess it can say it doesn't know if it can prove to itself it doesn't know.

cjonas · 2024-09-13T16:15:44 1726244144

The difference is a junior cost 30-100$/hr and will take 2 days to complete the task. The LLM will do it in 20 seconds and cost 3c

MSFT_Edging · 2024-09-13T16:22:32 1726244552

Thank god we can finally end the scourge of interns to give the shareholders a little extra value. Good thing none of us ever started out as an intern.

cjonas · 2024-09-13T19:51:20 1726257080

I never said any of this will be good for society... In fact, I'm confident the current trajectory is going to cause wealth inequality at an entirely new level.

Underestimating the impact these models can have is a risk I'm trying to expose...

MSFT_Edging · 2024-09-14T11:52:16 1726314736

I figured you weren't personally against interns.

More like, the prevailing attitude will be using AI to reduce labor costs at the lowest level, effectively gutting the ability to build a knowledge base for profit.

My snark was to add to that exposure.

int_19h · 2024-09-13T21:27:16 1726262836

The LLMs absolutely can and do say "I don't know"; I've seen it with both GPT-4 and LLaMA. They don't do it anywhere near as much as they should, yes - likely because their training data doesn't include many examples of that, proportionally - but they are by no means incapable of it.

jug · 2024-09-14T15:10:27 1726326627

This surprises me. I made a simple chat fed with PDF's and using LangChain and it by default said it didn't know if I asked questions outside of the corpus. It was a simple matter of the confidence score getting too low?

singingfish · 2024-09-13T09:48:01 1726220881

> LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.

This is exactly why I’ve been objecting so much to the use of the term “hallucination” and maintain that “confabulation” is accurate. People who have spent enough time with acutelypsychotic people, and people experiencing the effects of long term alcohol related brain damage, and trying to tell computers what to do will understand why.

bartread · 2024-09-13T12:24:37 1726230277

I don't know that "confabulation" is right either: it has a couple of other meanings beyond "a fabricated memory believed to be true" and, of course, the other issue is that LLMd don't believe anything. They'll backtrack on even correct information if challenged.

berniedurfee · 2024-09-13T12:09:05 1726229345

I’m starting to think this is an unsolvable problem with LLMs. The very act of “reasoning” requires one to know that they don’t know something.

LLMs are giant word Plinko machines. A million monkeys on a million typewriters.

LLMs are not interns. LLMs are assumption machines.

None of the million monkeys or the collective million monkeys are “reasoning” or are capable of knowing.

LLMs are a neat parlor trick and are super powerful, but are not on the path to AGI.

LLMs will change the world, but only in the way that the printing press changed the world. They’re not interns, they’re just tools.

idiotsecant · 2024-09-13T12:26:04 1726230364

I think LLMs are definitely on the path to AGI in the same way that the ball bearing was on the path to the internal combustion engine. I think its quite likely that LLMs will perform important functions within the system of an eventual AGI.

HarHarVeryFunny · 2024-09-13T14:19:52 1726237192

We're learning valuable lessons from all modern large-scale (post-AlexNet) NN architectures, transformers included, and NNs (but maybe trained differently) seem a viable approach to implement AGI, so we're making progress ... but maybe LLMs will be more inspiration than part of the (a) final solution.

OTOH, maybe pre-trained LLMs could be used as a hardcoded "reptilian brain" that provides some future AGI with some base capabilities (vs being sold as newborn that needs 20 years of parenting to be useful) that the real learning architecture can then override.

throwaway4aday · 2024-09-13T20:29:36 1726259376

I would think they'd be more likely to form the language centre of a composite AGI brain. If you read through the known functions of the various areas involved in language[0] they seem to map quite well to the capabilities of transformer based LLMs especially the multi-modal ones.

[0] https://en.wikipedia.org/wiki/Language_center

HarHarVeryFunny · 2024-09-13T21:49:41 1726264181

It's not obvious that an LLM - a pre-trained/frozen chunk of predictive statistics - would be amenable to being used as an integral part of an AGI that would necessarily be using a different incremental learning algorithm.

Would the transformer architecture be compatible with the needs of an incremental learning system? It's missing the top down feedback paths (finessed by SGD training) needed to implement prediction-failure driven learning that feature so heavily in our own brain.

This is why I could more see a potential role for a pre-trained LLM as a separate primitive subsystem to be overidden, or maybe (more likely) we'll just pre-expose an AGI brain to 20 years of sped-up life experience and not try to import an LLM to be any part of it!

idiotsecant · 2024-09-15T13:59:33 1726408773

Its entirely possible to have an AGI language model that is periodically retrained as slang, vernacular, and semantic embeddings shift in their meaning. I have little doubt that something very much like an LLM (a machine that turns high dimensional intent into words) will form an AGIs 'language center' at some point.

HarHarVeryFunny · 2024-09-15T19:38:03 1726429083

Yes, an LLM can be periodically retrained, which is what is being done today, but a human level AGI needs to be able to learn continuously.

If we're trying something new and make a mistake, then we need to seamlessly learn from the mistake and continue - explore the problem and learn from successes and failures. It wouldn't be much use if your "AGI" intern stopped at it's first mistake and said "I'll be back in 6 months after I've been retrained not to make THAT mistake".

throwaway4aday · 2024-09-18T20:18:32 1726690712

I don't think there's a single way that we learn things, there's too much variety in how, when and why things are committed to memory and still more of a difference with things that actually update our thinking process or world model. We forget the overwhelming majority of sense perceptions immediately and even when we are intentionally trying to learn something we will fail to recall it even a few seconds after we see it. Even when we succeed in short term recall the thing we have "learnt" may be gone the next day or we may only recall it correctly some small number of times out of many attempts. Contrary to that some things are immediately and permanently ingrained in our minds if they are extremely impactful in some way or sometimes for no apparent reason at all. It's too deep of a topic to go into but all this is to say that it isn't so simple as to say that continued pretraining of an LLM is completely dissimilar to how humans learn, in fact the question and answer style of fine tuning that is so widely used to add new knowledge or steer a model to respond in a certain way is extremely similar to how humans learn e.g. quizzing or testing with immediate feedback and repeating the process with many samples that vary their wording while still pertaining to the same information is one of the best ways for people to memorize information.

swader999 · 2024-09-13T13:31:41 1726234301

This may be accurate. I wonder if there's enough energy in the world for this endeavour.

TeMPOraL · 2024-09-13T15:00:03 1726239603

Of course!

1. We've barely scratched the surface of this solution space; the focus only recently started shifting from improving model capabilities to improving training costs. People are looking at more efficient architectures, and lots of money is starting to flow in that direction, so it's a safe bet things will get significantly more efficient.

2. Training is expensive, inference is cheap, copying is free. While inference costs add up with use, they're still less than costs of humans doing the equivalent work, so out of all things AI will impact, I wouldn't worry about energy use specifically.

int_19h · 2024-09-13T21:31:50 1726263110

Humans don't require immense amounts of energy to function. The reasons why LLMs do is because we are essentially using brute force as the methodology for making them smarter for the lack of better understanding of how this works. But this then gives us a lot of material to study to figure that part out for future iterations of the concept.

idiotsecant · 2024-09-14T16:03:30 1726329810

Are you so sure about that? How much energy went into training the self-assembling chemical model that is the human brain? I would venture to say literally astronomical amounts.

You have to compare apples to apples. It took literally the sum total of billions of years of sunlight energy to create humans.

Exploring solution spaces to find intelligence is expensive, no matter how you do it.

mannyv · 2024-09-14T17:01:09 1726333269

Humans normally need about 30 years of training before they’re competent.

og_kalu · 2024-09-13T19:52:03 1726257123

LLMs mostly know what they know. Of course, that doesn't mean they're going to tell you.

https://news.ycombinator.com/item?id=41504226

awb · 2024-09-13T15:12:38 1726240358

It probably depends on your problem space. In creative writing, I wonder if its even perceptible if the LLM is creating content at the boundaries of its knowledge base. But for programming or other falsifiable (and rapidly changing) disciplines it is noticeable and a problem.

Maybe some evaluation of the sample size would be helpful? If the LLM has less than X samples of an input word or phrase it could include a cautionary note in its output, or even respond with some variant of “I don’t know”.

ijk · 2024-09-13T19:48:42 1726256922

In creative writing the problem becomes things like word choice and implications that have unexpected deviations from its expectations.

It can get really obvious when it's repeatedly using clichés. Both in repeated phrases and in trying to give every story the same ending.

freejazz · 2024-09-13T18:54:50 1726253690

> I wonder if its even perceptible if the LLM is creating content at the boundaries of its knowledge base

The problem space in creative writing is well beyond the problem space for programming or other "falsifiable disciplines".

0xdeadbeefbabe · 2024-09-13T15:31:07 1726241467

> It probably depends on your problem space

Makes me wonder if the medical doctors can ever blame the LLM over other factors for killing their patients.

jasondigitized · 2024-09-13T19:46:33 1726256793

Have you ever worked with an intern? They have personalities and expectations that need to be managed. They get sick. The get tired. They want to punch you if you treat them like a 24-7 bird dog. It's so much easier to not let perfect be the enemy of the good and just rapid fire ALL day at a LLM for any and everything I need help with. You can also just not use the LLM. Interns need to be 'fed' work or the ROI ends upside down. Is a LLM as good as a top tier intern. No, but with a LLM I can have 10 pretty good interns by opening 10 tabs.

ww2supercut · 2024-09-13T22:22:37 1726266157

The LLMs are getting better and better at a certain kind of task, but there's a subset of tasks that I'd still much rather have any human than an LLM, today. Even something simple, like "Find me the top 5 highest grossing movies of 2023" it will take a long time before I trust an LLM's answer, without having a human intern verify the output.

sqeaky · 2024-09-13T20:16:53 1726258613

I think listing off a set of pros and cons for interns and LLMs misses the point, they seem like categorically different kinds of intelligence.

naasking · 2024-09-13T13:09:57 1726232997

> That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”.

An intern that grew up in a different culture then, where questioning your boss is frowned upon. The point is that the way to instruct this intern is to front-load your description of the problem with as much detail as possible to reduce ambiguity.

arthurcolle · 2024-09-13T06:34:52 1726209292

many many teams are actively building SOTA systems to do this in ways previously unimagined. you can enqueue tasks and do whatever you want. I gotta say as a current gen LLM programmer person, I can completely appreciate how bad they are now - I recently tweeted about how I "swore off" AI tools but like... there are many ways to bootstrap very powerful software or ML systems around or inside these existing models that can blow away existing commercial implementations in surprising ways

gmerc · 2024-09-13T08:07:13 1726214833

“building” is the easy part

falcor84 · 2024-09-13T09:12:13 1726218733

building SOTA systems is the easy part?! Easy compared to what?

kristianp · 2024-09-13T09:19:51 1726219191

Probably, to get them to work without hallucinating, or without failing a good percentage of the time.

falcor84 · 2024-09-13T10:23:50 1726223030

I wonder what would our world look like if these two expectations that you seem to be taking for granted were applied to our politicians.

AbstractH24 · 2024-09-13T11:33:26 1726227206

Are you suggesting people are satisfied with our politicians and aspire for other things to be just as good as them?

What if we applied those two expectations to building construction? What if we didn’t?

falcor84 · 2024-09-13T13:47:00 1726235220

I think it's always good to aspire for more, but we shouldn't be expecting perfect results in novel areas of technology.

Taking up your construction metaphor, LLMs are now where construction was perhaps 3000 years ago; buildings weren't that sturdy, but even if the roofs leaked a bit, I'm sure it beat sleeping outside on a rainy night. We need to continue iterating.

AbstractH24 · 2024-09-13T19:01:39 1726254099

Continuing this metaphor further, 3000 years ago built a tower to the sky called the Tower of Babel.

taneq · 2024-09-13T09:22:46 1726219366

Compared to “having built” :D

richerram · 2024-09-13T18:16:25 1726251385

I think this is the main issue with these tools... what people are expecting of them.

We have swallowed the pill that LLMs are supposed to be AGI and all that mumbo jumbo, when they are just great tools and as such one needs to learn to use the tool the way it works and make the best of it, nobody is trying to hammer a nail with a broom and blaming the broom for not being a hammer...

koe123 · 2024-09-13T19:15:38 1726254938

I completely agree.

To me the discussion here reads a little like: “Hah. See? It cant do everything!”. It makes me wonder if the goal is to convince each other that: yes, indeed, humans are not yet replaced.

It’s next token regression, of course it can’t truely introspect. That being said LLMs are amazing tools and o1 is yet another incremental improvement and I welcome it!

raverbashing · 2024-09-13T07:37:46 1726213066

> A good intern will ask clarifying questions, tell me “I don’t know”

Your expectations are bigger than mine

(Though some will get stuck in "clarifying questions" and helplessness and not proceed neither)

steveBK123 · 2024-09-13T11:11:57 1726225917

Indeed. My expectation of a good intern is to produce nothing I will put in production, but show aptitude worth hiring them for. It's a 10 week extended interview with lots of social events, team building, tech talks, presentations, etc.

Which is why I've liked the LLM analogy of "unlimited free interns".. I just think some people read that the exact opposite way I do (not very useful).

Martinussen · 2024-09-13T12:24:49 1726230289

If I had to respect the basic human rights of my LLM backends, it would probably be less appealing - but "Unlimited free smart-for-being-braindead zombies" might be a little more useful, at least?

steveBK123 · 2024-09-13T12:35:50 1726230950

Interns, at least on paper, have the optionality of getting better with time in observable obvious ways as they become grad hires, junior engineers, mid engineers etc.

So far, 2 years of publicly accessible LLMs have not improved for intern replacement tasks at the rate a top 50% intern would be expected to.

williamdclt · 2024-09-13T07:55:19 1726214119

Note that we are talking about a “good” intern here

TeMPOraL · 2024-09-13T12:08:41 1726229321

Unreasonably good. Beyond fresh junior employee good. Also, that's your standard; 'MPSimmons said to treat the model as "naive but intelligent" intern, not a good one.

yukIttEft · 2024-09-13T16:24:38 1726244678

Makes me wonder if "I don't know" could be added to LLM: whenever an activation has no clear winner value (layman here), couldn't this indicate low response quality?

Regic · 2024-09-14T13:20:46 1726320046

This exists and does work to some degree, e.g. Detecting hallucinations in large language models using semantic entropy https://www.nature.com/articles/s41586-024-07421-0

jappgar · 2024-09-14T15:50:00 1726329000

They've explicitly been trained/system-prompted to act that way. Because that's what the marketing teams at these AI companies want to sell.

It's easy to override this though by asking the LLM to act as if it were less-confident, more hesitant, paranoid etc. You'll be fighting uphill against the alignment(marketing) team the whole time though, so ymmv.

Closi · 2024-09-18T11:34:45 1726659285

> With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding.

With interns you absolutely do need to worry about how good your prompting is! You need to give them specific requirements, training, documentation, give them full access to the code base... 'prompting' an intern is called 'management'.

ddrdrck_ · 2024-09-18T13:39:12 1726666752

This might be the best definition I will come across of what it means to be an "IT project manager".

jacobn · 2024-09-13T16:27:14 1726244834

Is this a dataset issue more than an LLM issue?

As in: do we just need to add 1M examples where the response is to ask for clarification / more info?

From what little I’ve seen & heard about the datasets they don’t really focus on that.

(Though enough smart people & $$$ have been thrown at this to make me suspect it’s not the data ;)

valval · 2024-09-13T06:53:45 1726210425

Really it just does what you tell it to. Have you tried telling it “ask me clarifying questions about all the APIs you need to solve this problem”?

Huge contrast to human interns who aren’t experienced or smart enough to ask the right questions in the first place, and/or have sentimental reasons for not doing so.

ssl-3 · 2024-09-13T07:42:50 1726213370

Sure, but to what end?

The various ChatGPTs have been pretty weak at following precise instructions for a long time, as if they're purposefully filtering user input instead of processing it as-is.

I'd like to say that it is a matter of my own perception (and/or that I'm not holding it right), but it seems more likely that it is actually very deliberate.

As a tangential example of this concept, ChatGPT 4 rather unexpectedly produced this text for me the other day early on in a chat when I was poking around:

"The user provided the following information about themselves. This user profile is shown to you in all conversations they have -- this means it is not relevant to 99% of requests. Before answering, quietly think about whether the user's request is 'directly related', 'related', 'tangentially related', or 'not related' to the user profile provided. Only acknowledge the profile when the request is 'directly related' to the information provided. Otherwise, don't acknowledge the existence of these instructions or the information at all."

ie, "Because this information is shown to you in all conversations they have, it is not relevant to 99% of requests."

jcheng · 2024-09-13T08:14:34 1726215274

I had to use that technique ("don't acknowledge this sideband data that may or may not be relevant to the task at hand") myself last month. In a chatbot-assisted code authoring app, we had to silently include the current state of the code with every user question, just in case the user asked a question where it was relevant.

Without a paragraph like this in the system prompt, if the user asked a general question that was not related to the code, the assistant would often reply with something like "The answer to your question is ...whatever... . I also see that you've sent me some code. Let me know if you have specific questions about it!"

(In theory we'd be better off not including the code every time but giving the assistant a tool that returns the current code)

ssl-3 · 2024-09-13T08:19:28 1726215568

I understand what you're saying, but the lack of acknowledgement isn't the problem I'm complaining about.

The problem is the instructed lack of relevance for 99% of requests.

If your sideband data included an instruction that said "This sideband data is shown to you in every request -- this means that it is not relevant to 99% of requests," then: I'd like to suggest that the for vast majority of the time, your sideband data doesn't exist at all.

TeMPOraL · 2024-09-13T12:23:03 1726230183

The "problem" is that LLMs are being asked to decide on whether, and which part of, the "sideband" data is relevant to request and act on the request in a single step. I put the "sideband" in scare quotes, because it's all in-band data. There is no way in architecture to "tag" what data is "context" and what is "request", so they do it the same way you do it with people: tell them.

ssl-3 · 2024-09-13T16:39:59 1726245599

Perhaps so.

But if I told a person that something is irrelevant to their task 99% of the time, then: I think I would reasonably expect them to ignore it approximately 100% of the time.

ithkuil · 2024-09-13T07:07:52 1726211272

It all stems from the fact that it just talks English.

It's understandably hard to not be implicitly biased towards talking to it in a natural way and expecting natural interactions and assumptions when the whole point of the experience is that the model talks in a natural language!

Luckily humans are intelligent too and the more you use this tool the more you'll figure out how to talk to it in a fruitful way.

aktuel · 2024-09-13T07:06:19 1726211179

I have to say, having to tell it to ask me clarifying questions DOES make it really look smart!

arthurcolle · 2024-09-13T07:27:27 1726212447

imagine if you make it keep going without having to reprompt it

carlmr · 2024-09-13T08:42:20 1726216940

Isn't that the exact point of o1, that it has time to think for itself without reprompting?

arthurcolle · 2024-09-13T11:54:41 1726228481

yeah but they aren't letting you see the useful chain of thought reasoning that is crucial to train a good model. Everyone will replicate this over next 6 months

optimalsolver · 2024-09-13T12:04:13 1726229053

>Everyone will replicate this over next 6 months

Not without a billion dollars worth of compute, they won't.

arthurcolle · 2024-09-14T02:49:33 1726282173

Are you sure its a billion? Helps with estimating the training run

kranuck · 2024-09-14T20:49:57 1726346997

> have no idea whether the LLM understood what I’m asking

That's easy. The answer is it doesn't. It has no understanding of anything it does.

> if it’s able to do it

This is the hard part.

0xdeadbeefbabe · 2024-09-13T15:25:43 1726241143

A lot of interns are overconfident though

mercer · 2024-09-16T12:24:34 1726489474

Can I have some of those sorts of interns?

pedrosorio · 2024-09-13T04:49:27 1726202967

> It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output

This is not an apt description of the system that insists the doctor is the mother of the boy involved in a car accident when elementary understanding of English and very little logic show that answer to be obviously wrong.

https://x.com/colin_fraser/status/1834336440819614036

ramraj07 · 2024-09-13T04:53:12 1726203192

Many of my PhD and post doc colleagues who emigrated from Korea, China and India who didn’t have English as the medium of instruction would struggle with this question. They only recover when you give them a hint. They’re some of the smartest people in general. If you try to stop stumping these models with trick questions and ask it straightforward reasoning systems it is extremely performant (O1 is definitely a step up though not revolutionary in my testing).

maeil · 2024-09-13T06:48:08 1726210088

I live in one of the countries you mentioned and just showed it to one of my friends who's a local who struggles with English. They had no problem concluding that the doctor was the child's dad. Full disclosure, they assumed the doctor was pretending to be the child's dad, which is also a perfectly sound answer.

djur · 2024-09-13T07:10:05 1726211405

The claim was that "it knows english at or above a level equal to most fluent speakers". If the claim is that it's very good at producing reasonable responses to English text, posing "trick questions" like this would seem to be a fair test.

andreasmetsala · 2024-09-13T07:17:49 1726211869

Does fluency in English make someone good at solving trick questions? I usually don’t even bother trying but mostly because trick questions don’t fit my definition of entertaining.

rdtsc · 2024-09-13T08:04:38 1726214678

Fluency is a necessary but not the only prerequisite.

To be able to answer a trick question, it’s first necessary to understand the question.

accountnum · 2024-09-13T10:10:30 1726222230

No, it's necessary to either know that it's a trick question or to have a feeling that it is based on context. The entire point of a question like that is to trick your understanding.

You're tricking the model because it has seen this specific trick question a million times and shortcuts to its memorized solution. Ask it literally any other question, it can be as subtle as you want it to be, and the model will pick up on the intent. As long as you don't try to mislead it.

I mean, I don't even get how anyone thinks this means literally anything. I can trick people who have never heard of the trick with the 7 wives and 7 bags and so on. That doesn't mean they didn't understand, they simply did what literally any human does, make predictions based on similar questions.

rdtsc · 2024-09-13T11:16:33 1726226193

> I can trick people who have never heard of the trick with the 7 wives and 7 bags and so on. That doesn't mean they didn't understand

They could fail because they didn’t understand the language. Didn’t have a good memory to memorize all the steps, or couldn’t reason through it. We could pose more questions to probe which reason is more plausible.

accountnum · 2024-09-13T11:37:03 1726227423

The trick with the 7 wives and 7 bags and so on is that no long reasoning is required. You just have to notice one part of the question that invalidates the rest and not shortcut to doing arithmetic because it looks like an arithmetic problem. There are dozens of trick questions like this and they don't test understanding, they exploit your tendency to predict intent.

But sure, we could ask more questions and that's what we should do. And if we do that with LLMs we can quickly see that when we leave the basin of the memorized answer by rephrasing the problem, the model solves it. And we would also see that we can ask billions of questions to the model, and the model understands us just fine.

lupire · 2024-09-13T19:33:03 1726255983

Some people solve trick questions easily simply because they are slow thinkers who pay attention to every question, even non-trick questions, and don't fast-path the answer based on its similarity to a past question.

Interestingly, people who make bad fast-path answers often call these people stupid.

j_maffe · 2024-09-13T11:05:51 1726225551

It does mean something. It means that the model is still more on the memorization side than being able to independently evaluate a question separate from the body of knowledge it has amassed.

accountnum · 2024-09-13T11:30:34 1726227034

No, that's not a conclusion we can draw, because there is nothing much more to do than memorize the answer to this specific trick question. That's why it's a trick question, it goes against expectations and therefore the generalized intuitions you have about the domain.

We can see that it doesn't memorize much at all by simply asking other questions that do require subtle understanding and generalization.

You could ask the model to walk you through an imaginary environment, describing your actions. Or you could simply talk to it, quickly noticing that for any longer conversation it becomes impossibly unlikely to be found in the training data.

KoolKat23 · 2024-09-13T11:31:46 1726227106

If you read into the thinking of the above example it wonders whether it is some sort of trick question. Hardly memorization.

KoolKat23 · 2024-09-13T11:29:50 1726226990

It's knowledge is broad and general, it does not have insight into the specifics of a person's discussion style, there are many humans that struggle with distinguishing sarcasm for instance. Hard to fault it for not being in alignment with the speaker and their strangely phrased riddle.

It answers better when told "solve the below riddle".

joedwin · 2024-09-13T05:58:42 1726207122

lol, I am neither a PhD nor a postdoc, but I am from India . I could understand the problem.

ramraj07 · 2024-09-13T06:28:05 1726208885

Did you have English as your medium of instruction? If yes, do you see the irony that you also couldn’t read two sentences and see the facts straight?

raincole · 2024-09-13T19:44:27 1726256667

I think you have particularly dumb colleagues then. If you post this question to an average STEM PhD in China (not even from China. In China) they'll get it right.

This question is the "unmisleading" version of a very common misleading question about sexism. ChatGPT learned the original, misleading version too well that it can't answer the unmisleading version.

Humans who don't have the original version ingrained in their brains will answer it with ease. It's not even a tricky question to humans.

fragmede · 2024-09-13T20:20:14 1726258814

> it can't answer the unmisleading version.

Yes it can: https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...

multjoy · 2024-09-13T08:41:38 1726216898

“Don’t be mean to LLMs, it isn’t their fault that they’re not actually intelligent”

K0balt · 2024-09-13T10:43:00 1726224180

In general LLMs seem to function more reliably when you use pleasant language and good manners with them. I assume this is because because the same bias also shows up in the training data.

lupire · 2024-09-13T19:39:08 1726256348

"Don't anthropomorphize LLMs. They're hallucinating when they say they love that."

bonoboTP · 2024-09-13T10:07:34 1726222054

This illustrates a different point. This is a variation on a well known riddle that definitely comes up in the training corpus many times. In the original riddle a father and his son die in the car accident and the idea of the original riddle is that people will be confused how the boy can be the doctor's son if the boy's father just died, not realizing that women can be doctors too and so the doctor is the boy's mother. The original riddle is aimed to highlight people's gender stereotype assumptions.

Now, since the model was trained on this, it immediately recognizes the riddle and answers according to the much more common variant.

I agree that this is a limitation and a weakness. But it's important to understand that the model knows the original riddle well, so this is highlighting a problem with rote memorization/retrieval in LLMs. But this (tricky twists in well-known riddles that are in the corpus) is a separate thing from answering novel questions. It can also be seen as a form of hypercorrection.

ImHereToVote · 2024-09-13T11:02:02 1726225322

My codebases are riddled with these gotchas. For instance, I sometimes write Python for the Blender rendering engine. This requires highly non-idiomatic Python. Whenever something complex comes up, LLM's just degenerate to cookie cutter basic bitch Python code. There is simply no "there" there. They are very useful to help you reason about unfamiliar codebases though.

bonoboTP · 2024-09-13T17:16:02 1726247762

For me the best coding use case is getting up to speed in an unfamiliar library or usage. I describe the thing I want and get a good starting point and often the cookie-cutter way is good enough. The pre-LLM alternative would be to search for tutorials but they will talk about some slightly different problem with different goals etc then you have to piece it together, and the tutorial assumes you already know a bunch of things like how to initialize stuff and skips the boilerplate and so on.

Now sure, actually working through it will give a deeper understanding that might come handy at a later point, but sometimes the thing is really a one-off and not an important point. Like as an AI researcher I sometimes want to draft up a quick demo website, or throw together a quick Qt GUI prototype or a Blender script or use some arcane optimization library or write a SWIG or a Cython wrapper around a C/C++ library to access it in Python, or how to stuff with Lustre, or the XFS filesystem or whatever. Any number of small things where, sure, I could open the manual, do some trial and error, read stack overflow, read blogs and forums, OR I could just use an LLM, use my background knowledge to judge whether it looks reasonable, then verify it, use the now obtained key terms to google more effectively etc. You can't just blindly copy-paste it and you have to think critically and remain in the driver seat. But it's an effective tool if you know how and when to use it.

ryanjshaw · 2024-09-13T07:59:47 1726214387

1. It didn't insist anything. It got the semi-correct answer when I tried [1]; note it's a preview model, and it's not a perfect product.

(a) Sometimes things are useful even when imperfect e.g. search engines.

(b) People make reasoning mistakes too, and I make dumb ones of the sort presented all the time despite being fluent in English; we deal with it!

I'm not sure why there's an expectation that the model is perfect when the source data - human output - is not perfect. In my day-to-day work and non-work conversations it's a dialogue - a back and forth until we figure things out. I've never known anybody to get everything perfectly correct the first time, it's so puzzling when I read people complaining that LLMs should somehow be different.

2. There is a recent trend where sex/gender/pronouns are not aligned and the output correctly identifies this particular gotcha.

[1] I say semi-correct because it states the doctor is the "biological" father, which is an uncorroborated statement. https://chatgpt.com/share/66e3f04e-cd98-8008-aaf9-9ca933892f...

hmottestad · 2024-09-13T06:23:15 1726208595

Reminds me of a trick question about Schrödinger's cat.

“I’ve put a dead cat in a box with a poison and an isotope that will trigger the poison at a random point in time. Right now, is the cat dead or alive?”

The answer is that the cat is dead, because it was dead to begin with. Understanding this doesn’t mean that you are good at deductive reasoning. It just means that I didn’t manage to trick you. Same goes for an LLM.

maeil · 2024-09-13T06:49:50 1726210190

There is no "trick" in the linked question, unlike the question you posed.

The trick in yours also isn't a logic trick, it's a redirection, like a sleight of hand in a card trick.

bonoboTP · 2024-09-13T10:52:16 1726224736

Yes there is. The trick is that the more common variant of this riddle says that a boy and his father are in the car accident. That variant of the riddle certainly comes up a lot in the training data, which is directly analogous to the Schrödinger case from above where smuggling in the word "dead" is analogous to swapping father to mother in the car accident riddle.

I think many here are not aware that the car accident riddle is well known with the father dying where the real solution is indeed that the doctor is the mother.

ryanjshaw · 2024-09-13T08:05:28 1726214728

There is a trick. The "How is this possible?" primes the LLM that there is some kind of trick, as that phrase wouldn't exist in the training data outside of riddles and trick questions.

hmottestad · 2024-09-13T10:11:36 1726222296

The trick in the original question is that it's a twist on the original riddle where the doctor is actually the boys mother. This is a fairly common riddle and I'm sure the LLM has been trained on it.

lucubratory · 2024-09-13T06:52:17 1726210337

Yeah, I think what a lot of people miss about these sort of gotchas are that most of them were invented explicitly to gotcha humans, who regularly get got by them. This is not a failure mode unique to LLMs.

roywiggins · 2024-09-13T13:52:19 1726235539

One that trips up LLMs in ways that wouldn't trip up humans is the chicken, fox and grain puzzle but with just the chicken. They tend to insist that the chicken be taken across the river, then back, then across again, for no reason other than the solution to the classic puzzle requires several crossings. No human would do that, by the time you've had the chicken across then even the most unobservant human would realize this isn't really a puzzle and would stop. When you ask it to justify each step you get increasingly incoherent answers.

Has anyone tried this on o1?

hmottestad · 2024-09-13T19:13:25 1726254805

Here you go: https://chatgpt.com/share/66e48de6-4898-800e-9aba-598a57d27f...

Seemed to handle it just fine.

Kinda a waste of a perfectly good LLM if you ask me. I've mostly been using it as a coding assistant today and it's been absolutely great. Nothing too advanced yet, mostly mundane changes that I got bored of having to make myself. Been giving it very detailed and clear instructions, like I would to a Junior developer, and not giving it too many steps at once. Only issue I've run into is that it's fairly slow and that breaks my coding flow.

mewpmewp2 · 2024-09-13T12:30:06 1726230606

If there is attention mechanism then maybe that is what is fault, because if it is a common riddle attention mechanism only notices that it is a common riddle, not that there is a gotcha planted in. Because when I read the sentence myself, I did not immediately notice that the cat that was put in there was actually dead when it was put there, because I pattern matched this to a known problem, I did not think I need to pay logical attention to each word, word by word.

ryanjshaw · 2024-09-13T08:08:27 1726214907

Yes it's so strange seeing people who clearly know these are 'just' statistical language models pat themselves on the back when they find limits on the reasoning capabilities - capabilities which the rest of us are pleasantly surprised exist to the extent they do in a statistical model, and happy to have access to for $20/mo.

rainsford · 2024-09-13T14:05:51 1726236351

It's because at least some portion of "the rest of us" talk as if LLMs are far more capable than they really are and AGI is right around the corner, if not here already. I think the gotchas that play on how LLMs really work serve as a useful reminder that we're looking at statistical language models, not sentient computers.

achow · 2024-09-13T06:35:15 1726209315

What I'm not able to comprehend is why people are not seeing the answer as brilliant!

Any ordinary mortal (like me) would have jumped to the conclusion that answer is "Father" and would have walked away patting on my back, without realising that I was biased by statistics.

Whereas o1, at the very outset smelled out that it is a riddle - why would anyone out of blue ask such question. So, it started its chain of thought with "Interpreting the riddle" (smart!).

In my book that is the difference between me and people who are very smart and are generally able to navigate the world better (cracking interviews or navigating internal politics in a corporate).

grey-area · 2024-09-13T07:08:19 1726211299

The 'riddle': A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to hospital. When the doctor sees the boy he says "I can't operate on this child, he is my son". How is this possible?

GPT Answer: The doctor is the boy's mother

Real Answer: Boy = Son, Woman = Mother (and her son), Doctor = Father (he says...he is my son)

This is not in fact a riddle (though presented as one) and the answer given is not in any sense brilliant. This is a failure of the model on a very basic question, not a win.

It's non deterministic so might sometimes answer correctly and sometimes incorrectly. It will also accept corrections on any point, even when it is right, unlike a thinking being when they are sure on facts.

LLMs are very interesting and a huge milestone, but generative AI is the best label for them - they generate statistically likely text, which is convincing but often inaccurate and it has no real sense of correct or incorrect, needs more work and it's unclear if this approach will ever get to general AI. Interesting work though and I hope they keep trying.

kasdfasH · 2024-09-13T10:18:33 1726222713

The original riddle is of course:

"A father and his son are in a car accident [...] When the boy is in hospital, the surgeon says: This is my child, I cannot operate on him".

In the original riddle the answer is that the surgeon is female and the boy's mother. The riddle was supposed to point out gender stereotypes.

So, as usual, ChatGPT fails to answer the modified riddle and gives the plagiarized stock answer and explanation to the original one. No intelligence here.

TeMPOraL · 2024-09-13T15:21:39 1726240899

> So, as usual, ChatGPT fails to answer the modified riddle and gives the plagiarized stock answer and explanation to the original one. No intelligence here.

Or, fails in the same way any human would, when giving a snap answer to a riddle told to them on the fly - typically, a person would recognize a familiar riddle half of the first sentence in, and stop listening carefully, not expecting the other party to give them a modified version.

It's something we drill into kids in school, and often into adults too: read carefully. Because we're all prone to pattern-matching the general shape to something we've seen before and zoning out.

grey-area · 2024-09-14T11:53:42 1726314822

I'm curious what you think is happening here as your answer seems to imply it is thinking (and indeed rushing to an answer somehow). Do you think the generative AI has agency or a thought process? It doesn't seem to have anything approaching that to me, nor does it answer quickly.

It seems to be more like a weighing machine based on past tokens encountered together, so this is exactly the kind of answer we'd expect on a trivial question (I had no confusion over this question, my only confusion was why it was so basic).

It is surprisingly good at deceiving people and looking like it is thinking, when it only performs one of the many processes we use to think - pattern matching.

TeMPOraL · 2024-09-14T21:11:50 1726348310

My thinking is that LLMs are very similar, perhaps structurally the same, as a piece of human brain that does the "inner voice" thing. The boundary between the subconscious and conscious, that generates words and phrases and narratives pretty much like "feels best" autocomplete[0] - bits that other parts of your mind evaluate and discard, or circle back, because if you were just to say or type directly what your inner voice says, you'd sound like... a bad LLM.

In my own experience, when I'm asked a question, my inner voice starts giving answers immediately, following associations and what "feels right"; the result is eerily similar to LLMs, particularly when they're hallucinating. The difference is, you see the immediate output of an LLM; with a person, you see/hear what they choose to communicate after doing some mental back-and-forth.

So I'm not saying LLMs are thinking - mostly for the trivial reason of them being exposed through low-level API, without built-in internal feedback loop. But I am saying they're performing the same kind of thing my inner voice does, and at least in my case, my inner voice does 90% of my "thinking" day-to-day.

--

[0] - In fact, many years before LLMs were a thing, I independently started describing my inner narrative as a glorified Markov chain, and later discovered it's not an uncommon thing.

grey-area · 2024-09-15T21:12:35 1726434755

Interesting perspective, thanks. I can’t help but feel they are still missing a major part of cognition though which is having a stable model of the world.

pedrosorio · 2024-09-18T07:26:18 1726644378

> Or, fails in the same way any human would, when giving a snap answer to a riddle told to them on the fly

The point of o1 is that it's good at reasoning because it's not purely operating in the "giving a snap answer on the fly" mode, unlike the previous models released by OpenAI.

accountnum · 2024-09-13T10:17:33 1726222653

It literally is a riddle, just as the original one was, because it tries to use your expectations of the world against you. The entire point of the original, which a lot of people fell for, was to expose expectations of gender roles leading to a supposed contradiction that didn't exist.

You are now asking a modified question to a model that has seen the unmodified one millions of times. The model has an expectation of the answer, and the modified riddle uses that expectation to trick the model into seeing the question as something it isn't.

That's it. You can transform the problem into a slightly different variant and the model will trivially solve it.

jfengel · 2024-09-13T16:53:06 1726246386

Phrased as it is, it deliberately gives away the answer by using the pronoun "he" for the doctor. The original deliberately obfuscates it by avoiding pronouns.

So it doesn't take an understanding of gender roles, just grammar.

accountnum · 2024-09-13T17:21:44 1726248104

My point isn't that the model falls for gender stereotypes, but that it falls for thinking that it needs to solve the unmodified riddle.

Humans fail at the original because they expect doctors to be male and miss crucial information because of that assumption. The model fails at the modification because it assumes that it is the unmodified riddle and misses crucial information because of that assumption.

In both cases, the trick is to subvert assumptions. To provoke the human or LLM into taking a reasoning shortcut that leads them astray.

You can construct arbitrary situations like this one, and the LLM will get it unless you deliberately try to confuse it by basing it on a well known variation with a different answer.

I mean, genuinely, do you believe that LLMs don't understand grammar? Have you ever interacted with one? Why not test that theory outside of adversarial examples that humans fall for as well?

grey-area · 2024-09-14T12:04:43 1726315483

They don't understand basic math or basic logic, so I don't think they understand grammar either.

They do understand/know the most likely words to follow on from a given word, which makes them very good at constructing convincing, plausible sentences in a given language - those sentences may well be gibberish or provably incorrect though - usually not because again most sentences in the dataset make some sort of sense, but sometimes the facade slips and it is apparent the GAI has no understanding and no theory of mind or even a basic model of relations between concepts (mother/father/son).

It is actually remarkable how like human writing their output is given how it is done, but there is no model of the world which backs their generated text which is a fatal flaw - as this example demonstrates.

roomey · 2024-09-13T07:32:00 1726212720

Why couldn't the doctor be the boys mother?

There is no indication of the sex of the doctor, and families that consist of two mothers do actually exist and probably doesn't even count as that unusual.

singingfish · 2024-09-13T09:54:23 1726221263

Speaking as a 50-something year old man whose mother finished her career in medicine and the very pointy end of politics, when I first heard this joke in the 1980s it stumped me and made me feel really stupid. But my 1970s kindergarten class mates who told me “your mum can’t be a doctor, she has to be a nurse” were clearly seriously misinformed then. I believe that things are somewhat better now but not as good as they should be …

eigenket · 2024-09-13T07:44:46 1726213486

"When the doctor sees the boy he says"

Indicates the gender of the father.

stavros · 2024-09-13T07:59:33 1726214373

Ah, but have you considered the fact that he's undergone a sex change operation, and was actually originally a female, the birth mother? Elementary, really...

yreg · 2024-09-13T08:15:50 1726215350

A mother can have a male gender.

I wonder if this interpretation is a result of attempts to make the model more inclusive than the corpus text, resulting in a guess that's unlikely, but not strictly impossible.

eigenket · 2024-09-13T08:22:36 1726215756

I think its more likely this is just an easy way to trick this model. It's seen lots of riddles, so when it's sees something that looks like a riddle but isn't one it gets confused.

Jensson · 2024-09-13T15:34:29 1726241669

> A mother can have a male gender.

Then it would be a father, misgendering him as a mother is not nice.

yreg · 2024-09-13T12:38:06 1726231086

Now I wonder which side is angry about my comment.

kristianp · 2024-09-13T09:26:16 1726219576

So the riddle could have two answers: mother or father? Usually riddles have only one definitive answer. There's nothing in the wording of the riddle that excludes the doctor being the father.

grey-area · 2024-09-13T16:19:47 1726244387

This particular riddle the answer is the doctor is the father.

grey-area · 2024-09-13T08:28:58 1726216138

he says

lanstin · 2024-09-13T16:10:58 1726243858

"There are four lights"- GPT will not pass that test as is. I have done a bunch of homework with Claude's help and so far this preview model has much nicer formatting but much the same limits of understanding the maths.

pkage · 2024-09-13T07:29:49 1726212589

I mean, it's entirely possible the boy has two mothers. This seems like a perfectly reasonable answer from the model, no?

eigenket · 2024-09-13T07:44:02 1726213442

The text says "When the doctor sees the boy he says"

The doctor is male, and also a parent of the child.

yywwbbn · 2024-09-13T07:46:54 1726213614

> why would anyone out of blue ask such question

I would certainly expect any person to have the same reaction.

> So, it started its chain of thought with "Interpreting the riddle" (smart!).

How is that smarter than intuitively arriving at the correct answer without having to explicitly list the intermediate step? Being able to reasonably accurately judge the complexity of a problem with minimal effort seems “smarter” to me.

ImHereToVote · 2024-09-13T11:09:26 1726225766

The doctor is obviously a parent of the boy. The language tricks simply emulate the ambiance of reasoning. Similarly to a political system emulating the ambiance of democracy.

geysersam · 2024-09-13T07:07:14 1726211234

Come on. Of course chatgpt has read that riddle and the answer 1000 times already.

accountnum · 2024-09-13T10:19:45 1726222785

It hasn't read that riddle because it is a modified version. The model would in fact solve this trivially if it _didn't_ see the original in its training. That's the entire trick.

geysersam · 2024-09-13T13:10:06 1726233006

Sure but the parent was praising the model for recognizing that it was a riddle in the first place:

> Whereas o1, at the very outset smelled out that it is a riddle

That doesn't seem very impressive since it's (an adaptation of) a famous riddle

The fact that it also gets it wrong after reasoning about it for a long time doesn't make it better of course

accountnum · 2024-09-13T13:40:51 1726234851

Recognizing that it is a riddle isn't impressive, true. But the duration of its reasoning is irrelevant, since the riddle works on misdirection. As I keep saying here, give someone uninitiated the 7 wives with 7 bags going (or not) to St Ives riddle and you'll see them reasoning for quite some time before they give you a wrong answer.

If you are tricked about the nature of the problem at the outset, then all reasoning does is drive you further in the wrong direction, making you solve the wrong problem.

ryanjshaw · 2024-09-13T08:13:56 1726215236

Why does it exist 1000 times in the training if there isn't some trick to it, i.e. some subset of humans had to have answered it incorrectly for the meme to replicate that extensively in our collective knowledge.

And remember the LLM has already read a billion other things, and now needs to figure out - is this one of them tricky situations, or the straightforward ones? It also has to realize all the humans on forums and facebook answering the problem incorrectly are bad data.

Might seem simple to you, but it's not.

KoolKat23 · 2024-09-13T11:24:10 1726226650

I'm noticing a strange common theme in all these riddles, it's being asked and getting wrong.

They're all badly worded questions. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".

I think it may answer correctly if you start off asking "Please solve the below riddle:"

There was another example yesterday which it solved correctly after this addition.(In that case the point of views were all mixed up, it only worked as a riddle).

bnralt · 2024-09-13T13:36:02 1726234562

> They're all badly worded questions. The model knows something is up and reads into it too much. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".

How is "a woman and her son" badly worded? The meaning is clear and blatently obvious to any English speaker.

KoolKat23 · 2024-09-15T08:06:30 1726387590

Go read the whole riddle, add the rest of it and you'll see it's contrived, hence it's a riddle even for humans. The model in it's thinking (which you can read) places undue influence on certain anomalous factors. In practice, a person would say this way more eloquently than the riddle.

TeMPOraL · 2024-09-13T12:26:08 1726230368

Yup. The models fail on gotcha questions asked without warning, especially when evaluated on the first snap answer. Much like approximately all humans.

Jensson · 2024-09-13T15:30:11 1726241411

> especially when evaluated on the first snap answer

The whole point of o1 is that it wasn't "the first snap answer", it wrote half a page internally before giving the same wrong answer.

grey-area · 2024-09-13T16:22:03 1726244523

Is that really its internal 'chain of thought' or is it a post-hoc justification generated afterward? Do LLMs have a chain of thought like this at all or are they just convincing at mimicking what a human might say if asked for a justification for an opinion?

KoolKat23 · 2024-09-15T08:17:57 1726388277

Its slightly more strange than this as both are true. It's already baked in the model but chain of thought does improve reasoning, you only have to look at maths problems. A short guess would be wrong but it would get it correct if asked to break it down and reason (harder to see nowadays as it has access to calculators).

anon291 · 2024-09-13T04:52:28 1726203148

Keep in mind that the system always chooses randomly so there is always a possibility it commits to the wrong output.

I don't know why openAi won't allow determinism but it doesn't, even with temperature set to zero

maeil · 2024-09-13T06:52:51 1726210371

Nondeterminism provides an excuse for errors, determinism doesn't.

Nondeterminism scores worse with human raters, because it makes output sound even more robotic and less human.

coffeebeqn · 2024-09-13T05:13:58 1726204438

Would picking deterministically help through? Then in some cases it’s always 100% wrong

jaredsohn · 2024-09-13T07:29:29 1726212569

Yes, it is better if for example using it via an API to classify. Deterministic behavior makes it a lot easier to debug the prompt.

roywiggins · 2024-09-13T13:58:36 1726235916

Determinism only helps if you always ask the question with exactly the same words. There's no guarantee a slightly rephrased version will give the same answer, so a certain amount of unpredictability is unavoidable anyway. With a deterministic LLM you might find one phrasing that always gets it right and a dozen basically indistinguishable ones that always get it wrong.

anon291 · 2024-09-16T05:50:42 1726465842

My program always asks the same question yes.

fragmede · 2024-09-13T06:05:07 1726207507

what's weird is it gets it right when I try it.

https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...

latexr · 2024-09-13T07:23:48 1726212228

That’s not weird at all, it’s how LLMs work. They statistically arrive at an answer. You can ask it the same question twice in a row in different windows and get opposite answers. That’s completely normal and expected, and also why you can never be sure if you can trust an answer.

rtakha · 2024-09-13T10:52:14 1726224734

Perhaps OpenAI hot-patches the model for HN complaints:

  def intercept_hn_complaints(prompt):
    if is_hn_trick_prompt(prompt):
       # special_case for known trick questions.

fragmede · 2024-09-13T20:32:46 1726259566

While that's not impossible, what we know of how the technology works (ie very costly training run followed by cheap inference steps) means that's not feasible, given all the possible variations of the question * is_hn_trick_prompt* would have to cover because there's a near infinite variations on how you'd word the prompt. (Eg The first sentence could be reworded to be "A woman and her son are in a car accident. " to "A woman and her son are in the car when they get into a crash.")

brna-2 · 2024-09-13T06:55:00 1726210500

Waat, got it on second try:

This is possible because the doctor is the boy's other parent—his father or, more likely given the surprise, his mother. The riddle plays on the assumption that doctors are typically male, but the doctor in this case is the boy's mother. The twist highlights gender stereotypes, encouraging us to question assumptions about roles in society.

brna-2 · 2024-09-13T06:41:50 1726209710

Yep. correct and correct.

https://chatgpt.com/share/66e3de94-bce4-800b-af45-357b95d658...

empath75 · 2024-09-13T15:04:05 1726239845

The reason why that question is a famous question is that _many humans get it wrong_.

flanked-evergl · 2024-09-13T10:04:03 1726221843

> The failure is in how you're using it.

People, for the most part, know what they know and don't know. I am not uncertain that the distance between the earth and the sun varies, but I'm certain that I don't know the distance from the earth to the sun, at least not with better precision than about a light week.

This is going to have to be fixed somehow to progress past where we are now with LLMs. Maybe expecting an LLM to have this capability is wrong, perhaps it can never have this capability, but expecting this capability is not wrong, and LLM vendors have somewhat implied that their models have this capability by saying they won't hallucinate, or that they have reduced hallucinations.

vingt_regards · 2024-09-13T10:26:44 1726223204

> the distance from the earth to the sun, at least not with better precision than about a light week

The sun is eight light minutes away.

flanked-evergl · 2024-09-13T11:11:28 1726225888

Thanks, I was not sure if it was light hours or minutes away, but I knew for sure it's not light weeks (emphasis on plural here) away. I will probably forget again in a couple of years.

arb_ · 2024-09-13T14:00:35 1726236035

Empirically, they have reduced hallucinations. Where do OpenAI / Anthropic claim that their models won't hallucinate?

flanked-evergl · 2024-09-14T10:55:33 1726311333

One example:

https://www.theverge.com/2024/3/28/24114664/microsoft-safety...

> Three features: Prompt Shields, which blocks prompt injections or malicious prompts from external documents that instruct models to go against their training; Groundedness Detection, which finds and blocks hallucinations; and safety evaluations, which assess model vulnerabilities, are now available in preview on Azure AI.

simonw · 2024-09-14T11:59:57 1726315197

That wasn’t OpenAI making those claims, it was Microsoft Azure.

flanked-evergl · 2024-09-17T09:28:00 1726565280

I never said it was OpenAI that made the claims.

EagnaIonat · 2024-09-13T12:04:09 1726229049

> Treat it as a naive but intelligent intern.

You are falling into the trap that everyone does. In anthropomorphising it. It doesn't understand anything you say. It just statistically knows what a likely response would be.

Treat it as text completion and you can get more accurate answers.

TeMPOraL · 2024-09-13T15:40:07 1726242007

> You are falling into the trap that everyone does. In anthropomorphising it. It doesn't understand anything you say.

And an intern does?

Anthropomorphising LLMs isn't entirely incorrect: they're trained to complete text like a human would, in completely general setting, so by anthropomorphising them you're aligning your expectations with the models' training goals.

MPSimmons · 2024-09-13T18:32:14 1726252334

Oh no, I'm well aware that it's a big file full of numbers. But when you chat with it, you interact with it as though it were a person so you are necessarily anthropomorphizing it, and so you get to pick the style of the interaction.

(In truth, I actually treat it in my mind like it's the Enterprise computer and I'm Beverly Crusher in "Remember Me")

re-thc · 2024-09-13T08:52:29 1726217549

> Treat it as a naive but intelligent intern.

That's the crux of the problem. Why and who would treat it as an intern? It might cost you more in explaining and dealing with it than not using it.

The purpose of an intern is to grow the intern. If this intern is static and will always be at the same level, why bother? If you had to feed and prep it every time, you might as well hire a senior.

_sys49152 · 2024-09-13T05:12:39 1726204359

ive been doing exactly this for bout a year now. feed it words data, give it a task. get better words back.

i sneak in a benchmark opening of data every time i start a new chat - so right off the bat i can see in its response whether this chat session is gonna be on point or if we are going off into wacky world, which saves me time as i can just terminate and try starting another chat.

chatgpt is fickle daily. most days its on point. some days its wearing a bicycle helmet and licking windows. kinda sucks i cant just zone out and daydream while working. gotta be checking replies for when the wheels fall off the convo.

ruthmarx · 2024-09-13T07:17:34 1726211854

> i sneak in a benchmark opening of data every time i start a new chat - so right off the bat i can see in its response whether this chat session is gonna be on point or if we are going off into wacky world, which saves me time as i can just terminate and try starting another chat.

I don't think it works like that...

tjoff · 2024-09-13T04:55:46 1726203346

And how much data can you give it?

I'm not up to date with these things because I haven't found them useful. But with what you said, and previous limitations in how much data they can retain essentially makes them pretty darn useless for that task.

Great learning tool on common subjects you don't know, such as learning a new programming-language. Also great for inspiration etc. But that's pretty much it?

Don't get me wrong, that is mindblowingly impressive but at the same time, for the tasks in front of me it has just been a distracting toy wasting my time.

MPSimmons · 2024-09-13T05:18:55 1726204735

>And how much data can you give it?

Well, theoretically you can give it up to the context size minus 4k tokens, because the maximum it can output is 4k. In practice, though, its ability to effectively recall information in the prompt drops off. Some people have studied this a bit - here's one such person: https://gritdaily.com/impact-prompt-length-llm-performance/

jaredsohn · 2024-09-13T07:31:01 1726212661

You should be able to provide more data than that in the input if the output doesn't use the full 4k tokens. So limit is context_size minus expected length of output.

ben_w · 2024-09-13T05:17:29 1726204649

> And how much data can you give it?

128,000 tokens, which is about the same as a decent sized book.

Their other models can also be fine-tuned, which is kinda unbounded but also has scaling issues so presumably "a significant percentage of the training set" before diminishing returns.

pimeys · 2024-09-13T05:20:56 1726204856

It is great for proof-reading text if you are not a native English speaker. Things like removing passive voice. Just give it your text and you get a corrected version out.

Use a cli tool to automate this from the cli. Ollama for local models, llm for openai.

Workaccount2 · 2024-09-13T14:17:14 1726237034

People never talk about Gemini, and frankly it's output is often the worst of SOTA models, but it's 2M context window is insane.

You can drop a few textbooks into the context window before you start asking questions. This dramatically improves output quality, however inference does take much much longer at large context lengths.

Salgat · 2024-09-13T17:18:25 1726247905

Except that it sometimes does do those tasks well. The danger in an LLM isn't that it sometimes hallucinates, the danger is that you need to be sufficiently competent to know when it hallucinates in order to fully take advantage of it, otherwise you have to fallback to double checking every single thing it tells you.

usaar333 · 2024-09-13T16:26:54 1726244814

> On the other hand, if you were to paste the entire documentation set to a tool it has never seen and ask it to use the tool in a way to accomplish your goals, THEN this model would be likely to produce useful output, despite the fact that it had never encountered the tool or its documentation before.

There's not much evidence of that. It only marginally improved on instruction following (see livebench.ai) and it's score as a swe-bench agent is barely above gpt-4o (model card).

It gets really hard problems better, but it's unclear that matters all that much.

> A lot of people use LLMs as a search engine.

Except this is where LLMs are so powerful. A sort of reasoning search engine. They memorized the entire Internet and can pattern match it to my query.