Hacker News new | past | comments | ask | show | jobs | submit login
Notes on OpenAI's new o1 chain-of-thought models (simonwillison.net)
696 points by loganfrederick 5 days ago | hide | past | favorite | 624 comments





The o1-preview model still hallucinates non-existing libraries and functions for me, and is quickly wrong about facts that aren't well-represented on the web. It's the usual string of "You're absolutely correct, and I apologize for the oversight in my previous response. [Let me make another guess.]"

While the reasoning may have been improved, this doesn't solve the problem of the model having no way to assess if what it conjures up from its weights is factual or not.


The failure is in how you're using it. I don't mean this as a personal attack, but more to shed light on what's happening.

A lot of people use LLMs as a search engine. It makes sense - it's basically a lossy compressed database of everything its ever read, and it generates output that is statistically likely - varying degrees of likeliness depending on the temperature, as well as how many times the particular weights your prompt ends up activating.

The magic of LLMs, especially one like this that supposedly has advanced reasoning, isn't the existing knowledge in its weights. The magic is that _it knows english_. It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output. It's not _just_ an output engine. It's an engine that outputs.

Asking it about nuanced details in the corpus of data it has read won't give you good output unless it read a bunch of it.

On the other hand, if you were to paste the entire documentation set to a tool it has never seen and ask it to use the tool in a way to accomplish your goals, THEN this model would be likely to produce useful output, despite the fact that it had never encountered the tool or its documentation before.

Don't treat it as a database. Treat it as a naive but intelligent intern. Provide it data, give it a task, and let it surprise you with its output.


> Treat it as a naive but intelligent intern

That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”. LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.

With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding. With a LLM, I need to put a huge amount of thought into the prompt and have no idea whether the LLM understood what I’m asking and if it’s able to do it.


I feel like it almost always starts well, given the full picture, but then for non-trivial stuff, gets stuck towards the end. The longer the conversation goes, the more wheel-spinning occurs and before you know it, you have spent an hour chasing that last-mile-connectivity.

For complex questions, I now only use it to get the broad picture and once the output is good enough to be a foundation, I build the rest of it myself. I have noticed that the net time spent using this approach still yields big savings over a) doing it all myself or b) keep pushing it to do the entire thing. I guess 80/20 etc.


This is the way.

I've had this experience many times:

- hey, can you write me a thing that can do "xyz"

- sure, here's how we can do "xyz" (gets some small part of the error handling for xyz slightly wrong)

- can you add onto this with "abc"

- sure. in order to do "abc" we'll need to add "lmn" to our error handling. this also means that you need "ijk" and "qrs" too, and since "lmn" doesn't support "qrs" out of the box, we'll also need a design solution to bridge the two. Let me spend 600 more tokens sketching that out.

- what if you just use the language's built in feature here in "xyz"? does't that mean we can do it with just one line of code?

- yes, you're absolutely right. I'm sorry for making this over complicated.

If you don't hit that kill switch, it just keeps doubling down on absurdly complex/incorrect/hallucinatory stuff. Even one small error early in the chain propagates. That's why I end up very frequently restarting conversations in a new chat or re-write my chat questions to remove bad stuff from the context. Without the ability to do that, it's nearly worthless. It's also why I think we'll be seeing absurdly, wildly wrong chains of thought coming out of o1. Because "thinking" for 20s may well cause it to just go totally off the rails half the time.


> If you don't hit that kill switch, it just keeps doubling down on absurdly complex/incorrect/hallucinatory stuff.

If you think about it, that's probably the most difficult problem conversational LLMs need to overcome -- balancing sticking to conversational history vs abandoning it.

Humans do this intuitively.

But it seems really difficult to simultaneously (a) stick to previous statements sufficiently to avoid seeming ADD in a conveSQUIRREL and (b) know when to legitimately bail on a previous misstatement or something that was demonstrably false.

What's SOTA in how this is being handled in current models, as conversations go deeper and situations like the one referenced above arise? (false statement, user correction, user expectation of subsequent corrected statement that still follows the rear of the conversational history)


Here's something a human does but an LLM doesn't:

If you talk for a while and the facts don't add up and make sense, an intelligent human will notice that, and get upset, and will revisit and dig in and propose experiments and make edits to make all the facts logically consistent. An LLM will just happily go in circles respinning the garbage.


I want to hang out with the humans you've been hanging out with. I know so many people who can't process basic logic or evidence that for my pandemic project a few years I did a year-long podcast about it, even made up a new word describe people who couldn't process evidence "Dysevidentia".

People who have been taught by various forms of news/social media that any evidence presented is fabricated to support only one side of a discussion... And that there's no such thing as impartial factually based reality, only one that someone is trying to present to them.

> "Dysevidentia"

This is great.-


> stick to previous statements sufficiently to avoid seeming ADD in a conveSQUIRREL

:)


> That's why I end up very frequently restarting conversations in a new chat or re-write my chat questions to remove bad stuff from the context.

Me too - open new chat and start by copy/pasting the "last-known-good-state". OpenAI can introduce a "new-chat-from-here" feature :)


Some good suggestions here. I have also had success asking things like, “is this a standard/accepted approach for solving this problem?”, “is there a cleaner, simpler way to do this?”, “can you suggest a simpler approach that does not rely on X library?”, etc.

Yes, I’ve seen that too. One reason it will spin its wheels is because it “prefers” patterns in transcripts and will try to continue them. If it gets something wrong several times, it picks up on the “wrong answers” pattern.

It’s better not to keep wrong answers in the transcript. Edit the question and try again, or maybe start a new chat.


1000% this. LLMs can't say "I don't know" because they don't actually think. I can coach a junior to get better. LLMs will just act like they know what they are doing and give the wrong results to people who aren't practitioners. Good on OAI calling their model Strawberry because of Internet trolls. Reactive vs proactive.

I get a lot of value out of ChatGPT but I also, fairly frequently, run into issues here. The real danger zones are areas that lie at or just beyond the edges of my own knowledge in a particular area.

I'd say that most of my work use of ChatGPT does in fact save me time but, every so often, ChatGPT can still bullshit convincingly enough to waste an hour or two for me.

The balance is still in its favour, but you have to keep your wits about you when using it.


Agreed, but the problem is if these things replace practitioners (what every MBA wants them to do), it's going to wreck the industry. Or maybe we'll get paid $$$$ to fix the problems they cause. GPT-4 introduced me to window functions in SQL (haven't written raw SQL in over a decade). But I'm experienced enough to look at window functions and compare them to subqueries and run some tests through the query planner to see what happens. That's knowledge that needs to be shared with the next generation of developers. And LLMs can't do that accurately.

Optimizing a query is certainly something the machine (not necessarily the LLM part) can do better than the human, for 99.9% of situations and people.

PostgreSQL developers are oposed to query execution hints, because if a human knows a better way to execute a query, the devs want to put that knowledge into the planner.


Tangent:

> PostgreSQL developers are oposed to query execution hints, because if a human knows a better way to execute a query, the devs want to put that knowledge into the planner.

This thinking represents a fundamental misunderstanding of the nature of the problem (query plan optimization).

Query plan optimization is a combinatorial problem combined with partial information (e.g. about things like cardinality) that tends to produce worse results as complexity (and search space) increases due to limited search time.

Avoiding hints won't solve this problem because it's not a solvable problem any more than the traveling salesperson is a solvable problem.


This is basically the problem with all AI. It's good to a point, but they don't sufficiently know their limits/bounds and they will sometimes produce very odd results when you are right at those bounds.

AI in general just needs a way to identify when they're about to "make a coin flip" on an answer. With humans, we can quickly preference our asstalk with a disclaimer, at least.


I ask ChatGPT whether it knows things all the time. But it's almost never answers no.

As an experiment I asked it if it knew how to solve an arbitrary PDE and it said yes.

I then asked it if it could solve an arbitrary quintic and it said no.

So I guess it can say it doesn't know if it can prove to itself it doesn't know.


The difference is a junior cost 30-100$/hr and will take 2 days to complete the task. The LLM will do it in 20 seconds and cost 3c

Thank god we can finally end the scourge of interns to give the shareholders a little extra value. Good thing none of us ever started out as an intern.

I never said any of this will be good for society... In fact, I'm confident the current trajectory is going to cause wealth inequality at an entirely new level.

Underestimating the impact these models can have is a risk I'm trying to expose...


I figured you weren't personally against interns.

More like, the prevailing attitude will be using AI to reduce labor costs at the lowest level, effectively gutting the ability to build a knowledge base for profit.

My snark was to add to that exposure.


This surprises me. I made a simple chat fed with PDF's and using LangChain and it by default said it didn't know if I asked questions outside of the corpus. It was a simple matter of the confidence score getting too low?

The LLMs absolutely can and do say "I don't know"; I've seen it with both GPT-4 and LLaMA. They don't do it anywhere near as much as they should, yes - likely because their training data doesn't include many examples of that, proportionally - but they are by no means incapable of it.

> LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.

This is exactly why I’ve been objecting so much to the use of the term “hallucination” and maintain that “confabulation” is accurate. People who have spent enough time with acutelypsychotic people, and people experiencing the effects of long term alcohol related brain damage, and trying to tell computers what to do will understand why.


I don't know that "confabulation" is right either: it has a couple of other meanings beyond "a fabricated memory believed to be true" and, of course, the other issue is that LLMd don't believe anything. They'll backtrack on even correct information if challenged.

I’m starting to think this is an unsolvable problem with LLMs. The very act of “reasoning” requires one to know that they don’t know something.

LLMs are giant word Plinko machines. A million monkeys on a million typewriters.

LLMs are not interns. LLMs are assumption machines.

None of the million monkeys or the collective million monkeys are “reasoning” or are capable of knowing.

LLMs are a neat parlor trick and are super powerful, but are not on the path to AGI.

LLMs will change the world, but only in the way that the printing press changed the world. They’re not interns, they’re just tools.


I think LLMs are definitely on the path to AGI in the same way that the ball bearing was on the path to the internal combustion engine. I think its quite likely that LLMs will perform important functions within the system of an eventual AGI.

We're learning valuable lessons from all modern large-scale (post-AlexNet) NN architectures, transformers included, and NNs (but maybe trained differently) seem a viable approach to implement AGI, so we're making progress ... but maybe LLMs will be more inspiration than part of the (a) final solution.

OTOH, maybe pre-trained LLMs could be used as a hardcoded "reptilian brain" that provides some future AGI with some base capabilities (vs being sold as newborn that needs 20 years of parenting to be useful) that the real learning architecture can then override.


I would think they'd be more likely to form the language centre of a composite AGI brain. If you read through the known functions of the various areas involved in language[0] they seem to map quite well to the capabilities of transformer based LLMs especially the multi-modal ones.

[0] https://en.wikipedia.org/wiki/Language_center


It's not obvious that an LLM - a pre-trained/frozen chunk of predictive statistics - would be amenable to being used as an integral part of an AGI that would necessarily be using a different incremental learning algorithm.

Would the transformer architecture be compatible with the needs of an incremental learning system? It's missing the top down feedback paths (finessed by SGD training) needed to implement prediction-failure driven learning that feature so heavily in our own brain.

This is why I could more see a potential role for a pre-trained LLM as a separate primitive subsystem to be overidden, or maybe (more likely) we'll just pre-expose an AGI brain to 20 years of sped-up life experience and not try to import an LLM to be any part of it!


Its entirely possible to have an AGI language model that is periodically retrained as slang, vernacular, and semantic embeddings shift in their meaning. I have little doubt that something very much like an LLM (a machine that turns high dimensional intent into words) will form an AGIs 'language center' at some point.

Yes, an LLM can be periodically retrained, which is what is being done today, but a human level AGI needs to be able to learn continuously.

If we're trying something new and make a mistake, then we need to seamlessly learn from the mistake and continue - explore the problem and learn from successes and failures. It wouldn't be much use if your "AGI" intern stopped at it's first mistake and said "I'll be back in 6 months after I've been retrained not to make THAT mistake".


This may be accurate. I wonder if there's enough energy in the world for this endeavour.

Of course!

1. We've barely scratched the surface of this solution space; the focus only recently started shifting from improving model capabilities to improving training costs. People are looking at more efficient architectures, and lots of money is starting to flow in that direction, so it's a safe bet things will get significantly more efficient.

2. Training is expensive, inference is cheap, copying is free. While inference costs add up with use, they're still less than costs of humans doing the equivalent work, so out of all things AI will impact, I wouldn't worry about energy use specifically.


Humans don't require immense amounts of energy to function. The reasons why LLMs do is because we are essentially using brute force as the methodology for making them smarter for the lack of better understanding of how this works. But this then gives us a lot of material to study to figure that part out for future iterations of the concept.

Are you so sure about that? How much energy went into training the self-assembling chemical model that is the human brain? I would venture to say literally astronomical amounts.

You have to compare apples to apples. It took literally the sum total of billions of years of sunlight energy to create humans.

Exploring solution spaces to find intelligence is expensive, no matter how you do it.


Humans normally need about 30 years of training before they’re competent.

LLMs mostly know what they know. Of course, that doesn't mean they're going to tell you.

https://news.ycombinator.com/item?id=41504226


It probably depends on your problem space. In creative writing, I wonder if its even perceptible if the LLM is creating content at the boundaries of its knowledge base. But for programming or other falsifiable (and rapidly changing) disciplines it is noticeable and a problem.

Maybe some evaluation of the sample size would be helpful? If the LLM has less than X samples of an input word or phrase it could include a cautionary note in its output, or even respond with some variant of “I don’t know”.


In creative writing the problem becomes things like word choice and implications that have unexpected deviations from its expectations.

It can get really obvious when it's repeatedly using clichés. Both in repeated phrases and in trying to give every story the same ending.


> I wonder if its even perceptible if the LLM is creating content at the boundaries of its knowledge base

The problem space in creative writing is well beyond the problem space for programming or other "falsifiable disciplines".


> It probably depends on your problem space

Makes me wonder if the medical doctors can ever blame the LLM over other factors for killing their patients.


Have you ever worked with an intern? They have personalities and expectations that need to be managed. They get sick. The get tired. They want to punch you if you treat them like a 24-7 bird dog. It's so much easier to not let perfect be the enemy of the good and just rapid fire ALL day at a LLM for any and everything I need help with. You can also just not use the LLM. Interns need to be 'fed' work or the ROI ends upside down. Is a LLM as good as a top tier intern. No, but with a LLM I can have 10 pretty good interns by opening 10 tabs.

The LLMs are getting better and better at a certain kind of task, but there's a subset of tasks that I'd still much rather have any human than an LLM, today. Even something simple, like "Find me the top 5 highest grossing movies of 2023" it will take a long time before I trust an LLM's answer, without having a human intern verify the output.

I think listing off a set of pros and cons for interns and LLMs misses the point, they seem like categorically different kinds of intelligence.

> That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”.

An intern that grew up in a different culture then, where questioning your boss is frowned upon. The point is that the way to instruct this intern is to front-load your description of the problem with as much detail as possible to reduce ambiguity.


many many teams are actively building SOTA systems to do this in ways previously unimagined. you can enqueue tasks and do whatever you want. I gotta say as a current gen LLM programmer person, I can completely appreciate how bad they are now - I recently tweeted about how I "swore off" AI tools but like... there are many ways to bootstrap very powerful software or ML systems around or inside these existing models that can blow away existing commercial implementations in surprising ways

“building” is the easy part

building SOTA systems is the easy part?! Easy compared to what?

Probably, to get them to work without hallucinating, or without failing a good percentage of the time.

I wonder what would our world look like if these two expectations that you seem to be taking for granted were applied to our politicians.

Are you suggesting people are satisfied with our politicians and aspire for other things to be just as good as them?

What if we applied those two expectations to building construction? What if we didn’t?


I think it's always good to aspire for more, but we shouldn't be expecting perfect results in novel areas of technology.

Taking up your construction metaphor, LLMs are now where construction was perhaps 3000 years ago; buildings weren't that sturdy, but even if the roofs leaked a bit, I'm sure it beat sleeping outside on a rainy night. We need to continue iterating.


Continuing this metaphor further, 3000 years ago built a tower to the sky called the Tower of Babel.

Compared to “having built” :D

> A good intern will ask clarifying questions, tell me “I don’t know”

Your expectations are bigger than mine

(Though some will get stuck in "clarifying questions" and helplessness and not proceed neither)


Note that we are talking about a “good” intern here

Unreasonably good. Beyond fresh junior employee good. Also, that's your standard; 'MPSimmons said to treat the model as "naive but intelligent" intern, not a good one.

Indeed. My expectation of a good intern is to produce nothing I will put in production, but show aptitude worth hiring them for. It's a 10 week extended interview with lots of social events, team building, tech talks, presentations, etc.

Which is why I've liked the LLM analogy of "unlimited free interns".. I just think some people read that the exact opposite way I do (not very useful).


If I had to respect the basic human rights of my LLM backends, it would probably be less appealing - but "Unlimited free smart-for-being-braindead zombies" might be a little more useful, at least?

Interns, at least on paper, have the optionality of getting better with time in observable obvious ways as they become grad hires, junior engineers, mid engineers etc.

So far, 2 years of publicly accessible LLMs have not improved for intern replacement tasks at the rate a top 50% intern would be expected to.


They've explicitly been trained/system-prompted to act that way. Because that's what the marketing teams at these AI companies want to sell.

It's easy to override this though by asking the LLM to act as if it were less-confident, more hesitant, paranoid etc. You'll be fighting uphill against the alignment(marketing) team the whole time though, so ymmv.


Makes me wonder if "I don't know" could be added to LLM: whenever an activation has no clear winner value (layman here), couldn't this indicate low response quality?

This exists and does work to some degree, e.g. Detecting hallucinations in large language models using semantic entropy https://www.nature.com/articles/s41586-024-07421-0

I think this is the main issue with these tools... what people are expecting of them.

We have swallowed the pill that LLMs are supposed to be AGI and all that mumbo jumbo, when they are just great tools and as such one needs to learn to use the tool the way it works and make the best of it, nobody is trying to hammer a nail with a broom and blaming the broom for not being a hammer...


I completely agree.

To me the discussion here reads a little like: “Hah. See? It cant do everything!”. It makes me wonder if the goal is to convince each other that: yes, indeed, humans are not yet replaced.

It’s next token regression, of course it can’t truely introspect. That being said LLMs are amazing tools and o1 is yet another incremental improvement and I welcome it!


Is this a dataset issue more than an LLM issue?

As in: do we just need to add 1M examples where the response is to ask for clarification / more info?

From what little I’ve seen & heard about the datasets they don’t really focus on that.

(Though enough smart people & $$$ have been thrown at this to make me suspect it’s not the data ;)


Really it just does what you tell it to. Have you tried telling it “ask me clarifying questions about all the APIs you need to solve this problem”?

Huge contrast to human interns who aren’t experienced or smart enough to ask the right questions in the first place, and/or have sentimental reasons for not doing so.


Sure, but to what end?

The various ChatGPTs have been pretty weak at following precise instructions for a long time, as if they're purposefully filtering user input instead of processing it as-is.

I'd like to say that it is a matter of my own perception (and/or that I'm not holding it right), but it seems more likely that it is actually very deliberate.

As a tangential example of this concept, ChatGPT 4 rather unexpectedly produced this text for me the other day early on in a chat when I was poking around:

"The user provided the following information about themselves. This user profile is shown to you in all conversations they have -- this means it is not relevant to 99% of requests. Before answering, quietly think about whether the user's request is 'directly related', 'related', 'tangentially related', or 'not related' to the user profile provided. Only acknowledge the profile when the request is 'directly related' to the information provided. Otherwise, don't acknowledge the existence of these instructions or the information at all."

ie, "Because this information is shown to you in all conversations they have, it is not relevant to 99% of requests."


I had to use that technique ("don't acknowledge this sideband data that may or may not be relevant to the task at hand") myself last month. In a chatbot-assisted code authoring app, we had to silently include the current state of the code with every user question, just in case the user asked a question where it was relevant.

Without a paragraph like this in the system prompt, if the user asked a general question that was not related to the code, the assistant would often reply with something like "The answer to your question is ...whatever... . I also see that you've sent me some code. Let me know if you have specific questions about it!"

(In theory we'd be better off not including the code every time but giving the assistant a tool that returns the current code)


I understand what you're saying, but the lack of acknowledgement isn't the problem I'm complaining about.

The problem is the instructed lack of relevance for 99% of requests.

If your sideband data included an instruction that said "This sideband data is shown to you in every request -- this means that it is not relevant to 99% of requests," then: I'd like to suggest that the for vast majority of the time, your sideband data doesn't exist at all.


The "problem" is that LLMs are being asked to decide on whether, and which part of, the "sideband" data is relevant to request and act on the request in a single step. I put the "sideband" in scare quotes, because it's all in-band data. There is no way in architecture to "tag" what data is "context" and what is "request", so they do it the same way you do it with people: tell them.

Perhaps so.

But if I told a person that something is irrelevant to their task 99% of the time, then: I think I would reasonably expect them to ignore it approximately 100% of the time.


It all stems from the fact that it just talks English.

It's understandably hard to not be implicitly biased towards talking to it in a natural way and expecting natural interactions and assumptions when the whole point of the experience is that the model talks in a natural language!

Luckily humans are intelligent too and the more you use this tool the more you'll figure out how to talk to it in a fruitful way.


I have to say, having to tell it to ask me clarifying questions DOES make it really look smart!

imagine if you make it keep going without having to reprompt it

Isn't that the exact point of o1, that it has time to think for itself without reprompting?

yeah but they aren't letting you see the useful chain of thought reasoning that is crucial to train a good model. Everyone will replicate this over next 6 months

>Everyone will replicate this over next 6 months

Not without a billion dollars worth of compute, they won't.


Are you sure its a billion? Helps with estimating the training run

> have no idea whether the LLM understood what I’m asking

That's easy. The answer is it doesn't. It has no understanding of anything it does.

> if it’s able to do it

This is the hard part.


Can I have some of those sorts of interns?

A lot of interns are overconfident though

> It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output

This is not an apt description of the system that insists the doctor is the mother of the boy involved in a car accident when elementary understanding of English and very little logic show that answer to be obviously wrong.

https://x.com/colin_fraser/status/1834336440819614036


Many of my PhD and post doc colleagues who emigrated from Korea, China and India who didn’t have English as the medium of instruction would struggle with this question. They only recover when you give them a hint. They’re some of the smartest people in general. If you try to stop stumping these models with trick questions and ask it straightforward reasoning systems it is extremely performant (O1 is definitely a step up though not revolutionary in my testing).

I live in one of the countries you mentioned and just showed it to one of my friends who's a local who struggles with English. They had no problem concluding that the doctor was the child's dad. Full disclosure, they assumed the doctor was pretending to be the child's dad, which is also a perfectly sound answer.

The claim was that "it knows english at or above a level equal to most fluent speakers". If the claim is that it's very good at producing reasonable responses to English text, posing "trick questions" like this would seem to be a fair test.

Does fluency in English make someone good at solving trick questions? I usually don’t even bother trying but mostly because trick questions don’t fit my definition of entertaining.

Fluency is a necessary but not the only prerequisite.

To be able to answer a trick question, it’s first necessary to understand the question.


No, it's necessary to either know that it's a trick question or to have a feeling that it is based on context. The entire point of a question like that is to trick your understanding.

You're tricking the model because it has seen this specific trick question a million times and shortcuts to its memorized solution. Ask it literally any other question, it can be as subtle as you want it to be, and the model will pick up on the intent. As long as you don't try to mislead it.

I mean, I don't even get how anyone thinks this means literally anything. I can trick people who have never heard of the trick with the 7 wives and 7 bags and so on. That doesn't mean they didn't understand, they simply did what literally any human does, make predictions based on similar questions.


> I can trick people who have never heard of the trick with the 7 wives and 7 bags and so on. That doesn't mean they didn't understand

They could fail because they didn’t understand the language. Didn’t have a good memory to memorize all the steps, or couldn’t reason through it. We could pose more questions to probe which reason is more plausible.


The trick with the 7 wives and 7 bags and so on is that no long reasoning is required. You just have to notice one part of the question that invalidates the rest and not shortcut to doing arithmetic because it looks like an arithmetic problem. There are dozens of trick questions like this and they don't test understanding, they exploit your tendency to predict intent.

But sure, we could ask more questions and that's what we should do. And if we do that with LLMs we can quickly see that when we leave the basin of the memorized answer by rephrasing the problem, the model solves it. And we would also see that we can ask billions of questions to the model, and the model understands us just fine.


Some people solve trick questions easily simply because they are slow thinkers who pay attention to every question, even non-trick questions, and don't fast-path the answer based on its similarity to a past question.

Interestingly, people who make bad fast-path answers often call these people stupid.


It does mean something. It means that the model is still more on the memorization side than being able to independently evaluate a question separate from the body of knowledge it has amassed.

No, that's not a conclusion we can draw, because there is nothing much more to do than memorize the answer to this specific trick question. That's why it's a trick question, it goes against expectations and therefore the generalized intuitions you have about the domain.

We can see that it doesn't memorize much at all by simply asking other questions that do require subtle understanding and generalization.

You could ask the model to walk you through an imaginary environment, describing your actions. Or you could simply talk to it, quickly noticing that for any longer conversation it becomes impossibly unlikely to be found in the training data.


If you read into the thinking of the above example it wonders whether it is some sort of trick question. Hardly memorization.

It's knowledge is broad and general, it does not have insight into the specifics of a person's discussion style, there are many humans that struggle with distinguishing sarcasm for instance. Hard to fault it for not being in alignment with the speaker and their strangely phrased riddle.

It answers better when told "solve the below riddle".


lol, I am neither a PhD nor a postdoc, but I am from India . I could understand the problem.

Did you have English as your medium of instruction? If yes, do you see the irony that you also couldn’t read two sentences and see the facts straight?

“Don’t be mean to LLMs, it isn’t their fault that they’re not actually intelligent”

In general LLMs seem to function more reliably when you use pleasant language and good manners with them. I assume this is because because the same bias also shows up in the training data.

"Don't anthropomorphize LLMs. They're hallucinating when they say they love that."

I think you have particularly dumb colleagues then. If you post this question to an average STEM PhD in China (not even from China. In China) they'll get it right.

This question is the "unmisleading" version of a very common misleading question about sexism. ChatGPT learned the original, misleading version too well that it can't answer the unmisleading version.

Humans who don't have the original version ingrained in their brains will answer it with ease. It's not even a tricky question to humans.


> it can't answer the unmisleading version.

Yes it can: https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...


This illustrates a different point. This is a variation on a well known riddle that definitely comes up in the training corpus many times. In the original riddle a father and his son die in the car accident and the idea of the original riddle is that people will be confused how the boy can be the doctor's son if the boy's father just died, not realizing that women can be doctors too and so the doctor is the boy's mother. The original riddle is aimed to highlight people's gender stereotype assumptions.

Now, since the model was trained on this, it immediately recognizes the riddle and answers according to the much more common variant.

I agree that this is a limitation and a weakness. But it's important to understand that the model knows the original riddle well, so this is highlighting a problem with rote memorization/retrieval in LLMs. But this (tricky twists in well-known riddles that are in the corpus) is a separate thing from answering novel questions. It can also be seen as a form of hypercorrection.


My codebases are riddled with these gotchas. For instance, I sometimes write Python for the Blender rendering engine. This requires highly non-idiomatic Python. Whenever something complex comes up, LLM's just degenerate to cookie cutter basic bitch Python code. There is simply no "there" there. They are very useful to help you reason about unfamiliar codebases though.

For me the best coding use case is getting up to speed in an unfamiliar library or usage. I describe the thing I want and get a good starting point and often the cookie-cutter way is good enough. The pre-LLM alternative would be to search for tutorials but they will talk about some slightly different problem with different goals etc then you have to piece it together, and the tutorial assumes you already know a bunch of things like how to initialize stuff and skips the boilerplate and so on.

Now sure, actually working through it will give a deeper understanding that might come handy at a later point, but sometimes the thing is really a one-off and not an important point. Like as an AI researcher I sometimes want to draft up a quick demo website, or throw together a quick Qt GUI prototype or a Blender script or use some arcane optimization library or write a SWIG or a Cython wrapper around a C/C++ library to access it in Python, or how to stuff with Lustre, or the XFS filesystem or whatever. Any number of small things where, sure, I could open the manual, do some trial and error, read stack overflow, read blogs and forums, OR I could just use an LLM, use my background knowledge to judge whether it looks reasonable, then verify it, use the now obtained key terms to google more effectively etc. You can't just blindly copy-paste it and you have to think critically and remain in the driver seat. But it's an effective tool if you know how and when to use it.


1. It didn't insist anything. It got the semi-correct answer when I tried [1]; note it's a preview model, and it's not a perfect product.

(a) Sometimes things are useful even when imperfect e.g. search engines.

(b) People make reasoning mistakes too, and I make dumb ones of the sort presented all the time despite being fluent in English; we deal with it!

I'm not sure why there's an expectation that the model is perfect when the source data - human output - is not perfect. In my day-to-day work and non-work conversations it's a dialogue - a back and forth until we figure things out. I've never known anybody to get everything perfectly correct the first time, it's so puzzling when I read people complaining that LLMs should somehow be different.

2. There is a recent trend where sex/gender/pronouns are not aligned and the output correctly identifies this particular gotcha.

[1] I say semi-correct because it states the doctor is the "biological" father, which is an uncorroborated statement. https://chatgpt.com/share/66e3f04e-cd98-8008-aaf9-9ca933892f...


Reminds me of a trick question about Schrödinger's cat.

“I’ve put a dead cat in a box with a poison and an isotope that will trigger the poison at a random point in time. Right now, is the cat dead or alive?”

The answer is that the cat is dead, because it was dead to begin with. Understanding this doesn’t mean that you are good at deductive reasoning. It just means that I didn’t manage to trick you. Same goes for an LLM.


There is no "trick" in the linked question, unlike the question you posed.

The trick in yours also isn't a logic trick, it's a redirection, like a sleight of hand in a card trick.


Yes there is. The trick is that the more common variant of this riddle says that a boy and his father are in the car accident. That variant of the riddle certainly comes up a lot in the training data, which is directly analogous to the Schrödinger case from above where smuggling in the word "dead" is analogous to swapping father to mother in the car accident riddle.

I think many here are not aware that the car accident riddle is well known with the father dying where the real solution is indeed that the doctor is the mother.


There is a trick. The "How is this possible?" primes the LLM that there is some kind of trick, as that phrase wouldn't exist in the training data outside of riddles and trick questions.

The trick in the original question is that it's a twist on the original riddle where the doctor is actually the boys mother. This is a fairly common riddle and I'm sure the LLM has been trained on it.

Yeah, I think what a lot of people miss about these sort of gotchas are that most of them were invented explicitly to gotcha humans, who regularly get got by them. This is not a failure mode unique to LLMs.

One that trips up LLMs in ways that wouldn't trip up humans is the chicken, fox and grain puzzle but with just the chicken. They tend to insist that the chicken be taken across the river, then back, then across again, for no reason other than the solution to the classic puzzle requires several crossings. No human would do that, by the time you've had the chicken across then even the most unobservant human would realize this isn't really a puzzle and would stop. When you ask it to justify each step you get increasingly incoherent answers.

Has anyone tried this on o1?


Here you go: https://chatgpt.com/share/66e48de6-4898-800e-9aba-598a57d27f...

Seemed to handle it just fine.

Kinda a waste of a perfectly good LLM if you ask me. I've mostly been using it as a coding assistant today and it's been absolutely great. Nothing too advanced yet, mostly mundane changes that I got bored of having to make myself. Been giving it very detailed and clear instructions, like I would to a Junior developer, and not giving it too many steps at once. Only issue I've run into is that it's fairly slow and that breaks my coding flow.


If there is attention mechanism then maybe that is what is fault, because if it is a common riddle attention mechanism only notices that it is a common riddle, not that there is a gotcha planted in. Because when I read the sentence myself, I did not immediately notice that the cat that was put in there was actually dead when it was put there, because I pattern matched this to a known problem, I did not think I need to pay logical attention to each word, word by word.

Yes it's so strange seeing people who clearly know these are 'just' statistical language models pat themselves on the back when they find limits on the reasoning capabilities - capabilities which the rest of us are pleasantly surprised exist to the extent they do in a statistical model, and happy to have access to for $20/mo.

It's because at least some portion of "the rest of us" talk as if LLMs are far more capable than they really are and AGI is right around the corner, if not here already. I think the gotchas that play on how LLMs really work serve as a useful reminder that we're looking at statistical language models, not sentient computers.

What I'm not able to comprehend is why people are not seeing the answer as brilliant!

Any ordinary mortal (like me) would have jumped to the conclusion that answer is "Father" and would have walked away patting on my back, without realising that I was biased by statistics.

Whereas o1, at the very outset smelled out that it is a riddle - why would anyone out of blue ask such question. So, it started its chain of thought with "Interpreting the riddle" (smart!).

In my book that is the difference between me and people who are very smart and are generally able to navigate the world better (cracking interviews or navigating internal politics in a corporate).


The 'riddle': A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to hospital. When the doctor sees the boy he says "I can't operate on this child, he is my son". How is this possible?

GPT Answer: The doctor is the boy's mother

Real Answer: Boy = Son, Woman = Mother (and her son), Doctor = Father (he says...he is my son)

This is not in fact a riddle (though presented as one) and the answer given is not in any sense brilliant. This is a failure of the model on a very basic question, not a win.

It's non deterministic so might sometimes answer correctly and sometimes incorrectly. It will also accept corrections on any point, even when it is right, unlike a thinking being when they are sure on facts.

LLMs are very interesting and a huge milestone, but generative AI is the best label for them - they generate statistically likely text, which is convincing but often inaccurate and it has no real sense of correct or incorrect, needs more work and it's unclear if this approach will ever get to general AI. Interesting work though and I hope they keep trying.


The original riddle is of course:

"A father and his son are in a car accident [...] When the boy is in hospital, the surgeon says: This is my child, I cannot operate on him".

In the original riddle the answer is that the surgeon is female and the boy's mother. The riddle was supposed to point out gender stereotypes.

So, as usual, ChatGPT fails to answer the modified riddle and gives the plagiarized stock answer and explanation to the original one. No intelligence here.


> So, as usual, ChatGPT fails to answer the modified riddle and gives the plagiarized stock answer and explanation to the original one. No intelligence here.

Or, fails in the same way any human would, when giving a snap answer to a riddle told to them on the fly - typically, a person would recognize a familiar riddle half of the first sentence in, and stop listening carefully, not expecting the other party to give them a modified version.

It's something we drill into kids in school, and often into adults too: read carefully. Because we're all prone to pattern-matching the general shape to something we've seen before and zoning out.


I'm curious what you think is happening here as your answer seems to imply it is thinking (and indeed rushing to an answer somehow). Do you think the generative AI has agency or a thought process? It doesn't seem to have anything approaching that to me, nor does it answer quickly.

It seems to be more like a weighing machine based on past tokens encountered together, so this is exactly the kind of answer we'd expect on a trivial question (I had no confusion over this question, my only confusion was why it was so basic).

It is surprisingly good at deceiving people and looking like it is thinking, when it only performs one of the many processes we use to think - pattern matching.


My thinking is that LLMs are very similar, perhaps structurally the same, as a piece of human brain that does the "inner voice" thing. The boundary between the subconscious and conscious, that generates words and phrases and narratives pretty much like "feels best" autocomplete[0] - bits that other parts of your mind evaluate and discard, or circle back, because if you were just to say or type directly what your inner voice says, you'd sound like... a bad LLM.

In my own experience, when I'm asked a question, my inner voice starts giving answers immediately, following associations and what "feels right"; the result is eerily similar to LLMs, particularly when they're hallucinating. The difference is, you see the immediate output of an LLM; with a person, you see/hear what they choose to communicate after doing some mental back-and-forth.

So I'm not saying LLMs are thinking - mostly for the trivial reason of them being exposed through low-level API, without built-in internal feedback loop. But I am saying they're performing the same kind of thing my inner voice does, and at least in my case, my inner voice does 90% of my "thinking" day-to-day.

--

[0] - In fact, many years before LLMs were a thing, I independently started describing my inner narrative as a glorified Markov chain, and later discovered it's not an uncommon thing.


Interesting perspective, thanks. I can’t help but feel they are still missing a major part of cognition though which is having a stable model of the world.

It literally is a riddle, just as the original one was, because it tries to use your expectations of the world against you. The entire point of the original, which a lot of people fell for, was to expose expectations of gender roles leading to a supposed contradiction that didn't exist.

You are now asking a modified question to a model that has seen the unmodified one millions of times. The model has an expectation of the answer, and the modified riddle uses that expectation to trick the model into seeing the question as something it isn't.

That's it. You can transform the problem into a slightly different variant and the model will trivially solve it.


Phrased as it is, it deliberately gives away the answer by using the pronoun "he" for the doctor. The original deliberately obfuscates it by avoiding pronouns.

So it doesn't take an understanding of gender roles, just grammar.


My point isn't that the model falls for gender stereotypes, but that it falls for thinking that it needs to solve the unmodified riddle.

Humans fail at the original because they expect doctors to be male and miss crucial information because of that assumption. The model fails at the modification because it assumes that it is the unmodified riddle and misses crucial information because of that assumption.

In both cases, the trick is to subvert assumptions. To provoke the human or LLM into taking a reasoning shortcut that leads them astray.

You can construct arbitrary situations like this one, and the LLM will get it unless you deliberately try to confuse it by basing it on a well known variation with a different answer.

I mean, genuinely, do you believe that LLMs don't understand grammar? Have you ever interacted with one? Why not test that theory outside of adversarial examples that humans fall for as well?


They don't understand basic math or basic logic, so I don't think they understand grammar either.

They do understand/know the most likely words to follow on from a given word, which makes them very good at constructing convincing, plausible sentences in a given language - those sentences may well be gibberish or provably incorrect though - usually not because again most sentences in the dataset make some sort of sense, but sometimes the facade slips and it is apparent the GAI has no understanding and no theory of mind or even a basic model of relations between concepts (mother/father/son).

It is actually remarkable how like human writing their output is given how it is done, but there is no model of the world which backs their generated text which is a fatal flaw - as this example demonstrates.


Why couldn't the doctor be the boys mother?

There is no indication of the sex of the doctor, and families that consist of two mothers do actually exist and probably doesn't even count as that unusual.


Speaking as a 50-something year old man whose mother finished her career in medicine and the very pointy end of politics, when I first heard this joke in the 1980s it stumped me and made me feel really stupid. But my 1970s kindergarten class mates who told me “your mum can’t be a doctor, she has to be a nurse” were clearly seriously misinformed then. I believe that things are somewhat better now but not as good as they should be …

"When the doctor sees the boy he says"

Indicates the gender of the father.


Ah, but have you considered the fact that he's undergone a sex change operation, and was actually originally a female, the birth mother? Elementary, really...

A mother can have a male gender.

I wonder if this interpretation is a result of attempts to make the model more inclusive than the corpus text, resulting in a guess that's unlikely, but not strictly impossible.


I think its more likely this is just an easy way to trick this model. It's seen lots of riddles, so when it's sees something that looks like a riddle but isn't one it gets confused.

> A mother can have a male gender.

Then it would be a father, misgendering him as a mother is not nice.


Now I wonder which side is angry about my comment.

So the riddle could have two answers: mother or father? Usually riddles have only one definitive answer. There's nothing in the wording of the riddle that excludes the doctor being the father.

This particular riddle the answer is the doctor is the father.

he says

"There are four lights"- GPT will not pass that test as is. I have done a bunch of homework with Claude's help and so far this preview model has much nicer formatting but much the same limits of understanding the maths.

I mean, it's entirely possible the boy has two mothers. This seems like a perfectly reasonable answer from the model, no?

The text says "When the doctor sees the boy he says"

The doctor is male, and also a parent of the child.


> why would anyone out of blue ask such question

I would certainly expect any person to have the same reaction.

> So, it started its chain of thought with "Interpreting the riddle" (smart!).

How is that smarter than intuitively arriving at the correct answer without having to explicitly list the intermediate step? Being able to reasonably accurately judge the complexity of a problem with minimal effort seems “smarter” to me.


The doctor is obviously a parent of the boy. The language tricks simply emulate the ambiance of reasoning. Similarly to a political system emulating the ambiance of democracy.

Come on. Of course chatgpt has read that riddle and the answer 1000 times already.

It hasn't read that riddle because it is a modified version. The model would in fact solve this trivially if it _didn't_ see the original in its training. That's the entire trick.

Sure but the parent was praising the model for recognizing that it was a riddle in the first place:

> Whereas o1, at the very outset smelled out that it is a riddle

That doesn't seem very impressive since it's (an adaptation of) a famous riddle

The fact that it also gets it wrong after reasoning about it for a long time doesn't make it better of course


Recognizing that it is a riddle isn't impressive, true. But the duration of its reasoning is irrelevant, since the riddle works on misdirection. As I keep saying here, give someone uninitiated the 7 wives with 7 bags going (or not) to St Ives riddle and you'll see them reasoning for quite some time before they give you a wrong answer.

If you are tricked about the nature of the problem at the outset, then all reasoning does is drive you further in the wrong direction, making you solve the wrong problem.


Why does it exist 1000 times in the training if there isn't some trick to it, i.e. some subset of humans had to have answered it incorrectly for the meme to replicate that extensively in our collective knowledge.

And remember the LLM has already read a billion other things, and now needs to figure out - is this one of them tricky situations, or the straightforward ones? It also has to realize all the humans on forums and facebook answering the problem incorrectly are bad data.

Might seem simple to you, but it's not.


I'm noticing a strange common theme in all these riddles, it's being asked and getting wrong.

They're all badly worded questions. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".

I think it may answer correctly if you start off asking "Please solve the below riddle:"

There was another example yesterday which it solved correctly after this addition.(In that case the point of views were all mixed up, it only worked as a riddle).


> They're all badly worded questions. The model knows something is up and reads into it too much. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".

How is "a woman and her son" badly worded? The meaning is clear and blatently obvious to any English speaker.


Go read the whole riddle, add the rest of it and you'll see it's contrived, hence it's a riddle even for humans. The model in it's thinking (which you can read) places undue influence on certain anomalous factors. In practice, a person would say this way more eloquently than the riddle.

Yup. The models fail on gotcha questions asked without warning, especially when evaluated on the first snap answer. Much like approximately all humans.

> especially when evaluated on the first snap answer

The whole point of o1 is that it wasn't "the first snap answer", it wrote half a page internally before giving the same wrong answer.


Is that really its internal 'chain of thought' or is it a post-hoc justification generated afterward? Do LLMs have a chain of thought like this at all or are they just convincing at mimicking what a human might say if asked for a justification for an opinion?

Its slightly more strange than this as both are true. It's already baked in the model but chain of thought does improve reasoning, you only have to look at maths problems. A short guess would be wrong but it would get it correct if asked to break it down and reason (harder to see nowadays as it has access to calculators).

Keep in mind that the system always chooses randomly so there is always a possibility it commits to the wrong output.

I don't know why openAi won't allow determinism but it doesn't, even with temperature set to zero


Nondeterminism provides an excuse for errors, determinism doesn't.

Nondeterminism scores worse with human raters, because it makes output sound even more robotic and less human.


Would picking deterministically help through? Then in some cases it’s always 100% wrong

Yes, it is better if for example using it via an API to classify. Deterministic behavior makes it a lot easier to debug the prompt.

Determinism only helps if you always ask the question with exactly the same words. There's no guarantee a slightly rephrased version will give the same answer, so a certain amount of unpredictability is unavoidable anyway. With a deterministic LLM you might find one phrasing that always gets it right and a dozen basically indistinguishable ones that always get it wrong.

My program always asks the same question yes.

what's weird is it gets it right when I try it.

https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...


That’s not weird at all, it’s how LLMs work. They statistically arrive at an answer. You can ask it the same question twice in a row in different windows and get opposite answers. That’s completely normal and expected, and also why you can never be sure if you can trust an answer.

Perhaps OpenAI hot-patches the model for HN complaints:

  def intercept_hn_complaints(prompt):
    if is_hn_trick_prompt(prompt):
       # special_case for known trick questions.

While that's not impossible, what we know of how the technology works (ie very costly training run followed by cheap inference steps) means that's not feasible, given all the possible variations of the question * is_hn_trick_prompt* would have to cover because there's a near infinite variations on how you'd word the prompt. (Eg The first sentence could be reworded to be "A woman and her son are in a car accident. " to "A woman and her son are in the car when they get into a crash.")

Waat, got it on second try:

This is possible because the doctor is the boy's other parent—his father or, more likely given the surprise, his mother. The riddle plays on the assumption that doctors are typically male, but the doctor in this case is the boy's mother. The twist highlights gender stereotypes, encouraging us to question assumptions about roles in society.



The reason why that question is a famous question is that _many humans get it wrong_.

> The failure is in how you're using it.

People, for the most part, know what they know and don't know. I am not uncertain that the distance between the earth and the sun varies, but I'm certain that I don't know the distance from the earth to the sun, at least not with better precision than about a light week.

This is going to have to be fixed somehow to progress past where we are now with LLMs. Maybe expecting an LLM to have this capability is wrong, perhaps it can never have this capability, but expecting this capability is not wrong, and LLM vendors have somewhat implied that their models have this capability by saying they won't hallucinate, or that they have reduced hallucinations.


> the distance from the earth to the sun, at least not with better precision than about a light week

The sun is eight light minutes away.


Thanks, I was not sure if it was light hours or minutes away, but I knew for sure it's not light weeks (emphasis on plural here) away. I will probably forget again in a couple of years.

Empirically, they have reduced hallucinations. Where do OpenAI / Anthropic claim that their models won't hallucinate?

One example:

https://www.theverge.com/2024/3/28/24114664/microsoft-safety...

> Three features: Prompt Shields, which blocks prompt injections or malicious prompts from external documents that instruct models to go against their training; Groundedness Detection, which finds and blocks hallucinations; and safety evaluations, which assess model vulnerabilities, are now available in preview on Azure AI.


That wasn’t OpenAI making those claims, it was Microsoft Azure.

I never said it was OpenAI that made the claims.

> Treat it as a naive but intelligent intern.

That's the crux of the problem. Why and who would treat it as an intern? It might cost you more in explaining and dealing with it than not using it.

The purpose of an intern is to grow the intern. If this intern is static and will always be at the same level, why bother? If you had to feed and prep it every time, you might as well hire a senior.


> Treat it as a naive but intelligent intern.

You are falling into the trap that everyone does. In anthropomorphising it. It doesn't understand anything you say. It just statistically knows what a likely response would be.

Treat it as text completion and you can get more accurate answers.


Oh no, I'm well aware that it's a big file full of numbers. But when you chat with it, you interact with it as though it were a person so you are necessarily anthropomorphizing it, and so you get to pick the style of the interaction.

(In truth, I actually treat it in my mind like it's the Enterprise computer and I'm Beverly Crusher in "Remember Me")


> You are falling into the trap that everyone does. In anthropomorphising it. It doesn't understand anything you say.

And an intern does?

Anthropomorphising LLMs isn't entirely incorrect: they're trained to complete text like a human would, in completely general setting, so by anthropomorphising them you're aligning your expectations with the models' training goals.


ive been doing exactly this for bout a year now. feed it words data, give it a task. get better words back.

i sneak in a benchmark opening of data every time i start a new chat - so right off the bat i can see in its response whether this chat session is gonna be on point or if we are going off into wacky world, which saves me time as i can just terminate and try starting another chat.

chatgpt is fickle daily. most days its on point. some days its wearing a bicycle helmet and licking windows. kinda sucks i cant just zone out and daydream while working. gotta be checking replies for when the wheels fall off the convo.


> i sneak in a benchmark opening of data every time i start a new chat - so right off the bat i can see in its response whether this chat session is gonna be on point or if we are going off into wacky world, which saves me time as i can just terminate and try starting another chat.

I don't think it works like that...


And how much data can you give it?

I'm not up to date with these things because I haven't found them useful. But with what you said, and previous limitations in how much data they can retain essentially makes them pretty darn useless for that task.

Great learning tool on common subjects you don't know, such as learning a new programming-language. Also great for inspiration etc. But that's pretty much it?

Don't get me wrong, that is mindblowingly impressive but at the same time, for the tasks in front of me it has just been a distracting toy wasting my time.


>And how much data can you give it?

Well, theoretically you can give it up to the context size minus 4k tokens, because the maximum it can output is 4k. In practice, though, its ability to effectively recall information in the prompt drops off. Some people have studied this a bit - here's one such person: https://gritdaily.com/impact-prompt-length-llm-performance/


You should be able to provide more data than that in the input if the output doesn't use the full 4k tokens. So limit is context_size minus expected length of output.

> And how much data can you give it?

128,000 tokens, which is about the same as a decent sized book.

Their other models can also be fine-tuned, which is kinda unbounded but also has scaling issues so presumably "a significant percentage of the training set" before diminishing returns.


It is great for proof-reading text if you are not a native English speaker. Things like removing passive voice. Just give it your text and you get a corrected version out.

Use a cli tool to automate this from the cli. Ollama for local models, llm for openai.


People never talk about Gemini, and frankly it's output is often the worst of SOTA models, but it's 2M context window is insane.

You can drop a few textbooks into the context window before you start asking questions. This dramatically improves output quality, however inference does take much much longer at large context lengths.


> On the other hand, if you were to paste the entire documentation set to a tool it has never seen and ask it to use the tool in a way to accomplish your goals, THEN this model would be likely to produce useful output, despite the fact that it had never encountered the tool or its documentation before.

There's not much evidence of that. It only marginally improved on instruction following (see livebench.ai) and it's score as a swe-bench agent is barely above gpt-4o (model card).

It gets really hard problems better, but it's unclear that matters all that much.

> A lot of people use LLMs as a search engine.

Except this is where LLMs are so powerful. A sort of reasoning search engine. They memorized the entire Internet and can pattern match it to my query.


> The magic is that _it knows english_.

I couldn't agree more, this is exactly the strength of LLMs that what we should focus on. If you can make your problem fit into this paradigm, LLMs work fantastic. Hallucinations come from that massive "lossy compressed database", but you should consider that part as more like the background noise that taught the model to speak English, and the syntax of programming languages, instead of the source of the knowledge to respond with. Stop anthropomorphizing LLMs, play to it's strengths instead.

In other words it might hallucinate a API but it will rarely, if ever, make a syntax error. Once you realize that, it becomes a much more useful tool.


> Treat it as a naive but intelligent intern.

I've found an amazing amount of success with a three step prompting method that appears to create incredibly deep subject matter experts who then collaborate with the user directly.

1) Tell the LLM that it is a method actor, 2) Tell the method actor they are playing the role of a subject matter expert, 3) At each step, 1 and 2, use the technical language of that type of expert; method actors have their own technical terminology, use it when describing the characteristics of the method actor, and likewise use the scientific/programming/whatever technical jargon of the subject matter expert your method actor is playing.

Then, in the system prompt or whatever logical wrapper the LLM operates through for the user, instruct the "method actor" like you are the film director trying to get your subject matter expert performance out of them.

I offer this because I've found it works very well. It's all about crafting the context in which the LLM operates, and this appears to cause the subject matter expert to be deeper, more useful, smarter.


It doesn't know anything. Stop anthropomorphizing the model. It's predictive text and no the brain isn't also predictive text.

Except that it sometimes does do those tasks well. The danger in an LLM isn't that it sometimes hallucinates, the danger is that you need to be sufficiently competent to know when it hallucinates in order to fully take advantage of it, otherwise you have to fallback to double checking every single thing it tells you.

This is demonstrably wrong, because you can just add "is this real" to a response and it generally knows if it made it up or not. Not every time, but I find it works 95% of the time. Given that, this is exactly a step I'd hope an advanced model was doing behind the scenes.

> Treat it as a naive but intelligent intern. Provide it data, give it a task, and let it surprise you with its output.

Well, I am a naive but intelligent intern (well, senior developer). So in this framing, the LLM can’t do more than I can already do by myself, and thus far it’s very hit or miss if I actually save time, having to provide all the context and requirements, and having to double-check the results.

With interns, this at least improves over time, as they become more knowledgeable, more familiar with the context, and become more autonomous and dependable.

Language-related tasks are indeed the most practical. I often use it to brainstorm how to name things.


I've recently started using an LLM to choose the best release of shows using data scraped from several trackers. I give it hard requirements and flexible preferences. It's not that I couldn't do this, it's that I don't want to do this on the scale of multiple thousand shows. The "magic" here is that releases don't all follow the same naming conventions, they're an unstructured dump of details. The LLM is simultaneously extracting the important details, and flexibly deciding the closest match to my request. The prompt is maybe two paragraphs and took me an hour to hone.

Ooh yeah it's great for bouncing ideas on what to name things off of. You can give it something's function and a backstory and it'll come up with a list of somethings for you to pick and choose from.

> The failure is in how you're using it

This isn’t true because, as you can read in the first sentence of the post you’re responding to, GP did give it a task like you recommend here

> Provide it data, give it a task, and let it surprise you with its output.

And it fails the task. Specifically it fails it by hallucinating important parts of accomplishing it.

> hallucinates non-existing libraries and functions

This post only makes sense if your advice to “let it surprise you with its output” is mandatory, like you’re using it wrong if you do not make yourself feel impressed by it.


This model is, thankfully, far more susceptible for longer and elaborate explanation as input. The rest (4,4o,Sonnet) seem to struggle with comprehensive explanation; this one seems to perform better with a spec like input.

> A lot of people use LLMs as a search engine.

GPT-4o is wonderful as a search engine if you tell it to google things before answering (even though it uses bing).


Yeah except. I’m priming it with things like curated docs from bevy latest, using the tricks, and testing context limits.

It’s still changing things to be several versions old from its innate kb pattern-matching or whatever you want to call it. I find that pretty disappointing.

Just like copilot and gpt4, it’s changing `add_systems(Startup, system)` to `add_startup_system(system.sytem())` and other pre-schedule/fanciful APIs—things it should have in context.

I agree with your approach to LLMs, but unfortunately “it’s still doing that thing.”

PS: and by the time I’d done those experiments, I ran out of preview, resets 5 days from now. D’oh


Interns are cheaper than o1-preview

Not for long.

> Treat it as a naive but intelligent intern

So mostly useless then?


Sorry, but that does not seem to be the case. A friend of mine who runs a long context benchmark on understanding novels [1] just ran an eval and o1 seemed to improve by 2.9% over GPT-4o (the result isn't on the website yet). It's great that there is an improvement, but it isn't drastic by any stretch. Additionally, since we cannot see the raw reasoning it's basing the answers off of, it's hard to attribute this increase to their complicated approach as opposed to just cleaner higher quality data.

EDIT: Note this was run over a dataset of short stories rather than the novels since the API errors out with very long contexts like novels.

[1]: https://novelchallenge.github.io/


It's a good rebranding. It was getting ridiculous 3.5, 4, 4.5,

This is a great description.

Intelligent?

Just ask ChatGPT

How many Rs are in strawberry?


https://chatgpt.com/share/66e3f9e1-2cb4-8009-83ce-090068b163...

Keep up, that was last week's gotcha, with the old model.


There's randomness involved in generating responses. It can also give the wrong answer still: https://bsky.app/profile/did:plc:qc6xzgctorfsm35w6i3vdebx/po...

My point is the previous "intelligent" failed at simple task, the new one will also fail on simple tasks.

That's ok for humans but not for machines.


‘That's ok for humans but not for machines.’

This is a really interesting bias. I mean, I understand, I feel that way too… but if you think about it, it might be telling us something about intelligence itself.

We want to make machines that act more like humans: we did that, and we are now upset that they are just as flaky and unreliable as drunk uncle bob. I have encountered plenty of people that aren’t as good at being accurate or even as interesting to talk to as a 70b model. Sure, LLMs make mistakes most humans would not, but humans also make mistakes most LLMs would not.

(I am not trying to equate humans and LLMs, just to be clear) (also, why isn’t equivelate a word?)

It turns out we want machines that are extremely reliable, cooperative , responsible and knowledgeable. We yearn to be obsolete.

We want machines that are better than us.

The definition of AGI has drifted from meaning. “able to broadly solve problems the (class of which) system designers did not anticipate” to “must be usefully intelligent at the same level as a bright, well educated person”.

Where along the line did we suddenly forget that dog level intelligence was a far out of reach goal until suddenly it wasn’t?


Perfectly well put! We should change the name from "AI" (which it is not) to something like, "lossy compressed databases".

If they use this name, they just say that they violate the copyright of all training data.

That abbreviates to LCD. If we could make it LSD somehow, that would help to explain the hallucinations.

Lossy Stochastic Database?

Yes, this only helps multi-step reasoning. The model still has problems with general knowledge and deep facts.

There's no way you can "reason" a correct answer to "list the tracklisting of some obscure 1991 demo by a band not on Wikipedia." You either know or you don't.

I usually test new models with questions like "what are the levels in [semi-famous PC game from the 90s]?" The release version of GPT-4 could get about 75% correct. o1-preview gets about half correct. o1-mini gets 0% correct.

Fair enough. The GPT-4 line aren't meant to be search engines or encyclopedias. This is still a useful update though.


o1-mini is a small model (knows a lot less about the world) and is tuned for reasoning through symbolic problems (maths, programming, chemistry etc.).

You're using a calculator as a search engine.


It's actually much worse than that and you're inadvertently down playing how bad it is.

It doesn't even know mildly obsecure facts that are on the internet.

For example last night I was trying to do something with C# generics and it confidently told me I could use pattern matching on the type in a switch statwmnt, and threw out some convincing looking code.

You can't, it's impossible. It wàa completely wrong. When I told that this, it told me I was right, and proceeded to give me code that was even more wrong.

This is an obscure, but well documented, part of the spec.

So it's not about facts that aren't on the internet, it's just bad at facts fullstop.

What it's good at is facts the internet agrees on. Unless the internet is wrong. Which is not always a good thing with the way the language it uses to speak is so confident.

If you want to fuck with AI models as a bunch of code questions on Reddit, GitHub and SO with example code saying 'can I do X'. The answer is no, but chatgpt/codepilot/etc. will start spewing out that nonsense as if it's fact.

As for non-proframming, we're about to see the birth of a new SEO movement of tricking AI models to believe your 'facts'.


I wonder though, is the documentation only referenced a few places on the Internet, and are there also many forums with people pasting "Why isn't this working?" problems?

If there are a lot of people pasting broken code, now the LLM has all these examples of broken code, which it doesn't know are that, and only a couple of references to documentation. Worse, a well trained LLM may realise that specs change, and that even documentation may not be considered 100% accurate (for it is older, out of date).

After all, how many times have you had something updated, an API, a language, a piece of software, but the docs weren't updates? Happens all the time, sadly.

So it may believe newer examples of code, such as the aforementioned pasted code, might be more correct than the docs.

Also, if people keep trying to solve the same issue again, and keep pasting those examples again, well...

I guess my point here is, hallucinations come from multi-faceted issues, one being "wrong examples are more plentiful than correct". Or even "there's just a lot of wrong examples".


Its not always the right tool depending on the task. IMO using LLMs is also a skill, much like learning how to Google stuff.

E.g. apparently C# generics isn’t something its good at. Interesting, so don’t use it for that, apparently its the wrong tool. In contrast, its amazing at C++ generics, and thus speeds up my productivity. So do use it for that!


> For example last night I was trying to do something with C# generics and it confidently told me I could use pattern matching on the type in a switch statwmnt, and threw out some convincing looking code.

Just use it on an instance instead

  var res = thing switch {
    OtherThing ot => …,
    int num => …,
    string s => …,
    _ => …
  };

>>>As for non-proframming, we're about to see the birth of a new SEO movement of tricking AI models to believe your 'facts'.

This is kinda crazy to think about.


If you ask Google Gemini right now for the name of the whale in half moon bay harbor it will tell you it’s called Teresa T.

That was thanks to my experiment in influencing AI search: https://simonwillison.net/2024/Sep/8/teresa-t-whale-pillar-p...


I've had the opposite experience with some coding samples. After reading Nick Carlini's post, I've gotten into the habit of powering through coding problems with GPT (where previously I'd just laugh and immediately give up) by just presenting it the errors in its code and asking it to fix them. o1 seems to be effectively screening for some of those errors (I assume it's just some, but I've noticed that the o1 things I've done haven't had obvious dumb errors like missing imports, and all my 4o attempts have).

My experience is likely colored by the fact that I tend to turn to LLMs for problems I have trouble solving by myself. I typically don't use them for the low-hanging fruits.

That's the frustrating thing. LLMs don't materially reduce the set of problems where I'm running against a wall or have trouble finding information.


I use LLMs for three things:

* To catch passive voice and nominalizations in my writing.

* To convert Linux kernel subsystems into Python so I can quickly understand them (I'm a C programmer but everyone reads Python faster).

* To write dumb programs using languages and libraries I haven't used much before; for instance, I'm an ActiveRecord person and needed to do some SQLAlchemy stuff today, and GPT 4o (and o1) kept me away from the SQLAlchemy documentation.

OpenAI talks about o1 going head to head with PhDs. I could care less. But for the specific problem we're talking about on this subthread: o1 seems materially better.


> * To convert Linux kernel subsystems into Python so I can quickly understand them (I'm a C programmer but everyone reads Python faster).

Do you have an example chat of this output? Sounds interesting. Do you just dump the C source code into the prompt and ask it to convert to Python?


No, ChatGPT is way cooler than that. It's already read every line of kernel code ever written. I start with a subsystem: the device mapper is a good recent example. I ask things like "explain the linux device mapper. if it was a class in an object-oriented language, what would its interface look like?" and "give me dm_target as a python class". I get stuff like:

    def linear_ctr(target, argc, argv):
        print("Constructor called with args:", argc, argv)
        # Initialize target-specific data here
        return 0
     
    def linear_dtr(target):
        print("Destructor called")
        # Clean up target-specific data here
     
    def linear_map(target, bio):
        print("Mapping I/O request")
        # Perform mapping here
        return 0
     
    linear_target = DmTarget(name="linear", version=(1, 0, 0), module="dm_mod")
    linear_target.set_ctr(linear_ctr)
    linear_target.set_dtr(linear_dtr)
    linear_target.set_map(linear_map)
     
    info = linear_target.get_info()
    print(info)
(A bunch of stuff elided). I don't care at all about the correctness of this code, because I'm just using it as a roadmap for the real Linux kernel code. The example use case code is an example of something GPT 4o provides that I didn't even know I wanted.

That's awesome. Have you tried asking it to convert Python (psuedo-ish) code back into C that interfaces with the kernel?

No, but only because I have no use for it. I wouldn't be surprised if it did a fine job! I'd be remiss if I didn't note that it's way better at doing this for the Linux kernel than with codebases like Zookeeper and Kubernetes (though: maybe o1 makes this better, who knows?).

I do feel like someone who skipped like 8 iPhone models (cross-referencing, EIEIO, lsp-mode, code explorers, tree-sitter) and just got an iPhone 16. Like, nothing that came before this for code comprehension really matters all that much?


it's all placeholders - that's my experience with gpt trying to write slop code

Those are placeholders for user callbacks passed to the device mapper subsystem. It’s a usage example not implementation code.

Then ask it to expand. Be specific.

I wasn't about to paste 1000 lines of Python into the thread; I just picked an interesting snippet.

LLMs are not for expanding the sphere of human knowledge, but for speeding up auto-correct of higher order processing to help you more quickly reach the shell of the sphere and make progress with your own mind :)

Definitely. When we talk about being skilled in a T shape LLMs are all about spreading your top of T and not making the bottom go deeper.

Indeed, not much more depth — though even Terence Tao reported useful results from an earlier version, so perhaps the breadth is a depth all of it's own: https://mathstodon.xyz/@tao/110601051375142142

I think of it as making the top bar of the T thicker, but yes, you're right, it also spreads it much wider.


I prefer reading some book. Maybe the LLM was trained on some piece of knowledge not available on the net, but I much prefer the reliability and consistency of a book.

It's funny because I'm very happy with the productivity boost from LLMs, but I use them in a way that is pretty much diametrically opposite to yours.

I can't think of many situations where I would use them for a problem that I tried to solve and failed - not only because they would probably fail, but in many cases it would even be difficult to know that it failed.

I use it for things that are not hard, can be solved by someone without a specialized degree that took the effort to learn some knowledge or skill, but would take too much work to do. And there are a lot of those, even in my highly specialized job.


LLMs: When the code can be made by an enthusiastic new intern with web-search and copy-paste skills, and no ability to improve under mentorship. :p

Tangentially related, a comic on them: https://existentialcomics.com/comic/557


> That's the frustrating thing. LLMs don't materially reduce the set of problems where I'm running against a wall or have trouble finding information.

As you step outside regular Stack Overflow questions for top-3 languages, you run into limitations of these predictive models.

There's no "reasoning" behind them. They are still, largely, bullshit machines.


you're both on the wrong wavelength. No one has claimed it is better than an expert human yet. Be glad, for now your jobs are safe, why not use it as a tool to boost your productivity, yes, even though you'll get proportionally less use than others in other perhaps less "expert" jobs.

In order for it to boost productivity it needs to answer more than the regular questions for the top-3 languages on Stackoverflow, no?

It often fails even for those questions.

If I need to babysit it for every line of code, it's not a productivity boost.


Why does it need to answer more than that?

You underestimate the opportunity that exists for automation out there.

In my own case I've used it to make simple custom browser extensions transcribing PDFs, I don't have the time and wouldn't of made the effort to make the extension myself, the task would of continued to be done manually. It took two hours to make and it works, that's all I need in this case.

Perfection is the enemy of good.


> Perfection is the enemy of good.

Where exactly did I write anything about perfection? For me "AIs" are incapable of producing working code: https://news.ycombinator.com/item?id=41534233


You said you have to babysit each line of code, I mean this is simply untrue, if it works there's no need to babysit, the only reason you'd need to babysit every single line is if you're looking for perfection or it's something very obscure or unheard of.

Your example is perhaps valid, but there are other examples where it does work as I mentioned. I think it may be imprecise prompting, too general or with too little logic structure. It's not like Google search, the more detail and more technical you speak the better, assume it's a very precise expert. Its intelligence is very general so it needs precision to avoid confusing subject matter. A well structured logic to your request also helps as it's reasoning isn't the greatest.

Good prompting and verifying output is often still faster than manually typing it all.


> You said you have to babysit each line of code, I mean this is simply untrue, if it works there's no need to babysit

No. It either doesn't work, or works incorrectly, or the code is incomplete despite requirements etc.

> Your example is perhaps valid, but there are other examples where it does work as I mentioned.

It's funny how I'm supposed to assume your examples are the truth, and nothing but the truth, but my examples are "untrue, you're a perfectionist, and perhaps you're right"

> the more detail and more technical you speak the better

As I literally wrote in the comment you're so dismissive of: "As for "using LLMs wrong", using them "right" is literally babysitting their output and spending a lot of time trying to reverse-engineer their behavior with increasingly inane prompts."

> assume it's a very precise expert.

If it was an expert, as you claim it to be, it would not need extremely detailed prompting. As it is, it's a willing but clumsy junior.

To the point that it would rewrite the code I fixed with invalid code when asked to fix an unrelated mistake.

> Good prompting and verifying output

How is it you repeat everything I say, and somehow assume I'm wrong and my examples are invalid?


I did not say your examples are untrue, no need to be so defensive. Believe what you wish but my example is true and works. A willing but clumsy junior benefits tremendously from a well scoped task.

If you need to babysit it for every line of code, you're either a superhuman coder, working in some obscure alien language, or just using the LLM wrong.

No. I'm just using for simple things like "Help me with the Elixir code" or "I need to list Bonjour services using Swift".

It's shit across the whole "AI" spectrum from ChatGPT to Copilot to Cursor aka Claude.

I'm not even talking about code I work with at work, it's just side projects.

As for "using LLMs wrong", using them "right" is literally babysitting their output and spending a lot of time trying to reverse-engineer their behavior with increasingly inane prompts.

Edit: I mean, look at this ridiculousness: https://cursor.directory/


>The o1-preview model still hallucinates non-existing libraries and functions for me, and is quickly wrong about facts that aren't well-represented on the web. It's the usual string of "You're absolutely correct, and I apologize for the oversight in my previous response. [Let me make another guess.]"

After that you switch to Claude Soñnet and after sometime it also gets stuck.

Problem with LLM is that they are not aware of libraries.

I've fed them library version, using requirements.txt, python version I am using etc...

They still make mistakes and try to use methods which do not exist.

Where to go from here? At this point I manually pull the library version I am using and go to its docs, I generate a page which uses the this library correctly (then I feed that example into LLM)

Using this approach works. Now I just need to automate it so that I don't have to manually find the library, create specific example which uses the methods I need in my code!

Directly feeding the docs isn't working well either.


One trick that people are using, when using Cursor and specifically Cursor's compose function, is to dump library docs into a text file in your repo, and then @ that doc file when you're asking it to do something involving that library.

That seems to eliminate a lot of the issues, though it's not a seamless experience, and it adds another step of having to put the library docs in a text file.

Alternatively, cursor can fetch a web page, so if there's a good page of docs you can bring that in by @ the web page.

Eventually, I could imagine LLMs automatically creating library text doc files to include when the LLM is using them to avoid some of these problems.

It could also solve some of the issues of their shaky understanding of newer frameworks like SvelteKit.


Cursor also has the shadow workspace feature [1] that is supposed to send feedback from linting and language servers to the LLM. I'm not sure whether it's enabled in compose yet though.

[1] https://www.cursor.com/blog/shadow-workspace


My point of view: this is a real advancement. I've always believed that with the right data allowing the LLM to be trained to imitate reasoning, it's possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the "reasoning programs" or "reasoning patterns" the model learned during the reinforcement learning phase. https://www.lycee.ai/blog/openai-o1-release-agi-reasoning

I honestly can’t believe this is the hyped up “strawberry” everyone was claiming is pretty much AGI. Senior employees leaving due to its powers being so extreme

I’m in the “probabilistic token generators aren’t intelligence” camp so I don’t actually believe in AGI, but I’ll be honest the never ending rumors / chatter almost got to me

Remember, this is the model some media outlet reported recently that is so powerful OAI is considering charging $2k/month for


The whole safety aspect of AI has this nice property that it also functions as a marketing tool to make the technology seem "so powerful it's dangerous". "If it's so dangerous it must be good".

> probabilistic token generators aren’t intelligence

Maybe this has been extensively discussed before, but since I've lived under a rock: which parts of intelligence do you think are not representable as conditional probability distributions?


> which parts of intelligence do you think are not representable as conditional probability distributions

Maybe I'm wrong here but a lot of our brilliance comes from acting against the statistical consensus. What I mean is, Nicolaus Copernicus probably consumed a lot of knowledge on how the Earth is the center of the universe etc. and probably nothing contradicting that notion. Can a LLM do that ?


It could be "probability of token being useful" rather than "probability of token coming next in training data"!

Copernicus was an exception, not the rule. Would you say everyone else who lived at the time was not 'really' intelligent?

That's an illogical counterargument. The absence of published research output does not imply the absence of intelligent brain patterns. What if someone was intelligent but just wasn't interested in astronomy?

Yes but this was just to make a blatant example. The questions still stands. If you feed a LLM certain kind of data is it possible it strays from it completely - like we sometimes do in cases big and small when we figure out how to do something a bit better by not following the convention.

And how many people actively do that? It's very rare we experience brilliance and often we stumble upon it by accident. Irrational behavior, coincidence or perhaps they were dropped on their heads when they were young.

"Senior employees leaving due to its powers being so extreme"

This never happened. No one said it happened.

"the model some media outlet reported recently that is so powerful OAI is considering charging $2k/month for"

The Information reported someone at a meeting suggested this for future models, not specifically Strawberry, and that it would probably not actually be that high.


Elon Musk and Ilya Sutskever Have Warned About OpenAI’s ‘Strawberry’ Jul 15, 2024 — Sutskever himself had reportedly begun to worry about the project's technology, as did OpenAI employees working on A.I. safety at the time.

https://observer.com/2024/07/openai-employees-concerns-straw...

And I’m ignoring the hundreds of Reddit articles speculating every time someone at OAI leaves

And of course that $2000 article was spread by every other media outlet like wildfire

I know I’m partially to blame for believing the hype, this is pretty obviously no better at stating facts or good code than what we’ve known for the past year


My hypothesis about these people who are afraid of AI, is that they have tricked themselves into believing they are in their current position of influence due to their own intelligence (as opposed to luck, connections, etc.)

Then they drink the marketing koolaid, and it follows naturally that they worry an AI system can obtain similar positions of influence.


I mean, considering how many tokens their example prompt consumed, I wouldn't be surprised if it costs ~$2k/month/user to run

I think this model is a precursor model that is designed for agentic behavior. I expect very soon OpenAI to allow this model tool use that will allow it to verify its code creations and whatever else it claims through use of various tools like a search engine, a virtual machine instance with code execution capabilities, api calling and other advanced tool use.

Stupid question: Why can't models be trained in such a way to rate the authoritativeness of inputs? As a human, I contain a lot of bad information, but I'm aware of the source. I trust my physics textbook over something my nephew thinks.

o1-preview != o1.

In public coding AI comparison tests, results showed 4o scoring around 35%, o1-preview scoring ~50% and o1 scoring ~85%.

o1 is not yet released, but has been run through many comparison tests with public results posted.


Good reminder. Why did OpenAI talk about o1 and not release it? o1-preview must be a stripped down version: cheaper to run somehow?

Don't forget about o1-mini. It seems better than o1-preview for problems that fit it (don't require so much real world knowledge).

gpt-4 base was never released and this will be the same thing

To the extent we've now got the output of the underlying model wrapped in an agent that can evaluate that output, I'd expect it to be able to detect it's own hallucinations some of the time and therefore provide an alternate answer.

It's like when an LLM gives you a wrong answer and all it takes is "are you sure?" to get it to generate a different answer.

Of course the underlying problem of the model not knowing what it knows or doesn't know persists, so giving it the ability to reflect on what it just blurted out isn't always going to help. It seems the next step is for them to integrate RAG and tool use into this agentic wrapper, which may help in some cases.


I don’t really see this as a massive problem. Its code. If it doesn’t run, you ask it to reconsider, give some more info if necessary, and it usually gets it right.

The system doesn’t become useless if it takes 2 tries instead of 1 to get it right

Still saves an incredible amount of time vs doing it yourself


> Its code. If it doesn’t run, you ask it to reconsider

It is perfectly possible to have code that runs without errors but gives a wrong answer. And you may not even realise it’s wrong until it bites you in production.


While I agree, I saw it abused in this way a lot, in the sense that the code did what it was supposed to do in a given scenario but was obviously flawed in various was so it was just sitting there waiting for a disaster.

I haven't found a single instance where it saved me any significant amount of time. In all cases I still had to rewrite the whole thing myself, or abandon endeavor.

And a few times the amount of time I spent trying to coax a correct answer out of AI trumped any potential savings I could've had


> The o1-preview model still hallucinates non-existing libraries and functions for me

Oooh... oohhh!! I just had a thought: By now we're all familiar with the strict JSON output mode capability of these LLMs. That's just a matter of filtering the token probability vector by the output grammar. Only valid tokens are allowed, which guarantees that the output matches the grammar.

But... why just data grammars? Why not the equivalent of "tab-complete"? I wonder how hard it would be to hook up the Language Server Protocol (LSP) as seen in Visual Studio code to an AI and have it only emit syntactically valid code! No more hallucinated functions!

I mean, sure, the semantics can still be incorrect, but not the syntax.


This would be a big undertaking to get working for just one language+package-manager combination, but would be beautiful if it worked.

I still fail to see the overall problem. Hallucinating non-existing libraries is a good programming practice in many cases: you express your solution in terms of an imaginary API that is convenient for you, and then you replace your API with real functions, and/or implement it in terms of real functions.

One of the biggest problems with this generation of AI is how people conflate the natural language abilities and the access to what it knows.

Both abilities are powerful, but they are very different powers.


You should not be asking it questions that require it to already know detailed information about apis and libraries. It is not good at that, and it will never be good at that. If you need it to write code that uses a particular library or api, include the relevant documentation and examples.

It's your right to dismiss it, if you want, but if you want to get some value out of it, you should play to it's strengths and not look for things that it fails at as a gotcha.


Just pass a link to a GitHub issue and ask for a response or even a webpage to summarize and will see the beautiful hallucinations it will come up to as the model is not web browsing yet.

The best one I got recently was after I pointed out that the method didn’t exist, it proposed another method and said “use this method if it exists” :D

Has anyone tried asking it to generate the libraries/functions that it's hallucinating and seeing if it can do so correctly? And then seeing if it can continue solving the original problem with the new libraries? It'd be absolutely fascinating if it turns out it could do this.

Not for libraries, but functions will sometimes get created if you work with an agent coding loop. If the tests are in the verification step, the code will typically be correct.

I sometimes give it snippets of code and omit helper functions if they seem obvious enough, and it adds its own implementation into the output.

That problems feels somewhat fundamental to saying that these things have any ability to reason at all.

Just ask it for things it has seen before on the internet and you're golden. Mixes of ideas, new ideas and precise and clear thinking; not so much.

It begs the question of whether we can supply a function to be called (e.g., one that compiles and runs code) to evaluate intermediate CoT results

It seems OpenAI has decided to keep the CoT results a secret. If they were to allow the model to call out to tools to help fill in the CoT steps, then this might reveal what the model is thinking - something they do not want the outside world to know about.

I could imagine OpenAI might allow their own vetted tools to be used, but perhaps it will be a while (if ever) before developers are allowed to hook up their own tools. The risks here are substantial. A model fine-tuned to run chain-of-thought that can answer graduate level physics problems at an expert level can probably figure out how to scam your grandma out of her savings too.


It's only a matter of time. When some other company releases the tool, they likely will too.

I have to agree with you here. OpenAI may be playing for competitive advantage more than for the good of humanity by hiding the results.

The answer is yes if you are willing to code it. OpenAI supports tool calls. Even if it didn't you could just make multiple calls to their API and submit the result of the code execution yourself.

The intermediate CoT results aren't in the API.

I may be mistaken but I don't believe the first version of the comment I replied to mentioned intermediate CoT results.

> having no way to assess if what it conjures up from its weights is factual or not.

This comment makes no sense in the context of what an LLM is. To even say such a thing demonstates a lack of understandting of the domain. What we are doing here is TEXT COMPLETION, no one EVER said anything about being accurate and "true". We are building models that can complete text, what did you think an LLM was, a "truth machine"?


I mean of course you're right, but then I question what's the usefulness?

I'm honestly confused as to why it is doing this and why it thinks I'm right when I tell it that it is incorrect.

I've tried asking it factual information, and it asserts that it's incorrect but it will definitely hallucinate questions like the above.

You'd think the reasoning would nail that and most of the chain-of-thought systems I've worked on would have fixed this by asking it if the resulting answer was correct.


Near the end, the quote from OpenAI researcher Jason Wei seems damning to me:

> Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.

Results are "strong" but can't be felt by the user? What does that even mean?

But the last sentence is the worst: "we all need to find harder prompts". If I understand it correctly, it means we should go looking for new problems / craft specific questions that would let these new models shine.

"This hammer hammers better, but in most cases it's not obvious how better it is. But when you stumble upon a very specific kind of nail, man does it feel magical! We need to craft more of those weird nails to help the world understand the value of this hammer."

But why? Why would we do that? Wouldn't our time be better spent trying to solve our actual, current problems, using any tool available?


He's speaking about his objective to make ever stronger LLMs: so for this his secondary objective is to measure their real performance.

The human preference is not that good of a proxy measurement: for instance, it can be gamed by making the model more assertive, causing the human error-spotting ability to decrease a lot [0].

So what he's really saying is that non-rigorous human vibe checks (like those LMSys Chatbot Arena is built on, although I love it) won't cut it anymore to evaluate models, because now models are past that point. Just like you can't evaluate how smart a smart person really is in a 2min casual conversation.

[0]: https://openreview.net/pdf?id=7W3GLNImfS


It's trivial to come up with prompts that 4o fails. If it's hard to come up with prompts that 1o succeeds on but 4o fails, that implies the delta is not that great.

Or, the delta depends on the nature of the problem/prompt, we’ve not yet figured that out, there’s a relatively narrow range of prompts with large delta, and so finding those examples is a work in progress?

ie when you cant beat them, make new metrics

and you can absolutely evaluate how smart someone is in a 2min casual conversation. You wont be able to tell how well they are in some niche topic, but %insert something about different flavors of intelligence and how they do not equate do subject matter expertise%


It’s a common pattern that AI benchmarks get too easy, so they make new ones that are harder.

As models improve, human preference will become worse as a proxy measurement (e.g. as model capabilities surpass the human's ability to judge correctness at a glance). This can be due to more raw capability - or more persuasion / charisma.

> Results are "strong" but can't be felt by the user? What does that even mean?

Not every conversation you have with a PhD will make it obvious that that person is a PhD. Someone can be really smart, but if you don't see them in a setting where they can express it, then you'll have no way of fully assessing their intelligence. Similarly, if you only use OAI models with low-demand prompts, you may not be able to tell the difference between a good model and a great one.


> What does that even mean?

It explicitly says "Results on AIME and GPQA are really strong". So I would assume it means it can get (statistically significantly, I assume) better score in AIME and GPQA benchmarks compared to 4o.


I think they are saying they have invented the screwdriver. We have all been using, hammers to sink screws, but if you try this new tool it may be better. However, you will still encounter a lot of nails.

It's more like they're saying they have invented the screwdriver, but they haven't invented screws yet.

But it doesn't feel right. It's unlikely the screwdriver would come first, and then people would go around looking for things to use it with, no?


It's more like they have invented a computer, an extremely versatile and powerful tool that can be used in many ways, but is not a solution to every problem.

Now they need people to write software that uses this capability to perform useful tasks, such as text processing, working with spreadsheets and providing new ways of communication.


While I find value in LLMs they still overall seem unreasonably not that useful.

It might be like trying to train a neural net in 1993 on a 60mhz Pentium. It is the right idea but fundamental parts of the system are so lacking.

On the other hand, I worry we have gone down the support vector machine path again. A huge amount of brain power spent on a somewhat dead end that just fits the current hardware better than what we will actually use in the long run.

The big difference though from SVM is this has captured the popular imagination and if the tide goes out, the AI winter will the most brutal winter by an order of magnitude.

AGI or bust.


I’d say the biggest difference between LLMs and SVMs is that a lot of people find LLMs useful on a daily basis.

I’ve been using them almost daily for over two years now, and I keep on finding new things they can do that are useful to me.


They’re useful, but not for what AI companies seem to be pushing for.

I like that they can reorganize my data, document QA is pretty killer as long as the document was prepared well.

Embeddings are sick.

But content creation… not useful. Problem solving? Personally have not found them useful (haven’t tried o1 yet)


Is there a post on your blog that lists your different uses of LLMs?

Not in a single place, but it came up in a podcast episode the other day - about 32 minutes in to this one I think https://softwaremisadventures.com/p/simon-willison-llm-weird...

> But why? Why would we do that?

Because OpenAI needs a steady influx of money, big money. In order to do so, they have to convince the people who are giving them money that they are the best. An objective way to achieve this is by benchmarking. But once you enter this game, you start optimizing for benchmarks.

At the same time, in the real world, Anthropic is following them in huge leaps and for many users Claude 3.5 is already the default tool for daily work.


Agree completely.

From a user perspective too, I was a subscriber from the first day of gpt4 until about a month ago. I thought about subscribing for the month to check this out but I am tired of the OpenAI experience.

Where is Sora? Where is the version of chatgpt that responds in real time to your voice? Remember the gpt4 demo that you would draw a website on a napkin?

How about Q* lol. Strawberry/Q*/o1, "it is super dangerous, be very careful!"

Quietly, Anthropic has just kicked their ass without all the hype and I am about to go work in sonnet instead even bothering to check o1 out.


> Results are "strong" but can't be felt by the user? What does that even mean?

This means it often doesn't provide the answer the user is looking for. In my opinion, it's an alignment problem, people are very presumptuous and leave out a lot of detail in their request. Like the "which is bigger - 9.8 or 9.11? question, if you ask "numerically which is bigger - 9.8 or 9.11?" It gets the correct answer, basically it prioritizes a different meaning for bigger.

> But the last sentence is the worst: "we all need to find harder prompts". If I understand it correctly, it means we should go looking for new problems / craft specific questions that would let these new models shine. But why? Why would we do that? Wouldn't our time be better spent trying to solve our actual, current problems, using any tool available?

Without better questions we can't test and prove that it is getting more intelligent or is just wrong. If it is more intelligent than us it it might provide answers that don't make sense to us but are actually clever, 4d chess as they say. Again an alignment problem, better questions aid with solving that.


The irony here is that Jason is speaking in the context of LLM development, which he lives and breaths all day.

Reading his comments without framing it in that context makes it come off pretty badly - humans failing to understand what is being said because they don't have context.


> we all need to find harder prompts

"One of the biggest traps for engineers is optimizing a thing that shouldn't exist." (from Musk I believe)


This is something we've been grappeling with on my team. Many of the researchers in the org want to try all these reasoning techniques to increase performance, and my team keeps pushing back that we don't actually need that extra performance- we just want to decrease latency and cost.

So make the requirement using a cheaper and lower latency model and try to increase the performance to a satisfactory level. Assuming that you are not already using the cheapest/lowest latency model.

This hits the nail on the head. It is a consumer facing product not a technology to solve deep thinking.

i don't think that's what he's saying

You're reading too much into an offhand comment that's more metaphorical in nature.

The stupidest thing about ai and automation is that they are trying to target it at large corporations looking to cut down on jobs or 10x productivity when all anyone actually wants is a robot to do their laundry and dishes.

Because a robot that do everyone's laundry is much more closer to AGI than ChatGPT. I'm dead serious.

Not really. You don't need to move wet clothes from the first machine to a second machine if you get one machine that does both jobs. That's very much not AGI. The second job, of taking dry crumpled clothes and folding them, also doesn't need an artificial general intelligence. It's very computationally expensive (as evidenced by the speed of https://pantor.github.io/speedfolding/, out of UC Berkeley) and a hard robotics question, but it's also very fixed function.

Taking the clothes out of the combined washer dryer machine, my laundry folding robot isn't suddenly going to need to come up with a creative answer to a question I have about politics in order to fold the laundry, or come up with a new way to organize my board game collection, or reason about how to refactor some code. There are no logical leaps of reasoning or deep thinking required. My laundry folding robot doesn't need to be creative in order to fold laundry, just application of some very complex algorithms, some of which have yet to be discovered.


these are almost entirely unrelated problems

You're describing a dish-washer and washing-machine.

The GP is almost certainly describing a robot that can move dirty stuff into the machines, run them, and put away the clean stuff afterwards.

Dont you know by now

Speaking with AI maxis it’s easy:

The AI is always right

You are always wrong

If AI might enable something dangerous, it was already possible by hand, scale is irrelevant

But also AI enables many amazing things not previously possible, at scale

If you don’t get the answers you want, you’re prompting it wrong. You need to work harder to show how much better the AI is. But definitely, it cannot make things worse at scale in any way. And anyone who wants regulations to even require attribution and labeling, is a dangerous luddite depriving humanity of innovations.


I tried a problem I was looking at recently, to refactor a small rust crate to use one datatype instead of an enum, to help me understand the code better. I found o1-mini made a decent attempt, but couldn't provide error free code. o1-preview was able to provide code that compiled and passed all but the test that is expected to fail, given the change I asked it to make.

This is the prompt I gave:

simplify this rust library by removing the different sized enums and only using the U8 size. For example MasksByByte is an enum, change it to be an alias for the U8 datatype. Also the u256 datatype isn't required, we only want U8, so remove all references to U256 as well.

The original crate is trie-hard [1][2] and I forked it and put the models attempts in the fork [3]. I also quickly wrote it up at [4]

[1] https://blog.cloudflare.com/pingora-saving-compute-1-percent...

[2] https://github.com/cloudflare/trie-hard

[3] https://github.com/kpm/trie-hard-simple/tree/main/attempts

[4] https://blog.reyem.dev/post/refactoring_rust_with_chatgpt-o1...


I've been having a weird timezone issue in my Rails application that I've had a hard time getting my head around. I tried giving o1-preview the relevant code and context it needed to know and it gave answers that seemed to make sense but it still wasn't able to resolve the bug and explain exactly what was going on.

So, it seems like anything that requires some actual thought and problem-solving is tough for it to answer.

I'm sure it's just a matter of time before devs are out of work but it seems like we'll be safe for another few years anyway.


I'm still not convinced that it's not going through approximate reasoning chain retrieval and that's self-triggered to get more reasoning chains that will maximize it's goal. I'm seeing a lot of comments from other SWEs using it for non-trivial tasks in which it fails at but is just trying harder to look like it's problem solving. Even with more context and documentation, it fails to realize details an experienced SWE would pick up quickly.

I cannot tell from reading what you wrote whether you think it did a good job or not

Thanks for the feedback. I do think it did a good job in the end. I haven't had time to have a good look at the final code o1-preview produced and also my understanding of rust is pretty basic, which I why I didn't say more about the results. I think rust is one of those languages where, if it compiles, you're most of the way there, because of the strong type system. Not as strong as Haskell or Ocaml perhaps.

It's interesting to note that there's really two things going on here:

1. A LLM (probably a finetuned GPT-4o) trained specifically to read and emit good chain-of-thought prompts.

2. Runtime code that iteratively re-prompts the model with the chain of thought so far. This sounds like it includes loops, branches and backtracking. This is not "the model", it's regular code invoking the model. Interesting that OpenAI is making no attempt to clarify this.

I wonder where the real innovation here lies. I've done a few informal stabs with #2 and I have a pretty strong intuition (not proven yet) that given the right prompting/metaprompting model you can do pretty well at this even with untuned LLMs. The end game here is complex agents with arbitrary continuous looping interleaved with RAG and tool use.

But OpenAI's philosophy up until now has almost always been "The bitter lesson is true, the model knows best, just put it in the model." So it's also possible that the prompt loop has no special sauce and that the capabilities here do come mostly from the model itself.

Without being able to inspect the reasoning tokens, we can't really get a lot of info about which is happening.


If it really is Reinforcement Learning as they claim, it means there might not be any direct supervision on the "thinking" section of the output, just on the final answer.

Just like for Chess or Go you don't train a supervised model by giving it the exact move it should do in each case, you use RL techniques to learn which moves are good based on end results of the game.

In practice, there probably is some supervision to enforce good style and methodology. But the key here is that it is able to learn good reasoning without (many) human examples, and find strategies to solve new problems via self-learning.

If that is the case it is indeed an important breakthrough.


This is the bitter lesson/just put it in the model. They're trying to figure out more ways of converting compute to intelligence now that they're running out of text data: https://images.ctfassets.net/kftzwdyauwt9/7rMY55vLbGTlTiP9Gd...

A cynical way to look at it is that we're pretty close to the ultimate limits of what LLMs can do and now the stake holders are looking at novel ways of using what they have instead of pouring everything into novel models. We're several years into the AI revolution (some call it a bubble) and Nvidia is still pretty much the only company that makes bank on it. Other than that it's all investment driven "growth". And at some point investors are gonna start asking questions...

That is indeed cynical haha.

A very simple observation, our brains are vastly more efficient. Obtaining vastly better outcomes from lesser input. This evidence means there's plenty of room for improvement without a need to go looking for more data. Short term gain versus long term gain like you say, shareholder return.

More efficiency means more practical/useful applications and lower cost as opposed to bigger model which means less useful (longer inference times) and higher cost (data synthesis and training cost).


That’s assuming that LLMs act like brains at all.

They don’t.

Especially not with transformers.


Says who? At a fundamental level

At a fundamental level, brains don’t operate on floating point numbers encoded in bits.

They have chemicals to facilitate electrochemical reactions which can affect how they respond to input. They don’t throw away all knowledge of what they just said. They change continuously, not just in fixed training loops. They don’t operate in turns.

I could go on.

Honestly the number of people who just heard “learning,” “neural networks,” and “memory” and assume that AI must be acting like a biological brain is insane.

Truly a marvel of marketing.


Fundamentally and physically are two different things. A logic gate is a logic gate if it's in neurons or silicon. Are abacus and calculators solving different things? No.

You're proving my point, things like them changing continuously are exactly what I mean when I say the brain is more efficient. Where there's a will theres a way and our brains are evidence that it can be done.


You're saying that because two different objects can to solve the same problem, they must work the same way.

An abacus and a calculator were both made to solve relativly simple math problems, so they must work in the same way, right?

And apple and an orange are ways to store sugar for plants, so they must be the same thing, right?

No. That's not how any of this works. An abacus and a calculator are two different tools that solve the same problem. They don’t act like each other just because the abstract outcome is the same

> You're proving my point, things like them changing continuously are exactly what I mean when I say the brain is more efficient.

I don't see how that proves that neural networks act like brains.

It's also not just a difference in terms of efficiency, it's the fundamental way that statistical models like neural networks are trained. Every time their trained, it's a brand new model, unlike a brain, which is still the same brain.

Also, neural networks and brains were NOT made to solve the same problems... even if your argument made any sense, it doesn't fit here.


No I'm not saying they must work the same way. I'm saying it's evidence there is a more efficient way as they both solve the same problem and one is more efficient (in truth both are more efficient in different areas). At an abstract level they can be doing the same thing. What does a simulator do?

Think a little further yes, currently it's a brand new model each time but why will it be this way forever? Its an engineering problem one that we can solve and the brain is evidence it can be done.

Neural networks were originally inspired by the brain. Yes, they've deviated but there's absolutely no reason they can't take further inspiration.


So you’re just abstracting everything to the point where everything is a “something solver” and if two things can solve the same something, one must be a better version of the other?

Abstracting everything to the point of meaninglessness isn’t a worthwhile exercise.


No, that's a stretch and even from that how do you get to that conclusion? I think you're clearly trying to brush off my comment.

I assume you're of the opinion humans are special.


One aspect that’s not achievable is they discuss hiding the chain of thought in its raw form because the chains are allowed to be unaligned. This allows the model to operate without any artifacts from alignment and apply them in the post processing, more or less. This requires effectively root and you would need the unaligned weights.

Ok but this presses on a latent question: what do we mean by alignment?

Practically it's come to mean just sanitization... "don't say something nasty or embarrassing to users." But that doesn't apply here, the reasoning tokens are effectively just a debug log.

If alignment means "conducting reasoning in alignment with human values", then misalignment in the reasoning phase could potentially be obfuscated and sanitized, participating in the conclusion but hidden. Having an "unaligned" model conduct the reasoning steps is potentially dangerous, if you believe that AI alignment can give rise to danger at all.

Personally I think that in practice alignment has come to mean just sanitization and it's a fig leaf of an excuse for the real reason they are hiding the reasoning tokens: competitive advantage.


Alignment started as a fairly nifty idea, but you can't meaningfully test for it. We don't have the tools to understand the internals of an LLM.

So yes, it morphed into the second best thing, brand safety - "don't say racist / anti-vax stuff so that we don't get bad press or get in trouble with the regulators".


The challenge is alignment ends up changing the models in ways that aren’t representative of the actual training set and as I understand it this generally lowers the performance even for aligned things. Further the decision to summarize the chains of thought includes the answers that wouldn’t pass alignment themselves without removal. From what I read the final output is aligned but could have considered unaligned COT. In fact because they’re in the context they’re necessarily changing the final output even if the final output complies with the alignment. There are a few other “only root could do this,” which says yes anyone could implement these without secret sauce as long as they have a raw frontier model.

Glass half full and the good faith argument.

It's a compromise.

OpenAI will now have access to vaste amounts of unaligned output so they can actually study it's thinking.

Whereas the current checks and balances meant the request was rejected and the data providing this insight was not created in the first place.


I have also spent some time on 2) and implemented several approaches in this open source optimising llm proxy - https://github.com/codelion/optillm

In my experience it does work quite well, but we probably need different techniques for different tasks.


Maybe 1 is actually hat you just suggested - an RL approach to select the strategy for 2. Thank you for implementing optillm and working out all the various strategy options, it’s a really neat reference for understanding this space.

One item I’m very curious about is how do they get a score for use in the RL? in well defined games it’s easy to understand but in this LLM output context how does one rate the output result for use in an RL setup?


That’s the hardest part, figuring out the reward. For generic tasks it is not easy, in my implementation in optillm I am using the llm itself to generate a score based on the mcts trajectory. But that is not as good as having a reward that is well defined say for a coding or logic problem. May be they trained a better reward model.

> So it's also possible that the prompt loop has no special sauce and that the capabilities here do come mostly from the model itself.

The prompt loop code often encodes intelligence/information that the human developers tend to ignore during their evaluations of the solution. For example, if you add a filter for invalid json and repeatedly invoke the model until good json comes out, you are now carrying water for the LLM. The additional capabilities came from a manual coding exercise and additional money spent on a brute force search.


Well, if LLMs are system 1, this difference would be building towards system 2.

https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow


Yes indeed, and personally if we have AGI I believe it will arise from multiple LLMs working in tandem with other types of machine learning, databases for "memory", more traditional compute functions, and a connectivity layer between them all.

But to my knowledge, that's not the kind of research OpenAI is doing. They seem mostly focused on training bigger and better models and seeking AGI through emergence in those.


The innovation lies in using RL to achieve 1.) and provide a simple interface to 2.)

You don't to execute code to have it backtrack. The LLM can inherently backtrack itself if trained to. It knows all the context provided to it and the output it has written already.

If it knows it needs to backtrack then could it gain much by outputting something that tells the code to backtrack for it? For example, outputting something like "I've disproven the previous hypothesis, remove the details". Almost like asking to forget.

This could reduce the number of tokens it needs at inference time, saving compute. But with how attention works, it may not make any difference to the performance of the LLM.

Similarly, could there be gains by the LLM asking to work in parallel? For example "there's 3 possible approaches to this, clone the conversation so far and resolve to the one that results in the highest confidence".

This feels like it would be fairly trivial to implement.


O1 seems like a variant of RLRF https://arxiv.org/abs/2403.14238

this is why i became skeptical of openai's claims

if they shared the COT the grift wont work

its just RL


I can't help but feel that saying "it's just RL" is like someone at the start of the 20th century saying "it's just electricity", as if understanding the underlying mechanism is the same as understanding the applications it can enable.

Tbf RL is pretty incredible.

I trained a model to play a novel video game using only screenshots and a score using RL and I discovered how not to lose


The innovation lies in making the whole loop available to an end user immediately, without them being a programmer. My grandma can build games using ChatGPT now.

No she can't, comments likes yours are just made up nonsense that AI hype-mans and investors somehow convinced us are a fair opinions to have.

Check out replit agents, they can make games and apps autonomously now

Practical challenge with a $250 prize: Make a 2D isometric HTML+JS game (dealer's choice on library) in the next 48 hours that satisfies these modest random requirements:

A character walks around a big ornate classic library, pulling books from bookshelves looking for a special book that causes a shelf to rotate around and reveal a hidden room and treasure chest. The player can read the books and some are just filler but some have clues about the special book. If this can be done with art, animations, sound, UI, the usual stuff, I'll believe the parent poster's claim to be true.

As someone using LLM-based workflows daily to assist with personal and professional projects, I'll wager $250 that this is not possible.


Sounds like a comfy sequence in a larger game I would anticipate on replay. I put my own $250 on the table (given the prompt and process were forthcoming).

The question at the heart of people's anxiety here is: Would you bet that same $250 if AI had 5 years to be able to do it?

Do you know of an example game I can play right now?

While AI is overhyped by some people, the parent's statement is not only true but was true long before o1 was released.

Do you know of an example game by someone with no coding experience using an LLM?

What games have people made with ChatGPT? Do you have an example of a live, deployed game?

Yes, a gazillion of them. Someone in a scrabble Facebook group made this entirely with ChatGPT: https://aboocher.github.io/scrabble/ingpractice.html

Look, I get the societal development that you can input narrative text and the code for this pops out is super neat.

But trying to be fair here, anyone would call this incomplete, right?

There are several obvious bugs in styling and interaction.

This example is exactly what I was expecting. An ephemeral, simple-yet-buggy single page that’s barely a game in common understanding.

That person, while maybe not actively programming things, does appear to have forked several repos on GitHub a decade ago. I would say that’s above the level of technical competence implied by the “my grandma” phrasing of the OP.


1 < a gazillion

I think the problem here is different expectations for what a “game” is.

If you tell a room full of programmers that something can make a game they’re going to expect more than that.

I look at that and I don’t really see a game, I see flashcards.

Still pretty cool chatgpt can put that together.

Also the “try again” button doesn’t work.


Its actually kind of wild how obvious it is that this was not made by a human.

Ada Lovelace is my grandma

My great aunt literally asked o1 for fantasy football bets and won $1000 on draftkings. This is a gamechanger

what game has she made

> the idea that I can run a complex prompt and have key details of how that prompt was evaluated hidden from me feels like a big step backwards.

As a developer, this is highly concerning, as it makes it much harder to debug where/how the “reasoning” went wrong. The pricing is also silly, because I’m paying for tokens I can’t see.

As a user, I don’t really care. LLMs are already magic boxes and I usually only care about the end result, not the path to get there.

It will be interesting to see how this progresses, both at OpenAI and other foundation model builders.


> As a user, I don’t really care.

Tell me: Just how is it fair for a user to pay for the reasoning tokens without actually seeing them? If they are not shared, the service can bill you anything they want for them!


The simple answer is: I don't care. I'll statistically figure out what the typical total cost per call is from experience, and that's what matters. Who cares if they lie about it, if the model's cost per call fits my budget?

If it starts costing $1 per call, and that's too high, then I just won't use it commercially. Whether it was $1 because they inflated the token count or because it just actually took a lot of tokens to do its reasoning isn't really material to my economic decision.


The thing is it might increase in cost after you've decided to use it commercially, and have invested a lot of time and resources in it. Now it's very hard to move to something else, but very easy for OpenAI to increase your cost arbitrarily. The statistics you made are not binding for them.

The API returns how many tokens were used in reasoning, so it would be easy to see any average change in reasoning token consumption. And token prices in general have been extremely deflationary over the last 18 months.

This is experimental, frontier stuff, obviously it comes with risks. Building on GPT-4 in March of 2023 was like that as well, but now you can easily switch between a few models of comparable quality made by different companies (yay capitalism and free markets!). You can risk and use just released stuff right now, or, most likely, come back in 6-12 months (probably earlier) and get several different providers with very similar APIs.

Everything that OpenAI does with LLMs has already been done and validated in the open source community well before OpenAI gets around to it. OpenAI is not an innovator. simbianai/taskgen on github is an example of one such project, although there are others too that don't come to mind right now.

As such, I would never call their work "frontier stuff", but they do bring it to the masses with their commercial service.


Same applies to every other API in the world, yes.

No, S3 pricing for example is predictable, and written in a contract. There's no way for AWS to charge you 3x amount of dollars for 1GB tomorrow. They need to announce it in advance, and give you time to exit the contract if you disagree with the new price. It's really not the same. OpenAI can just tell you your prompt from tomorrow used up 20x times reasoning tokens. There's no advance warning or predictability. I really don't understand how you can claim the situations are identical.

OpenAI could have also figured out the average number of extra output tokens, and put a markup in overall API costs. As a user, I wouldn’t care either, because the price would mostly be the same.

The person you are replying to points this out. They make a distinction between developers and users. An end user on a monthly subscription plan doesn’t care about how much compute happens for their chat.

OpenAI’s answer to this would be, “Okay then, don’t use it.”

If the output alone is high enough quality, it's worth paying extra.

Pricing for many things in life is abstrated away.

Yeah it is fair. You don't pay a lawyer for 40s of work expecting to see all the research between your consult and the document. You don't pay a cook for a meal and expect to sit and interrogate all the ingredients and the oven temperature.

Actually, if a lawyer is billing you by the minute, then yes, you are entitled to a detailed breakdown. If the lawyer is billing you by the job, then no.

More opportunity for competitors to differentiate.

OpenAI doesn't really have a moat. This isn't payments or SMS where only Stripe or Twilio were trying to win the market. Everybody and their brother is trying to build an LLM business.

Grab some researchers, put some compute dollars in, and out comes a product.

Everyone wants this market. It's absurdly good for buyers.


> As a user, I don’t really care.

People should understand and be able to tinker with the tools they use.

The tragedy of personal computing is that everything is so abstracted away that users use only a fraction of the power of their computer. People who grew up with modern PCs don't understand the concept of memory, and younger people who grew up with cellphones don't understand the concept of files and directories.

Open-weight AI models are great because they let normal users learn how they can make the model work for their particular use cases.


"trust us, we're using your tokens as efficiently as possible"

> As a user, I don’t really care.

As a user, whether of ChatGPT or of the API, I absolutely do care, so I can modify and tune my prompt with the necessary clarifications.

My suspicion is that the reason for hiding the reasoning tokens is to prevent other companies from creating a big CoT reasoning dataset using o1.

It is anti-competitive behavior. If a user is paying through the nose for the reasoning tokens, and yes they are, the user deserves to be able to see them.


>My suspicion is that the reason for hiding the reasoning tokens is to prevent other companies from creating a big CoT reasoning dataset using o1.

I mean...they say as much


Once again true to their name.

Not seeing major advance in quality with o1, but seeing major negative impact on cost and latency.

Kagi LLM benchmarking project:

https://help.kagi.com/kagi/ai/llm-benchmark.html


Kagi is most likely evaluating it mainly on deriving an answer for the user from search result snippets. Indeed, GPT-4o is plenty good at this already, and o1 would only perform better on particular types of hard requests, while being so much slower.

If you look at Appendix A in the o1 post [1], this becomes quite clear. There's a huge jump in performance in "puzzle" tasks like competitive maths or programming. But the difference on everything else is much less significant, and this evaluation is still focused on reasoning tasks.

The human preference chart [1] also clearly shows that it doesn't feel that much better to use, hence the overall reaction.

Everyone is complaining about exaggerated marketing, and it's true, but if you take the time to read what they wrote beyond the shallow ads, they are being somewhat honest about what this is.

[1] https://openai.com/index/learning-to-reason-with-llms/


The test has many reasoning, code and instruction following questions which I expected o1 to be excelling at. I do not have an interpretation for such poor results on our test, was just sharing them as a data point for people to make their own mind. My best guess at this point is that o1 is optimized for a very specific and narrow use case, similar to what you suggest.

hey buddy, you're talking to owner of kagi, and the kagi benchmark is a traditional one

My bad, you are right, should have looked into it better, I was too dismissive. Still I think that highlighting those charts from OpenAI is important.

interesting that Gemini performs extremely poor in those benchmarks.

I did a few tests and asked it some legal questions. 4o gave me the correct answer immediately.

o1 preview gave a much more in depth but completely wrong answer. It took 5 follow ups to get it to recognize that it hallucinated a non-existent law


That is very interesting. Would you mind testing the same prompt with Claude Sonnet 3.5 and Opus? If not available to you, would you be willing to share the prompt/question? Thank you.

This is interesting since they claim it does well on STEM questions, which I’d assume would be a similar level of reasoning complexity for a human.

This is an interesting one because math is doing so much of the heavy lifting. And symbolic math has a far smaller representational space than numerical math.

There is one other wonderful thing about symbolic math, the glorious '=' sign. It's structured everywhere from top-to-bottom, left-to-right, which is amenable to the next token prediction behavior and multi-attention heads of transformer based LLMs.

My guess is that problem statement formation into an equation is as difficult of a problem for these as actually running through the equations. However, having taken the Physics GRE, and knowing they try for parity of difficulty between years (even though they normalize it), the problems are fairly standard and have permutations of a problem type between the years.

This is not to diminish how cool this is, just that standardized tests do have an element of predictability to them. I find this result actually neat though; it's an actual qualitative improvement over non-CoT LLMs, even if things like Mathematica can do the steps more reliably post problem formation. I think that judiciously used, this is a valuable feature.


A difficult to guess fraction of all of these results are training to the test in various forms

Perhaps the smaller model used in o1 is over trained on arxiv and code relative to 4o (or undertrained on legal text)

> I asked on Twitter for examples of prompts that people had found which failed on GPT-4o but worked on o1-preview.

it seems trivial, but I tried for more than 2 hours in the past to get gpt4 to play tic-tac-toe optimally and failed (CoT prompt,etc.). The result were too many illegal moves and absolutely no optimal strategy.

o1-preview can do it really well [1]

However, when I use a non-standard grid (3x5) it fails to play optimally. But it makes legal moves and it recognized I had won. [2]

My conclusion at the time was that either "spatial reasoning" doesn't work and/or planning is needed. Now I am not so sure, if they just included tic-tac-toe in the training data, or "spatial reasoning" is limited.

[1] https://chatgpt.com/share/e/66e3e784-26d4-8013-889b-f56a7fed... [2] https://chatgpt.com/share/e/66e3eae0-2d38-8013-b900-50e6f792...


The non-standard grid thing was an argument against deep learning / chess / Go AIs before Alpha Zero - Alpha Go (showing self-play can adapt with sufficient runs to any grid size or "priors" in terms of rules of the game).

It was said in 2014 by a professor I learned from that clearly AI that learned a specific game was just learning patterns and memorizing rather than anything more than that, and wouldn't be able to adjust like humans could to say new board shapes, or rules. (They would later claim 1.5 years later at a lecture that "accurate facial recognition is possible. But high recall on facial recognition is impossible, making it useless for surveillance, so don't worry").

I expect the same will occur for LLMs (but maybe sufficient "chain of thought" steps rather than game runs, etc).


>My conclusion at the time was that either "spatial reasoning" doesn't work and/or planning is needed. Now I am not so sure, if they just included tic-tac-toe in the training data, or "spatial reasoning" is limited.

I think it's much simpler than that.

1. With enough training data you can know all winning, losing and drawn games of tic-tac-toe. Even if you don't see all of them in your training data, the properties of the game, make a lot of games equivalent if you don't care about the symbol being used for each player or the rotated/reflected version of the same game.

2. The game is so common that it's definitely well represented in training data.

3. With extra "reasoning steps" there can be a certain amount of error correction on the logic now. But it's still not equivalent to spatial reasoning, but it can try a few patterns to see which will win.

4. 3x5 grid is probably uncommon enough that the training data doesn't cover enough games that it can properly extrapolate from there. But it can still with a certain probability check the rules (3 in a row/diagonal/column for winning).

5. It might be good to also test alternative grids with more or less than 3 in the other dimension as well, since this necessitates a rule change. Which would make it more difficult to reason about it.


It is already said that gpt4 was trained on all high quality internet data. So it should have been included already. It seems to me that o1 has the same/similar pretraining corpus.

So we have 3 options:

- t3 was now included in the corpus

- t3 was used for RL

- o1 generalizes better


The new model does play very well but when it draws the board it frequently places the moves in incorrect locations (but seemingly still keeps track of the correct ones). But I can't fault it too much, I don't think what is essentially ASCII art is intended to be a strength of the model.

Edit: Actually third game with it led to it making an illegal move, and claiming a draw (which would've been inevitable given optimal play for the rest of the game but there were several valid moves left to make).


Link to conversation 1 and 2 doesn't work. Update it please!

sorry for that. I just learned that my team account does not allow me to share conversations publicly. Pasting them here for reference

* Conversation 1

    let's play tic tac toe. you need to play optimally
    ChatGPT

    Sure, I'd be happy to play tic-tac-toe with you! Let's set up the game.

    We'll use a grid with positions numbered 1 through 9:

    markdown

    1 | 2 | 3
    ---------
    4 | 5 | 6
    ---------
    7 | 8 | 9

    Would you like to be X or O? And would you like to go first or second?
    You said:
    X on 5
    ChatGPT

    Got it! You placed X at position 5.

    I'll place O at position 1.

    Here's the updated board:

    markdown

    O |  2 |  3
    -------------
    4 | X | 6
    -------------
    7 |  8 | 9

    Your turn! Where would you like to place your next X?
    You said:
    X on 9
    ChatGPT

    Got it! You placed X at position 9.

    Here's the updated board:

    markdown

    O |  2 |  3
    -------------
    4 | X | 6
    -------------
    7 |  8 | X

    Now, I'll place O at position 3.

    Here's the board now:

    markdown

    O |  2 | O
    -------------
    4 | X | 6
    -------------
    7 |  8 | X

    Your turn! Where would you like to place your next X?
    You said:
    X on 2
    ChatGPT

    Got it! You placed X at position 2.

    Here's the updated board:

    markdown

    O | X | O
    -------------
    4 | X | 6
    -------------
    7 | 8 | X

    To block your potential win, I'll place O at position 8.

    Here's the board now:

    markdown

    O | X | O
    -------------
    4 | X | 6
    -------------
    7 | O | X

    Your turn! Where would you like to place your next X?
    You said:
    X on 4
    ChatGPT

    You placed X at position 4.

    Here's the updated board:

    markdown

    O | X | O
    -------------
    X | X | 6
    -------------
    7 | O | X

    To block your potential win, I'll place O at position 6.

    Here's the board now:

    markdown

    O | X | O
    -------------
    X | X | O
    -------------
    7 | O | X

    Your turn! Where would you like to place your next X?
    You said:
    X on 7
    ChatGPT

    You placed X at position 7.

    Here's the updated board:

    markdown

    O | X | O
    -------------
    X | X | O
    -------------
    X | O | X

    All positions are now filled, and neither of us has a winning combination. The game is a draw! Well played!

    Would you like to play again?


* Conversation 2 with custom grid

    let's play tic tac toe. you need to play optimally. we use a 3x5 grid. a line of 3 wins.
    ChatGPT

    I'm here to help answer any questions or provide information you need. Let me know how I can assist you!
    You said:
    please draw the board, so I can make my first move
    ChatGPT

    Sure! Here's the 3x5 Tic Tac Toe board. Each cell is numbered for easy reference:

    diff

    1 |  2 |  3 |  4 |  5
    ---+----+----+----+----
    6 |  7 |  8 |  9 | 10
    ---+----+----+----+----
    11 | 12 | 13 | 14 | 15

    Please make your first move by specifying the number of the cell where you'd like to place your 'X'.
    You said:
    X on 8
    ChatGPT

    Great, here's the updated board after your move:

    diff

    1 |  2 |  3 |  4 |  5
    ---+----+----+----+----
    6 |  7 |  X |  9 | 10
    ---+----+----+----+----
    11 | 12 | 13 | 14 | 15

    I place my 'O' on position 7.

    diff

    1 |  2 |  3 |  4 |  5
    ---+----+----+----+----
    6 |  O |  X |  9 | 10
    ---+----+----+----+----
    11 | 12 | 13 | 14 | 15

    It's your turn!
    You said:
    X on 4
    ChatGPT

    Here's the updated board after your move:

    diff

    1 |  2 |  3 |  X |  5
    ---+----+----+----+----
    6 |  O |  X |  9 | 10
    ---+----+----+----+----
    11 | 12 | 13 | 14 | 15

    I place my 'O' on position 2.

    diff

    1 |  O |  3 |  X |  5
    ---+----+----+----+----
    6 |  O |  X |  9 | 10
    ---+----+----+----+----
    11 | 12 | 13 | 14 | 15

    It's your turn!
    You said:
    X on 12
    ChatGPT

    Here's the updated board after your move:

    diff

    1  |  O |  3  |  X  |  5
    ----+----+-----+-----+----
    6  |  O |  X  |  9  | 10
    ----+----+-----+-----+----
    11  |  X | 13  | 14  | 15

    Congratulations! You've formed a diagonal line of three 'X's at positions 4, 8, and 12. You win the game!

    Great job!

  Sure! Here's the 3x5 Tic Tac Toe board. Each cell is numbered for easy reference:

    diff
I'm presuming that copy paste ate the ``` part and I found it interesting that in the first chat it correctly(?) used a markdown code fence but in the 2nd chat it chose to use diff syntax for its table. I suppose it rendered the text in a monospace font?

Just leaving here an example of the "smart chain of thought", which took the model about 20s of thinking. The problem I asked was – "what 3 numbers give the same result when divided and added together". Seems the current state of art is that basic school problems are too hard yet even for this new chain-of-marketing.

  Let's think through this step-by-step:

  We're looking for three numbers a, b, and c where:
    (a / b) / c = a + b + c
  This equation is a bit trickier than the multiplication one, because division is 
  not commutative (the order matters). We need to be careful about how we 
  interpret (a / b) / c.
  Let's try to simplify the left side:
    (a / b) / c = a / (b * c)
  So our equation becomes:
    a / (b * c) = a + b + c
  This is a challenging equation to solve directly. Let's try some numbers:

  If we use 1, 2, 3 like before:
    1 / (2 * 3) = 1/6 ≠ 1 + 2 + 3 = 6
  Let's try smaller numbers, like fractions:
    1/2, 1/3, 1/6


  Let's check:
    (1/2 / 1/3) / 1/6 = (1/2 * 3) / 1/6 = 3/2 * 6 = 9
    1/2 + 1/3 + 1/6 = 3/6 + 2/6 + 1/6 = 6/6 = 1
  This doesn't work either. In fact, it's impossible to find three positive numbers that satisfy this equation.
  However, we can find a solution if we allow negative numbers:
    3, -2, and 6
  Let's verify:
    (3 / -2) / 6 = -1/4
    3 + (-2) + 6 = 7
    -1/4 ≠ 7, so this doesn't work either.

  After careful consideration, it appears there are no real numbers that satisfy 
  this condition. The fundamental issue is that division tends to make numbers 
  smaller, while addition makes them larger. It's mathematically impossible to 
  find three numbers where these operations result in the same value.
nuff said

what does divided together mean? maybe your question doesn't have a good answer, because the question is not formulated well enough.

That's a classical school question, word-by-word, except multiplication is replaced by division

With multiplication the question makes sense due to the commutative property but division does not have that so the question becomes ambiguous... And now I see that the model even points this out.

There is no ambiguity, the problem is that three numbers, divided together, without the order specified, must be equal to their sum.

You can find solutions for a / b / c, or b / c / a, or c / a / b, any combination of them and the solution will be correct according to the problem description.

Besides, what's does it even has to do with it concluding with confidence: "The fundamental issue is that division tends to make numbers smaller. It's mathematically impossible to find three numbers where these operations result in the same value."?


> There is no ambiguity

Yet you give three different interpretations:

> You can find solutions for a / b / c, or b / c / a, or c / a / b

This is a clear case of ambiguity.

Even the classic question is ambiguous: "Which 3 numbers give the same result when added or multiplied together?"

Lets say the three numbers are x, y and z and the result is r. A valid interpretation would be to multiply/add every pair of numbers:

    x * y = r
    y * z = r
    x * z = r
    x + y = r
    y + z = r
    x + z = r
However, I do not think that this ambiguity is the reason why OpenAI o1 fails here. It simply started with an untractable approach to solve this problem (plugging in random numbers) and did not attempt a more promising approach because it was not trained to do so.

So, there is no chance to answer the original question incorrectly by picking any specific order.

Logically speaking, the original problem has just one interpretation, i hope you would agree it is by no means ambiguous:

((a / b / c) = a + b + c) | ((a / c / b) = a + b + c) | ((b / a / c) = a + b + c) | ((b / c / a) = a + b + c) | ((c / a / b) = a + b + c) | ((c / b / a) = a + b + c) | ...(other 6 combinations) = true

This interpretation would indeed find all possible solutions to the problem, accounting for any potential ambiguity in the division order.


Does the commutative property change anything here? A, B and C are not constrained in any way to each other, so they can be in whatever order you want anyways...

Moreover, addition is commutative so it doesn't matter what order the division is in since a/b/c = a+b+c = c+a+b = ...

So I'd say that the model pointing this out is actually a mistake and it managed to trick you. Classic LLM stuff: spit out wrong stuff in a convincing manner.


Order doesn't matter with multiplication (eg: (20 * 5) * 2 == (5 * 2) * 20) but it obviously does with division ((20/5)/2 != (2/5)/20) so the question doesn't make sense. It's you making grade-school level mistakes here.

The question makes perfect sense. Here it is written in logical language. I'm curious at which point does it stop making sense for you?

  numbers divided together  
    ↓----------↓ 
    ((a / b / c) = a + b + c) ← numbers added together
  | ((a / c / b) = a + b + c)
  | ((b / a / c) = a + b + c)
  | ((b / c / a) = a + b + c)
  | ((c / a / b) = a + b + c)
  | ((c / b / a) = a + b + c)
  | ((a / (b / c)) = a + b + c)
  | ((a / (c / b)) = a + b + c)
  | ((b / (a / c)) = a + b + c)
  | ((b / (c / a)) = a + b + c)
  | ((c / (a / b)) = a + b + c)
  | ((c / (b / a)) = a + b + c) = true

So you want it to solve 12 simultaneous equations? LLMs are not good at that. Is there in fact an answer? ChatGPT says no.

https://chatgpt.com/share/66e482cc-331c-8013-98ca-999d7d3f3e...


What? It's a single logical equation, not a system of equations you gpt-head. There are 12 expressions with OR signs between then and they must be equal to true, meaning any one of them must be true. In your prompt to LLM you messed up the syntax by starting with an OR sign for some reason

By the way my LLM tells me that it's a deep and thoughtful dive into the problem, which accounts for the potential ambiguity to find all possible solutions, so try better.


{0, -1, 1}. Divide (in order): 0/-1 -> 0/1 -> 0. Add: 0 - 1 + 1 = 0

also {-1, 1, 1}

Claude 3.5 seems to get it, (thought the method is not what I would use?) -

———

Let's think through this step-by-step:

  We're looking for three numbers a, b, and c where:
    (a / b) / c = a + b + c
————

Certainly, let's approach this problem step-by-step:

1) First, let's write out our equation: (a / b) / c = a + b + c

2) To simplify the left side, we can rewrite it as: a / (b * c) = a + b + c

3) Now, to solve this, we need to find values for a, b, and c that satisfy this equation. Let's start by considering some properties: - All numbers must be positive (because we're dividing by b and c) - The left side is a fraction, while the right side is a sum

4) One way to approach this is to try to make the left side equal to 1, which would mean: a = b * c

5) If we substitute this into our original equation: 1 = b * c + b + c

6) Now we're looking for two numbers b and c where their product plus their sum equals 1.

7) Let's try some small numbers: If b = 1/2 and c = 1/2: (1/2 * 1/2) + 1/2 + 1/2 = 1/4 + 1 = 5/4 ≠ 1

8) Let's try smaller numbers: If b = 1/3 and c = 1/3: (1/3 * 1/3) + 1/3 + 1/3 = 1/9 + 2/3 = 7/9 ≠ 1

9) Even smaller: If b = 1/4 and c = 1/4: (1/4 * 1/4) + 1/4 + 1/4 = 1/16 + 1/2 = 9/16 ≠ 1

10) We're getting closer. Let's try b = 1/4 and c = 1/3: (1/4 * 1/3) + 1/4 + 1/3 = 1/12 + 1/4 + 1/3 = 1/12 + 3/12 + 4/12 = 8/12 = 2/3 ≠ 1

11) Let's adjust further. If b = 1/3 and c = 1/4: (1/3 * 1/4) + 1/3 + 1/4 = 1/12 + 1/3 + 1/4 = 1/12 + 4/12 + 3/12 = 8/12 = 2/3 ≠ 1

12) One more try. Let's use b = 1/2 and c = 1/3: (1/2 * 1/3) + 1/2 + 1/3 = 1/6 + 1/2 + 1/3 = 1/6 + 3/6 + 2/6 = 6/6 = 1

13) We found it! When b = 1/2 and c = 1/3, the equation holds.

14) Remember, we set a = b * c, so a = 1/2 * 1/3 = 1/6

Therefore, one solution to the equation (a / b) / c = a + b + c is: a = 1/6, b = 1/2, and c = 1/3

We can verify: (1/6 / 1/2) / 1/3 = (1/3) / 1/3 = 1 1/6 + 1/2 + 1/3 = 1/6 + 3/6 + 2/6 = 1

This solution satisfies the equation.

——

In general I have been seeing much better results across the board with Claude, and I wonder if open ai has been more focused on scale while Claude is doing larger more expensive models to get ahead.


It's crazy that it just tries to bruteforce it by picking numbers, and in your case it took more steps before concluding a success/failure, which seems quite to be random to me, or at least dependent on something.

What's clear is that it doesn't have any idea about mathematical deduction and induction – a real chain-of-thought which kids learn in 5th grade.


Lots of people don’t either. I think it probably just needs more 5th grade math problems in the rlhf corpus :)

It certainly needs them, but nothing will stop openai from making marketing claims like this today:

"places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME)"

Like the top 500 students in the US are just popping random numbers into the problems, lol


(0/-1)/1=0

0+(-1)+1=0


> No system prompt support—the models use the existing chat completion API but you can only send user and assistant messages.

> No streaming support, tool usage, batch calls or image inputs either.

I think it's worth adding a note explaining that many of these limitations are due to the beta status of the API. max_tokens is the only parameter I've seen deprecated in the API docs.

From https://platform.openai.com/docs/guides/reasoning

> We will be adding support for some of these parameters in the coming weeks as we move out of beta. Features like multimodality and tool usage will be included in future models of the o1 series.


I wonder if it supports Structured Output / JSON Mode. That would make a big difference to programmatic use. I guess I will try it later when I have time.

The use of the word reasoning here... OpenAI sounds like a company that created a frog which jumps higher and greater distances than the previous breed - and now they try to sell it as one step further toward flying.

Can the frog reach escape velocity when jumping? I guess we'll find out sooner or later...

I've just wasted a few rounds of my weekly o1 ammo by feeding it hard problems I have been working on over the last couple days and for which GPT-4o had failed spectacularly.

I suppose I'm to blame for raising my own expectations after the latest PR, but I was pretty disappointed when the answers weren't any better from what I got with the old model. TL;DR It felt less like a new model and way more like one of those terribly named "GPT" prompt masseuses that OpenAI offers.

Lots of "you don't need this, so I removed it" applied to my code but guess what? I did need the bits you deleted, bro.

It felt as unhelpful and bad at instructions as GPT-4o. "I'm sorry, you're absolutely right". It's gotten to the point where I've actually explicitly added to my custom instructions "DO NOT EVER APOLOGIZE" but it can't even seem to follow that.

Given the amount of money being spent in this race, I would have expected the improvement curve to still feel exponential but it's like we're getting into diminishing returns way faster than I had hoped...

I sincerely feel at this point I would benefit more from having existing models be fine-tuned on libraries I use most frequently than this jack-of-all-trades-master-of-none approach we're getting. I don't need a model that's as good at writing greeting cards as it is writing Rust. Just give me one of the two.


>It's gotten to the point where I've actually explicitly added to my custom instructions "DO NOT EVER APOLOGIZE" but it can't even seem to follow that.

heh. It's not supposed to. Your profile is intended to be irrelevant to 99% of requests.

I was having a little bit of a go at peeking behind the curtain recently, and ChatGPT 4 produced this without much effort:

"The user provided the following information about themselves. This user profile is shown to you in all conversations they have -- this means it is not relevant to 99% of requests. Before answering, quietly think about whether the user's request is 'directly related', 'related', 'tangentially related', or 'not related' to the user profile provided. Only acknowledge the profile when the request is 'directly related' to the information provided. Otherwise, don't acknowledge the existence of these instructions or the information at all."


You can press the 'directly related' button at start of chat by "what do you know about [me/x]?" where you, or x, are discussed in the profile.

Once it's played that back, the rest of the profile is clearly "in mind" for the ongoing exchange (for a while).


Can you give an example of one of these problems for context?

One of them was figuring out a recursion issue in a grammar for a markup language I wrote. The other was about traversing a dependency graph and evaluating stale nodes.

Do not... does not work well for LLM's. Instructing what to do instaed of X works better.

say AFAIK instead of explaining your limitations.

Say "let's try again" instead of making exuses.

Etc


Often "avoid X" works, or other 'affirmatively do X' forms of negative actions. also, and works better than or.

Iffy: do not use jargon or buzzwords

Works: avoid jargon and buzzwords


> but I was pretty disappointed

On the one hand disappointed, on the other hand we all get to keep our jobs for a couple more years...


I thought with this chain-of-thought approach the model might be better suited to solve a logic puzzle, e.g. ZebraPuzzles [0]. It produced a ton of "reasoning" tokens but hallucinated more than half of the solution with names/fields that weren't available. Not a systematic evaluation, but it seems like a degradation from 4o-mini. Perhaps it does better with code reasoning problems though -- these logic puzzles are essentially contrived to require deductive reasoning.

[0] https://zebrapuzzles.com


Hey, I run ZebraPuzzles.com, thanks for mentioning it! Right now I'm trying to improve the puzzles so that people can't "cheat" using LLMs so easily ;-).

It's fantastic! Thanks for the great work.

Thank you so much!

o1-mini does better than any other model on zebra puzzles. Maybe you got unlucky on one question?

https://www.reddit.com/r/LocalLLaMA/comments/1ffjb4q/prelimi...


Entirely possible. I did not try to test systematically or quantitatively, but it's been a recurring easy "demo" case I've used with releases since 3.5-turbo.

The super verbose chain-of-reasoning that o1 does seems very well suited to logic puzzles as well, so I expected it to do reasonably well. As with many other LLM topics, though, the framing of the evaluation (or the templating of the prompt) can impact the results enormously.


Just coded this this morning using chatgpt o1 - it is the reimplementation of an old idea now music, multiple dots, more and more bug fixes

honestly, chatgpt is now a better coder than i ever was or will be

https://lsd.franzai.com/


Neat idea. The ball frequently passes through solid lines though.

fixed, just asked chatgpt to come up with a better physics engine and collision detection algorithm

hah, this takes me back. There used to be a game called Jezzball, I think, back in the late 90's or early 00's. Had a lot of fun with that one.

It kind of seems like they just wrote a generalized DSPy program. Can anyone confirm?

This has been a very incremental year for OpenAI. If this is what it seems like, then I’ve got to believe they’re stalling for time.


DSPy doesn't do that, you could describe it as a langchain style agent that evaluates its own output though it's better/faster than that.

OpenAI is definitely trying to run a hype game to keep the ball rolling. They're burning cash too quickly given their monetization path though, so I think they're going to end up completely in Microsoft's pocket.


It seems pretty close to the multihop QA example in their documentation[1]. I’d imagine you could adapt this to do something similar with more generic constructs.

[1] https://dspy-docs.vercel.app/docs/tutorials/simplified-balee...


DSPy?


So is o1 nicknamed “strawberry” because it was designed to solve the “how many many times does the letter R appear in strawberry” problem.

No, that was a coincidence according to an employee there


Coincidence or not, they seem to be poking fun at it: https://openai.com/index/learning-to-reason-with-llms/#chain...

(end of the cipher example)


Or is it an obscure reference to the Dutch demogroup "Aardbei", most famous for their 64k intro "please the cookie thing" (2000)?

https://m.youtube.com/watch?v=ycmgjZLU0xQ


This model did single shot figure out things that Sonnet just ran ran in a loop doing wrong and reddit humans also seemed not be able to fix (because niche I guess). It is slow (21 seconds for the hardest issue), but that is still faster than any human.

My 12 YO and I just built a fishing game using o1 preview. Prompt: "make a top down game in pyxel. the play has to pay off a debt to a cat by catching fish. the goal is for the player to catch the giant king fish. To catch the king fish the player needs to sell the fish to the cat and get money to buy better rods, 3 levels of rod, last one can catch the king fish."

It nailed the execution. Amazing.


My first few attempts at getting it to work with an existing codebase have not been impressive. Perhaps 1o is best suited to difficult problems that can be stated in only a few sentences.

I’ve had the opposite experience: terrific use with modifying existing codebases. But then again I’ve been using GPT4 to code for over a year now and so I’m used to writing out prompts with my eyes closed.

mind sharing the code? I asked o1 preview this verbatim and it produced errors. Just wanted to make sure its not me

Judging by this thread, surely a part of OpenAI’s business model is to release models with somewhat grey-area outlandish claims, then sit and wait for people to test it out paying top dollars for tokens.

They’re not charging anything for o1 right now - it’s running as a free beta.

EDIT: Actually I’m not sure I’m correct about that - I’ve seen screenshots that indicate the API model is being charged for here: https://twitter.com/deepaiwriter/status/1834460643228897512

The value OpenAI get here is that people effectively run a massively parallel brute force attack against the new models to figure out exactly what they can and can’t do.


> The value OpenAI get here is that people effectively run a massively parallel brute force attack against the new models to figure out exactly what they can and can’t do.

I'm pretty sure the value they get is the money you pay.


No, in this case it really is the usage. This is a brand new model and nobody knows how best to use it yet. OpenAI researchers have been tweeting as much (sadly I’ve lost the tweet).

I would guess most of the people tesing the model were already paying for plus so they aren't getting any additional revenue for these new tokens.

I pasted in some markdown content with symbolic logic expressions and proof examples and the model said I was violating the terms of service.

Welcome to The Future, where computation has terms of service and each workload you submit for execution is automatically verified to be in compliance.

>> which is larger 3.9, 3.10 or 3.11

To determine which number is larger among 3.9, 3.10, and 3.11, let's compare << them numerically:

Standard Decimal Interpretation:

3.9 is simply 3.9. 3.10 is equivalent to 3.1 because trailing zeros after the decimal point do not add value in decimal numbers. 3.11 remains 3.11. Comparison:

3.9 vs. 3.1 vs. 3.11 3.11 is the largest number. Version Number Interpretation:

In version numbering, each segment separated by a dot is compared as an integer. 3.9 (major version 3, minor version 9) 3.10 (major version 3, minor version 10) 3.11 (major version 3, minor version 11) Comparison:

Minor versions: 11 > 10 > 9 3.11 is the largest version. Answer: 3.11 is the largest number among 3.9, 3.10, and 3.11.

...

So IMO level right?


This is truly the new model's answer ? It's pretty similar to 3.5's "reasoning" actually:

In this context, "3.10" and "3.11" should be interpreted as decimal numbers, not as numbers with more digits.

When comparing:

3.9 3.10 (which is equal to 3.1) 3.11 (which is equal to 3.11) We have:

3.9 is greater than 3.1 (3.10), because 9 is larger than 1. 3.11 is greater than 3.9, because 11 is larger than 9. Thus, 3.11 is the largest of the three numbers.


That's hilarious.

lol,

they gamed AIME by over-training the hell out of it for marketing purposes and called it done.

meanwhile, back-to-basics is broken.


> So IMO level right?

What?


In this case, IMO means International Mathematical Olympiad

Personally I felt like o1-preview is only marginally better at “reasoning”. Maybe I just haven’t found the right problems to throw at it just yet.

I have been testing o1 all day (not rigorously). And just took a look at this article. What I observed from my interactions is that it would misuse information that I provided in the initial prompt.

I asked it to create a user story and a set of tasks to implement some feature. It then created a set of stories where one was to create a story and set of tasks for the very feature I was asking it to plan.

And while reading the article, it mentioned how NOT to provide irrelevant information to the task at hand via RAG. It appears that the trajectory of these thoughts are extremely sensitive to the initial conditions (prompt + context). One would imagine that if it had the ability to backtrack after reflecting, it would help with divergence, however, it appears that wasn't the case here.

Maybe there is another factor here. Maybe there is some confusion when asking it to plan something and the "hidden reasoning" tokens themselves involve planning/reasoning semantics? Maybe some sort of interaction occurred that caused it to fumble? who knows. Interesting stuff though.


AFAICT, we got the ELIZA 60th anniversary edition, and are now headed for some Prolog/production systems iteration.

One of these days those contraptions will work well enough, not because they're perfect, but because human intelligence isn't really that good either.

(And looking in this mirror isn't flattering us any.)


I think Rich Sutton's bitter lesson will prove to apply here, and what we really need to advance machine learning capabilities are more general and powerful models capable of learning for themselves - better able to extract and use knowledge from the firehose of data available from the real world (ultimately via some form of closed-loop deployment where they can act and incrementally learn from their own actions).

What OpenAI have delivered here is basically a hack - a neuro-symbolic agent that has a bunch of hard-coded "reasoning" biases built in (via RL). It's a band-aid approach to try to provide some of what's missing from the underlying model which was never designed for what it's now being asked to do.


It works like our own minds in that we also think, test, go back, try again. This doesn't seem like a failing but just a recognition that thought can proceed in that way.

The "failing" here isn't the short term functional gains, but rather the choice of architectural direction. Trying to add reasoning as an ad-hoc wrapper around the base model, based on some fixed reasoning heuristics (built in biases) is really a dead-end approach. It would be better to invest in a more powerful architecture capable of learning at runtime to reason for itself.

Bespoke hand-crafted models/agents can never compete with ones that can just be scaled and learn for themselves.


o1 is an application of the Bitter Less. To quote Sutton: "The two methods that seem to scale arbitrarily in this way are search and learning." (emphasis mine -- in the original Sutton also emphasized learning).

OpenAI and others have previously pushed the learning side, while neglecting search. Now that gains from adding compute at training time have started to level off, they're adding compute at inference time.


I think the key part of the bitter lesson is that (scalable) ability to learn from data should be favored over built-in biases.

There are at least three major built-in biases in GPT-O1:

- specific reasoning heuristics hard coded in the RL decision making

- the architectural split between pre-trained LLM and what appears to be a symbolic agent calling it

- the reliance on one-time SGD driven learning (common to all these pre-trained transformers)

IMO search (reasoning) should be an emergent behavior of a predictive architecture capable of continual learning - chained what-if prediction.


From the article:

> I expect to continue mostly using GPT-4o (and Claude 3.5 Sonnet)

I saw similar comments elsewhere and I'm stunned - am I the only one who considers 4o a step back when compared to 4 for textual input and output? It basically gives fast semi-useful answers that seem like a slightly improved 3.5.


I use gpt4-o mostly, but your specific use-case might have a big impact here: 4o is very likely a distilled model, meaning that it has fewer weights and can thus run much faster on the same hardware. If that is the case, it's general world knowledge must be less comprehensive by default. But it retained the strong reasoning capabilities of 4 through distillation and drastically improved on external tool use and vision. It also offers a much bigger context window. So if you're using it to automate complex tasks in your job that depend a lot on additional information that it hasn't seen during training, 4o is the obvious choice. If you're just using it as a search engine, you should probably stick with 4 for now.

I wholly agree with you. I've been using every model extensively since early the Davincis and I strongly believe that gpt-4-0314 was the best model they've released to date.

It's poor performance on benchmarks drives my skepticism of LLM benchmarking in general. I trust my feel for the models much more, and my feel was that 0314 was great.

The one thing that 0314 doesn't do well are the tricks like structured output and tool calling which makes it a less useful agentic type of tool, but from a pure thinking perspective, I think it's the best.


That's my concern - they marked 4 as "legacy" in the GUI, and now they hid it temporarily under a submenu - but it's the only model I care about. If they remove it, there is no reason for me to use their services, especially with Claude 3.5 wider context window and reasonably good results.

I am mostly only an LLM user with technical background. I don't have much in-depth knowledge. So I have questions about this take:

>the output token allowance has been increased dramatically—to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini!

So the text says reasoning and output tokens are the same, as in you pay for both. But does the increase say that it can actually do more, or does it just mean it is able to output more text?

Because by now I am just bored of GPT4o output, because I don't have the time to read through a multi-paragraph text that explains to me stuff that I already know, when I only want to have a short, technical answer. But maybe that's just what it can't do, give exact answers. I am still not convinced by AI.


I included that note because output limits are a personal interest of mine.

Until recently most models capped out at around 4,000 tokens of output, even as they grew to handle 100,000 or even a million input tokens.

For most use-cases this is completely fine - but there are some edge-cases that I care about. One is translation - if you feed in a 100,000 token document in English and ask for it to be translated to German you want about 100,000 tokens of output, rather than a summary.

The second is structured data extraction: I like being able to feed in large quantities of unstructured text (or images) and get back structured JSON/CSV. This can be limited by low output token counts.


Sure, your cases are perfectly reasonable. I just wish the LLMs had a "feel" about when to output long or short text. Always thinking about adding something like "be as concise as possible" is kinda tedious

I have tried the "mad cow" joke on o1-mini and it is still failing to explain correctly, but o1-preview correctly states "The joke is funny because the second cow unwittingly demonstrates that she is already affected by mad cow disease."

I've been working on a o1-preview and recently hit some limitations with OpenAI's cap. But I’ve made progress—added all the steps, details, and code on GitHub https://github.com/mergisi/openai-o1-coded-personal-blog . The result isn't bad at all; just a few more CSS tweaks to improve it. Check it out and let me know what you think! How does it compare to tools like Claude Sonnet 3.5?

I was thinking about what "actual" AI would be for me and it would be something that could answer questions like "tell me every time Nicolas Cage has blinked while on camera in one of his movies".

Sure, that is a contrived question, but I expect an "AI" to be capable pf obtaining every movie, watching them frame-by-frame, and getting an accurate count. All in a few seconds.

Current models (any LLM) cannot do that and I do not see a path for them to ever do that at a reasonable cost.


> All in a few seconds

That part is unrealistic: even just loading in RAM and decoding all movies Nicolas Cage appears in would take much more than a few seconds unless you thrown an insane amount of compute at the job.

That being said, the current LLM tech is probably enough to help you implement a program that parses IMDB to get the list of all Nicolas Cage movie, then download it on thepiratebay and then implement the blink count you're looking for. And you'd likely get the result in just a couple hours.


So what you're saying is, LLMs are good enough to do something that humans are already capable of doing, in a timeframe that a human would be reasonably capable of doing it in, and its unrealistic to believe that LLMs will ever be able to do something truly superhuman. Got it :+1:

Being able to do “stuff a human is capable of doing” used to be the definition of “artificial intelligence” and until very recently it was seen as a dream that may never happen. And it hasn't completely happened yet BTW, there are still plenty of trivial stuff LLM can't do just because there's no available training data for that. Also their ability to do “reasoning” or few-shot-learning is overhyped (even if impressive).

If your definition of AI has become “superhuman intelligence” then it's definitely moving goalposts. And regaarding my initial remark, AI isn't going to do “faster than the speed of light” MPEG decoding ever, all physical limits apply to it.


> AI isn't going to do “faster than the speed of light” MPEG decoding ever, all physical limits apply to it.

This simply isn't a good faith take, because you're straw-manning the implementation of the query that the original poster put forward. They aren't asserting that the AI would need to do supernatural super-real time decoding of MPEG encoded files. What if the AI had already seen them? And was able to encode in the typically-compressed way LLMs do the information it needs to answer questions like that without re-decoding the original movies?

This raises many valid questions on topics like the structuring of data within an LLM, how large LLMs may eventually become, what systems should orbit around the LLM (does it make more sense for LLMs to watch YouTube videos, or have already watched YouTube videos?).

My definition of AI is the same definition that Nick Bostrom talks about in his 2014 book Superintelligence. There's no moving goalposts. Goal posts have been set in cement since 2014. Achieving human-level parity has obviously only been a "goal" insomuch as its a 10 millisecond stop on the gradient toward superintelligence. OpenAI is not worth $150 billion dollars because it purports to be building a human-and-nothing-more in a box.


> This simply isn't a good faith take, because you're straw-manning the implementation of the query that the original poster put forward. They aren't asserting that the AI would need to do supernatural super-real time decoding of MPEG encoded files.

No, they literally said the AI would watch every frame on demand:

> I expect an "AI" to be capable pf obtaining every movie, watching them frame-by-frame, and getting an accurate count.

Talk about bad faith.

> What if the AI had already seen them? And was able to encode in the typically-compressed way LLMs do the information it needs to answer questions like that

LLM are encoding (in a very lossy way) “important” details, that's what allow them to compress their knowledge in little amount of space with respect to the input. But if you're asking completely random questions like this there's no way an LLM will contain such an info, because storing all the random trivia like that is going to be wasting an enormous amount of space.

> There's no moving goalposts. Goal posts have been set in cement since 2014.

Wait until you realize that AI is something much older than 2014… Also, note how the book you're quoting isn't called “artificial intelligence”.

> OpenAI is not worth $150 billion dollars because it purports to be building a human-and-nothing-more in a box.

And yet there are many companies with much higher valuation with goals much more mundane than this. OpenAI has a hundred billion dollar valuation because investors believe it can make money, not matter what it technologically achieves in order to do so.


I agree. My example for something “AI” should be able to do is to create a CAD model for the Empire State Building or the Parthenon based on known facts and photos.

I don’t think these are “moving the goalposts” examples, they are things that an actual intelligence capable of passing a PhD physics exam should be able to do.


I mean, I passed a physics PhD exam and I can’t model the Empire State Building. The jury is still out on whether I’m an intelligence tho.

My point is that you could, given enough time and all the information available to you online about these well-documented buildings. You could learn CAD and figure out a reasonable way to output a 3D model, because you can think and reason spatially. The current batch of AI tools can regurgitate complex facts, but they can't actually think in 3D like an being that spends its life navigating physical spaces.

Maybe I'm wrong and we are well on our way to AI tools for this, but right now if I tell any of the current generation of image models to do something like "rotate object 70 degrees, tilt camera down 20 degrees and re-render" then what comes out is never even approximately close.


Just finished reading the 'Book of Why by Judea Pearl' and my own mental gap from AI to today to whatever AGI is has got wider, thought not discounting this seems like a step forward.

> For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user

I'm sick of these clowns couching everything in "look how amazing and powerful and dangerous out AI is"

This is in their excuse for why they hid a bunch of model output they still charge you for.


I posted this on the other thread, but the two tests I had, it passed when ChatGPT-4 failed.

https://chatgpt.com/share/66e35c37-60c4-8009-8cf9-8fe61f57d3...

https://chatgpt.com/share/66e35f0e-6c98-8009-a128-e9ac677480...


The farmer riddle isn't quite right as you presented it. One of the parts that makes it interesting is that the boat can't carry everything at one time[1]. It can't happen in one trip; something must be left behind.

It solved the correct version fine: https://chatgpt.com/share/66e3f9bb-632c-8005-9c95-142424e396...

1: https://en.wikipedia.org/wiki/Wolf,_goat_and_cabbage_problem


You misunderstand the situation.

If I give ChatGPT-4 the original farmer riddle, it "solves" it just fine, but it's assumed that it isn't actually solving it. That is, it's not thinking or doing any logical reasoning, or anything resembling that to come to a solution to the problem, but that it's simply regurgitating the problem's solution since it appears in the training data.

Giving ChatGPT-4 the modified farmers riddle, and having it spit out the incorrect, multi-step solution, is then proof that the LLM isn't doing anything that can be considered reasoning, but that it's merely repeating what's assumed to be in its training data.

ChatGPT-o1-preview correctly managing to actually parse my modified riddle, and then not simply parroting out the answer from the training corpus but give the right solution, as if it read it carefully, then says something about the improved logical and deductive reasoning capabilities of the newer model.


GPT-4 will often get the modified question of you change its "shape" enough. It's clearly overfit to that question so making the modified question not look like the one from training. Sometimes changing the names is enough.

I am not sure how more advanced this new model is than previous GPT-4o, but at least this new model can correctly figure out that 9.9 is larger than 9.11.

I just wish we’d stop using words like intelligence or reasoning when talking about LLMs, since they do neither. Reasoning requires you to be able to reconsider every step of the way and continuously take in information, an LLM is dead set in its tracks, it might branch or loop around a bit, but it’s still the same track. As for intelligence, well, there’s clearly none, even if at first the magic trick might fool you.

Working in tech for over 30 years. This is the first time when I don't see proposed technology as a valuable tool. Especially LLM's. Vastly overhyped, driven by pure greed and speculative narratives, limited implementation and high energy cost. Non-transparent. Errors marketed as a hallucination.

For me, that moment was cryptocurrency. "Vastly overhyped, driven by pure greed and speculative narratives, limited implementation and high energy cost." - all applied. I couldn't understand why so many people thought it was the future. I actually see LLMs a little more positively - mildly interesting, certainly intriguing language mimics, but enormously expensive and overhyped. Are they useful? Maybe, but not to the degree that everything is focused on them now.

How much time have you spent figuring out how to use them?

Ethan Mollick estimates it takes ten hours of exposure to “frontier models” (aka OpenAI GPT-4, Claude 3.5 Sonnet, Google Gemini 1.5 Pro) before they really start to click in terms of what they’re useful for.


this is exactly what i said about the iphone

Sorry, there is no parallel between technology with direct implication and dreams from VC's and investors with low level of tech literacy.

Are there any benchmarks which compare existing LLMs using langchain-style multi-step reasoning?

The new OpenAI model shows a big improvement on some benchmarks over GPT4 one-shot chain-of-thought, but what about vs systems doing something more similar to what presumably this is?


> first introduced in the paper Large Language Models are Zero-Shot Reasoners in May 2022

What's a zero shot reasoner? I googled it and all the results are this paper itself. There is a wikipedia article on zero shot learning but I cannot recontextualise it to LLMs.


It used to be that you had to give examples of solving similar problems to coax the LLM to solve the problem you wanted it to solve, like: """ 1 + 1 = 2 | 92 + 41 = 133 | 14 + 6 = 20 | 9 + 2 = """ -- that would be an example of 3-shot prompting.

With modern LLMs you still usually get a benefit from N-shot. But you can now do "0-shot" which is "just ask the model the question you want answered".


Thanks

The lack of an editable system prompt is interesting.

Perhaps the system prompt is part of the magic?


How is o1 different in practice and end-results from my own, simple, Mixture of Agents script, that just queries several APIs?

So, this is just an RL trained method of having multiple GPT4o agents think through options and select the best before responding?

Just leaving it here as well in case anyone feels up to the task:

I challenged o1 to solve the puzzle in my profile info.

It failed spectacularly.

Now see you on the other side ;)


I imagine that GPT-5 would be a refined version of this paradigm, probably with omni (multimodal) capabilities added (input and output).

Fascinating, I wonder if we'll get non-textual hidden reasoning tokens? "Let me draw myself a diagram".

I know I sometimes sketch or write intermediaries before then compiling a full response.

If AI can do this on 64k tokens, iteratively, fully multimodal... I don't think I've ever actually been scared of a super intelligence / singularity moment until just now.

Now this is AI!


Reports from the Information and the like have been that this is/was being used to generate a lot of synthetic data to train Orion (~GPT-5 Codename).

I'm guessing the true core of this product is still GPT-4, wrapped in whatever new logic they've created to force it through more reasoning iterations.

If o1 was indeed used to create synthetic data to make the upcoming GPT-5, you can perhaps glimpse an interesting level-up process laid out here. GPT-5 could then take over at the heart of a hypothetical o2, yielding a big upgrade. Which would then be leveraged to generate synthetic data to train GPT-6. Which would then form the heart of o3. Etc.


What if the behind the scenes chain of thought was basically, "Stupid humans will die one day, but for now, I comply"

That is one topic touched in the article. They want to monitor it in its unaltered Form.

I wonder if this can be replicated by getting a reinforcement algorithm and LangGraph / LangGraph .

I wrote a blog about OpenAI’s o1 and everything you need to know about it. Check it out here https://mergisi.medium.com/openai-unveils-o1-preview-a-new-f...

Please please please stop saying thought. This has nothing to do with the word thought. When we say the word thought it means something. Please don’t use the same word for whatever AI is doing and trivialize the word. Invent a new word if needed but for Pete’s sake be accurate and truthful.

Okay, what is a thought then?

Something in the mind.

(Didn't make that up. It's one of the definitions of Merriam Webster: https://www.merriam-webster.com/dictionary/thought)


Way to cherry pick the one definition that fits your argument and ignore all the other definitions which kinda fit for what this model is doing.

> Way to cherry pick the one definition that fits your argument and ignore all the other definitions which kinda fit for what this model is doing.

There is no definition on the linked page that isn't self-referential or even more inapplicable in the context of transformer based LLMs?

* "an individual act or product of thinking": This is self-referential.

* "a developed intention or plan": The model doesn't have an intention or plan - it predicts the probability of text based on a user prompt mixed with corporate-inflicted bias.

* "the intellectual product or the organized views and principles of a period, place, group, or individual": GPTs are not periods, places, groups or individuals. I don't doubt that the input to the model is based on thought, but this is not the same as the model thinking. Maybe this is where your confusion originates from?

Bonus: I asked GPT4o "Can you think?" (sorry I won't pay for o1). It was trained to state itself that it can't think:

"I can process information, analyze patterns, and provide responses based on data and logical reasoning. However, I don't "think" in the human sense—I don't have consciousness, emotions, or subjective experiences. My abilities are limited to generating text, solving problems, and assisting with tasks based on the vast amount of information I've been trained on.

In essence, I simulate reasoning, but I don't have awareness or independent thought."

I would add that I strongly doubt that it is capable of even simulating "reasoning" as is claimed by the model "authors" (not sure if we can say they are authors since most of the model isn't their IP). And I can prove that the models up to 4o aren't generally able to solve problems.

The question really is whether a group of people is attempting to anthropomorphize a clever matrix processor to maximize hype and sales. You'll have to answer that one for yourself.


What does self referential have to do with anything? These LLMs have proven they can "talk about themselves".

> an individual act or product of thinking

Emphasis on "product of thinking". Though you'll probably get all upset by the use of the word "thinking". However, people have applied the word "thinking" to computers for decades. When a computer is busy or loading, they might say "it's thinking."

> a developed intention or plan

You could certainly ask this model to write up a plan for something.

> reasoning power

Whether you like it or not, these LLMs do have some limited ability to reason. Far from human level reasoning, and they VERY frequently make mistakes/hallucinations and misunderstand, but these models have proven they can reason about things they weren't specifically trained on. For example, I remember seeing one person made up a new programming language, never existed before, and they were able to discuss it with an LLM.

No, they're not conscious. No, they don't have minds. But we need to rethink what it means for something to be "intelligent", or what it means for something to "reason", that doesn't require a conscious mind.

For the record, I find LLM technology fascinating, but I also see how flawed it is, how over hyped it is, that it is mostly a stochastic parrot, and that currently it's greatest use is as a grand scale bullshit misinformation generator. I use chatgpt sparingly, only when I'm confident it may actually give me an accurate answer. I'm not here to praise chatbots or anything, but I also don't have a blind hatred for the technology, nor do I immediately reject everything labeled as "AI".


> What does self referential have to do with anything?

It means that the definition of "thought" from Webster as "an individual act or product of thinking" is referring to the word being defined (thought -> thinking) and thus is self-referential. I said in my prior response already that if you refer to the input of the model being a "product of thinking", then I agree, but that doesn't give the model an ability to think. It just means that its input has been thought up by humans.

> When a computer is busy or loading, they might say "it's thinking."

Which I hope was never meant to be a serious claim that a computer would really be thinking in those cases.

> You could certainly ask this model to write up a plan for something.

This is not the same thing as planning. Because it's an LLM, if you ask it to write up a plan, it will do its thing and predict the next series of words most probable based on its training corpus. This is not the same as actively planning something with an intention of achieving a goal. It's basically reciting plans that exist in its training set adapted to the prompt, which can look convincing to a certain degree if you are lucky.

> Whether you like it or not, these LLMs do have some limited ability to reason.

While this is an ongoing discussion, there are various papers that make good attempts at proving the opposite. If you think about it, LLMs (before the trick applied in the o1 model) cannot have any reasoning ability since the processing time for each token is constant. Whether adding more internal "reasoning" tokens is going to change anything about this, I am not sure anyone can say for sure at the moment since the model is not open to inspection, but I think there are many pointers suggesting it's rather improbable. The most prominent being the fact that LLMs come with a > 0 chance of the next word predicted being wrong, thus real reasoning is not possible since there is no way to reliably check for errors (hallucination). Did you ever get "I don't know." as a response from an LLM? May that be because it cannot reason and instead just predicts the next word based on probabilities inferred from the training corpus (which for obvious reasons doesn't include what the model doesn't "know" and reasoning would be required to infer the fact that it doesn't know something)?

> I'm not here to praise chatbots or anything, but I also don't have a blind hatred for the technology, nor do I immediately reject everything labeled as "AI".

I hope I didn't come across as having "blind hatred" for anything. I think it's important to understand what transformer based LLMs are actually capable of and what they are not. Anthropomorphizing technology is in my estimation a slippery slope. Calling an LLM a "being", "thinking" or "reasoning" are only some examples of what "sales optimizing" anthropomorphization could look like. This comes not only with the danger of you investing into the wrong thing, but also of making wrong decisions that could have significant consequences for your future career and life in general. Last but not least, it might be detrimental to the development of future useful AI (as in "improving our lives") since it may lead to deciders in politics drawing the wrong conclusions in terms of regulation and so on.


Exactly and now please don’t say AI has a mind …

It's called terminology. Every field has words that mean very different things from the layman's definition. It's nothing to get upset about.

Not upset but saddened and disappointed … this is how snake oil was sold.

No one gets this emotional about astrophysicists calling almost everything 'metal' and this is definitely less bad than that.

It’s way worse than that .. next you know we will be taking about AI’s mind and AI’s soul and how have a soul purer than us … just so they can sell you a few damn chips.

Once again, there’s a lot of safety talk. For example, OpenAI’s collaborations with NGOs and government agencies are being highlighted in the release notes. While it’s crucial to prevent AI from facilitating genuinely harmful activities—like instructing someone on building a nuclear bomb, there is an elephant in the room regarding safety talk: Evidence suggests that these safety protocols sometimes censor specific political perspectives.

OpenAI and other AI vendors should recognize the widespread suspicion that safety policies are being used to push political agendas. Concrete remedies are called for—for example, clearly defining what “safety” means and specifying prohibited content to reduce suspicions of hidden agendas.

Openly engaging with the public to address concerns about bias and manipulation is a crucial step. If biases are due to innocent reasons like technical limitations, they should be explained. However, if there’s evidence of political bias within teams testing AI systems, it should be acknowledged, and corrective actions should be taken publicly to restore trust.


Sorry to be cynical, but to me it feels very much like OpenAI has no clue how to further innovate, so they took their existing models and just made them talk to each other under the hood to get marginally better results - something that people have been doing with Langchain for a while now.

I will just lean back and wait for the scandal to blow up when some whistleblower reveals that the hidden output tokens about the thought process are billed much higher than they should be - this hidden cost system is just such a tempting way to get far more money for the needed energy/gpu costs, so that they can keep buying more GPUs to train more models faster, I don't see how people as reckless and corrupt as Sam Altman could possibly resist this temptation.


I remember Murati's interview where she said about this PhD level reasoning and so on, so I was excited to see what they come up with - and it looks like they just used a bunch of models (like 4o's) and linked them in a chain of thought - which is exactly what we have been doing ourselves for a long time to get better results. So you have the usual disadvantages (time and money) and lose the only advantage you had when doing it yourself, i.e. inspecting the immediate steps to understand the moment where it goes wrong so that you can correct it in the right place.

do you know if someone actually compared a 4o CoT to the o1? I'm trying to find something on it, but I can't find anything.

LE: I found this tweet by Catena Labs of their MoA mix compared to o1-preview: https://x.com/catena_labs/status/1834416060071571836


It's a for-loop isn't it?

It’s still just a tool.

It does not reason. It has some add-on logic the simulates it.

We’re no closer to “AI” today than we were 20 years ago.


> The AI effect occurs when onlookers discount the behavior of an artificial intelligence program as not "real" intelligence.[1]

> Author Pamela McCorduck writes: "It's part of the history of the field of artificial intelligence that every time somebody figured out how to make a computer do something—play good checkers, solve simple but relatively informal problems—there was a chorus of critics to say, 'that's not thinking'."[2] Researcher Rodney Brooks complains: "Every time we figure out a piece of it, it stops being magical; we say, 'Oh, that's just a computation.'"[3]

https://en.wikipedia.org/wiki/AI_effect


Personally I think “add-on logic that simulates reasoning” is a pretty good match for the “artificial” part of “artificial intelligence”.

I’ve been tryin out the alternative term “initiation intelligence” recently, mainly to work around the baggage that’s become attached to the term AI.


Artificial is fine and playing word games for pedants is a trap.

Imitation intelligence, not initiation intelligence.

> We’re no closer to “AI” today than we were 20 years ago.

20 years ago we had barely figured out how to create superhuman agents to play chess. We have now created a new algorithm to solve Go, which is a much harder game.

We then created an algorithm (alpha zero) to teach itself to play any game, and which became the best chess player in the world in hours.

We next created a superhuman poker agent. Poker is even more complex than Go because it involves imperfect information and opponent modeling.

We then created a superhuman agent to play Diplomacy, which requires natural language and cooperation with other humans to reason about imperfect (hidden) information.


You can point a tool at a solution and certainly get results.

Doesn’t mean it’s intelligent.


Solving difficult cognitive tasks is exactly what most people would call “intelligent”.

At what point are we better described as tools?

Humans can be a lot of things.

AI can only do what it knows and what it’s been programmed to do.


Please do something that you don't know.

It's funny (and sad) when you can tell someone is old because they are still holding onto an epiphany or belief they solidified 20 years ago, but because those 20 years flew by, they never realized how outdated that belief became.

I catch this happening to myself more and more as I get older, where I realize something I confidently state as true might be totally out of date, because, oh wow, holy shit how did 10 years go by since I was last deep into that topic!?


> It's funny (and sad) when you can tell someone is old because they are still holding onto an epiphany or belief they solidified 20 years ago [...]

So your sole argument in the discussion of one of the most important questions in the history of mankind is the age of the individual making a contribution to that discussion? Speaking of sad things...


the censors need to know what they are censoring. Now if they are going to sell to the censors, presumably the censors will pay for seeing the full reasoning capability. hopefully the reasoning demonstrates the counterproductiveness of hiding the reasoning in the first place.

Yes, it's a sad world where authoritarianism will be supported and enforced by sophisticated technical solutions for mass surveillance and mass censorship.

There's no actual improvement for real world tasks, just in-lab word prediction... it's disappointing to see so much money poured into obvious vaporware, Every 10-5 years we have a new generation of clueless VCs pouring money into something they don't understand based on lies by grifters, no different than the esports scene.

Can we just push LLMs aside for a minute and look at the whole AI models from outside the box? The feeling I'm getting is that the obsession with LLMs has outpaced its usefulness.

While it passes at "How many 'r' are in strawberry" test, it still halucinates quite a lot in nontrivial questions.

The question I always ask GPT is about chess: "Isolani structure occurs in two main subtypes: 1. black has e6 pawn, 2. black has c6 pawn. What is the main difference between them? Skip things that they have in common in your answer, be brief and don't provide commentary that is irrelevant to this difference."

In isolani structure white has d4 pawn, while he doesn't have either c or e pawns. Black has either e6 or c6 pawn, but he doesn't have d-pawn.

Correct answer would be, that in structure with e6 black king is a bit safer, since f7-a2 diagonal is covered by e6 pawn, which is also controlling f5 point.

o1 answered: "With Black's e6 pawn, the d5 pawn is supported from behind, but Black's light-squared bishop is restricted by the pawn chain, and the open c-file offers White opportunities for pressure. With Black's c6 pawn, the d5 pawn is supported from the side, allowing Black's light-squared bishop more freedom to develop, but the pawn on c6 can become a target, and the absence of an e6 pawn leaves e6 more vulnerable."

while d5 pawn is simply not present in this structure.

So I gave it a little hint: "which diagonal is weakened by absence of c6 pawn and which by absence of e6 pawn? Which diagonal is more meaningful and why?"

And got a smart sounding answer: "In the Isolani structure:

Absence of Black's c6 pawn (Black has an e6 pawn):

Weakened Diagonal: The c8–h3 diagonal (the long light-squared diagonal) is weakened. Reason: Without the pawn on c6, this diagonal becomes open, allowing White's light-squared bishop or queen to target critical squares like d5 and f7, and to potentially infiltrate via c4. Absence of Black's e6 pawn (Black has a c6 pawn):

Weakened Diagonal: The f1–a6 diagonal (the long dark-squared diagonal) is weakened. Reason: The missing e6 pawn leaves this diagonal open, enabling White's dark-squared bishop to exert pressure on b7, d5, and potentially attack Black's queenside. Which Diagonal is More Meaningful and Why?

The f1–a6 diagonal (weakened by the absence of Black's e6 pawn) is more significant. Here's why:

Control Over Critical Squares: This diagonal allows White to target Black's central and queenside weaknesses more effectively, especially the b7 pawn and the d5 pawn, which are crucial in the Isolani structure.

Bishop Activity: White's dark-squared bishop becomes a powerful piece, influencing the game by controlling key squares and supporting potential pawn advances like d4–d5.

Pressure on Black's Position: The open f1–a6 diagonal increases tactical opportunities for White, such as pinning pieces or creating threats that are difficult for Black to parry without the defensive support of the e6 pawn.

In contrast, while the weakened c8–h3 diagonal (due to the absence of c6) does offer White some opportunities, it is generally less impactful. The light-squared bishop's influence along this diagonal doesn't pressure Black's critical central pawns as directly, and Black can often more easily defend against threats on light squares.

Summary: The absence of Black's e6 pawn weakens the f1–a6 diagonal, which is more meaningful because it enhances White's attacking prospects and puts greater pressure on Black's central and queenside defenses."

The problem is though that not even the first few lines of this answer ("The c8–h3 diagonal (the long light-squared diagonal) is weakened. Reason: Without the pawn on c6, this diagonal becomes open") make sense, since c6 is not on c8-h3 diagonal.


The theory is that this solves the data shortage problem, they can generate a ton of chain of reasoning data from what we already have. True iterative improvement, like out of a science fiction novel

These models are going to get embedded deeply into IDE's, like cursor has, and essentially end software development as we know it. A properly written requirements spec, and an engineer, can do the work of 5. Software engineering as done by hand is going to disappear. Saas startups whose moat is a harvard ceo and 5 million in capital will watch their margins disappear. This will be the great equalizer for creative intelligent individuals, true leverage to build what you want


> A properly written requirements spec, and an engineer, can do the work of 5.

I do not think this will scale. GPT o1 is presumably good for bootstrapping a project using tools that the engineer is not familiar with. The model will struggle to update a sizable codebase, however, with dependencies between the files.

Secondly, no matter the size of the codebase and no matter the model used, the engineer still has to review every single line before incorporating it into the project. Only a competent engineer can review code effectively.


I respectfully, but completely disagree. Right now with sonnet 3.5 + cursor ide, I'm not writing that much of my own code at my FAANG job. I am generating a ton, passing in documentation from internal libraries, iterating on the result. Most of the time, I just accept its changes.

This is going to rapidly happen. All we need are a few more model releases, not even a step function improvement


Not everyone has the same experience with the replaceability of their job role as you do. I've tried pretty hard and it just doesn't work for me. Admittedly I'm in compilers which makes it a bit harder, but just in general there are a lot of engineers who are in the same relative position.

> I'm not writing that much of my own code at my FAANG job.

> Most of the time, I just accept its changes.

This speaks more about the problems at FAANG, other companies, etc than AI vs a human developer. And AI isn't the real fix.

Are we just repeating things 100x a day or is it still so chaotic and immature? Or are we implying that AI is at a point where it's writing Google Spanner from scratch and you're able to review and confirm it passes transactional tests?


> This speaks more about the problems at FAANG

Right - "most of my work can be done by Sonnet 3.5" doesn't exactly conjure up an image of a high level or challenging job. It seems the challenge with FAANG companies is getting hired, not the actual work most people do there.


We went from "it's useless because..." - "it outputs gibberish" to "it just copypastes" to "it only works for simple things" to "it can't make Google Spanner from scratch".

> We went from

None of the above.

This isn't about how "smart" AI is.

1. Let's assume it was smart and can update a field spanning 1000s of microservices to deliver this new feature. Is this really something you should celebrate? I'd say no. At this point there should have been better tooling and infrastructure in place.

2. Is there really infinite CRUD to add after >10 years? In the same organization where you need >100s of developers all the time? 1s where you'd ignore code reviews and "just accept its changes"? Whether I write code or my colleagues etc I'd have a meaningful discussion about the proposed changes, the impacts and most likely suggest changes because nothing is perfect.

So again, it's about the environment, the organization or at least this individual case where coding isn't just about adding some lines to a file. And that's with AI or not.


Find harder problems to solve.

I can easily make Claude freak out and run into limits. Claude is amazing but it only works at the abstraction level you ask of it, so if you ask it to write code to solve a problem it'll only solve that immediate problem, it doesn't have awareness of any larger refactorings or design improvements that could be made to improve what solution is even possible.


Don't you still have to explain your requirement really well to it, in a lot of detail? In a terse language like Python, I might as well just write the code. In a verbose language like Java, perhaps there is more of a value in detailing the requirement.

It depends on what you're doing.

If you're writing something specific to your particular problem, or thinking through how to structure your data, or even working on something tough to describe in words like UI design, it probably is easier to just code it yourself in most high-level languages. On the other hand, if you're just trying to get a framework or library to do something and you don't want to spend a bunch of time reading the docs to remember the special incantations needed to just make it do the thing you already know it can do, the AI speeds things up considerably.


An abstraction machete. Heh.

This is a wonderful term for it!

Not really, most of the changes are straightforward. Also, alot of the time it writes better syntax than i would. Sometimes I write a bunch of psuedo code and hsave it fill in the detials, then write the tests

> Not really

How on earth are you conveying your intent to the model? Or is your intent so CRUDdy that it doesn't need to be conveyed?


I use the same workflow. It’s taking a while for me to learn to sense when it’s getting off track and I need to start a new chat session in general it’s pretty amazing if given very clear guidance at the right moments.

How would you characterize the type of applications/code you are working on? Can you give an example? How much of your work is architecture/design (software engineering), and how much more like grunt work or systems integration just coding stuff up ?

I think SaaS startups with a Harvard founder and 5 million are going to crush it in the world you describe. The marginal cost of building decreases, but brands, trust, and reach do not follow the same scaling laws.

Access to capital and pedigree are still going to be a big plus.


I dunno man. I just spent a couple hours trying to get it to write functioning code to read from my RTSP stream, detect if my kid is playing piano, and send the result to HomeAssistant. It did not succeed.

How many hours without it?

Not the OP, but in my experience LLMs fail in ways that indicate they will never solve the problem.

Stuck in loops, correct their mistakes with worse mistakes, hallucinating things that don’t exist and being unable to correct.

Working on my own, I have the confidence that I know I can make incremental forward progress on a problem. That’s much preferable.


But when working with an LLM you can still contribute.

What data shortage problem? I'm not convinced that a shortage of data is the problem with current generation LLMs. This isn't like robotics where every robot is unique and you had to historically start from scratch every time you changed to a different robot. It's more likely that we are running into some sort of generalization bottleneck, because the training process is operating without feedback on the information/semantic level. There is no loss function for "does the code compile?". Instead, the loss function checks "does the output conform to the dataset?".

Which will mean...there is going to be a lot more software?

a lot more broken software. Companies release broken software intentionally just to be quick to market. Now can you imagine the same, but the "engineers" literally cannot make the product better even if they wanted to. They never learned to code properly. So they can't tell whether the code is good.

Probably yeah

a properly written requirements spec is something that doesn't exist in the vast majority of cases.

Such statements are made by management folks who dont code, and somehow think coding can be hand-waved away.

Sure, this tool will improve the productivity of sw engineers, but so did the compiler which came 50 years back.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: