Hacker News new | past | comments | ask | show | jobs | submit login

> Unless otherwise specified, we evaluated o1 on the maximal test-time compute setting.

Maximal test time is the maximum amount of time spent doing the “Chain of Thought” “reasoning”. So that’s what these results are based on.

The caveat is that in the graphs they show that for each increase in test-time performance, the (wall) time / compute goes up exponentially.

So there is a potentially interesting play here. They can honestly boast these amazing results (it’s the same model after all) yet the actual product may have a lower order of magnitude of “test-time” and not be as good.






I interpreted it to suggest that the product might include a user-facing “maximum test time” knob.

Generating problem sets for kids? You might only need or want a basic level of introspection, even though you like the flavor of this model’s personality over that of its predecessors.

Problem worth thinking long, hard, and expensively about? Turn that knob up to 11, and you’ll get a better-quality answer with no human-in-the-loop coaching or trial-and-error involved. You’ll just get your answer in timeframes closer to human ones, consuming more (metered) tokens along the way.


Yeah, I think this is the goal - remember; there are some problems that only need to be solved correctly once! Imagine something like a millennium problem - you'd be willing to wait a pretty long time for a proof of the RH!

This power law behavior of test-time improvement seems to be pretty ubiquitous now. In more agents is all you need [1], they start to see this as a function of ensemble size. It also shows up in: Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [2]

I sorta wish everyone would plot their y-axis with logit y-axis, rather than 0->100 accuracy (including the openai post), to help show the power-law behavior. This is especially important when talking about incremental gains in the ~90->95, 95->99%. When the values (like the open ai post) are between 20->80, logit and linear look pretty similar, so you can "see" the inference power-law

[1] https://arxiv.org/abs/2402.05120 [2] https://arxiv.org/abs/2407.21787


Surprising that at run time it needs an exponential increase in thinking to achieved a linear increase in output quality. I suppose it's due to diminishing returns to adding more and more thought.

The exponential increase is presumably because of the branching factor of the tree of thoughts. Think of a binary tree who's number of leaf nodes doubles (= exponential growth) at each level.

It's not too surprising that the corresponding increase in quality is only linear - how much difference in quality would you expect between the best, say, 10 word answer to a question, and the best 11 word answer ?

It'll be interesting to see what they charge for this. An exponential increase in thinking time means an exponential increase in FLOPs/dollars.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: