> Unless otherwise specified, we evaluated o1 on the maximal test-time compute s...

alwa · 2024-09-12T17:48:42.000000Z

I interpreted it to suggest that the product might include a user-facing “maximum test time” knob.

Generating problem sets for kids? You might only need or want a basic level of introspection, even though you like the flavor of this model’s personality over that of its predecessors.

Problem worth thinking long, hard, and expensively about? Turn that knob up to 11, and you’ll get a better-quality answer with no human-in-the-loop coaching or trial-and-error involved. You’ll just get your answer in timeframes closer to human ones, consuming more (metered) tokens along the way.

mrdmnd · 2024-09-12T17:50:36.000000Z

Yeah, I think this is the goal - remember; there are some problems that only need to be solved correctly once! Imagine something like a millennium problem - you'd be willing to wait a pretty long time for a proof of the RH!

bluecoconut · 2024-09-12T17:54:50.000000Z

This power law behavior of test-time improvement seems to be pretty ubiquitous now. In more agents is all you need [1], they start to see this as a function of ensemble size. It also shows up in: Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [2]

I sorta wish everyone would plot their y-axis with logit y-axis, rather than 0->100 accuracy (including the openai post), to help show the power-law behavior. This is especially important when talking about incremental gains in the ~90->95, 95->99%. When the values (like the open ai post) are between 20->80, logit and linear look pretty similar, so you can "see" the inference power-law

[1] https://arxiv.org/abs/2402.05120 [2] https://arxiv.org/abs/2407.21787

logicchains · 2024-09-12T17:46:33.000000Z

Surprising that at run time it needs an exponential increase in thinking to achieved a linear increase in output quality. I suppose it's due to diminishing returns to adding more and more thought.

HarHarVeryFunny · 2024-09-12T18:12:22.000000Z

The exponential increase is presumably because of the branching factor of the tree of thoughts. Think of a binary tree who's number of leaf nodes doubles (= exponential growth) at each level.

It's not too surprising that the corresponding increase in quality is only linear - how much difference in quality would you expect between the best, say, 10 word answer to a question, and the best 11 word answer ?

It'll be interesting to see what they charge for this. An exponential increase in thinking time means an exponential increase in FLOPs/dollars.