O3 High (tuned) model scored an 88% at what looks like $6,000/task haha
I think soon we'll be pricing any kind of tasks by their compute costs. So basically, human = $50/task, AI = $6,000/task, use human. If AI beats human, use AI? Ofc that's considering both get 100% scores on the task
Isn't that generally what ... all jobs are? Automation Cost vs Longterm Human cost... its why amazon did the weird "our stores are AI driven" but in reality was cheaper to higher a bunch of guys in a sweat shop to look at the cameras and write things down lol.
The thing is given what we've seen from distillation and tech, even if its 6,000/task... that will come down drastically over time through optimization and just... faster more efficient processing hardware and software.
That's the elephant in the room with the reasoning/COT approach, it shifts what was previously a scaling of training costs into scaling of training and inference costs. The promise of doing expensive training once and then running the model cheaply forever falls apart once you're burning tens, hundreds or thousands of dollars worth of compute every time you run a query.
They're gonna figure it out. Something is being missed somewhere, as human brains can do all this computation on 20 watts. Maybe it will be a hardware shift or maybe just a software one, but I strongly suspect that modern transformers are grossly inefficient.
Yeah, but next year they'll come out with a faster GPU, and the year after that another still faster one, and so on. Compute costs are a temporary problem.
The issue is not just scaling compute, but scaling it in a rate that meets the increase in complexity of the problems that are not currently solved. If that is O(n) then what you say probably stands. If that is eg O(n^8) or exponential etc, then there is no hope to actually get good enough scaling by just increasing compute in a normal rate. Then AI technology will still be improving, but improving to a halt, practically stagnating.
o3 will be interesting if it offers indeed a novel technology to handle problem solving, something that is able to learn from few novel examples efficiently and adapt. That's what intelligence actually is. Maybe this is the case. If, on the other hand, it is a smart way to pair CoT within an evaluation loop (as the author hints as possibility) then it is probable that, while this _can_ handle a class of problems that current LLMs cannot, it is not really this kind of learning, meaning that it will not be able to scale to more complex, real world tasks with a problem space that is too large and thus less amenable to such a technique. It is still interesting, because having a good enough evaluator may be very important step, but it would mean that we are not yet there.
From reading the blog post and Twitter, and cost of other models, I think it's evident that it IS actually cost per task, see this tweet: https://files.catbox.moe/z1n8dc.jpg
And o1 cost $15/$60 for 1M in/out, so the estimated costs on the graph would match for a single task, not the whole benchmark.
This makes me think and speculate if the solution comprises of a "solver" trying semi-random or more targeted things and a "checker" checking these? Usually checking a solution is cognitively (and computationally) easier than coming up with it. Else I cannot think what sort of compute would burn 6000$ per task, unless you are going through a lot of loops and you have somehow solved the part of the problem that can figure out if a solution is correct or not, while coming up with the actual correct solution is not as solved yet to the same degree. Or maybe I am just naive and these prices are just like breakfast for companies like that.
I think soon we'll be pricing any kind of tasks by their compute costs. So basically, human = $50/task, AI = $6,000/task, use human. If AI beats human, use AI? Ofc that's considering both get 100% scores on the task