Hacker News new | past | comments | ask | show | jobs | submit login

O3 High (tuned) model scored an 88% at what looks like $6,000/task haha

I think soon we'll be pricing any kind of tasks by their compute costs. So basically, human = $50/task, AI = $6,000/task, use human. If AI beats human, use AI? Ofc that's considering both get 100% scores on the task




Isn't that generally what ... all jobs are? Automation Cost vs Longterm Human cost... its why amazon did the weird "our stores are AI driven" but in reality was cheaper to higher a bunch of guys in a sweat shop to look at the cameras and write things down lol.

The thing is given what we've seen from distillation and tech, even if its 6,000/task... that will come down drastically over time through optimization and just... faster more efficient processing hardware and software.


I remember hearing Tesla trying to automate all of production but some things just couldn’t , like the wiring which humans still had to do.


Compute costs on AI with the same roughly the same capabilities have been halving every ~7 months.

That makes something like this competitive in ~3 years


And human costs have been increasing a few percent per year for a few centuries!


That's the elephant in the room with the reasoning/COT approach, it shifts what was previously a scaling of training costs into scaling of training and inference costs. The promise of doing expensive training once and then running the model cheaply forever falls apart once you're burning tens, hundreds or thousands of dollars worth of compute every time you run a query.


They're gonna figure it out. Something is being missed somewhere, as human brains can do all this computation on 20 watts. Maybe it will be a hardware shift or maybe just a software one, but I strongly suspect that modern transformers are grossly inefficient.


Yeah, but next year they'll come out with a faster GPU, and the year after that another still faster one, and so on. Compute costs are a temporary problem.


The issue is not just scaling compute, but scaling it in a rate that meets the increase in complexity of the problems that are not currently solved. If that is O(n) then what you say probably stands. If that is eg O(n^8) or exponential etc, then there is no hope to actually get good enough scaling by just increasing compute in a normal rate. Then AI technology will still be improving, but improving to a halt, practically stagnating.

o3 will be interesting if it offers indeed a novel technology to handle problem solving, something that is able to learn from few novel examples efficiently and adapt. That's what intelligence actually is. Maybe this is the case. If, on the other hand, it is a smart way to pair CoT within an evaluation loop (as the author hints as possibility) then it is probable that, while this _can_ handle a class of problems that current LLMs cannot, it is not really this kind of learning, meaning that it will not be able to scale to more complex, real world tasks with a problem space that is too large and thus less amenable to such a technique. It is still interesting, because having a good enough evaluator may be very important step, but it would mean that we are not yet there.

We will learn soon enough I suppose.


It's not 6000/task (i.e per question). 6000 is about the retail cost for evaluating the entire benchmark on high efficiency (about 400 questions)


From reading the blog post and Twitter, and cost of other models, I think it's evident that it IS actually cost per task, see this tweet: https://files.catbox.moe/z1n8dc.jpg

And o1 cost $15/$60 for 1M in/out, so the estimated costs on the graph would match for a single task, not the whole benchmark.


The blog clarifies that it's $17-20 per task. Maybe it runs into thousands for tasks it can't solve?


That cost is for o3 low, o3 high goes into thousands per task.


This makes me think and speculate if the solution comprises of a "solver" trying semi-random or more targeted things and a "checker" checking these? Usually checking a solution is cognitively (and computationally) easier than coming up with it. Else I cannot think what sort of compute would burn 6000$ per task, unless you are going through a lot of loops and you have somehow solved the part of the problem that can figure out if a solution is correct or not, while coming up with the actual correct solution is not as solved yet to the same degree. Or maybe I am just naive and these prices are just like breakfast for companies like that.


What if we use those humans to generate energy for the tasks?


Well they got 75.7% at $17/task. Did you see that?


Time and availability would also be factors.


Compute can get optimized and cheap quickly.


Is it? The moore’s law is dead dead, I don’t think this is a given.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: