O3 High (tuned) model scored an 88% at what looks like $6,000/task haha I think ...

cchance · 2024-12-20T18:31:21 1734719481

Isn't that generally what ... all jobs are? Automation Cost vs Longterm Human cost... its why amazon did the weird "our stores are AI driven" but in reality was cheaper to higher a bunch of guys in a sweat shop to look at the cameras and write things down lol.

The thing is given what we've seen from distillation and tech, even if its 6,000/task... that will come down drastically over time through optimization and just... faster more efficient processing hardware and software.

cryptoegorophy · 2024-12-20T18:44:43 1734720283

I remember hearing Tesla trying to automate all of production but some things just couldn’t , like the wiring which humans still had to do.

Benjaminsen · 2024-12-20T18:43:39 1734720219

Compute costs on AI with the same roughly the same capabilities have been halving every ~7 months.

That makes something like this competitive in ~3 years

seizethecheese · 2024-12-20T22:59:19 1734735559

And human costs have been increasing a few percent per year for a few centuries!

jsheard · 2024-12-20T18:32:37 1734719557

That's the elephant in the room with the reasoning/COT approach, it shifts what was previously a scaling of training costs into scaling of training and inference costs. The promise of doing expensive training once and then running the model cheaply forever falls apart once you're burning tens, hundreds or thousands of dollars worth of compute every time you run a query.

Workaccount2 · 2024-12-20T21:02:24 1734728544

They're gonna figure it out. Something is being missed somewhere, as human brains can do all this computation on 20 watts. Maybe it will be a hardware shift or maybe just a software one, but I strongly suspect that modern transformers are grossly inefficient.

Legend2440 · 2024-12-20T18:40:09 1734720009

Yeah, but next year they'll come out with a faster GPU, and the year after that another still faster one, and so on. Compute costs are a temporary problem.

freehorse · 2024-12-20T19:21:48 1734722508

The issue is not just scaling compute, but scaling it in a rate that meets the increase in complexity of the problems that are not currently solved. If that is O(n) then what you say probably stands. If that is eg O(n^8) or exponential etc, then there is no hope to actually get good enough scaling by just increasing compute in a normal rate. Then AI technology will still be improving, but improving to a halt, practically stagnating.

o3 will be interesting if it offers indeed a novel technology to handle problem solving, something that is able to learn from few novel examples efficiently and adapt. That's what intelligence actually is. Maybe this is the case. If, on the other hand, it is a smart way to pair CoT within an evaluation loop (as the author hints as possibility) then it is probable that, while this _can_ handle a class of problems that current LLMs cannot, it is not really this kind of learning, meaning that it will not be able to scale to more complex, real world tasks with a problem space that is too large and thus less amenable to such a technique. It is still interesting, because having a good enough evaluator may be very important step, but it would mean that we are not yet there.

We will learn soon enough I suppose.

og_kalu · 2024-12-20T19:00:07 1734721207

It's not 6000/task (i.e per question). 6000 is about the retail cost for evaluating the entire benchmark on high efficiency (about 400 questions)

Tiberium · 2024-12-20T19:13:54 1734722034

From reading the blog post and Twitter, and cost of other models, I think it's evident that it IS actually cost per task, see this tweet: https://files.catbox.moe/z1n8dc.jpg

And o1 cost $15/$60 for 1M in/out, so the estimated costs on the graph would match for a single task, not the whole benchmark.

slibhb · 2024-12-20T19:18:52 1734722332

The blog clarifies that it's $17-20 per task. Maybe it runs into thousands for tasks it can't solve?

Tiberium · 2024-12-20T20:50:57 1734727857

That cost is for o3 low, o3 high goes into thousands per task.

freehorse · 2024-12-20T18:53:44 1734720824

This makes me think and speculate if the solution comprises of a "solver" trying semi-random or more targeted things and a "checker" checking these? Usually checking a solution is cognitively (and computationally) easier than coming up with it. Else I cannot think what sort of compute would burn 6000$ per task, unless you are going through a lot of loops and you have somehow solved the part of the problem that can figure out if a solution is correct or not, while coming up with the actual correct solution is not as solved yet to the same degree. Or maybe I am just naive and these prices are just like breakfast for companies like that.

seydor · 2024-12-20T20:58:40 1734728320

What if we use those humans to generate energy for the tasks?

gbnwl · 2024-12-20T19:27:56 1734722876

Well they got 75.7% at $17/task. Did you see that?

redeux · 2024-12-20T18:32:44 1734719564

Time and availability would also be factors.

dyauspitr · 2024-12-20T18:31:44 1734719504

Compute can get optimized and cheap quickly.

karmasimida · 2024-12-20T19:36:07 1734723367

Is it? The moore’s law is dead dead, I don’t think this is a given.