It would be fresh, just now and then to see a passionate AI bro who isn’t selling AI.
You know, I get it, earn those clicks. Spun that hype. Pump that valuation.
Now, go watch people on YouTube like Armin Ronacher (just search, you’ll find him), actually streaming their entire coding practice.
This is what expert LLM usage actually looks like.
People with six terminal running Claude are a lovely bedtime story, but please, if you’re doing it, do me a favour and do some live streams of your awesomeness.
I’d really love to see it.
…but so far, the live coding sessions showing people doing this “everyday 50x engineer” practice don’t seem to exist, and that makes me a bit skeptical.
Yes, but the point of this article is surely that on average if it's working, there would be obvious signs of it working by now.
Even if there are statistical outliers (ie. 10x productivity using the tools), if on average, it does nothing to the productivity of developers, something isn't working as promised.
We need long running averages and 2023-2025 is still too early to determine it's not effective. The barriers of entry for 2023 and 2024, I'd argue is too high for inexperienced developers to start churning software. For seasoned developers, the skepticism and company adoption wasn't there yet (and still isn't).
We both read the article; you know as well as I do that the advice in it is to build simple reliable system that focus on actual problems not imagined ones.
…but does not say how to do that; and offers no meaningful value for someone trying to pick the “right” thing in the entire solution space that is both sufficiently complex and scalable to solve the requirements, but not too scalable, or too complex.
There’s just some vague hand waving about over engineering things at Big Corp, where, ironically, scale is an issue that mandates a certain degree of complexity in many cases.
Here’s some thing that works better than meaningless generic advice: specific detailed examples.
You will note the total lack of them in this article, and others like it.
Real articles with real advice are a mix of practical examples that illustrate the generic advice they’re giving.
You know why?
…because you can argue with a specific example. Generic advice with no examples is not falsifiable.
You can agree with the examples, or disagree with them; you can argue that examples support or do not support the generic advice. People can take the specific examples and adapt them as appropriate.
…but, generic advice on its own is just an opinion.
I can arbitrarily assert “100% code coverage is meaningless; there are hot paths that need heavy testing and irrelevant paths that do not require code coverage. 100% code coverage is a fools game that masks a lack of a deeper understanding of what you should be testing”; it may sound reasonable, it may not. That’s your opinion vs mine.
…but with some specific examples of where it is true, and perhaps, not true, you could specifically respond to it, and challenge it with counter examples.
(And indeed, you’ll see that specific examples turn up here in this comment thread as arguments against it; notably not picked up to be addressed by the OP in their hacker news feedback section)
I feel oddly skeptical about this article; I can't specifically argue the numbers, since I have no idea, but... there are some decent open source models; they're not state of the art, but if inference is this cheap then why aren't there multiple API providers offering models at dirt cheap prices?
The only cheap-ass providers I've seen only run tiny models. Where's my cheap deepseek-R1?
Surely if its this cheap, and we're talking massive margins according to this, I should be able to get a cheap / run my own 600B param model.
Am I missing something?
It seems that reality (ie. the absence of people actually doing things this cheap) is the biggest critic of this set of calculations.
> but if inference is this cheap then why aren't there multiple API providers offering models at dirt cheap prices
There are multiple API providers offering models at dirt cheap prices, enough so that there is at least one well-known API provider that is an aggreggator of other API providers that offers lots of models at $0.
> The only cheap-ass providers I've seen only run tiny models. Where's my cheap deepseek-R1?
At 4-bit quant, R1 takes 300+ gigs just for weights. You can certainly run smaller models into which R1 has been distilled on a modest laptop, but I don't see how you can run R1 itself on anything that wouldn't be considered extreme for a laptop in at least one dimension.
There are 7 providers on that page which have higher output token price than $3.08. There is even 1 which has higher input token price than that. So that "all" is not true either.
> I should be able to get a cheap / run my own 600B param model.
if the margins on hosted inference are 80%, then you need > 20% utilization of whatever you build for yourself for this to be less costly to you (on margin).
i self-host open weight models (please: deepseek et al aren't open _source_) on whatever $300 GPU i bought a few years ago, but if it outputs 2 tokens/sec then i'm waiting 10 minutes for most results. if i want results in 10s instead of 10m, i'll be paying $30000 instead. if i'm prompting it 100 times during the day, then it's idle 99% of the time.
coordinating a group buy for that $30000 GPU and sharing that across 100 people probably makes more sense than either arrangement in the previous paragraph. for now, that's a big component of what model providers, uh, provide.
I also have no idea on the numbers. But I do know that these same companies are pouring many billions of dollars into training models, paying very expensive staff, and building out infrastructure. These costs would need to be factored in to come up with the actual profit margins.
There's zero basis for assuming any of that. The most likely situation is a power law curve where the vast majority of users don't use it much at all and the top 10% of users account for 90% of the usage.
It is very likely that you are in the top 10% of users.
True. the article also has zero basis in its estimating the average usage from each tier's user base.
I somewhat doubt my usage is so close to the edge of the curve since I don't even pay for any plan. It could be that I'm very frugal with money and fat on consumption while most are more balanced, but 1M token per day in any case sounds slim for any user who pays for the service.
Another giant problem with this article is we have no idea the optimizations used on their end. There are some widly complex optimizations these large AI companies use.
What I'm trying to say is that hosting your own model is in an entierly different leauge than the pros.
If we account for error in article implies higher cost I would argue it would return back to profit directly because how advanced optimization of infer3nce has become.
If actual model intelligence is not a moat (looking likely this is true) the real sauce of profitable AI companies is advanced optimizations across the entire stack.
Openai is NEVER going to release their specialized kernels, routing algos, quanitizations or model comilation methods. These are all really hard and really specific.
> I'm here to provide helpful, respectful, and appropriate content for all users. If you have any other requests or need assistance with a different type of story or topic, feel free to ask!
How can a benchmark be secret if you post it to an API to test a model on it?
"We totally promise that when we run your benchmark against our API we won't take the data from it and use to be better at your benchmark next time"
:P
If you want to do it properly you have to avoid any 3rd party hosted model when you test your benchmark, which means you can't have GPT5, claude, etc. on it; and none of the benchmarks want to be 'that guy' who doesn't have all the best models on it.
How do you propose that would work? A pipeline that goes through query-response pairs to deduce response quality and then uses the low-quality responses for further training? Wouldn't you need a model that's already smart enough to tell that previous model's responses weren't smart enough? Sounds like a chicken and egg problem.
It just means that once you send your test questions to a model API, that company now has your test. So 'private' benchmarks take it on faith that the companies won't look at those requests and tune their models or prompts to beat them.
They have quite large amounts of money. I don't think they need to be very cost-efficient. And they also have very smart people, so likely they can figure out a somewhat cost-efficient way. The stakes are high, for them.
Depends. Something like arc-agi might be easy as it follows a defined format. I would also guess that the usage pattern for someone running a benchmark will be quite distinct from that of a normal user, unless they take specific measures to try to blend in.
> Is it part of the multi-modal system without it being able to differenciate that text from the prompt?
Yes.
The point the parent is making is that if your model is trained to understand the content of an image, then that's what it does.
> And even if they can't, they should at least improve the pipeline so that any OCR feature should not automatically inject its result in the prompt, and tell user about it to ask for confirmation.
That's not what is happening.
The model is taking <image binary> as an input. There is no OCR. It is understanding the image, decoding the text in it and acting on it in a single step.
There is no place in the 1-step pipeline to prevent this.
...and sure, you can try to avoid it procedural way (eg. try to OCR an image and reject it before it hits the model if it has text in it), but then you're playing the prompt injection game... put the words in a QR code. Put them in french. Make it a sign. Dial the contrast up or down. Put it on a t-shirt.
And after all, I'm not surprised. When I read their long research PDFs, often finishing with a question mark about emerging behaviors, I knew they don't know what they are playing with, with no more control than any neuroscience researcher.
This is too far from hacking spirit to me, sorry to bother.
Wanting things to be true does not make them true.
“Get a promotion this year, be a manager next year, manage the division in three years” is not a plan you can execute.
This is just the old self affirmation stuff you hear all the time: you won't succeed if you want it a bit. You wont succeed if want it and do nothing. You will succeed if you go all in, 100%.
It is BS.
You wont succeed if you go all in, statistically.
You might get a different outcome, but you wont hit your goal.
It is provably false that everyone who goes all succeeds; Not everyone gets to be an astronaut, no matter how hard they work.
The reality is that some people will put a little effort in and succeed, and some people will put a lot in and succeed. Other people will fail.
Your goals are not indicators of future success.
Only actual things that have actually happened are strong signals for future events.
The advice of having goals is helpful, but the much much more important thing to do is measure what actually happens and realistically create goals based on actual reality.
Try things. Measure things. Adopt things that work. Consciously record what you do, how it goes, how long it takes and use that to estimate achievable goals, instead of guessing randomly.
I mean, the parent even pointed out that it works for vibe coding and stuff you don't care about; ...but the 'You can't' refers to this question by the OP:
> I really need to approve every single edit and keep an eye on it at ALL TIMES, otherwise it goes haywire very very fast! How are people using auto-edits and these kind of higher-level abstraction?
No one I've spoken to is just sitting back writing tickets while agents do all the work. If it was that easy to be that successful, everyone would be doing it. Everyone would be talking about it.
To be absolutely clear, I'm not saying that you can't use agents to modify existing code. You can. I do; lots of people do. ...but that's using it like you see in all the demos and videos; at a code level, in an editor, while editing and working on the code yourself.
I'm specifically addressing the OPs question:
Can you use unsupervised agents, where you don't interact at a 'code' level, only at a high level abstraction level?
...and, I don't think you can. I don't believe anyone is doing this. I don't believe I've seen any real stories of people doing this successfully.
> Can you use unsupervised agents, where you don't interact at a 'code' level, only at a high level abstraction level?
My view, after having gone all-in with Claude Code (almost only Opus) for the last four weeks, is ”no”. You really can’t. The review process needs to be diligent and all-encompassing and is, quite frankly, exhausting.
One improvement I have made to my process for this is to spin up a new Claude Code instance (or clear context) and ask for a code review based on the diff of all changes. My prompt for this is carefully structured. Some issues it identifies can be fixed with the agent, but others need my involvement. It doesn’t eliminate the need to review everything, but it does help focus some of my efforts.
You know, I get it, earn those clicks. Spun that hype. Pump that valuation.
Now, go watch people on YouTube like Armin Ronacher (just search, you’ll find him), actually streaming their entire coding practice.
This is what expert LLM usage actually looks like.
People with six terminal running Claude are a lovely bedtime story, but please, if you’re doing it, do me a favour and do some live streams of your awesomeness.
I’d really love to see it.
…but so far, the live coding sessions showing people doing this “everyday 50x engineer” practice don’t seem to exist, and that makes me a bit skeptical.
reply