Hacker Newsnew | past | comments | ask | show | jobs | submit | noodletheworld's commentslogin

It would be fresh, just now and then to see a passionate AI bro who isn’t selling AI.

You know, I get it, earn those clicks. Spun that hype. Pump that valuation.

Now, go watch people on YouTube like Armin Ronacher (just search, you’ll find him), actually streaming their entire coding practice.

This is what expert LLM usage actually looks like.

People with six terminal running Claude are a lovely bedtime story, but please, if you’re doing it, do me a favour and do some live streams of your awesomeness.

I’d really love to see it.

…but so far, the live coding sessions showing people doing this “everyday 50x engineer” practice don’t seem to exist, and that makes me a bit skeptical.


> people are going to get a variety of results.

Yes, but the point of this article is surely that on average if it's working, there would be obvious signs of it working by now.

Even if there are statistical outliers (ie. 10x productivity using the tools), if on average, it does nothing to the productivity of developers, something isn't working as promised.


We need long running averages and 2023-2025 is still too early to determine it's not effective. The barriers of entry for 2023 and 2024, I'd argue is too high for inexperienced developers to start churning software. For seasoned developers, the skepticism and company adoption wasn't there yet (and still isn't).

Yes, but this is meaningless advice.

The best solution is the simplest.

The quickest? No the simplest; sometimes thats longer.

So definitely not a complex solution? No, sometimes complexity is required, its the simplest solution possible given your constraints.

Soo… basically, the advice is “pick the right solution”.

Sometimes that will be quick. Sometimes slow. Sometimes complex. Sometimes config, Sometimes distributed.

It depends.

But the correct solution will be the simplest one.

Its just: “solve your problems using good solutions not bad ones”

…and that indeed both good, and totally useless advice.


The article responds to this.

Really?

We both read the article; you know as well as I do that the advice in it is to build simple reliable system that focus on actual problems not imagined ones.

…but does not say how to do that; and offers no meaningful value for someone trying to pick the “right” thing in the entire solution space that is both sufficiently complex and scalable to solve the requirements, but not too scalable, or too complex.

There’s just some vague hand waving about over engineering things at Big Corp, where, ironically, scale is an issue that mandates a certain degree of complexity in many cases.

Here’s some thing that works better than meaningless generic advice: specific detailed examples.

You will note the total lack of them in this article, and others like it.

Real articles with real advice are a mix of practical examples that illustrate the generic advice they’re giving.

You know why?

…because you can argue with a specific example. Generic advice with no examples is not falsifiable.

You can agree with the examples, or disagree with them; you can argue that examples support or do not support the generic advice. People can take the specific examples and adapt them as appropriate.

…but, generic advice on its own is just an opinion.

I can arbitrarily assert “100% code coverage is meaningless; there are hot paths that need heavy testing and irrelevant paths that do not require code coverage. 100% code coverage is a fools game that masks a lack of a deeper understanding of what you should be testing”; it may sound reasonable, it may not. That’s your opinion vs mine.

…but with some specific examples of where it is true, and perhaps, not true, you could specifically respond to it, and challenge it with counter examples.

(And indeed, you’ll see that specific examples turn up here in this comment thread as arguments against it; notably not picked up to be addressed by the OP in their hacker news feedback section)


Huh.

I feel oddly skeptical about this article; I can't specifically argue the numbers, since I have no idea, but... there are some decent open source models; they're not state of the art, but if inference is this cheap then why aren't there multiple API providers offering models at dirt cheap prices?

The only cheap-ass providers I've seen only run tiny models. Where's my cheap deepseek-R1?

Surely if its this cheap, and we're talking massive margins according to this, I should be able to get a cheap / run my own 600B param model.

Am I missing something?

It seems that reality (ie. the absence of people actually doing things this cheap) is the biggest critic of this set of calculations.


> but if inference is this cheap then why aren't there multiple API providers offering models at dirt cheap prices

There are multiple API providers offering models at dirt cheap prices, enough so that there is at least one well-known API provider that is an aggreggator of other API providers that offers lots of models at $0.

> The only cheap-ass providers I've seen only run tiny models. Where's my cheap deepseek-R1?

https://openrouter.ai/deepseek/deepseek-r1-0528:free


How is this possible? I imagine someone is finding some value in the prompts themselves but this cant possibly be paying for itself.

Inference is just that cheap plus they hope that you'll start using the ones they charge for as you become more used to using AI in your workflow.

you can also run deepseek for free on a modestly sized laptop

At 4-bit quant, R1 takes 300+ gigs just for weights. You can certainly run smaller models into which R1 has been distilled on a modest laptop, but I don't see how you can run R1 itself on anything that wouldn't be considered extreme for a laptop in at least one dimension.

You're probably thinking of what ollama labels "deepseek" which is not in fact deepseek, but other models with some deepseek distilled into them.

> why aren't there multiple API providers offering models at dirt cheap prices?

There are. Basically every provider's R1 prices are cheaper than estimated by this article.

https://artificialanalysis.ai/models/deepseek-r1/providers


The cheapest provider in your link charges 460x more for input tokens than the article estimates.

> The cheapest provider in your link charges 460x more for input tokens than the article estimates.

The article estinates $0.003 per million input tokens, the cheapest on the list is $0.46 per million. The ratio is 120×, not 460×.

OTOH, all of the providers are far below the estimated $3.08 cost per million output tokens


There are 7 providers on that page which have higher output token price than $3.08. There is even 1 which has higher input token price than that. So that "all" is not true either.

> I should be able to get a cheap / run my own 600B param model.

if the margins on hosted inference are 80%, then you need > 20% utilization of whatever you build for yourself for this to be less costly to you (on margin).

i self-host open weight models (please: deepseek et al aren't open _source_) on whatever $300 GPU i bought a few years ago, but if it outputs 2 tokens/sec then i'm waiting 10 minutes for most results. if i want results in 10s instead of 10m, i'll be paying $30000 instead. if i'm prompting it 100 times during the day, then it's idle 99% of the time.

coordinating a group buy for that $30000 GPU and sharing that across 100 people probably makes more sense than either arrangement in the previous paragraph. for now, that's a big component of what model providers, uh, provide.


I also have no idea on the numbers. But I do know that these same companies are pouring many billions of dollars into training models, paying very expensive staff, and building out infrastructure. These costs would need to be factored in to come up with the actual profit margins.

There are, I screenshotted DeepInfra in the article, but there are a lot more https://openrouter.ai/deepseek/deepseek-r1-0528

is that a quantized model or the full r1?

Imo the article is totally off the mark since it assumes users on average do not go over th 1M tokens per day.

Afaik openai doesn't enforce a daily quota even on the $20 plans unless the platform is under pressure.

Since I often consume 20M token per day, one can assume many would use far more than the 1M tokens assumed in the article's calculations.


There's zero basis for assuming any of that. The most likely situation is a power law curve where the vast majority of users don't use it much at all and the top 10% of users account for 90% of the usage.

It is very likely that you are in the top 10% of users.


True. the article also has zero basis in its estimating the average usage from each tier's user base.

I somewhat doubt my usage is so close to the edge of the curve since I don't even pay for any plan. It could be that I'm very frugal with money and fat on consumption while most are more balanced, but 1M token per day in any case sounds slim for any user who pays for the service.


Meanwhile, I don’t use ChatGPT at all on a median day. I use it in occasional bursts when researching something.

https://openrouter.ai/deepseek/deepseek-chat-v3.1

They are dirt cheap. Same model architecture for the comparison: $0.30/M $1.00/M. Or even $0.20-$0.80 from another provider.


Another giant problem with this article is we have no idea the optimizations used on their end. There are some widly complex optimizations these large AI companies use.

What I'm trying to say is that hosting your own model is in an entierly different leauge than the pros.

If we account for error in article implies higher cost I would argue it would return back to profit directly because how advanced optimization of infer3nce has become.

If actual model intelligence is not a moat (looking likely this is true) the real sauce of profitable AI companies is advanced optimizations across the entire stack.

Openai is NEVER going to release their specialized kernels, routing algos, quanitizations or model comilation methods. These are all really hard and really specific.


I would not be surprised if the operating costs are modest

But these companies also have very expensive R&D development and large upfront costs.


https://lambda.chat

Deepseek R1 for free.


* distilled R1 for free

> I'm here to provide helpful, respectful, and appropriate content for all users. If you have any other requests or need assistance with a different type of story or topic, feel free to ask!

How can a benchmark be secret if you post it to an API to test a model on it?

"We totally promise that when we run your benchmark against our API we won't take the data from it and use to be better at your benchmark next time"

:P

If you want to do it properly you have to avoid any 3rd party hosted model when you test your benchmark, which means you can't have GPT5, claude, etc. on it; and none of the benchmarks want to be 'that guy' who doesn't have all the best models on it.

So no.

They're not secret.


How do you propose that would work? A pipeline that goes through query-response pairs to deduce response quality and then uses the low-quality responses for further training? Wouldn't you need a model that's already smart enough to tell that previous model's responses weren't smart enough? Sounds like a chicken and egg problem.


It just means that once you send your test questions to a model API, that company now has your test. So 'private' benchmarks take it on faith that the companies won't look at those requests and tune their models or prompts to beat them.


Sounds a bit presumptious to me. Sure, they have your needle, but they also need a cost-efficient way to find it in their hay stack.


They have quite large amounts of money. I don't think they need to be very cost-efficient. And they also have very smart people, so likely they can figure out a somewhat cost-efficient way. The stakes are high, for them.


Security through obscurity is not security.

Your api key is linked to your credit card, which is linked to your identity.

…but hey, youre right.

Lets just trust them not to be cheating. Cool.


Would the model owners be able to identify the benchmarking session among many other similar requests?


Depends. Something like arc-agi might be easy as it follows a defined format. I would also guess that the usage pattern for someone running a benchmark will be quite distinct from that of a normal user, unless they take specific measures to try to blend in.


> Is it part of the multi-modal system without it being able to differenciate that text from the prompt?

Yes.

The point the parent is making is that if your model is trained to understand the content of an image, then that's what it does.

> And even if they can't, they should at least improve the pipeline so that any OCR feature should not automatically inject its result in the prompt, and tell user about it to ask for confirmation.

That's not what is happening.

The model is taking <image binary> as an input. There is no OCR. It is understanding the image, decoding the text in it and acting on it in a single step.

There is no place in the 1-step pipeline to prevent this.

...and sure, you can try to avoid it procedural way (eg. try to OCR an image and reject it before it hits the model if it has text in it), but then you're playing the prompt injection game... put the words in a QR code. Put them in french. Make it a sign. Dial the contrast up or down. Put it on a t-shirt.

It's very difficult to solve this.

> It's hard to believe they can't prevent this.

Believe it.


Now that makes more sense.

And after all, I'm not surprised. When I read their long research PDFs, often finishing with a question mark about emerging behaviors, I knew they don't know what they are playing with, with no more control than any neuroscience researcher.

This is too far from hacking spirit to me, sorry to bother.


Playing with things you barely understand sounds like perfect "hacker spirit" to me!


You're confusing hacker and lamer.

https://en.wikipedia.org/wiki/Lamer


This isn't realistic.

Wanting things to be true does not make them true.

“Get a promotion this year, be a manager next year, manage the division in three years” is not a plan you can execute.

This is just the old self affirmation stuff you hear all the time: you won't succeed if you want it a bit. You wont succeed if want it and do nothing. You will succeed if you go all in, 100%.

It is BS.

You wont succeed if you go all in, statistically.

You might get a different outcome, but you wont hit your goal.

It is provably false that everyone who goes all succeeds; Not everyone gets to be an astronaut, no matter how hard they work.

The reality is that some people will put a little effort in and succeed, and some people will put a lot in and succeed. Other people will fail.

Your goals are not indicators of future success.

Only actual things that have actually happened are strong signals for future events.

The advice of having goals is helpful, but the much much more important thing to do is measure what actually happens and realistically create goals based on actual reality.

Try things. Measure things. Adopt things that work. Consciously record what you do, how it goes, how long it takes and use that to estimate achievable goals, instead of guessing randomly.


Are you talking about the same thing as the OP?

I mean, the parent even pointed out that it works for vibe coding and stuff you don't care about; ...but the 'You can't' refers to this question by the OP:

> I really need to approve every single edit and keep an eye on it at ALL TIMES, otherwise it goes haywire very very fast! How are people using auto-edits and these kind of higher-level abstraction?

No one I've spoken to is just sitting back writing tickets while agents do all the work. If it was that easy to be that successful, everyone would be doing it. Everyone would be talking about it.

To be absolutely clear, I'm not saying that you can't use agents to modify existing code. You can. I do; lots of people do. ...but that's using it like you see in all the demos and videos; at a code level, in an editor, while editing and working on the code yourself.

I'm specifically addressing the OPs question:

Can you use unsupervised agents, where you don't interact at a 'code' level, only at a high level abstraction level?

...and, I don't think you can. I don't believe anyone is doing this. I don't believe I've seen any real stories of people doing this successfully.


> Can you use unsupervised agents, where you don't interact at a 'code' level, only at a high level abstraction level?

My view, after having gone all-in with Claude Code (almost only Opus) for the last four weeks, is ”no”. You really can’t. The review process needs to be diligent and all-encompassing and is, quite frankly, exhausting.

One improvement I have made to my process for this is to spin up a new Claude Code instance (or clear context) and ask for a code review based on the diff of all changes. My prompt for this is carefully structured. Some issues it identifies can be fixed with the agent, but others need my involvement. It doesn’t eliminate the need to review everything, but it does help focus some of my efforts.


Is a whole IDE really the solution though?

There are already a plugins to use claude code in other IDEs.

This “Ill write a whole IDE because you get the best UX” seems like its a bit of a fallacy.

There are lots of ways you could do that.

A standalone application is just convenient for your business/startup/cross sell/whatever.


? Are you complaining about MCP or boost?

It’s an optional component.

What do you want the OP to do?

MCP may not be strictly necessary but it’s straight in line with the intent of the library.

Are you going to take shots at llama.cpp for having an http server and a template library next?

Come on. This uses conan, it has a decent cmake file. The code is ok.

This is pretty good work. Dont be a dick. (Yeah, ill eat the down votes, it deserves to be said)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: