Hacker News new | past | comments | ask | show | jobs | submit login

I think the idea that SOTA models can run on limited hardware makes people think that Nvidia sales will take a hit.

But if you think about it for two more seconds you realize that if SOTA was trained on mid level hardware, top of the line hardware could still put you ahead, and DeepSeek is also open source so it won't take long to see what this architecture could do on high end cards.






there's no reason to believe that performance will continue to scale with compute, though. that's why there's a rout. more simply, if you assume maximum performance with the current LLM/transformer architecture is say, twice as good as what humanity is capable of now, then that would mean that you're approaching 50%+ performance with orders of magnitude less compute. there's just no way you could justify the amount of money being spent on nvidia cards if that's true, hence the selloff.

Wait no, there is actually PLENTY of evidence that performance continues to scale with more compute. The entire point of the o3 announcement and benchmark results of throwing a million bucks of test time compute at ARC-AGI is that the ceiling is really really high. We have 3 verified scaling laws of pre-training corpus size, parameter count, and test time compute. More efficiency is fantastic progress, but we will always be able to get more intelligence by spending more. Scale is all you need. DeepSeek did not disprove that.

there's evidence that performance increases with compute, but not that it scales with compute, e.g. linearly or exponentially. the SOTA models already are seeing diminishing returns w.r.t parameter size, training time and generally just engineering effort. it's a fact that doubling, say, parameter size does not double benchmark performance.

would love to see evidence to the contrary. my assertion comes from seeing claude, gemini and o1.

if anything I feel performance is more of a function of the quality of data than anything else.


The biggest increase in model performance recently came from training them to do chain-of-thought properly - that is why DeepSeek is as good as it is. This requires a lot more tokens for the model to reason, though. Which means that it needs a lot more compute to do its thing even if it doesn't have a massive increase in parameter size.

> DeepSeek is also open source so it won't take long to see what this architecture could do on high end cards

As far as I can see, the training code isn't open source. It's open weights.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: