Hacker News new | past | comments | ask | show | jobs | submit login

You're trying to beg the question that "the point" must be: "which thing is wasting the most power right now?" It's not. People can talk about at least two things at once. This is critiquing the diminishing returns in terms of cost-per-output. It compares badly, on a unit level, to most other methods. That's a fair critique when planning which technique to use in the future.

Articles are often talking about something that won't end up mattering. But if you can't actually address the point the article is making on its own terms, you're going to struggle to refute it.




But we aren't seeing diminishing returns in terms of cost-per-output. We're seeing enormous advances, both in the quality of networks, and what we understand about how to train them, with a large but not unanticipated growth in compute which is still smaller than the growth of compute in other areas which aren't producing scientific advances.

When I refute a paper I start with the weakest points; if I convince my audience to ignore the paper on those terms, I don't spend time refuting the main point (perhaps this is not a great technique, but it is fairly efficient). I didn't even really address their main arguments because after reading the paper, I was just aghast at some of the things that say.

Let's give an example like this: """The first part is true of all statistical models: To improve performance by a factor of k, at least k**2 more data points must be used to train the model."""

I guess that's true of nearly all modern statistical models, at least small ones with simple model functions and fairly small number of data points. But I don't think that most advanced deep learning experts think in those terms; modern DLs do not behave anything like classical statistical models. I think those experts see increasing the data as providing an opportunity for overparameterized systems to generalize in ways we don't undertand and don't follow normal statistical rules. Modern DL systems are more like complex systems with emergent properties, than statistical models as understand by modern statisticians.

Here's another example: they sort of ignore the fact that we only need to train a small number of really big models and then all other models are fine-tuned from that. To get a world-class tardigrade detector, I took mobilenet, gave it a few hundred extra examples, trained for less than an hour on my home GPU (a 3080Ti), and then re-used that model for millions of predictions on my microscope. I didn't have to retrain any model from or use absurd amounts of extra data. I took advantage of all the work the original model trainign did to discover basis functions that can compactly encode the difference between tardigrade, algae, and dirt. I see a direct linear increase in my model's performance as I move to linearly larger models, and I need to add linear number of images to train more classes. We can reasonable expect this to be true of a wide range of models.

Similarly, for people doing molecular dynamics- the big CPU waster before DL- many parts of MD can now be approximated with DLs that are cheaper to run than MD, using models that were just trained once.

What about AlphaFold? Even if it cost DeepMind $100M in training time (and probably more in salaries), it has already generated some results that simply couldn't be produced without their technology- it didn't even exist! What they demonstrated, quite convincingly, is that algorithmic improvements could extract far more information from fairly cheap and low quality sequence data, compared to expensive structural data. So instead of extremely expensive MD sims or whatever to predict structure, you just run this model. My friends in pharma research (I work in pharma) are delighted with the results.

In short, I think the author's economic model is naive, I think his udnerstanding of improvement in DL is naive, and he undercounts the value of having a small number of huge models which are trained on O(nlogn), not n**2 data sets.

And I think that in the next decade it's likely either Google, Meta, or Microsoft will be actively training multi-modal models that basically include the sum of all publicly available, unencumbered data to produce networks that can move smoothly between video, audio, text, do logical reasoning, everything required to produce a virtual human being that could fool even experts, and probably even exceed human performance in science and mathematics in an impactful way. So what if they spend $100B to get there. That's just two years of NIH's budget.


So - this feels like a convincing refutation of the article! You should have said this first! I'm naive in all aspects of the details here but I'm glad to see that you had something more in your coat than "there are bigger energy eaters than these." Thanks!


Sometimes it's conversation that brings out the deeper thoughts that may not be well formed or put into words. OP may have not been able to clearly articulate their ideas without being prompted into it by yourself




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: