Hacker News new | past | comments | ask | show | jobs | submit login
Deep Learning’s Diminishing Returns (2021) (ieee.org)
76 points by jrepinc on June 16, 2023 | hide | past | favorite | 34 comments



The paper has a fundamental flaw, which can be seen if trying to reverse the reasoning to look at the past, rather than trying to predict the future.

If you used today's level of computation with 2012 era models, then you wouldn't have the error rates of today's models, you would have much higher, much worse errors.

The single biggest bottleneck in deep learning is not computation, it's the ingenuity used to devise new structures and new optimizations that allow for scaling.

For a given structure, with given techniques, you can throw more computation at a problem to decrease error rate, but the gains scale poorly with cost. Cost improvements only come with new techniques and structures.


I read this paper as commenting on the second-order derivative of benchmark progress with respect to DL techniques. The trends presented start with 2012-era models with 2012 data, and end with 2020 results for 2020-era models. Thus the extrapolation on training costs accounts for the current pace+style of progress and innovation in all relevant subareas of ML. To me it seems that it’s saying “if research continues in the same trends as it has from 2012 to 2020, here is where we will end up in 2035.”

In other words, in order for us to buckle the trend we would need to start innovating in ways that are unlike the ways that got us from AlexNet to here.


And don't forget data!


I don't find the authors points to be very convincing. There's a bit of a self-contradiction in the arguments being made -- namely that if you want better generalization, you need more parameters (which equates to higher energy consumption in training).

However, the obvious retort here is that if you train a model that is good at generalizing, then you don't need train more models! A show of hands, who has used GPT or an open LLM vs who has trained one would yield a vast disparity. If you don't need generalization, you don't need huge models. Small models are efficient over narrow domains and don't require vast compute/energy resources.

Secondarily, it's a self-solving issue. Energy isn't cheap, and GPUs aren't cheap. If you're going to burn 10's of thousands of dollars in energy costs, you should probably have a decent reason to do it. But those reasons are quickly diminishing as things that have already been _done_.

Third, overparameterized models are becoming less of an issue during inference with efficient quantization techniques. Distillation, though harder, is another option. Again, you do can these things one time after training.


> A show of hands, who has used GPT or an open LLM vs who has trained one would yield a vast disparity.

Believe me, that has almost nothing to do with "only needing one model" and everything to do with the compute required being absurd.

> Small models are efficient over narrow domains and don't require vast compute/energy resources.

The "best" small models are still typically born from the "best" overall models (however large) using a student/teacher paradigm. This is active research. As such, it is typically encouraged to go for "the big one", as you can distill it to far better small models than if you were to train the small models from scratch first.

> If you're going to burn 10's of thousands of dollars in energy costs, you should probably have a decent reason to do it. But those reasons are quickly diminishing as things that have already been _done_.

Again, this is _not_ the reason researchers don't train their own GPT-4's. They _absolutely_ would if they could raise that much money. The notion that researchers have everything they ever could have wanted from OpenAI's API-based walled garden approach is patently absurd. If they would release the weights, then finetuning and such could occur and you might have a point.

To be clear, I'm not defending the article's main point. You just happened to have used some strange arguments against it (imo).

The best argument is a sibling comment which points out that new architectures will be discovered that are more parameter efficient and run better on the hardware available. The burgeoning field of architecture search may mean we even have the ability to get the models themselves to help speed up this process of finding better ML architectures. It's always difficult to know when we have hit a wall of sorts however, and it may be that the transformer (and similar approaches) can only be refined so much. Time will tell.


Once you have an NN model and you know you want to keep the model weight set, moving to an optical neural network can massively drop the energy use. Its not necessarily easy, and certain architectures may not be as amenable, but it can certainly be a path to reducing energy use.


I’m not aware of anyone having done this for a production ready scenario, I thought it was all still highly experimental and very early days?


None of these articles admit to a basic truth: deep learning is only a small slice of total computation, and even if it grew 2 orders of magnitude, would not be the largest consumer by area.

If DL costs are unsustainable, it means those other things (SAP, web hosting, and all the ridiculous stuff people waste cycles on) are even more unsustainable, and should be addressed first.


You're trying to beg the question that "the point" must be: "which thing is wasting the most power right now?" It's not. People can talk about at least two things at once. This is critiquing the diminishing returns in terms of cost-per-output. It compares badly, on a unit level, to most other methods. That's a fair critique when planning which technique to use in the future.

Articles are often talking about something that won't end up mattering. But if you can't actually address the point the article is making on its own terms, you're going to struggle to refute it.


But we aren't seeing diminishing returns in terms of cost-per-output. We're seeing enormous advances, both in the quality of networks, and what we understand about how to train them, with a large but not unanticipated growth in compute which is still smaller than the growth of compute in other areas which aren't producing scientific advances.

When I refute a paper I start with the weakest points; if I convince my audience to ignore the paper on those terms, I don't spend time refuting the main point (perhaps this is not a great technique, but it is fairly efficient). I didn't even really address their main arguments because after reading the paper, I was just aghast at some of the things that say.

Let's give an example like this: """The first part is true of all statistical models: To improve performance by a factor of k, at least k**2 more data points must be used to train the model."""

I guess that's true of nearly all modern statistical models, at least small ones with simple model functions and fairly small number of data points. But I don't think that most advanced deep learning experts think in those terms; modern DLs do not behave anything like classical statistical models. I think those experts see increasing the data as providing an opportunity for overparameterized systems to generalize in ways we don't undertand and don't follow normal statistical rules. Modern DL systems are more like complex systems with emergent properties, than statistical models as understand by modern statisticians.

Here's another example: they sort of ignore the fact that we only need to train a small number of really big models and then all other models are fine-tuned from that. To get a world-class tardigrade detector, I took mobilenet, gave it a few hundred extra examples, trained for less than an hour on my home GPU (a 3080Ti), and then re-used that model for millions of predictions on my microscope. I didn't have to retrain any model from or use absurd amounts of extra data. I took advantage of all the work the original model trainign did to discover basis functions that can compactly encode the difference between tardigrade, algae, and dirt. I see a direct linear increase in my model's performance as I move to linearly larger models, and I need to add linear number of images to train more classes. We can reasonable expect this to be true of a wide range of models.

Similarly, for people doing molecular dynamics- the big CPU waster before DL- many parts of MD can now be approximated with DLs that are cheaper to run than MD, using models that were just trained once.

What about AlphaFold? Even if it cost DeepMind $100M in training time (and probably more in salaries), it has already generated some results that simply couldn't be produced without their technology- it didn't even exist! What they demonstrated, quite convincingly, is that algorithmic improvements could extract far more information from fairly cheap and low quality sequence data, compared to expensive structural data. So instead of extremely expensive MD sims or whatever to predict structure, you just run this model. My friends in pharma research (I work in pharma) are delighted with the results.

In short, I think the author's economic model is naive, I think his udnerstanding of improvement in DL is naive, and he undercounts the value of having a small number of huge models which are trained on O(nlogn), not n**2 data sets.

And I think that in the next decade it's likely either Google, Meta, or Microsoft will be actively training multi-modal models that basically include the sum of all publicly available, unencumbered data to produce networks that can move smoothly between video, audio, text, do logical reasoning, everything required to produce a virtual human being that could fool even experts, and probably even exceed human performance in science and mathematics in an impactful way. So what if they spend $100B to get there. That's just two years of NIH's budget.


So - this feels like a convincing refutation of the article! You should have said this first! I'm naive in all aspects of the details here but I'm glad to see that you had something more in your coat than "there are bigger energy eaters than these." Thanks!


Sometimes it's conversation that brings out the deeper thoughts that may not be well formed or put into words. OP may have not been able to clearly articulate their ideas without being prompted into it by yourself


>If DL costs are unsustainable, it means those other things (SAP, web hosting, and all the ridiculous stuff people waste cycles on) are even more unsustainable, and should be addressed first.

That fundamentally misunderstands the point. If it costs me $XXX amounts of dollars to compute the job I'm trying to replace, which only actually costs $X dollar, then it's not financially viable to use the model to replace the job. It makes no difference what the global percentage of computation that makes up.


The reason the global percentage matters is that if you are truly trying to address environmental impacts, you wouldn't talk about DL, it's currently too small (and will be for some time) to make a difference.

Just like not optimizing the 1% bottleneck in your code when there's a 50% bottleneck- focus on the big players where the cheap, easy wins are first.


Yes, your point makes sense from the perspective of trying to reduce global carbon usage... but that doesn't seem to be the perspective the article is written from, and the title is "Deep Learning's Diminishing Returns" not "Deep Learning Is Not Environmentally Friendly" so I think the point that DL has diminishing returns stands.


The paper advances several lines of arguments. Environmental is one of them, it's used as a negative consequence of their purported scaling laws for training.

The paper explicitly says: """Important work by scholars at the University of Massachusetts Amherst allows us to understand the economic cost and carbon emissions implied by this computational burden. The answers are grim: Training such a model would cost US $100 billion and would produce as much carbon emissions as New York City does in a month. And if we estimate the computational burden of a 1 percent error rate, the results are considerably worse."""

(that paper was debunked by people who build and operate extremely large scale cloud machine learning systems).

Remember, Google already built and runs its TPU fleet continuously and the machines are almost always busy at at least 50% of their capacity, meaning the money is already being spent and the carbon (which is much smaller than their estimates) being spewed.

As for the rest of the paper, their arguments about the need to increase parameters to get better results are fairly simplistic, and filled with misleading information/untrue statements.

Basically, everything they claim in the article has been shown to be empirically not true. And the community is already grappling with the "too many parameters approach". The authors would have served the community far better by writing a less critical paper that focused on the opportunities to identify good approaches to parameter and compute reduction that don't affect performance. In looking at the main author's area of research, it looks like he is more of a economist/public policy wonk than DL expert, which I think negatively affects the quality of the paper.


AI people sure don't like criticism. I have noted that.


The amount of computation in the world devoted to deep learning, in aggregate, might be a small slice of total computation. But in terms of costs borne by individual organizations who train such models, deep learning could be extremely substantial. Particularly if it takes millions in additional investment to achieve only marginal reductions in error.


How is that different from any other capital-intensive industry. Basically my point is that there is nothing specific to DL in these articles/complaints, people are just jumping on the bandwagon.


For me, the takeaway is that training state of the art models may become cost prohibitive even for large companies. There may be a practical limit to how complex models can become. This is at least plausibly interesting to know.


The article is forward looking, while SAP etc. is likely not going to increase its processing needs on orders of magnitude from now.


I remember people saying the same thing about crypto. Then it started consuming more electricity than Ireland: https://www.theguardian.com/technology/2017/nov/27/bitcoin-m...


Discussed at the time:

Deep Learning's Diminishing Returns - https://news.ycombinator.com/item?id=28646256 - Sept 2021 (84 comments)


»For example, when the cutting-edge image-recognition system Noisy Student converts the pixel values of an image into probabilities for what the object in that image is, it does so using a network with 480 million parameters. The training to ascertain the values of such a large number of parameters is even more remarkable because it was done with only 1.2 million labeled images—which may understandably confuse those of us who remember from high school algebra that we are supposed to have more equations than unknowns.

Each picture has more than one pixel, so for the network to "remember" the images, we need more parameters than there are pictures.


things like this make me really appreciate the energy efficiency of a human brain. a typical human learns these kinds of tasks with less energy than it takes to boil a tea kettle


Empirically, all reasonings based on extrapolation of "resource depletion" or "increased energy usage" were wrong.


Climate change is real and we are running out of CO_2 budget.


A disappointing analysis. It discusses cost in isolation from revenue, and raises environmental impacts without mentioning that the major cloud computing providers where the largest models are trained and run are carbon neutral.

Better results for general purpose tasks are likely to find willing buyers at prices orders of magnitude higher than even the largest language models of today. In most cases, the best alternative is human labor, and for more latency-sensitive cases, there is no alternative to a language model available at any price.


"Carbon neutrality" is shameless greenwashing while carbon credits in large part remain a scam.

https://en.wikipedia.org/wiki/Carbon_offsets_and_credits#Con...


Regarding carbon neutrality: yeah, by using their enourmous capital/pricing power to buy all green power relegating the rest of the country to either use greenwashed grey power or outright grey power. So learning is green, it's just the rest that is poluting....


Carbon neutral is the indulgence of our age. And about as effective.


One month of nyc carbon is no big deal. Once a model is trained using it is much much cheaper


The model has to be trained repeatedly to update the information.


Interesting to see how many people were clamouring that deep learning hit a wall and the whole thing was a fad that would end soon right before it exploded.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: