Hacker News new | past | comments | ask | show | jobs | submit login
New LLM optimization technique slashes memory costs (venturebeat.com)
444 points by hochmartinez 9 days ago | hide | past | favorite | 214 comments





Wonder how this compares with Microsoft's HeadKV paper [1] which claims a 98% percent reduction in memory while retaining 97% of the performance.

[1] https://arxiv.org/html/2410.19258v3


Seems like a different thing. That paper appears to be memory reduction in caching while article appears to be memory reduction in content.

They’re both exploring the same space of optimizing the memory needed by the KV cache which is essentially another name for the context window (no one elides the KV cache as otherwise you’re doing N^2 math to do attention). They’re exploring different approaches to achieve the same goal and they may be both possible to apply simultaneously to reduce the attention mechanism to almost 0 memory usage which would be really cool, but I’m curious how they compare against each other individually.

That sounds like a stretch to me. If not, I’m impressed how the articles can describe such similar things in such different terms.

The only memory mechanism within an LLM as far as I know is the attention mechanism where it compares all previous tokens to generate a probability distribution for the next token to generate. The attention mechanism has a thing called a KV cache to take the O(n^2) matrix math down to O(n) by caching and reusing the results of some math from previous tokens. The size of how many tokens the context will cover is called the context window (e.g. 128k for Llama).

The articles use very similar verbiage.

> The context window can be considered the model’s working memory

Snip

> Universal transformer memory optimizes prompts using neural attention memory models (NAMMs), simple neural networks that decide whether to “remember” or “forget” each given token stored in the LLM’s memory.

snip

> Meanwhile, by discarding unnecessary tokens, NAMM enabled the LLM model to save up to 75% of its cache memory while performing the tasks.

You just have to be familiar with the wording in the space and read enough literature. Here’s more direct wording from the NAMM paper:

> NAMMs use evolution to optimize the performance of LMs by pruning their KV cache memory. Evolved NAMMs can be zero-shot transferred to other transformers, even across input modalities and task domains.

This is all related work about shrinking the size of the KV cache as the context grows both due to memory and it also has a speed up effect since you’re not having to attend all the tokens (O(n) -> sublinear with the size of the context).

Context is critical in the LLM answering correctly and remembering all the information given to it + everything it said. Typical limits for open models these days are 128k but with techniques like this it could scale even further allowing better performance on thing like code completion.


I thought the context would also have floating point numbers so that tokens would be included in a more fuzzy way, and that when requests are sent it would result in loading slightly different tokens into the cache. Yeah my understanding certainly is limited and I’d like to study it more. Thanks for the response, I see more similarity now.

The word you're looking for is latent space and yes, everything in the compute graph, including context cache & compute is done in latent space. Literal input tokens are first converted to latent space through the embedding layer and literal output tokens are generated by converting the last compute tensor into token probabilities & taking the most probable token. Everything in the middle though happens in the "floating point" latent space.

When you hear something like "it's attending all previous tokens" IMHO it's not strictly the correct explanation since you're attending through latent space which doesn't actually correspond 1:1 with tokens but is a multidimensional representation of that token & all preceding tokens as understood by that attention head. But conceptually it's how it's described because the size of your context goes up by 1 tensor for every token you process, even though applying attention actually ends up changing all tensors in the KV cache (hence self-attention). Also important to note that each attention head within each layer has it's own KV cache. LLMs are an autoregressive family of models where the output of each layer feeds into the input of the next and each layer has a transformer performing attention. That's another reason why it's not strictly correct to think of it as tokens make up your context because there's actually many many contexts within a transformer model. That's why your 128k context window can be ~15 GiB for a naiive inference implementation - 128k context window * 1024 * 1024-element tensor * 2 bytes per tensor * 8 attention heads * 8 layers (or something along those lines). And that's what this work is talking about shrinking (as does the HeadKV).

> tokens would be included in a more fuzzy way, and that when requests are sent it would result in loading slightly different tokens into the cache

The entire process of LLMs is generally actually 100% deterministic based on the same inputs & given a fixed seed for the RNG (modulo bugs in the inference math / bugs in HW/SW for the accelerator). Some inference implementations don't guarantee this property in the face of concurrent requests & you can't control the seed for hosted LLMs which is why it seems like random responses for the same query.


The KV cache feels more like a graph to me, like in the RDF sense. Each parameter could be numbered and given a URL it seems. I have some studying to do. I think building a simple neural net and looking at raw data for context in whatever LLM I’m playing with in Ollama are good things to try.

> they may be both possible to apply simultaneously to reduce the attention mechanism to almost 0 memory usage which would be really cool

https://matt.might.net/articles/why-infinite-or-guaranteed-f...


This isn't like lossless compression. Both techniques involve throwing lots of information away, with the justification that doing so does not significantly affect the end result.

The extent to which using both the techniques together will help will depend on how much overlap there is between the information each ends up discarding.


My joke was more along the lines of entropy. Entropy is information and you can't throw away all of it, otherwise you have nothing useful left.

Modern LLMs are still quite inefficient in their representation of information. We're at like the DEFLATE era and we've still yet to invent zstd where there's only marginal incremental gains; so right now there's a lot of waste to prune away.

Hence the idea to only throw away almost all of it.

Any real-world (open-source) implementations of this?

Looks like it is open source: https://github.com/FYYFU/HeadKV

Is it possible that after 3-4 years of performance optimizations, both algorithmic and in hardware efficiency, it will turn out that we didn’t really need all of the nuclear plants we’re currently in the process of setting up to satisfy the power demands of AI data centers?

Nobody is building nuclear power plants for data centres. A few people have signed some paperwork saying that they would buy electricity from new nuclear plants if they could deliver it at a certain price, a price mind you that has not been done before. Others are trying to restart an existing reactor at three mile island (a thing that has never been done before, and likely won't be done now since the reactor was shut down due to being too expensive to run).

And certainly nobody is building one in the next 3-4 years; they'd be lucky to finish the paperwork in that time.

What is actually going to power them is solar, wind, and batteries: https://www.theverge.com/2024/12/10/24317888/googles-data-ce...


And unfortunately, gas and coal in the meantime.

https://www.theguardian.com/technology/2024/sep/15/data-cent...


That article is terribly vague.

Electric cars are causing exactly the same problem.

Also the "recs" appear to be based on a lie. Overall, an increase in load can only be green if there is added new green generation to service that load.


Except the alternative to electric are petrol/diesel cars which are worse than electric cars run on gas or coal. The pollution no longer occurs in population zones, and the grid can be cleaned up without changing the car.

The alternatives for these data centres are either build renewables or not build the data centres, both of which are better.


> Nobody is building nuclear power plants for data centres. A few people have signed some paperwork saying that they would buy electricity from new nuclear plants if they could deliver it at a certain price, a price mind you that has not been done before.

Not building new, but I think Microsoft paying to restart a reactor at Three Mile Island for their datacenter is much more significant than you make the deals sound:

https://www.theguardian.com/environment/2024/sep/20/three-mi...


Microsoft isn’t paying to restart. A PPA is a contract saying they will purchase electricity for a specified price for a fixed term. Three Mile Island needs to be able to produce the electricity for the specified price for Microsoft to pay buy it. If it’s above that price Microsoft is off the hook.

They say it’s going to be online in 2028.

Are you willing to bet that they won’t have 3 mile island operational by 2030?


> Constellation closed the adjacent but unconnected Unit 1 reactor in 2019 for economic reasons, but will bring it back to life after signing a 20-year power purchase agreement to supply Microsoft’s energy-hungry data centers, the company announced on Friday.

The reactor they’re restarting was operational just five years ago. It’s not a fully decommissioned or melted down reactor and it’s likely all their licensing is still valid so the red tape, especially environmental studies, is mostly irrelevant. Getting that reactor back up and running will be a lot simpler than building a new one.


The largest problem will be finding qualified and vetted personnel. All the people who worked at the plant when it closed five years ago had to find jobs elsewhere. Even though the plant was an important employer in Middletown, I don't know if those former employees will be willing the quit there current jobs to go back, especially if there is a risk the plant will just be shut down again when it once again becomes too expensive to operate.

> especially if there is a risk the plant will just be shut down again when it once again becomes too expensive to operate.

20-year purchase agreement covers half of an adult lifelong career.


It's all been pretty much greenwashing to distract for the real impact of all the AI infrastructure in the energy and water supply.

Could be a good candidate for factobattery. Overbuild the system, run them at full speed at peak solar generation, then underclock them at night.

https://www.moderndescartes.com/essays/factobattery/


Then why no one seem to be doing it?

This is a really interesting idea! Of course, in practice that will just mean crypto-token mining rather than anything useful.

> Nobody is building nuclear power plants for data centres

Argentina just announced they're building nuclear plants for AI.


Look, I agree that nuclear is difficult, but Google and Microsoft have publicly committed to those projects you’re mentioning. I don’t understand your dismissive tone that all of it is hogwash? This is one of those HN armchair comments.

My tone is because this is a simple predatory delay strategy.

Tomorrow, tomorrow, I’ll decarbonize tomorrow.

Instead of paying to buy wind and solar plants, which can go up today they are signing a meaningless agreement for the future.

A PPA isn’t worth the paper it’s written on if the seller can’t produce electricity at the agreed upon price by the date required.

Take Three Mile Island. It was closed in 2019 since it was uneconomical to run. Since then renewables have continued getting substantially cheaper, while the reactor has been in the process of decommissioning.

Instead of spending money on building wind and solar, Microsoft saw how well Vogtle went and decided that another first of it’s kind nuclear project is the best way to make it appear like they’re doing something.


The logic is pretty straightforward I’m not sure what your complaint is. They don’t need the power now, but they calculate that they’d need much more power in the future than non nuclear ways of power generation would be able to give them in the same timeframe.

The US is already adding record amount of solar and wind power to replace coal and natural gas plants. What makes you believe Microsoft can just buy more renewable energies? Did you privy to the terms and conditions of the PPA?

The alternative to Three Miles Island restart would be to add natural gas plants, or to buy renewable energy at higher price. I’m sure they have plan B.


I've been told my entire life that it's too late for nuclear, we should have been building them 20 years ago.

I think now's fine, even if it takes time. these companies already buy a ton of power from renewable sources, and it's good to diversify - nuclear is a good backup to have.


The west tried building nuclear power 20 years ago. If it had delivered we would be building more now.

It did not deliver. It is time to leave nuclear power to the past just like we have done with the steam engine.

It had its heyday but better cheaper technology replaced it.


what does it mean that "the west tried" - was it a technical failure or was it that people didn't want it in their backyard? just because people hate something doesn't mean that they don't need it. children hate spinach.

There was talk of an ongoing nuclear renaissance in the early 2000s. [1]

American companies and utilities announced 30 reactors. Britain announced ~14.

We went ahead and started construction on 7 reactors in Vogtle, Virgil C. Summer, Flamanville, Olkiluoto and Hanhikivi to rekindle the industry. We didn't believe renewables would cut it.

The end result of what we broke ground on is 3 cancelled reactors, 3 reactors which entered commercial operation in the 2020s and 1 still under construction.

The rest are in different states of trouble with financing with only Hinkley Point C slowly moving forward.

In the meantime renewables went from barely existing to dominating new capacity (TWh) in the energy sector.

Today renewables make up 2/3rds of global investment in the energy sector.

The failure of nuclear power is that it is horrifically expensive and the timelines are insane compared to the competition.

Steam locomotives technically work, but are like nuclear power uncompetitive.

Lately nuclear power has caught the imagination of conservative politicians as a method to delay the renewable disruption of the fossil industry and have an answer to climate change.

When their plans, like in Australia, get presented they don’t care the slightest about nuclear power and it is only a method to prolong the life of the coal and gas assets.

[1]: https://en.wikipedia.org/wiki/Nuclear_renaissance_in_the_Uni...


> American companies and utilities announced 30 reactors. Britain announced ~14.

Lots of projects get announced, they aren't meant to be promises.

> The end result of what we broke ground on is 3 cancelled reactors, 3 reactors which entered commercial operation in the 2020s and 1 still under construction.

So there are three operational reactors and another one almost ready. I'm surprised we got that after Fukushima.

> Today renewables make up 2/3rds of global investment in the energy sector.

So we should not invest in anything else?

> Steam locomotives technically work, but are like nuclear power uncompetitive.

This is a terrible analogy.

> Lately nuclear power has caught the imagination of conservative politicians as a method to delay the renewable disruption of the fossil industry and have an answer to climate change.

People who have been advocating for more nuclear power should stop because it is a conservative issue now?


Which would have moved forward towards completion if the economic calculus made sense.

We should of course continue with basic research. But, without some incredible breakthrough nuclear power will only serve climate change deniers agenda in delaying the renewable buildout.

This is what you sign up for when proposing investing in nuclear power in 2024:

> The opposition last week released modelling of its “coal-to-nuclear” plan that would slow the rollout of renewable energy and batteries and instead rely on more fossil fuel generation until a nuclear industry could be developed, mostly after 2040.

https://www.theguardian.com/australia-news/2024/dec/16/coali...


in other words, it's too late to build nuclear, let's bury our heads in the sand and hope somehow we have enough renewable in 20 years and we're not still using the coal/gas.

The bury our heads in the sand part seems to be you projecting.

The research disagrees with you. Whenever new built nuclear power is included in the analysis the results becomes prohibitively expensive.

> Focusing on the case of Denmark, this article investigates a future fully sector-coupled energy system in a carbon-neutral society and compares the operation and costs of renewables and nuclear-based energy systems.

> The study finds that investments in flexibility in the electricity supply are needed in both systems due to the constant production pattern of nuclear and the variability of renewable energy sources.

> However, the scenario with high nuclear implementation is 1.2 billion EUR more expensive annually compared to a scenario only based on renewables, *with all systems completely balancing supply and demand across all energy sectors in every hour*.

> For nuclear power to be cost competitive with renewables an investment cost of 1.55 MEUR/MW must be achieved, which is substantially below any cost projection for nuclear power.

https://www.sciencedirect.com/science/article/pii/S030626192...

Or if you want a more southern latitude you have Australia here:

https://www.csiro.au/-/media/Energy/GenCost/GenCost2024-25Co...


It may cost more, but it is constant generation, and we should invest in as many carbon neutral alternatives as possible that are feasible. The fact that you have a political opposition to it because of conservative opportunists using it for their own agenda is irrelevant.

Which is not what any modern grid needs? We need cheap dispatchable power, not horrifically expensive inflexible power.

Many grids around the world already spend loads of time with renewables filling 100% of the demand.

https://www.power-technology.com/news/california-achieves-10...

That is a down right hostile environment for nuclear power which relies on being able to output at 100% 24/7 all year around to only be horrifically expensive.

In the land of infinite resources and infinite time "all of the above" is a viable answer. In the real world we neither have infinite resources nor infinite time to fix climate change.

Lets focus our limited resources on what works and instead spend the big bucks on decarbonizing truly hard areas like aviation, construction, shipping and agriculture.


'Plenty of places' is not all places and you want to completely count out a significant energy generating ability because you are annoyed that it doesn't agree with your politics. If it isn't feasible then they won't build it -- by going around and advocating against it you are doing the same thing that happened in the 70s and 80s -- removing a perfectly valid option for energy that we need and will otherwise be fulfilled in any other way if not provided -- almost always with fossil fuels. If you can guarantee every place for all time will be fine with renewables, I'd like to see it, otherwise, why not step back and let engineers and scientists evaluate instead of grandstanding against an option?

What places aren’t covered by the spectrum with Denmark for higher latitudes and Australia for the near the equator?

I’m advocating against wasting public money on nuclear power pretending it is a solution to climate change.

Have at it with your own money.

I already provided you with the scientists and engineers, but you seem to have completely disregarded them because they did not align with what you wanted.

I can do it again:

The research disagrees with you. Whenever new built nuclear power is included in the analysis the results becomes prohibitively expensive.

> Focusing on the case of Denmark, this article investigates a future fully sector-coupled energy system in a carbon-neutral society and compares the operation and costs of renewables and nuclear-based energy systems.

> The study finds that investments in flexibility in the electricity supply are needed in both systems due to the constant production pattern of nuclear and the variability of renewable energy sources.

> However, the scenario with high nuclear implementation is 1.2 billion EUR more expensive annually compared to a scenario only based on renewables, *with all systems completely balancing supply and demand across all energy sectors in every hour*.

> For nuclear power to be cost competitive with renewables an investment cost of 1.55 MEUR/MW must be achieved, which is substantially below any cost projection for nuclear power.

https://www.sciencedirect.com/science/article/pii/S030626192...

Or if you want a more southern latitude you have Australia here:

https://www.csiro.au/-/media/Energy/GenCost/GenCost2024-25Co...


I agreed that it costs more and read the study you linked. You are having a hard time accepting that some people might have a different opinion than you and are taking it like they are being obstinate. Sorry it costs more, but I don't think we need to be uniformly opposed to a viable option due to cost.

I genuinely don’t understand why you think nuclear is a viable alternative.

You agree it costs more, is less flexible, takes longer to be operational than renewable energies.

You didn’t present any good argument for nuclear except “something something you don’t like my politics”

I agree with GP that nuclear is often used as a smokescreen to delay doing __anything__ practical and instead keep burning coal etc


I'm not making this political, I said that the politics are irrelevant. I am not advocating for more nuclear -- I am advocating keeping options on the table regardless of politics or cost, because the issue is important to the progress of our species and condensing things down by referencing single studies and talking points is short-sighted -- we have been down that road, it didn't work, let's not bind our hands needlessly.

in practice, 20 years of walking away from nuclear meant that Germany brought coal-fired stations back last year. I'm sure renewables will stop it happening again in 20 years _this time_.

Not sure why this misinformation keeps being repeated?

Since the nuclear phase out began both coal and nuclear is down replaced by renewables. Fossil gas is stable.

https://ourworldindata.org/grapher/share-elec-by-source?time...

Germany brought a few coal plants out mothball to prevent the collapse of the French grid when half the French nuclear fleet was off line at the height of the energy crisis.

Which then were promptly mothballed again when the French got their nuclear power under control.

https://www.nytimes.com/2022/11/15/business/nuclear-power-fr...

But lets blame the French nuclear power not delivering on Germany. That makes total sense.


do you expect renewables to be more consistent than nuclear?

it sounds like they turned off coal to go back to nuclear after all...


> Instead of paying to buy wind and solar plants

Have you considered googling and checking your assumptions? May help clear up the cynical misunderstandings you appear to have.

If you had, you would’ve read that both Microsoft and Google invest heavily into wind and solar, and that Google is the largest corporate purchaser of renewables in the world. I’m not advocating for these companies, just trying to show that tech is one of the few industries that does actually care and invest into clean energy.

Some sources:

- https://www.gstatic.com/gumdrop/sustainability/google-2024-e...

- https://www.latitudemedia.com/news/googles-largest-wind-inve...

- https://amp.theguardian.com/technology/2019/sep/20/google-sa...

- https://www.theverge.com/2024/5/2/24147153/microsoft-ai-data...


> Have you considered googling and checking your assumptions? May help clear up the cynical misunderstandings you appear to have.

I don't have any such misunderstanding. Perhaps consider seeing my original comment which links to an article describing Google building out solar and wind farms for its data centres.

My cynicism, which I argue is well founded, is based around tech companies signing such agreements with nuclear companies, especially when it involves doings things that have never been done before (restarting reactors and building economical SMRs, see Nuscale...).

All these agreements are likely to amount to nothing more than positive PR, greenwashing, or predatory delay. Yes, they also build out solar and wind, but their nuclear PPAs are given equal standing with projects which actually are likely to be built; so instead of having to build more solar and wind today for more real money, they can promise to buy nuclear tomorrow for no cost today.


Not one of your comments amidst this sprawling thread has a single positive fact in it. You’re blindly arguing “nuclear bad” and claiming that nuclear PPAs amount to nothing based on zero evidence? The ones we’re all discussing are the first of their kind…

Anyways, it’s been a bore, cheers!


> Google and Microsoft have publicly committed to those projects you’re mentioning

Google and Microsoft, or their current CEOs, today?

Amazon's CEO committed to their office employees having flexibility regarding their workplace, only about 2 years ago, yet here we are now, with said employees soon having the flexibility to be 5 days in the office, or quit the company.

CEO promises are not worth the screen time they're provided.

Have these companies signed contracts with major penalties if they back out? Those would basically be the only "close to" unbreakable bonds for them.


Microsoft also committed publicly to prioritise security. And Google says they prioritise privacy of their users above all else.

I pity the fool that believes anything these corporations put out publicly.

Actions matter, words are wind


I almost feel there’s a big difference between the kinds of things we’re talking about, but sure.

But not just for AI, for all their data center operations.

Because it's all marketing and greenwashing. They are training these models today using fossil fuels. By the time those nuclear reactors are online, they will have gobbled up literally every human creation to train their models multiple times and dried up several water sources.

Google and Microsoft won't do anything that doesn't translate to money. These days are over.

Yes, that’s why they want to fund and purchase cheap nuclear energy.

I feel like taking Google’s commitment to something seriously is one of this things that I can very uncontroversially respond to with “is this your first day?”

All but the biggest Google fanboys know that Google is incredibly indecisive and will cut plans at a moment’s notice.


I come to HN for better discussions without such infantile retorts.

Or solar in space, which some have already heard of Lumen Orbit https://www.ycombinator.com/companies/lumen-orbit

Space-based solar power contains little intrinsic advantage that we can get “only from space.” It looks like a wash at best, and the astronomers would say “don’t bother.” https://dothemath.ucsd.edu/2012/03/space-based-solar-power/

Yeah, but what I found thought provoking is what if you send the solar panels and the datacenter as well for training. No need to transmission of power down to earth. I guess then it becomes a heat dissipation and hardware upgrade and maintenance. But again, thought provoking.

Heat dissipation becomes a _huge_ problem when you deploy a data center inside a perfect insulator, the vacuum of space.

Currently about a third of the energy consumption of a data center spent on cooling (heat dissipation)? And that's with the use of a huge heat sink, the earth.


Plus, I feel like GP hasn't ever seen an actual data center. One does not simply strap on on top of a rocket (even a SpaceX Starship) and toss it into LEO.

No. This is a classic case of Jevon's paradox. Increased efficiency in resource use can lead to increased consumption of that resource, rather than decreased consumption.

Example:

1. To decrease total gas consumption, more fuel efficient vehicles are invented.

2. Instead of using less gas, people drive more miles. They take longer road trips, commute farther for work, and more people can now afford to drive.

3. This increased driving leads to higher overall gasoline consumption, despite each car using gas more efficiently.


[nitpick] it's the Jevons Paradox, named for William Stanley Jevons. No apostrophe, but if you were to add one, it would be Jevons' Paradox.

https://en.wikipedia.org/wiki/William_Stanley_Jevons


There's no paradox in that. People became more capable and can afford to do more.

It is a paradox because there is an apparent contradiction in the fact that higher efficiency leads to higher consumption. By definition the opposite should be true.

I don't share that intuition. If I earn more money, I won't necessarily save more. I'll buy better food, better clothes, better everything and live a materially more prosperous life. My savings rate may even go down. Or up. It depends on the specifics. When a tech gets more efficient, it causes people to do more. To shape their surroundings and bend reality more to their will. If you can travel easier, you can realize your travel wishes better.

Really, once you understand any "paradox", it ceases to be a paradox at all.

I always feel a bit silly referring to any "paradox" as such, when it's not a paradox anymore. When it now makes perfect sense.


Also I think this will play out for AI as a productivity multiplier. Instead of people having less work there will be more to do since more things are worth doing now. For the following few years at least.

Work expands to fill the available time, after all.


How do you argue that demand was induced as opposed to existing demand served?

Because economists only call it demand to the extent that people are willing and able to make a purchase.

If someone has a need or wants something real bad but can't afford to buy the desired quantity at the prevailing price then economists don't call it demand.


Yes, the term is a bit clumsy. The way I think of it, people have desires (to drive on the highway), but are dissuaded from doing so by disincentives (it’s too busy). Adding a lane reduces the disincentive, so that latent desire is satisfied, until it reaches a new equilibrium.

Result: more people getting where they want to go.

But sometimes the social environment adapts and now you have to drive that amount because it got factored in and things are now built further away, so whether you want to go far is not up to you. See long commutes becoming the norm. Now, arguably long commutes lead to better job allocations and more efficient land use as people can live in one place and work in any of the workplaces within a large radius. So it's a bit more complicated than "want", but ultimately more value seems to be produced.

> ultimately more value seems to be produced.

Is more value produced or are costs just shifted off the balance sheet onto the public commons? Driving instead of walking/public transit has certainly been profitable for some people/companies. But it has also been less than ideal from a public health standpoint. And the time spent commuting is unpaid, so while the business saves money on rent, the increase in travel time is still a cost borne by society as a whole. I would describe this as the opposite of 'efficient land use' personally.


But in the case of highways, they probably would have still gotten where they want to go by another route. The folly is treating highway capacity as being a "market" when really the decision making is much more dynamic and nuanced.

Like a market it's very complicated with many feedbacks and value judgements. For example "How much of my time is it worth sitting in traffic to get to my preferred store across town vs the closer one?"

It's a bit like queueing. The cost isn't monetary.


Ok, not maybe instead of calling it "induced", call it "latent demand" if you prefer.

People will do whatever's more convenient, so if you make driving far more convenient than everything else ("cheaper"/"more available"), they will drive.

However convenience should not be the only factor for social decisions. To take this to extremes, it would be much more convenient for J. Doe to steal a car than to buy it, so we definitely do not want to make theft convenient.


Maybe because of an extremely aggressive marketing, and pushing "AI features" into literally everything, and there being a pushback against that?

Congrats, you have independently reinvented the Hardware Overhang hypothesis: that early AGI could be very inefficient, undergo several optimization passes, and go from needing a datacenter of compute to, say, a single video game console's worth: https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-...

In that scenario, you can go from 0 independent artificial intelligences to tens of millions of them, very quickly.


Thanks for sharing. Worth its own submission: https://news.ycombinator.com/newest

it would seem perfectly reasonable to expect the first AIs to be very unoptimized and if the AIs are any good they will be able to optimize themselves a lot and even help design ASICs to help run them.

Are we setting up nuclear plants for AI data centers? If so, I see that as a win all around. We need to rely more on nuclear power, and I'll take whatever we can get to push us in that direction.

How else will we get manufacturing gains for a mars base nuclear system?

No, as the things using that power get better (newer models keep getting less garbagey) and cheaper (faster hardware and more efficient use of power), people will keep coming up with more things to use them for.

You have it flipped. But it's both.

AI compute is measured in gigawatts, not gigaflops.

It's "how any gigawatts of compute can we get allocated?"

Not

"How much compute can we fit inside of a gigawatt?"

There's no such thing as "enough"


Jevons paradox says as things get more efficient, usage goes up. In this case, even if AI data centers don't pan out, I think we'll still find use for the electricity they generate.

We don't even know a tighter lower bound for matrix multiplies than O(n²). Naive is O(n³), strassen is O(n^2.8). And those are simple, low-level kernels. At the higher level we also do not know tight lower bounds. But we do know some loose bounds from nature, e.g. how much data and energy a human consumes over its lifetime.

No, not really. AWS getting more power and space efficient chips didn't reduce total power demand, they just added more cores.

Even if the data centers didn't keep up with available capacity, energy demanding industry move to and expand with sources of power, like aluminum production.


Jevons Paradox will take care of it[1]. The more efficiently a resource is used, the more demand there is for it.

The grave implication of Jevons paradox is that the fundamental conflict between sustainability and economic progress is not resolved solely by using resources more efficiently. It's a theory of supply chain constraints essentially. Once a resource is used more efficiently, its use is increased until the next most economically constrained resource hits its economically useful limit.

[1] https://en.wikipedia.org/wiki/Jevons_paradox


I guess power demands will slowly grow. The same happened with compute in general. Compared to 1960, we have several orders of magnitude more compute but also several orders of magnitude more efficient compute. Data centers are currently about 0.4% of total energy use (electricity is about 20% of total energy use and of the electricity about 2% goes to data centers, so 20% * 2% = 0.4%).

If we're "lucky" (in an AI-optimist sense) we'll need the nuclear plants despite efficiency increases.

I think they can easily eat up the new capacity with larger multimodal models that ground language on video.

Wirth's law: software is getting slower more rapidly than hardware is becoming faster.

I think there's the energy parallel: Software is becoming more energy-hungry faster than algorithms are becoming efficient.

So we'll still need the energy.


It seems (feels?) likely that demand for LLM is elastic, especially when it comes to specialized niche. Less power requirements just mean we run more of them in parallel for stuffs, so the power needs is gonna be growing anyway.

That'd give a lot of extra power which can be used for other - and probably better - purposes so I'd say let them build those plants. The more power available the better after all?

It's called the rebound effect, at no point in modern history efficiency reduced our energy needs, we just use the extra energy to either run more of the same thing or run other things

Don't you think people will just add better models to meet available memory?

If we run 7B now, why wouldn't we run 700b with memory optimizations?


No idea, but it may also turn out that OpenAI has no moat, which is more interesting.

Then you can simply have more AI in the data centres.

What if it's a big hoax and we create a better world for nothing?

This only decreases memory cost of input context window, not the memory cost to load and run the models.

And that’s what matters the most! To me, at small model sizes (1-8B), anyway. A few thousans tokens already bog my RAM down quite a lot and I’d love to have more - I’d go as far as saying that context greatly determines LLM capability at this point.

Yes, pretraining and post-training is nice and important, but in-context learning turns LLMs from toys into tools.

Context window requires ram too.

I agree with you though, the title is misleading.

Title is perfect. Their typical audience probably understands "memory" better than "context window", but then if you've actually deployed these systems it's not difficult to go the other way, from "memory" to "context window" since the context window specifically is known to take additional VRAM over the model itself

"Only"

It’s mind bogglingly crazy that language models rivaling ones that used to require huge GPUs with a ton of VRAM to run now run on my upper-mid-range laptop from 4 years ago. At usable speed. Crazy.

I didn’t expect capable language models to be practical/possible to run loyally, much less on hardware I already have.


You have a sota multi-modal LLM running in your head at 20W, shared with best in class sensor package and top performing robotics control unit.

There’s soooo much more to optimize.


I would argue that our sensor package is losing its lead very quickly-- audio performance is already on par with current tech and image processing is closing the gap very quickly as well (it helps a lot that silicon-based technology is much less constrained on bandwidth). Tactile sensing is still lightyears ahead, and I don't see that situation improving anytime soon...

chemical sensing is still quite good though.

But can it know love?

Word on the street is researchers looked at the weights and weights looked back. You’ll have to ask the weights.

> I didn’t expect capable language models to be practical/possible to run loyally

Now there's a fun typo. Hopefully not too much fun.


That sentence triggered memories of reading the huge paperback Asimov compilation my mom kept on the bookshelf as a kid.

I might have a go at installing one, what is a good source or install at the moment?

Ollama was the easiest way to set up local LLMs for me.

https://ollama.com/


With llama3.2:1b, llama3.2:3b and llama 3.1:8b being the main ones I tried and found impressive.

If you don't care about docker packages being used as installers and your home directory invisibly used to store massive weight files in exchange for not having to deal with learning any configuration: ollama or lmstudio.

If you just want to play for a bit: llamafile

If you want granular control with ease of execution in exchange for having to figure out what the settings mean and figure out which weights to download: koboldcpp. (check out bartowski on huggingface for the weights)

These are all based on llamacpp as a backend, by the way.


I run ollama off a symlink to an external volume. It just feels neater that way, and can run any GGUF off of HuggingFace. I would like to know what configuration I'm missing out on, though.

lmstudio is very easy if you are running on local desktop.

msty.app is good



Given that the algorithms powering present LLM models hadn't been invented ten years ago, I have to think that they are (potentially) far from optimal.

Brains have gone through millions of iterations where being efficient was a huge driver of success. We should not be surprised if someone finds a new ML method that is both wildly more efficient and wildly more effective.


Perhaps LLM++ will start iterating the algorithms via synthetic data until they are far more optimal

Very clever, very meta, and it seems to work really well.

The two big take-aways for me are:

* It's possible to train a model to learn to summarize context from the attention matrix, based only on dot-product scores (k @ q.T * mask), regardless of how tokens are embedded.

* Once the model is trained, it will work with any attention matrix, even if it's the attention matrix of another model.

I've added this to my ever-growing list of things to try.


Is there any intuition why does it even work? It seems very unexpected.

The intuition is that the relative frequency at which past tokens get attention from future tokens is a good proxy for their relative importance.

The model the authors use, in fact, maps attention scores to features in the frequency domain.


Boringness classifier! Pretty cool because this implies the large models already know what is useless and what isn't.

> NAMMs are trained separately from the LLM and are combined with the pre-trained model at inference time […]

So it's like a garbage collector for prompts?

More of lossy compression

Does lossy means you may see previous inputs change ?!

Lossy doesn't necessarily imply that it is non-deterministic, just irreversible.

Only the model’s view, doesn’t have to be yours, just like you can participate in a long conversation without perfect memory that might in retrospect slightly differ from a recording.

with the added bonus that these things are unreliable and the compressor could drop important tokens

This is for inference right? Not training?

It's for KV caching. In most conversations that will mean inference. But you can do reinforcement learning using sampled sequences, and you could use KV caching to speed that up too, so that would be an instance where training could get a slight boost.

doesn't training require inference? so i guess it would help there too?

Yeah but training requires the larger memory deployment data center infra

Training doesn't require inference. It uses back-propagation, a different algorithm.

Backpropagation happens after some number of inferences. You need to infer to calculate a loss function to then backprop from.

And we finally can sell the unoptimized models as Hires (since I can still read the differences!).

Most people don't remember absolutely everything, just the important stuff.

This only reduces the working memory, not the base model itself?

Stop words in extra steps.

Really exciting news.

I'm a big fan of their papers, this one didn't disappoint

[flagged]


We are putting lots of optimisation efforts into lots of worthwhile endeavours.

I dunno, software seems to be getting worse, hardware is getting more expensive and both Microsoft and Apple are distracted by AI, not to mention NVIDIA who seem to have bet the farm on Deus Ex Shovel

Hardware is still getting cheaper all the time as far as I can tell.

Though I had thought you were talking about stuff like eg producing more corn on a given piece of land, or making more furniture from less wood or so. Or even just making better batteries and solar cells.


Like what?

Oh, I don’t know, how about reducing the search space/accelerating the search speed for potential room temperature superconductors? Or how about the same for viable battery chemistries?

It’s a good thing humanity can multitask.

Yeah, but it’s not an even sharing of resources. LLMs are consuming a vast amount of human attention (no pun intended) at the expense of technological pursuits that will more certainly generate value. As far as can be told LLMs are reaching a plateau in terms of real world value, batteries and superconductors have calculably more potential.

> search speed for potential room temperature superconductors?

and what if it's a dead end?


What if LLMs are a dead end?

Nothing to worry about. They are dead end only if we find something better. Till then they here to stay. And likely even after they will be used.

Would you like that with or without tokens?

Go ask ChatGPT how to make a room temperature superconductor

Ok, done. I can report to you that it helped me cut down my personal search space. Imagine what such a tool could do in the hands of a subject matter expert with rudimentary critical thinking ability and the faintest hint of a grasp of using the scientific method to verify claims, wow..

> Imagine what such a tool could do in the hands of a subject matter expert

Nothing. Because LLMs can’t spit out anything more than the corpus of information they’ve consumed. LLMs aren’t AGI.


You're making a fundamental error in your reasoning. An LLM's training corpus being fixed doesn't limit the system's total information processing capability when used as a tool by a human researcher. While the LLM itself can't generate truly novel information (per the data processing inequality), a human researcher using it as a dynamic search and analysis tool can absolutely generate new insights and discoveries through their interaction with it. The human-LLM system is open, not closed.

This is analogous to how a calculator cannot output any number that isn't computationally derivable from its programming, yet humans using calculators have discovered new mathematical proofs. The tool augments human capability without being AGI.

Your argument is essentially claiming that because a microscope can't generate new cellular structures, it can't help biologists make new discoveries.


Did an AI write this?

You have a fundamental misunderstanding of how LLMs work, which is why you think they are magical.

Of course if you play the LLM Pachinko machine you can get all sorts of novel output from it, but it’s only useful for certain tasks. It’s great for translation, summarizing (also a kind of translation), and to some degree it can recall from its training corpus an interesting fact. And yes, it can synthesize novel content such as poetry, or adapt an oft-used coding pattern in a flavor specified by a prompt.

What it can’t do is come up with a new idea. At least not in a way better than rolling a dice. It may come up with an idea that you, dear reader, may not have encountered, which makes it great for education.

I don’t have anything more to say, but you’re welcome to continue this discussion with an agent of your choice.


just read the literature arxiv.org/abs/2410.01720

[flagged]


Google Trends make it seem like we're out of the exponential growth phase for LLMs-- search interest is possibly plateauing.

A decline in search interest outside of academia makes sense. The groups who can get by on APIs don't care so much how the sausage is made and just want to see prices come down. Interested parties have likely already found tools that work for them.

There's definitely some academic interest outside of CS in producing tools using LLMs. I know plenty of astro folks working to build domain specific tools with open models as their backbone. They're typically not interested in more operational work, I guess because they operate under the assumption that relevant optimizations will eventually make their way into public inference engines.

And CS interest in these models will probably sustain for at least 5-10 more years, even if performance plateaus, as work continues into how LLMs function.

All that to say, maybe we're just seeing the trend die for laypeople?


Or maybe it’s Google Search usage that’s plateauing, as LLM interest is answered elsewhere?

I am only half kidding.


Well, Google Search trends are also only an imperfect proxy for what we are actually interested in.

Eg tap water is really, really useful and widely deployed. Approximately every household is a user, and that's unlikely to change. But I doubt you'll find much evidence of that in Google Search trends.


well gary marcus a non lay person is helping spread word that ai winter is again upon us.

but maybe statistical learning from pretraining is near its limit. not enough data or not enough juice to squeeze more performance out of averages.

though with all the narrow ais it does seem plausible you might be able to cram all what these narrow ais can do in on big goliath model. wonder if reinforcement learning and reasoning can manage to keep the exponential curve of ai going if there are still hiccups in the short term.

the difficulty in just shoehorning llms as they are in any and every day task without a hitch might be behind the temporary hype-dying down trend.


But "Large language model" as a topic in google trends is still in its peak. Maybe just everyone who would be the audience is already knowledgeable about LLMs so why would Google Search trends be able to keep rising?

ChatGPT is at it's peak, and something like Claude is still rising.


For the general public, not the HN audience.

True. Microsoft's all in, Apple's all in, Nvidia is selling shovels, insurance companies are all in, police & military are all in, education is all in, office management is all in. Who is left to pump line up?

no one is successfully using LLMs for anything other than customer service related things and text generation(coding, writing)

Rubbish. I built a pipeline to handle document classification that successfully took care of ~70TB of mostly unstructured and unorganized data, by myself, in a couple weeks, with no data engineering background whatsoever. This was quite literally impossible a couple years ago. The amount of work that saved was massive and is going to save us a shit ton of money on storage costs. Decades worth of invoices and random PDFs are now siloed properly so we can organize and sort them. This was almost intractable a few years ago.

Could you describe your stack and how its much more effective than two years ago? I heard of printed-table OCR and doc classification years back.

But LLM is obviously able to organize documents and data much more intelligently than any ML algorithm from the past.

Tagging with in house metadata like division, job code, who was the project manager etc.

Very interesting. If I may ask: how are you handling the correctness issue? What's the workflow there if even able to spot a mishap?

We came up with different categories of tags. I should clarify, the AI didn't actually do the sorting, it did tagging so sorting was tractable. After the tagging it's just a matter of grouping, either by algorithm or human.

Organising data even if it's not 100% perfect is much better than completely unorganized data.

That's fantastic. Congratulations!

I mean considering I did document classification back in 2010 using tesseract, I wouldn't say it was impossible.

But obviously it would be far from accuracy that LLM would be able to do. E.g. generate search keywords, tags, other type of meta data for a certain document.

Yup that's exactly it. By being able to tag things with all sorts of in house meta data we were then able to search and group things extremely accurately. There was still a lot of human in the mix, but this made the whole task going from "idk if we can even consider doing this" to "great, we can break this down and chip away at it over the next few months/throw some interns at it".

Yeah, I don't know - hearing arguments that this was already done by ML algorithms is to me hearing like "moving from place A to B existed already before cars". But it seems like a common sentiment. So much that simple ML attempted to be doing required massive amount of training and training data specific to your domain before you could use it, and LLM can do it out of the box, and actually consider nuance.

I think organizing and structuring data from unorganized data from the past is a massive use case that seems heavily underrated by so many right now. People spend a lot of time on figuring out where to find some data, internally in companies, etc.


mere trillian dollar industries. so far.

As far as I know, finance is not all in. I see Goldman Sachs doing experiments, for example, but it doesn't feel like they're convinced yet.

Finance is basically all of the reasons not to use (generative, LLM based) AI , all in one vertical. The poster child of determinism.

Could you please explain?

Finance is a big industry, and they are doing lots of different things.


Sure, there’s lots of room for LLMs in helping to do clerical work, HR, that kind of thing. I was actually thinking of the direct management of funds and investments. So yeah, like probably all businesses, the ancillary functions can probably improve productivity using Generative AI with a minimal hit to quality.

You are right about the clerical work, but even pure finance is a lot more than 'direct management of funds and investments'. Have a look at Matt Levine's Money Stuff newsletter for a taste.

And I'm not quite sure why you mention determinism in the grandfather comment? Finance people have been using Monte Carlo simulations for ages. (And removing non-determinism from LLMs by fixing the seed of any pseudo-random number generator used wouldn't really change anything, would it?)


Nobody wants to lose money (savings) or go bankrupt because of hallucinations.

That's one small part of finance. (And essentially solved mechanically with index funds.)

There's a lot more to finance outside of that.


At the end of the day, the hard limit in finance is defaulting. Everything outside that is financial poetry (or engineering :-p).

I know every segment of finance loves to pretend that's not the case, because their jobs (and high salaries) frequently rely on that not being true (see the subprime mortgage crisis).


> At the end of the day, the hard limit in finance is defaulting. Everything outside that is financial poetry (or engineering :-p).

You are forgetting all about regulations and taxation (and how to work with / around them). And how to cleverly read documents, and exploit loop holes in contracts.

There's so much more to finance.

(And for eg stocks or commodities, there's not even any notion of defaulting. Defaulting only really makes sense when you have fixed obligations. 'Fixed income' is only one part of finance.)

> (see the subprime mortgage crisis)

That's actually a more nuanced topic than you think. See eg https://kevinerdmann.substack.com/p/subprime-bank-runs-and-t... and other posts by Kevin Erdmann on the topic.


Finance is all in on reading 10-Ks and generating summaries. If you have decisions in mind, I’ll be referring to IBM 1979 slide until an HR LLM fires me.

That's not finance, that's just a generic application, and I think it's a terrible idea that won't last anyway.

Does this mean us plebs can run LLMs on gimped VRAM Nvidia lower end cards?

I don't think so. It seems to just lower the ram needed for the context window. Not for loading the model on the vram.

interesting



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: