Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Speeding up LLM inference 2x times (possibly) (asciinema.org)
419 points by kolinko 8 months ago | hide | past | favorite | 114 comments
Here's a project I've been working on for the last few months.

It's a new (I think) algorithm, that allows to adjust smoothly - and in real time - how many calculations you'd like to do during inference of an LLM model.

It seems that it's possible to do just 20-25% of weight multiplications instead of all of them, and still get good inference results.

I implemented it to run on M1/M2/M3 GPU. The mmul approximation itself can be pushed to run 2x fast before the quality of output collapses.

The inference speed is just a bit faster than Llama.cpp's, because the rest of implementation could be better, but with a better development I think it can be a new method to speed up inference - in addition to quantization.

You could call it ad-hoc model distillation :)

You can change the speed / accuracy of a model at will, in real time.

Oh, and as a side effect, the data format allows to also choose how much of the model you want to load into the memory. You can decide to skip say 10-20-40% of the least important weights.

It's implemented for Mistral, it was also tested slightly on Mixtral and Llama. It's for FP16 for now, but Q8 is in the works.

The algorithm is described here, and the implementation is open source.

https://kolinko.github.io/effort/

I know these are bold claims, but I hope they survive the scrutiny :)




Essentially the algorithm is about pruning parameters on the fly and making the weight matrix sparse by setting least significant weights to zero, where "least significant" is defined in terms of the rank of the absolute value of the pruned weight within its group?

A search for model pruning turns out many results, including https://arxiv.org/abs/2305.11627 which discusses "magnitude pruning" as a baseline, and refers to https://arxiv.org/pdf/2301.00774.pdf, which asserts in the introduction:

> First, as shown in Figure 1, SparseGPT can induce uniform layerwise sparsity of up to 60% in e.g. the 175-billion-parameter variant of the OPT family (Zhang et al., 2022), with minor accuracy loss. By contrast, the only known one-shot baseline which easily extends to this scale, Magnitude Pruning (Hagiwara, 1994; Han et al., 2015), preserves accuracy only until 10% sparsity, and completely collapses beyond 30% sparsity.

I don't like how these papers boast their own methods by poorly implementing a baseline and describing their own methods using lots of mathematical jargon - the OP's blog post is a breeze in making the methods accessible to people with very little background.


Thanks! The last full month was all about making sure all the research is as replicable and as trustworthy as I can do it. The original implementation was extremely inefficient, and even once I had the Metal/GPU mmul op working fast, I spent much time to bring the rest of the implementation as close to the Llama.cpp as possible to allow for easier benchmarking.

As for the approaches - it seems the papers you mention do this statically, and din't produce an algorithm for actually speeding up computations with their 20-50% results - which was a large part of the difficulty. I'll probably take some time off one day and do a thorough review of the literature finally.

Ultimately I want to add a citations page with these and some other papers people have posted in the comments soon. I expect sooner or later someone will find this algorithm written up by someone :)

While development asked gpt-4 and googled trying to find information about this method - everything seemed either static, or focused on removing full dimensions/layers ad hoc with eventual retraining. Didn't find anything matching this idea exactly.


Thank you. If you want to contribute novel research, a thorough literature search of prior work is essential. And hopefully a comparison to previous approaches.

If you didn't find these works in your literature search, I encourage you to improve this skill, because it's important and valuable.


True, it will be easier next time because new llms allow to better searching through papers.

Having said that - the tradeoff with getting into a new field is that there is 50 years of history to dig through, and by the time you know exactly what you’re looking for, you don’t need to find it really. Although a better math framework or another approach would help a bit for sure.

I think in the future it will be way easier with better llm searches. Gpt right now fell short in finding some stuff, but it was already very useful in giving me an overview of the field.


One more opinion here: if you're looking around and nobody else has something obviously working, it's ok not to do a thorough search through hundreds of papers hoping to find someone else who came up with the idea. Get it working first, show it off, and if someone comes claiming it was their idea, respectfully acknowledge the prior work. But there's a gigantic gap between "wrote it in a paper" and "actually got it working".

Also, awesome job! Thanks for writing up some seriously cool engineering.


>But there's a gigantic gap between "wrote it in a paper" and "actually got it working".

The modern patent system in a nutshell. It's now suits registering patents rather than inventors in their workshops.


I think of the issues in looking up related work is the vast terminology used. There are many similar works in terms of ideas and implementation, yet they still are not reconciled.


I found gpt to be awesome in this though - it's good at mapping some fields into others.

While working, I kept asking it about certain ideas, and if the idea was known to him, he said "oh, that's this and it's called this and this".

As a fun sidenote, I noticed there are three levels of GPT's answers: - "This has a name and it's...." - "This will be difficult / very hard to do..." (it will never say impossible) - "This is an intriguing idea"

The last one is what one should aim for - I noticed that when it sees an idea that is absolutely new to it, but it doesn't see any specific reason it should not be possible, it will label it as "intriguing".

The topmost compliment I got, when asked about opinion it said more or less "it shows a deep and nuanced understanding of XX, you should publish" (and then began listing steps necessary - testing, finding collabolators etc).

All in all GPT was super useful for getting into a new field - I didn't want to reinvent the wheel, but I didn't want to spend months catching up with the 50 years history of this field. I kept repeatedly asking him for stuff like "I need to do this and this", and it kept pointing me in the right directions and providing nomenclature.


Have a look at https://arxiv.org/pdf/2306.11695.pdf which also uses the norm of inputs based on calibration


> I don't like how these papers boast their own methods by poorly implementing a baseline and describing their own methods using lots of mathematical jargon

This is a dogwhistle for bad papers - the denser and more incomprehensible they are, the more you can be sure the authors are hiding bad science in there.


I love this line in the gpu implementation section.

"Readers fresh to GPU programming may ask now - how does it work?

Readers experienced with GPU programming may ask - how the hell does it work?"


Haha thanks! :) As far as I understand I had to implement the memory reads and some other things the opposite way to what is considered a proper approach.

Would love to have that code reviewed by someone who actually knows stuff about Metal - this is my first gpu programming attempt


“““ So instead, let's flip the matrix, sort the elements row-wise, and revisit the multiplications from that direction.

This is called a Compressed Sparse Row (CSR) format by the smart people. To do the multiplication now, we take, say, the 1 from the vector, multiply it by 256, and add it into the output vector at the 3rd row. And so on.

Now, let's see what happens if we truncate the last column - the one with the lowest values. ”””

How does csr works with reduced numbers multiplication?


Can you rephrase the question? Not sure I get it.


I could not read from the text how multiplication with CSR format works in the context of optimization.

The key missing piece for me is that how to utilize CSR format to find the numbers to do multiplication, in other words, how does CSR format helps with picking the numbers to multiply with the vector.


Ah, I get the question now.

If I understand correctly CSR, it stores indexes and values as a list. I store them also as a list - that's why the comparison is there. The difference is that with CSR you store say 15% of the values from a given row. I store all of them, but use only the first X% of them. The X% varies and depends on the input vector.

They are stored sorted, from the one of a highest absolute value to the lowest absolute value.

It's after midnight so my explanations may not be too good now, but I hope the pseudocode on the page and the examples explain it slightly better.

I'll be fixing grammar / typos, and asking ChatGPT to rewrite the page text for me tomorrow to make it more readable :)


This seems similar to semi-structured (aka 2:4) sparsity, may be worth explicitly comparing. As far as I can tell by skimming, your technique:

- is optimized for apple silicon - ~2x speed at 75% sparsity - dynamic, depends on input, applied at runtime - can choose amount of sparsity

And 2:4 semi-structured sparsity:

- is optimized for GPUs with sparse tensor cores (nvidia ampere and beyond) - ~2x speed at 50% sparsity - static, applied to the model at rest - probably worse results than your technique at 50% sparsity

The interesting comparison I'd want to see is semi-structured sparsity results (50% sparsity, 2x speedup) vs your results at 75% sparsity (2x speedup).


Thanks for checking it out!

I also can’t wait to have more tests done.

Also, I chose Apple Silicon because it was easier for me to develop on. It is possible that the algorithm would also have good performance on other architectures.


Having used CSR it's not surprising, and some newer formats might have more mechanical sympathy like block ELL, since they avoid uncoalesced reads / gathers, tho the code is trickier.


Oh, nice to finally bump into someone who has experience with CSR!

bucketMul has few uncoalesced reads, and it uses a different data structure than the regular CSR - it's decribed here: https://kolinko.github.io/effort/bucketmul.html It splits each Matrix row into 16 parts, and chooses which ones are necessary to read. The writes are fully linear.

Not sure if I speak sense though, it's getting a bit late today, and it's been a long day ;)


Yeah, nice, you could also consider a low rank factorization as well.


I looked into it at the beginning, but as far as I understand, the modern models like Mistral are difficult to do LoRA on - you can use it to finetune, but the model itself doesn't lend itself to such an operation.

I'm still quite new to the field, so I'd appreciate some more insights into this, and a correction.


Nice idea and write-up! I'm also working in the field of sparsity for neural network inference. Two things come to mind that you should be aware of:

- Compared with a dense Matrix-Vector Multiply implementation, your algorithm adds algorithmic complexity, but reduces memory traffic. Because Matrix-Vector is usually memory-bound, your throughput increases when you can reduce memory accesses. For batch sizes > 1, the speedup would likely disappear very quickly because you are no longer bottle-necked by memory accesses. (Assuming that's a scenario you are interested in)

- As a comparison, I would also like to see another model with an architecture that is 2x faster (so if you apply your method to a 13B parameters LLM at 50% sparsity, how does its performance fare against a 7B parameter LLM? How does it compare to the same LLM quantized to half the baseline bitwidth?). If you could show that your sparse model produces higher fidelity output in the same amount of time (compared to an established inference framework), I'm sure that that would make for an interesting publication.

- Because you are omitting multications, your approximation error will likely be biased to have always a smaller absolute value than the true result. If you can add a correction term to compensate for that systematic error, this likely gives you slightly better performance.


> Compared with a dense Matrix-Vector Multiply implementation, your algorithm adds algorithmic complexity, but reduces memory traffic. Because Matrix-Vector is usually memory-bound, your throughput increases when you can reduce memory accesses. For batch sizes > 1, the speedup would likely disappear very quickly because you are no longer bottle-necked by memory accesses. (Assuming that's a scenario you are interested in)

It doesn't really increase an algorithmic complexity. It is O(effrt * inDim * outDim) for multiplications, O(inDim) for calculating dispatch and O(~inDim * log inDim) for finding the cutoff point.

O-metric is not great for GPU work, but in this case it roughly holds.

The main issue is the architectual limitations of GPUs - the algorithm needs more register / threadgroup / cache memory than the traditional ones, and that would be the main bottleneck then. And the fact that it's not straightforward to parrarelize the work, since every multiplication uses different buckets (just like with MoE models).

As for a bigger architecture - I did a lot of testing on Mixtral, which is essentially a 13B model, and my feeling is that it holds way better there. In terms of inference speed per effort it holds as well, and in terms of quality per effort it holds up readable results until 12-16%, not 20-25%. The tests I did were limited, and I broke the Mixtral implementation when I implemented Mistral, so I don't have hard data on this - but will definitely fix it soon.

My intuition also is that the bigger the model the more you can shave off from effort.

- Ommitting multiplications - that was my guess also, but unintuitively it's not! I have some charts, but didn't manage to prepare them for a release yet.

Because the values in the matrix are evenly distributed in positive and in negative numbers, after a certain threshold there is not much drift in the resulting values.


That looks like fantastic stuff. I just want to point out the 15ms delay looks similar to 60Hz vsync (16.7ms), if you are updating the screen once per token, maybe that’s causing a sync somehow?


Nah, that's not it, I measure the CPU & GPU work separately, and 15ms happens between the kernel invocations. It also happens when I don't print out text.

Thanks for the idea though! I treat it as the first community contribution :D


Thank you for this really cool and open contribution!

I will be watching llama.cpp closely for them to implement this!

I’ve been looking for ways to speed up CPU inference and I really like this idea of “effort”


Hahah, thanks! It was a marathon to get develop this, and I'm glad it reached the front page.

The name was proposed by chatgpt :) It claims it doesn't recognise this approach - so there is a chance it's really a new thing.

I want to reach out to llama.cpp and the others - I hope it gets implemented. I considered just writing a patch to llama, but c++ and the scale of that project was beyond me.

As for CPU inference - it should speed it up just as well. But thanks to the fact that it can load up a fraction of weights (e.g. just 70%, skipping the least important ones), it should be possible now to run models on less VRAM than before (still, Q8 needs to implemented though).

Funnily - when I tried comparing benchmarks to llama.cpp, I couldn't find speeds for 7B/FP16 on MB Air 16GB, because it's impossible to run with regular methods. It is possible with Effort.

Ditto, I was running full resolution, but cropped, Mixtral on my 96GB M2, even though it usually takes 114GB ram. I just loaded 75% of weights, and it was working smoothly. (before I messed something up with implementation and it now produces crap output - needs a fix)


I would imagine the importance of weights depends on the prompt. How do you decide which weights are important?


Yeah, that is the point more or less - it dynamically chise the weights layer per layer depending on the internal state.

A bit technical explaination here. https://kolinko.github.io/effort/equations.html


> It is possible with Effort.

"All things are possible with enough effort." -- Dad.


Hahaha :)


Implementing this approach could significantly enhance the adoption of LLMs within mobile phone libraries and other compact devices. I highly recommend opening an improvement issue for llama.cpp.


Nice writeup! Very curious about the performance per VRAM with this compared to just quantization. Any plans to implement a cross platform version?


Per VRAM - not much netter, because it still uses all the weights, just not all the time.

I mean - it can also load less weights, but quality seems to degrade quick after offloading more than 20-30% weights.

In other words - this algorithm decouples inference time from VRAM use.

Having said that, I’m curious as well if using effort you can get better results on Q8 cropped to 75% than on Q6.

But it’s still probably a few weeks to get the implementation polished enough to be well tested.


> Having said that, I’m curious as well if using effort you can get better results on Q8 cropped to 75% than on Q6.

This is what I wanted to ask. This seems like the same kind of optimization as quantization, sacrificing a bit of quality for performance by discarding some data. So then the question is, which is better, and how do they combine?

You could even potentially get different results at different points in the scale. Maybe Q8 cropped to 75% isn't better than Q6 but Q4 cropped to 75% is better than Q3, or vice versa.


Below Q8 I think it will combine poorly - bucketMul needs to keep a few bits for an index. It can be offset by the fact that the numbers from each bucket are from a limited range. My intuition is that with Q8 this trades off roughly evenly - meaning bucketed Q8 may store as much info as regular Q8, but it will be more difficult with lower quantizations, and cross the boundary of impossible after Q5. Don't have math to back it up, but perhaps some information theoretitian could pick up the calculations.

You could say that these are divergent paths in the future developments (if the results hold for Q8) - perhaps you can crop the Q8 models to Q6 sizes and inference speeds of Q2/Q4. On the other hand, the wildly optimistic scenario is that the bucketMul speeds will overcome even Q1, with dynamic computation pruning - token by token and layer by layer, by having a separate small network that chooses effort levels based on the input data. (someone already proposed it in the thread).

For now, the most important thing is fixing the bugs, doing more serious tests for FP16, and showing how the charts look for Q8. Especially the Q8 is the biggest unknown, although the initial results are hopeful.


I know this doesn't retrain, but I wonder if approaches like this plus quantization could get back any "lost" quality with some post training.

It's great to see, and to know in one's mind how much likely performance and cost will be better in the future.

I know it's fun to work on, but I also want to say THANK YOU for developing open source.


At first glance, that sounds like it could work. From what I've read, there seems to be two main ways to regain some quality with quantization: post-training which happens after, and quantization-aware training which quantizes during training but leaves the activations and gradients full precision



Similar theme, but they skip whole neurons, in my case it’s down to a level of single weights.

From my experiment, skipping whole neurons (so whole rows/columns of matrixes) didn’t allow for such good results. In my case 30-50% neurons whole are skipped with 15% effort, but the rest is used partially still.

There are a few papers on a similar theme that a friend sent me today morning - I plan to add them in the citations part


awesome, looking forward to seeing how your results perform. I tested powerinfer on smaller models and didn't see large performance gains.


Could you break down a bit more about why you can skip so many calculations ?


It’s explained in detail here:

https://kolinko.github.io/effort/equations.html

Long story short - I kind of sort them and pick only the top % that would give a highest result.

One part is choosing them though - I think this was done before in some papers. But the second part was an implementation of multiplication that is efficient on both gpus and cpus when choosing weights almost at will.

All explained on the site, but I just got feedback that it may not be easy enough to read, so I’ll push it through gpt for grmamar fixes soon :) It’s also a bit complicated as an algorithm.


Hmmm. Very interesting. I wonder if you could speed up your clever approximation still further with this approach https://arxiv.org/pdf/2205.09120.pdf


Nah, sadly this most likely will not work with Q1/Q1.5 implementations - initially I was playing with monte carlo approximations (before I arrived at bucketMul), and the convergence was very slow for binary/ternary networks.

Or, in simpler terms - if you have just ones and zeroes, and minus ones, you can remove zeroes from calculations, but that's it. No good method to figure out which ones are more important than the other ones.

Also, there are no bits left to store positional information when bucketing.

There are some paths that could be explored in this fashion, but it would require a redesign of the algorithm from the ground up.


Late response: thank you. Saved me some time thinking further on this.


A bit like calculating the fastest route for a car, you probably don’t need to calculate the distances for the opposite side of earth if you will not drive there. Then multiply that by a billion, but the optimization still holds.


Nice analogy, thanks! Yes - you still need a map, because one day you need this road, another day you need info about another road, but you're not using info about all the roads all of the time.


I'm wondering if you could sort the two inputs, add the indicies for the multiplications together, then take the largest of that.

In your 0.1 example, 1000 gets index 2, and 0.1 index 0, combines to 2. This will tie with the 1*8, but I think it would even out with larger vector lengths.

Edit: I could be wrong but I think you can precompute the indices for the weights in advance without a prompt, then you won't need to perform those sorts at runtime.


That was one of the ideas I was also considering originally! Don't remember why is skipped it though - may have been because I tried the published one and it worked well enough during the first tries.

As for indices for weight - not sure if I get what you mean, but the weights are sorted as a part of precomputations. Sorting is not done in runtime, because that would kill any efficiency - the moment you need to read all the weights, you've lost, because it's the memory reads, not computations that matter. If you can skip one memory read by doing 5-10 multiplications, it's a good tradeoff.


I think this is the overall idea behind the MoE LLM models right? MoE just expands upon this idea by learning which sets of weights to use


I’d say that this expands on MoE really - MoE chooses dynamically which groups of weights may be needed, but it’s whole matrixes. Here it’s ingle weights.

Also, this works on top of MoE beautifully - most of the development and testing was done on Mixtral and it was getting (anecdotally) even better results - getting down to 15-18% effort before seriously losing quality.

I decided to release the Mistral version, but Mixtral was fully operational a few commits back :)

Also, the cool thing - because you can load only the top say 70% weights, I was running Mixtral full precision on my MB 96G - there were no bemchmarks for this in other impls because others need to load full model into the memory.

The real question is Q8 performance - I didn’t implement it fully so far.


Could you add a git tag for the last commit where Mixtral was working?


I think it was somewhere around that tag:

https://github.com/kolinko/effort/releases/tag/5.0-last-mixt...

Cannot rerun easily any more, because the underlying model/weight names changed in the meantime. It doesn't help that Mixtral's published .safetensor files seem messed up, and I needed to hack a conversion from pytorch - it added an extra layer of confusion into the project.


It also feels similar to mixture of depths (https://arxiv.org/abs/2404.02258).

Being able to apply this post-training is pretty cool though, makes it easier to use across a wider range of setups.


Yes! I like that, and I saw that paper last weekend iirc. I think these MoD/MoE and other similar methods are highly compatible, and in a similar style.

I was originally afraid that this method wouldn't be compatible with MoE and the other methods, but fortunately, at least for Mixtral, there seems to be an amazing synergy.

By the way, other tasks have higher priority now, byt there is an interesting observation about MoE. In MoE you get two experts chosen, and each expert has a different weight attached to it - e.g. expert 1 has 75% weight, and expert 2 has 25% weight. Perhaps this could allow to scale the effort to give 75% effort to one expert, and 25% to the other. There are some issues there due to non-linearity of the layers, but perhaps there is something to it.


Could you explain the pseudo code in your equations page? Is the second approxMul call a typo (also the capitalized V)?

def precompute(W): W = W.T probes = get_probes(W) W_idx, W_val = sortMatrixRows(W)

def approxMul(v, W_idx, W_val, probes): cutoff_chart = v * probes cutoff = topK(cutoff_chart, effort) approxMul(V, W_idx, W_val, cutoff)


oh, thanks, I fixed it. No idea what I meant there originally.

There are still a few typos on the page, I'll be fixing them tomorrow - it's midnight now, and my mental batteries are slowly drying out :)


Did you deploy your changes? I just refreshed and that is still there on the page[0].

[0] https://kolinko.github.io/effort/equations.html


I've been trying to think about how you'd amp up the batch size here. it's a bit tricky since the memory access would be way higher, but I think you can actually still save on compute by chunking things up in a clever way to utilize the tensorcores


What would definitely help would be a better understanding a GPU handles myVal - does it go to the cache, threadgroup memory or where. Or perhaps decompilation of the source kernel to see what's really going on there. If we knew, we could calculate the optimal parameters given an architecture.

There is also another approach I tried - I've kept it in the wildMul.swift/metal . More by the book. Splitting work so that one thread keeps track of only one or two output dimensions when going through the buckets. Didn't manage to get it working with a decent speed though.

The idea would be to make sure that one simdgroup reads 4/8 buckets in a batch (so still ~8-16 consecutive bytes which is considered minimum for an optimal read on Apple Silicon) and splits each single bucket among 4-8 threads. One thread might then need to remember only 2-4 dims in myVal, and this might stay in internal GPU registers.

Not sure if I'm explaining this all correctly though.

Having said this - the priority now will be to remove the overhead to get the whole inference time closer in speed to just the multiplication speeds, but above all - fixing the bugs causing suboptimal results at 100% (and breaking Mixtral), and doing Q8. The algorithm needs to work in Q8 to be really usable.


Looks like the models are missing on Huggingface, I've logged an issue: https://github.com/kolinko/effort/issues/3


Ah yes, forgot to make the repo public. Thanks a ton for pointing it out and writing the comment - I'd miss it if you didn't point it out on HN.


we need this into llama.cpp it seems somewhat stable down to 40% effort


Yes! I hope, if it’s proven, that it will be implemented into the main inference engines.

40% effort is only a bit faster than a full base multiplication, but I hope both the speed and the accuracy could be improved further.


I was wondering if something similar can be done for CNN too.


A friend who first heard of the method immediately suggested this may work, perhaps even better in Diffusion models.

It is really a drop-in replacement for the regular matrix multiplication. The data structure is a bit more painful to work with (you have three datasets, not just one to represent a weight matrix), but it shouldn't be too difficult for devs of the existing inference engines to implement it for a test.

Half of my challenge was that I wasn't knowledgable enough to just patch llama.cpp or MLX and use their engines with bucketMul. That's why I opted for making my own - still not sure if it was a good choice to build everything from the ground up though, although I'm proud of the name :)

Finally - the basic math behind approximation suggest that this should work with all the models - cosine similarity score is 0.99 until the magical 25% mark for most of the matrixes I tried. It can vary within a model though - e.g. in Llama, on the first layer, wq/wv/wk could be easily approximated with 5% effort, whereas some deeper layers at 25% had just 0.90 cos score - still seemed enough to not lose coherency within the model.



Okay now add a small model that decides how much effort is needed in each inference step and we are good to go


Yes! That would be awesome. Especially since there are ~32*6 independent effort settings for every single token.

I tested the most basic implementation, with a flat effort setting for all the muls, but I bet the results could be pushed even further with such an approach. Or even with just doing some ML to figure out which layer/matrix needs more and which less effort.


If this is within reach of doing it sounds like the joke might be worth exploring?


Great work! One thing: it seems the hugging face link doesn't work... I get a 404


It should work now :)


how does it compare to 8-bit/4-bit quantization in terms of speed/accuracy?


hard to say for now, I’m curious as well, but I used simpler tests so far because of the implementation issues - most test suites are geared towards testing models and not model implementation.

I didn’t want to wait any longer with the release, but better tests will be coming soon I hope. Anecdotally, I think 30% effort should be comparable to Q8z

More importantly, this algorithm should work on top of Q8. The quality is not yet certain though - I could use help with the implementation.


If it sounds too good to be true, it probably is not.

If removing weights improves some metrics, that may be a clue that the model is not optimal in some sense.


The algorithm still uses all the weights, just not all the time - just skips the weights when they are not important given an input vector.

Also, approximation methods, as a field, are not new and they have shown their use.

Having said all that, extraordinary claims require extraordinary evidence - that’s why I hedge the communication messages. It’s „probably” until we get serious tests going on


The metric that's improved is computation speed, and it's achieved by essentially changing the computation (by not performing some computation that likely doesn't have large impact on the results).

Give that it's a different computation, you could argue that Mistral+effort is a new model with the improved metric of quality per amount of computation performed.

Otherwise - given that for every different input there's a seperate set of weights in the model that are excluded - I don't think you could conclude from this (if it holds up etc etc) that the base model is not optimal.

In a similar sense, quantization improved the "quality per model size" metric, but I don't think people are arguing that Mistral is less optimal than quantised Mistral (unless you're speaking about literally that metric). On the other hand, if you're targeting that metric specifically, then it would make perfect sense to say that quantised Mistral is more optimisal for it.

I guess it comes down to optimality being dependent on the metric you're looking at, and there being many things you might want to optimise for.

To note again, if this technique holds up, it's better than model distillation (just get rid of some of the weights) because for some inputs those weights could matter and this technique should (iiuc) account for that somewhat. To me, this is what it sounds like you're referring to when saying:

> If removing weights improves some metrics, that may be a clue that the model is not optimal in some sense


Yeah, more tests are needed. I got some feedback on using KL instead of the token similarity - initial tests seem to show that it is workable (compared to Q8), but not awesomely amazing - will be working on that next week and publishing.

As for treating effort+Mistral as a separate model - I wouldn't do that comparison. The model stays the same, all the weights from it are still being used, just not all of the time - we don't really lose information from the source model.


Wait, does the second "it" refer to the true part? Because traditionally, it refers to the "too good" expression. So you'd say, "it _is_ too good to be true".


Name idea: lobotomize


YES! That was an idea I had a year ago actually! I wanted to lobotomize models, cutting out weights and leaving only functionality that we want. Unlike distillation it would be a much more brutal process.

The idea went nowwhere for the time being, but if I ever get back to it, the project surely will be called Lobotomizer.


do you have a simple python impl? :)


It originally started as a fork to Recmo’s cria pure numpy llama impl :)

https://github.com/recmo/cria

Took a whole night to compute a few tokens, but I used it to do the first tests.

Also, my friend pasted the paper to claude and it produced a working basic impl instantly :D

But in all seriousness - I think MLX implementation would be doable, or a wrapper to the Metal gou functionality


Will share python implementation soon as a kind of executable pseudo code which then can be ported to any platform.

This project is kind of like ultimate nerdsnipe as math is quite simple, you don’t need PhD to understand it and actually implementing things would teach you linear algebra faster vs just mindlessly doing exercises sets.


Haha yes :) Publish it, Kacper!

The project is a nerdsnipe for math geeks, because there are multiple small things that beg to be proven / described by math there. For example - what's the tradeoff between the number of bits we loose when embedding position vs the bits of information that we gain by knowing which bucket a weight belongs to?

In other words - is it possible that when storing weights in the bucketed form we can actually end up having a higher precision than using a regular form? For Q8 we get just 4 bits to store the weight (and 1 bit for sign, and 3 bits for location), but these 4 bits need to express numbers from a smaller range than before.


not sure why I got downvoted for asking this. seems like a reasonable thing in the ML space

anyway here's a numerical simulation written in PyTorch for those who want to consider this algo on their projects before fully integrating: https://gist.github.com/bwasti/78d7058aad7f42dc893d906c877b7...

happy hacking


If you got downvoted, it was briefly. That was a good question. PyTorch/numpy is awesome for initial testing.


My takeaway is that this proves what was said on the recent latent.space podcast with David Luan from Adept.

https://www.latent.space/p/adept

"I think is going to commoditize a lot of the regular LLMs and soon regular multimodal models."

In other words, if you train your own models, you will not get to take advantage of breakthroughs like this that start with open models (like Mistral).

All the advantages are going towards the open models and this is an existential risk for OpenAI and other closed model companies.


In this case though, the algorithm should be just as useful to closed models as to open models. There is nothing there that is optimised specifically for Mistral - aside from the hard-coded dimensions in multiple places in the code :D

Having said that, it's awesome to have open source models out there, and I hope they will ultimately win in the end.


That seems fundamentally flawed to me. Closed providers can copy techniques from open models but not vice versa.

To me that reads as closed will always be a step ahead not an existential risk to OpenAI.


Maybe, but theres nothing that stops OpenAI from stealing these tricks.


Exactly, that's the problem with the current state of things with open models, the players that keep their secret sauce keep an edge over the people doing things in open while benefiting from all of their work without contributing back.


That was the claim with a lot of the software in the past, but open source won in many places in the end.


At the end of the day, if their profit margins aren’t good, it doesn’t matter whether their competition is open source or not (which is often where OSS wins). I think we are seeing that AI isn’t the slam dunk for increasing productivity or we would see companies like UIPath being profitable. I don’t think we’ve seen anyone net a profit on AI software and the only company that has been investing since at least 2017, Apple, gets zero credit for their contributions and commercialization of the tech. I think about how Amazon abandoned the AI-powered, checkout-free tech because the margin of error stayed stubbornly high for too long. The clock is ticking on the industry and some players, like Apple, already have found it isn’t profitable (well to their standards of 60% return on investment).


Depends on a field.

The project from the thread would take me an impossible amount of time without GPT. Even the page itself would take me twice as long to generate - the charts were done by pasting source data to GPT and GPT writing plotlib code for me to chart them, and the equations were originally written by GPT as well, because I wasn't familiar with MathJAX.

Ditto with the code - a bunch of it was written by GPT originally, since this is my first Swift/Metal project. I kept telling it what I want to do in Python, and it kept rewriting it in Swift/Metal until I learned the latter.

The name "effort" was also invented by GPT. Originally, internally, I was using "quant" but that would be confused with quantization. I considered "perc" from percentage - but that's ugly. I described the project to GPT, and it suggested "effort" as a metric.

As for self-checkout - in Poland we have Żabka Nano which is still going on, and seems more solid than Amazon, but of course the time will tell :)


> but open source won in many places in the end.

The biggest companies in the world are running closed-source software for profit that uses open source foundation while barely contributing back, so it's really not the counter-argument you think it is. And that's no wonder we're seeing open source companies going for source-available licenses now (Redis, HashiCorp) or other kinds of restrictions (RedHat), because they were helpless regarding the parasitic behavior of the big bad wolfs.


In these fast moving early times of LLMs, they can maintain this advantage with simple things like proprietary datasets and greater compute.

The difference in quality between the best model that can be made with such proprietary utilities and without is likely to decrease over time as open datasets of greater quality are published and the field matures.

The difference in quality and number of competitors ultimately pays the bills and the harsher the competition is, the less money there will be, for each individual company, to maintain their possibly dwindling proprietary edge.

The greater access to compute is an edge companies will likely hold for a while. It will be interesting to see how much open models will be able to catch up and how great of an edge proprietary models will maintain.


It's extremely unlikely anyone will take any of this.

Quick take:

- it was 15x slower than llama.cpp when I used Apple's new proprietary ML framework on my MacBook

- So I made it possible to skip arbitrary amounts of work.

- I identified an arbitrary tradeoff that seems arbitrarily good to me.

- I've confirmed this by making GPT-4 write some prompt with questions. Then I had the normal version answer, and the "skip arbitrary work" version answer, and it LGTM.

- So I threw it up on GitHub, then on HN with a title "LLM inference 2x faster (possibly)", and people missed: [on my laptop] [in the ML framework I'm forcing myself to use] [based on an eval I made up] [based on an an eval where I am the evaluator]

This *really* shouldn't have the title it does, very misleading.

Author, please feel free to correct me, I'm sorry for not taking the time to find a gentler way to communicate this. I hope you kick ass. You did put possibly in a parenthetical, but its carrying the weight of the world here, people just see LLM 2x faster. That's why everyone is spinning off into grand speculation land, which I also see you valiantly commenting to dissuade


The whole inference is slow, but it's matrix multiplications that count. They work reliably on all the Macbooks that I tested - at 50% effort it's the same speed as the state of the art matrix multiplications, at 25% they are twice as fast.

The apple's MPS matrix multiplications from Apple are comparable in speed to the speed of Llama.cpp and the other models. When I was doing tests, I was comparing the Llama.cpp benchmarks ( https://github.com/ggerganov/llama.cpp/discussions/4167 ) to Apple's MPS - they match very closely. And then I was comparing Apple's MPS to my results.

Even if the end-results would show that the models somehow break (which they might on Q8), there is no other implemented method right now that would give you such speedups with matrixes of 25% sparsity. The usual methods break even with full matrix multiplications around 15% mark, and show speed improvements under 10% (as far as I know, but I'm new to the field, so I wait to be corrected).

As for the other metrics - I hope to get help from the community to get the implementation done properly. So far it's been 3 months of work 12 hours a day - even during Easter - to get this version going. It is as far as I can push it without the community support, which I'm happy I received over the last hour.

Also, I'm not sure what you'd expect really. A full production ready system on the day one? From a solo developer? Seriously? :)

Let's get the flame war going! :D


Nah it's good work, you'll be able to flame me in a couple weeks...months?..too when I ship my yawn-worthy yet another llama.cpp / OpenAI wrapper. :p

I'd love this knob, particularly in llama.cpp, inference is a bit too slow on Android, 6 tkn/s for 3B. just can't stand it when people don't actually read anything but the title, and go crazy overboard, like, how are we in a thread where people are like "oh this confirm local models will definitely win like I heard on a podcast" and "big bad OpenAI will steal this".


Hahah thanks, although I was hoping for a flame to get adrenaline flowing to push through the night :D

I also hope there will be an extra knob - or more like knobs, because effort can be regulated smoothly layer by layer, token by token, matrix by matrix. Think more like an equalizer, not a volume control :)

The biggest question right now is how (if) it will perform with Q8 and with smaller models. The risk is that the quality dropoff will show up closer to 40-60% at Q8, negating the performance gains.


I think you are commenting on the "stealing" reply and not my original comment. And, I think you are making my point stronger.

OpenAI could easily (and will) put out a blog post or tweet saying "next models do inference 2.5x faster!" Koliko did that, or maybe he didn't and someone else put words in his mouth. I don't really care: I can validate and test your comments here (and they are great!) and I can try his experiments myself.

I cannot do that against "GPT-5-2.5x faster (c) 2024"-42B (because it isn't released yet publicly). Putting a paper and some vague ideas on Arvix isn't really doing much these days except adding to the confusion. Truly open work like koliko is doing is really exciting and feels like it can only be done against truly open models like Mistral.

Oh wait, Mistral isn't fully open either (ducks...).


There used to be a class of software called freeware - you could download and use it without restriction, you just couldn't resell it, or have the source to modify it. Llama and similar models are like freeware - an inscrutable binary blob crafted to work with other software, except instead of a VM or native OS environment, you have llama.cpp or similar software that runs the AI model.

Mistral is open source, in that you can do anything the Apache license allows you to do, even package it into your own product and resell or modify it. We're missing the dataset details, the source to the software that produces the model, similar to not having access to an operating system and special compiler software. That's not a huge deal, because people don't have the resources to make use of those large datasets or the Mistral training software, which is likely highly tailored to their own training and development pipeline, and wouldn't do much good for anyone without at least a pod of A100's of their own.

Weights available and other terms are being thrown around, and Meta and the like are calling their stuff "open" but that use of the term bears little resemblance to the use of the word by the open source community.

The public Mistral models have open source licenses. The model can be used like open source software. The terms are permissive and free, requiring only attribution. Meta's license scheme is novel and not open, with arbitrary lawyerese and absolutely, 100% will bite someone in the ass when the threshold between "annoying to sue" and "profitable to sue" gets exceeded by someone using Llama in a way that's technically incorrect. Right now, Meta wants the goodwill more than they want a couple million dollars chasing a couple dozen startups.

If the model doesn't have an open source license, it's not open. It might be freeware. Llama is freeware. You can, technically, do whatever you want to it, but try to not attract too much notice or be too successful with it.

Mistral, by using Apache licensing, couldn't go after you even if they wanted to, unless you do something deliberately stupid.


> If the model doesn't have an open source license, it's not open.

Actually, OSS comes with tons of strings attached that make the term open dubious. And there are many ways they could come after you legally. Apache, GPL, etc all have terms and conditions, you have to contribute back X Y and Z, agree to our manifesto, and so on.

The only truly free license is MIT. Go build a billion dollar business, change it however you want with no strings attached and you can express the license terms in a single paragraph.


    Apache is stricter because it requires you to list all the modifications you make to the original software. A copy of the license, with copyright and attribution notices, must be included in the source code and any modifications. A specific section on contributions outlines the conditions for accepting external contributions to the project. The MIT license requires a copyright notice, but it does not have explicit requirements for the license or attribution notices.
source: https://www.mend.io/blog/top-10-apache-license-questions-ans...

Apache 2.0 is almost as permissive as MIT, and better suited for some cases. I love the MIT license, as it allows the most freedom for users and creators, across the board. Apache 2.0 is second best, but might better in a formalized organization that wants more structure and formalized documentation requirements.


Here to answer questions!

A friend of mine has published the original link on HN before me, so I hope a double post won't be an issue :)


We've merged them. (other one was https://news.ycombinator.com/item?id=40067489)


Thx! Although I'm happy you took your time and now I can brag until the end of time that I got 2 out of 5 top slots for a moment :D


Please, please, please call the final product Halfwit.

Seriously though, this is a very interesting biologically inspired idea, since not all neuronal pathways fire all the time.

It seems to follow that, if you can predict which weights you won't need, then you should be able to compress the model architecture permanently, at least for certain use cases.


Haha halfwit! I’m waiting for such a fork.

As for predicting the weights - not necessarily so. It seems most weights are being used, just not all the time. Kind of like that saying that humans are using just 5% of their brain - perhaps they are, but it’s various parts of the 5%.

Interestingly, Effort works just as well on MoE, if not better. I did most of the development on Mixtral and I think it go even to 15-20% effort before losing quality, but there is some sort of a bug right now that prevents the inference on Mixtral.

It’s on a todo to fix, but I didn’t want to delay the release because of it.


Effortless would be another great name (since you are literally reducing effort to get speed). OK, maybe not "great", but "an option if you're going for puns".


Halfweight. Half the weights, half the wait, half the wit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: