Hacker News new | past | comments | ask | show | jobs | submit login
Llama.rs – Rust port of llama.cpp for fast LLaMA inference on CPU (github.com/setzer22)
202 points by rrampage on March 15, 2023 | hide | past | favorite | 24 comments



I've counted three different Rust LLaMA implementations on r/rust subreddit this week:

https://github.com/Noeda/rllama/ (pure Rust+OpenCL)

https://github.com/setzer22/llama-rs/ (ggml based)

https://github.com/philpax/ggllama (also ggml based)

There's also a discussion on GitHub issue on setzer's repo to collaborate a bit on these separate efforts: https://github.com/setzer22/llama-rs/issues/4


Do you know if any of them support GPTQ [1], either end-to-end or just by importing weights that were previously quantized with GPTQ? Apparently GPTQ provides a significant quality boost “for free”.

I haven’t had time to look into this in detail, but apparently llama.cpp doesn’t support it yet [2] though it will soon. And the original implementation only works with CUDA.

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa/

[2] https://github.com/ggerganov/llama.cpp/issues/9


As far as I know, the two ggml ones are basically just llama.cpp ports that include the ggml source code so if the support is not in llama.cpp, I don't think it's in these implementations either. Although maybe that also means that they'll gain that ability as soon as llama.cpp does.

I'm the author of the last one, rllama and it has no quantization whatsoever. I don't think any of these are improvements over llama.cpp for end-users at this time. Unless you really really really want your software to be in Rust in particular.



Two of those links don’t work.


Anyone know if these LLaMA models can have a large pile of context fed in? Eg to have the "AI" act like ChatGPT with a specific knowledge base you feed in?

Ie imagine you feed in the last year of chatlogs of yours, and then ask the Assistant queries about the chatlogs. Compound that with your Wiki, itinerary, etc. Is this possible with LLaMA? Where might it fail in doing this?

(and yes, i know this is basically autocomplete on steroids. I'm still curious hah)


You can feed in a lot of context, but the memory requirements go up and up. It was crashing on me in some experiments, but I saw a pull request that let's you set the context size on the command line. I was using the 30B model and with 500 words of context it was using over 100GB of RAM.

You probably would be better off fine-tuning the LLaMA model like they did with Alpaca though.


I cracked up while reading the README. To the author: may you always get as much joy as you're giving with these projects. Nice work!


I feel like https://github.com/ggerganov/llama.cpp/issues/171 is a better approach here?

With how fast llama.cpp is changing, this seems like a lot of churn for no reason.


Oh hey, that's my issue.

It seems like most of the work would simply be moving the inference stuff (feeding tokens to the model and sampling) outside of main(). Most of the other functionality such as model weights loading are already handled in their own functions.


I am working on https://github.com/ggerganov/llama.cpp/issues/77 to "librarify" the code a bit. I eventually did want to Rewrite It In Rust™, but OP just beat me to it.


Great job porting the C++ code! Seems like the reasoning was to provide the code as a library to embed in a HTTP Server, cannot wait to see that happen and try it out.

Looking at how the inference runs, this shouldn't be a big problem, right? https://github.com/setzer22/llama-rs/blob/main/llama-rs/src/...


It shouldn't be too much effort to extend/rewrite that function to use websockets (i wouldn't use just http for something like this). All the important functions used by that function (llama_eval, sample_top_p_top_k etc.. ) seem to be public anyway.


Can someone a lot smarter than me give a basic explanation as to why something like this can run at a respectable speed on the CPU whereas Stable Diffusion is next to useless on them? (That is to say, 10-100x slower, whereas I have not seen GPU based LLaMA go 10-100x faster than the demo here.) I had assumed there were similar algorithms at play.


Stable Diffusion runs pretty fast on Apple Silicon. Not sure if that uses the GPU though.

I think one reason in this particular case may be the 4-bit quantization.


Quantization is the answer here. CPU running the large models at 16 bits (which is actually 32, because CPUs mostly do not support FP16) would be really slow.


ggml should be ported as well to make it really count, use rust’s multithreading for fun


Funny that he had a hard time converting llama.cop to expose a web server… I was just asking gpt 4 to write one for me… will hopefully have a pr ready soon


Anyone more knowledgeable in this space please explain what is meant by inference?

From what I know. LLaMa is built in Python and assuming PyTorch, does this port in Rust make use of Python process or is it what LLaMA algorithm is fully written in Rust?


It uses "ggml", a tensor library written in C to do the math. The models weights are converted to a format that can be used by GGML. The higher level structure of the model is created in Rust be leveraging ggml data structures and the weights are passed into the C library.

PyTorch is only needed for reading the original weights while converting them.


From the readme, to preempt the moaning: "I just like collecting imaginary internet points, in the form of little stars, that people seem to give to me whenever I embark on pointless quests for rewriting X thing, but in Rust."

OK? Just don't. Let us have this. :)


I like the tag, because I wouldn't read the story otherwise. Any time someone ports something to rust the story is usually interesting. Sometimes good interesting, sometimes bad interesting. I don't enjoy reading about ports to Go. It is almost always uneventful. Some more performance perhaps, didn't take long. Wasn't too hard. Stubbed my toe on the error checking etc. With rust, even if the port itself was easy, and the performance gains were minimal, there is usually some bit about a weird error rust found in the old code base, or how the borrow checker ate their baby and everyone panicked, but then everything was fine.

It isn't rust specific, I'd similarly like to know if someone rewrote something in haskell or austral, or lisp. Not because of the languages, but because they make good stories.


Sometimes I program something just out of fun. I designed a minimal binary and textual serializing format because why not? I didn't even complete the software because it got too tedious and the fun disappeared. I didn't publish anything. These are bytes sitting in my laptop and on some backup git repositories on a github clone.

And yet, sometimes I reopen the repository and reread my own code like it's a book. I am weird.


I think it's fun




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: