Hacker News new | past | comments | ask | show | jobs | submit login

Is there some single page that keeps a running status of the various LLVM's and the software to make them runnable on consumer hardware?



Hi! Funnily enough I couldn't find much on it either, so that's exactly what I've been working on for the past few months: just in case this kind of question got asked.

I've recently opened a GitHub repository which includes information for both AI model series[0] and frontends you can use to run them[1]. I've wrote a Reddit post beforehand that's messier, but a lot more technical[2].

I try to keep them as up-to-date as possible, but I might've missed something or my info may not be completely accurate. It's mostly to help get people's feet wet.

[0] - https://github.com/Crataco/ai-guide/blob/main/guide/models.m...

[1] - https://github.com/Crataco/ai-guide/blob/main/guide/frontend...

[2] - https://old.reddit.com/user/Crataco/comments/zuowi9/opensour...


consumer hardware is a bit vague of a limitation, which I guess it's partly why people are not tracking precisely what runs on what very closely

these could be useful:

https://nixified.ai

https://github.com/Crataco/ai-guide/blob/main/guide/models.m... -> https://old.reddit.com/user/Crataco/comments/zuowi9/opensour...

https://github.com/cocktailpeanut/dalai

the 4-bit quantized version of LLaMA 13B runs on my laptop without a dedicated GPU and I guess the same would apply to quantized vicuna 13B but I haven't tried that yet (converted as in this link but for 13B instead of 7B https://github.com/ggerganov/llama.cpp#usage )

GPT4All Lora's also works, perhaps the most compelling results I've got yet in my local computer - I have to try quantized Vicuna to see how that one goes, but processing the files to get a 4bit quantized version will take many hours so I'm a bit hesitant

PS: converting 13B Llama took my laptop's i7 around 20 hours and required a large swap file on top of its 16GB of RAM

feel free to answer back if you're trying any of these things this week (later I might lose track)


Vicuna's GitHub says that applying the delta takes 60GB of CPU RAM? Is that what you meant by large swap file?

On that note, why is any RAM needed? Can't the files be loaded and diffed chunk by chunk?

Edit: The docs for running Koala (a similar model) locally say this (about converting LLaMA to Koala):

>To facilitate training very large language models that does not fit into the main memory of a single machine, EasyLM adopt a streaming format of model checkpoint. The streaming checkpointing format is implemented in checkpoint.py. During checkpointing, the StreamingCheckpointer simply flatten a nested state dictionary into a single level dictionary, and stream the key, value pairs to a file one by one using messagepack. Because it streams the tensors one by one, the checkpointer only needs to gather one tensor from the distributed accelerators to the main memory at a time, hence saving a lot of memory.

https://github.com/young-geng/EasyLM/blob/main/docs/checkpoi...

https://github.com/young-geng/EasyLM/blob/main/docs/koala.md

Presumably the same technique can be used with Vicuna.


btw I got 4bit quantized Vicuna working in my 16GB laptop and the results seem very good, perhaps the best I got running locally so far


Did you have to diff LLaMA? Did you use EasyLM?


I found it ready-made for download, here https://huggingface.co/eachadea/ggml-vicuna-13b-4bit


Not a single page, but almost all large language models with open weights are published on this website: https://huggingface.co/models




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: