Hacker News new | past | comments | ask | show | jobs | submit | purplecats's comments login

nice touch with the images, especially since they are even named contextually.

navigational links don't work, but this is pretty cool. I wonder theoretically if we'll be able to tell the dynamic nature of this given the caching, and assuming LLMs/art generators keep getting faster


interesting perspective. i suppose complex minifiers would also be an attack vector, as they don't as readily afford even eyeballing obvious deviances due to the obfuscation


Disk Inventory X is the closest to as pretty as that I've found. I wish it was maintained


do you have a demo to try without building it first



Looks great. Had an issue getting it to work though. iPhone, microphone and speech recognition enabled. Says it adds to the list, but the list not changing. Any ideas?


We have released some fixes, give it a try. If not working, if possible share a video link.


taskpaper is my go to for this


only if they've read and internalized this paper


would like this on https://excalidraw.com/


gpt with context of my entire financial history in real time, to discuss financial decisions with it


i've never been a huge tesla fan, but this video was certainly entertaining and has me considering the cybertruck now, especially for my family.

however, it's very biased. i'd love to take suggestions for a less biased yet through review/comparison


when is the new mistral coming out and at what size?


I'm hoping that they make it 13B, which is the size I can run locally in 4-bit and still get reasonable performance


What kind of system do you need for that?


If your GPU has ~16GB of vram, you can run a 13B model in "Q4_K_M.gguf" format and it'll be fast. Maybe even ~12GB.

It's also possible to run on CPU from system RAM, to split the workload across GPU and CPU, or even from a memory-mapped file on disk. Some people have posted benchmarks online [1] and naturally, the faster your RAM and CPU the better.

My personal experience is running from CPU/system ram is painfully slow. But that's partly because I only experimented with models that were too big to fit on my GPU, so part of the slowness is due to their large size.

[1] https://www.reddit.com/r/LocalLLaMA/comments/14ilo0t/extensi...


I can fit 13B Q4 K M models on a 12GB RTX 3060. It OOMs when the context window goes above 3k. I get 25 tok/s.


I get 10 tokens/second on a 4-bit 13B model with 8GB VRAM offloading as much as possible to the GPU. At this speed, I cannot read the LLM output as fast as it generates, so I consider it to be sufficient.


Which video card?


RTX 3070 Max-Q (laptop)


Mine is a laptop with i7-11800h CPU + RTX 3070 Max-Q 8GB VRAM + 64GB RAM (though you can get probably get away with 16GB RAM). I bought this system for work and causal gaming, and was happy when I found out that GPU also enabled me to run LLMs locally at good performance. This laptop costed me ~= $ 1600, which was a bargain considering how much value I get out of it. If you are not on a budget, I highly recommend getting one of the high end laptops that have RTX 4090 and 16GB VRAM.

With my system, Llama.cpp can run Mistral 7B 8-bit quantized by offloading 32 layers to the GPU (35 total) at about 25-30 tokens/second, or 6-bit quantized by offloading all layers to the GPU at ~ 35 tokens/second.

I've tested a few 13B 4-bit models such as Codellama and got about 10 tokens/second by offloading 37 layers to the GPU. Got me about 10-15 tokens/second.


i have lenovo legion with 3070 8GB and was wondering should i use that instead of my macbook m1pro.


The main focus of llama.cpp has been Apple silicon, so I suspect M1 would be more efficient. The author recently published some benchmarks: https://github.com/ggerganov/llama.cpp/discussions/4167


On my Mac M1 Max 32GB of ram, Vicuna 13b (GGUF model) at 4bit consumes around 8GB of ram in Oobabooga.

Tried turning on mlock and upping thread count to 6, but it's still rather slow at around 3 tokens / sec.


a CPU would work fine for the 7B model, and if you have 32GB RAM and a CPU with a lot of core you can run a 13B model as well while it will be quite slow. If you dont care about speed, it's definitely one of the cheapest ways to run LLMs.


Q5_M on Mistral 7B has good accuracy and performs decently on a CPU too


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: