Hacker News new | past | comments | ask | show | jobs | submit login

It’s by far the easiest to consume and run format no? Spans devices easy. No weights or extra stuff. For “production” maybe not but to get in into the hands of the masses this seems perfect.



Yeah its by far the least trouble. Pretty much any other backend, even a "pytorch free" backend like MLC, is a utter nightmare to install, and that's if it uses a standardized quantization.

However, the llama.cpp server is... very buggy. The OpenAI endpoint doesn't work. It hangs and crashes constantly. I don't see how anyone could use it for batched production as of last november/december.

The reason I don't use llama.cpp personally is no flash attention (yet) and no 8 bit kv cache, so its not too great at long (32K+) contexts. But this is a niche, and being addressed.


What are you needing 32K contexts for, as use case examples plz.

From [0] @sdo72 writes::

>>>...32k tokens, 3/4 of 32k is 24k words, each page average is 500 or 0.5k words, so that's basically 24k / .5k = 24 x 2 =~48 pages...."

https://news.ycombinator.com/item?id=35841460

EDIT: I may be ignorant: does the 32k mean its output context, or single conversation attention span, or how much it can ingest in a prompt?


Analyzing huge documents and info dumps.

Co writing a long story with the model so it references past plot points.

The story writing in particular is just something you can't possibly do well with RAG. You'd be surprised how well LLMS can "understand" a mega context and grasp events, implications and themes from them.


Feed it my source code to use as context for how to refactor things. the bigger the context window, the more source code it can "read".


Thanks, Curious as I havent been in a forum where Ive seen asked:

Have you found a particular manner in which to feed it in - do you give it instructions for what it is looking for, what kind of phrases are you directing it to do?

I am about to start a try at a gpt co-piloted effort, and I have only done art so far - so curious if there are good pointers on coding with gpt?


continue.dev works great for me, supports vs code and jetbrains ide, there's shortcuts to give it code snipets as context and in place editing. works with all kind of LLM sources. both gpt and local stuff.

still haven't found anything that can read a whole project source code in a single click though


32K tokens = "context size" = sum of input tokens + max output tokens


Thank you - so you want the most efficient small input tokens for the max output tokens, or at least the best answer with the smallest amount of tokens used - but enough headroom it wont lose context and start down the hallucination path?


Going to be extremely opinionated and straightforward here in service of being concise, please excuse me if it sounds rough or wrong, feel free to follow-up:

You're "not even wrong", in that you don't really need to worry about ratio of input to output, or worry about inducing hallucinations.

I feel like things went generally off-track once people in the ecosystem turned RAG into these weird multi-stage diagrams when really it's just "hey, the model doesn't know everything, we should probably give it web pages / documents with info in it"

I think virtually all people hacking on this stuff daily would quietly admit that the large context sizes don't seem to be transformative. Like, I thought it meant I could throw a whole textbook in and get incredibly rich detailed answers to questions. But it doesn't. It still sort of talks the way it talks, but obviously now it has a lot more information to work with.

Thinking out loud: maybe the way I think about it is the base weights are lossy and unreliable. But, if the information is in the context, that is "lossless". The only time I see it gets things wrong is when the information itself is formatted weird.

All that to say, in practice, I don't see much gains in question-answering* when I provided > 4K tokens.

But, the large context sizes are still nice because A) I don't need to worry as much about losing previous messages / pushing out history when I add documents as when it was just 4K for ChatGPT. B) It's really nice for stuff like information extraction, ex. I can give it a USMLE PDF and have it extract Q+A without having to batch it into like 30 separate queries and reassamble. C) There's some obvious cases where the long context length helps, ex. if you know for sure a 100 page document has some very specific info in it, you're looking for a specific answer, and you just don't wanna look it up again, perfect!

* I've been working on "Siri/Google Assistant but cross platform and on LLMs", RAG + local + on all platforms + sync engine for about a [REDACTED]. It can nail ~every question at a high level, modulo my MD friend needs to use GPT-4 for that to happen. The failures I see are if I ask "what's the lakers next game", and my web page => text algo can't do much with tables, so it's formatted in a way that causes it to error.


Have you tried ollama? It’s another llama.cpp server implementation that is becoming popular.


The authors don't seem to care about the principle of least privilege: https://github.com/ollama/ollama/issues/851#issuecomment-177...

It makes me wonder what other security issues they might now care about.


This is a mac problem, not an ollama problem. It also sounds like it's solved by using homebrew (or linux).


(afaik) it does not support batching, and some other server focused features.

I was talking more about a high throughput server, which is not appropriate for ollama or llama.cpp in general.


Just to throw out there - I haven't had any issues with exllama personally, and it's a lot faster last I checked.


Yeah this is what I use as well. The arbitary quantization is just great, as is everything else.

Its not easy to install though, and big dGPU only.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: