It’s by far the easiest to consume and run format no? Spans devices easy. No weights or extra stuff. For “production” maybe not but to get in into the hands of the masses this seems perfect.
Yeah its by far the least trouble. Pretty much any other backend, even a "pytorch free" backend like MLC, is a utter nightmare to install, and that's if it uses a standardized quantization.
However, the llama.cpp server is... very buggy. The OpenAI endpoint doesn't work. It hangs and crashes constantly. I don't see how anyone could use it for batched production as of last november/december.
The reason I don't use llama.cpp personally is no flash attention (yet) and no 8 bit kv cache, so its not too great at long (32K+) contexts. But this is a niche, and being addressed.
Co writing a long story with the model so it references past plot points.
The story writing in particular is just something you can't possibly do well with RAG. You'd be surprised how well LLMS can "understand" a mega context and grasp events, implications and themes from them.
Thanks, Curious as I havent been in a forum where Ive seen asked:
Have you found a particular manner in which to feed it in - do you give it instructions for what it is looking for, what kind of phrases are you directing it to do?
I am about to start a try at a gpt co-piloted effort, and I have only done art so far - so curious if there are good pointers on coding with gpt?
continue.dev works great for me, supports vs code and jetbrains ide, there's shortcuts to give it code snipets as context and in place editing. works with all kind of LLM sources. both gpt and local stuff.
still haven't found anything that can read a whole project source code in a single click though
Thank you - so you want the most efficient small input tokens for the max output tokens, or at least the best answer with the smallest amount of tokens used - but enough headroom it wont lose context and start down the hallucination path?
Going to be extremely opinionated and straightforward here in service of being concise, please excuse me if it sounds rough or wrong, feel free to follow-up:
You're "not even wrong", in that you don't really need to worry about ratio of input to output, or worry about inducing hallucinations.
I feel like things went generally off-track once people in the ecosystem turned RAG into these weird multi-stage diagrams when really it's just "hey, the model doesn't know everything, we should probably give it web pages / documents with info in it"
I think virtually all people hacking on this stuff daily would quietly admit that the large context sizes don't seem to be transformative. Like, I thought it meant I could throw a whole textbook in and get incredibly rich detailed answers to questions. But it doesn't. It still sort of talks the way it talks, but obviously now it has a lot more information to work with.
Thinking out loud: maybe the way I think about it is the base weights are lossy and unreliable. But, if the information is in the context, that is "lossless". The only time I see it gets things wrong is when the information itself is formatted weird.
All that to say, in practice, I don't see much gains in question-answering* when I provided > 4K tokens.
But, the large context sizes are still nice because A) I don't need to worry as much about losing previous messages / pushing out history when I add documents as when it was just 4K for ChatGPT. B) It's really nice for stuff like information extraction, ex. I can give it a USMLE PDF and have it extract Q+A without having to batch it into like 30 separate queries and reassamble. C) There's some obvious cases where the long context length helps, ex. if you know for sure a 100 page document has some very specific info in it, you're looking for a specific answer, and you just don't wanna look it up again, perfect!
* I've been working on "Siri/Google Assistant but cross platform and on LLMs", RAG + local + on all platforms + sync engine for about a [REDACTED]. It can nail ~every question at a high level, modulo my MD friend needs to use GPT-4 for that to happen. The failures I see are if I ask "what's the lakers next game", and my web page => text algo can't do much with tables, so it's formatted in a way that causes it to error.