Hacker News new | past | comments | ask | show | jobs | submit login
How long can open-source LLMs truly promise on context length? (lmsys.org)
189 points by dacheng2 on June 29, 2023 | hide | past | favorite | 62 comments



I think this is more of an issue of smaller models with larger context aren't a magic bullet and I'm not sure anyone really paying attention thought they would be.

Falcon 40BN, after extensive use, is the only model that has crossed the "I'd build a product on this" threshold in open source. I expect my tune won't change there until we see larger Open Source models trained on high quality data. These are the models that will benefit from the extended context lengths.

But longer context without a model capable of utilizing the context to actually do things, is yeah, unsurprisingly, not useful.

There's also not really an incentive to currently make that happen in Open Source. If you can train a big model, with a unique high quality dataset you've assembled, using some novel techniques on some performant serving infrastructure, you basically have a blank check right now.

Edit: On my last point, I'm going to say I was being very jaded. I think the community is doing a lot of awesome things and I know of a few efforts personally to train these larger models for open source consumption. I don't mean to be so overly negative here.


It's pretty amazing to me that there are any open source models that have crossed that threshold - especially given that prior to the Llama release in March there were hardly any open source models that felt like they could catch up with GPT-3 at all.

If you're looking for an incentive to make this happen in open source, how about the team behind MPT-30B just getting acquired for $1.3 billion.

There are a LOT of well-funded groups trying to make an impact in open source models right now.


I'm highly skeptical that people are just going to leave the secret sauce out in the open for anyone to directly compete with them, feels a bit too idealistic to me.

If you look at Mosaic, their value, their secret sauce, is the infrastructure. The model, to say it cynically, was marketing. Now it's true it can be marketing and help push open source forward at the same time.

More of a separate point, but I also think models are going to be fairly quickly commoditized on a couple years timescale, so the infrastructure to train a model or serve a model, or the thing built on the model is the value, but right now models are still the value because they're so new and we haven't finished (in fact we're really only starting) the race to the bottom.


Sure, OpenAI and Anthropic clearly have some secret sauce that they're holding back... but the rate at which the open source research community has been catching up is quite frankly astonishing.

I would be very surprised if the closed source research teams can maintain their advantage over the next 12-24 months - especially given the number of enormous GPU clusters that are becoming available now.


I don't want to diminish the efforts of the community, I agree the work is enormous and the strides are pretty crazy. I honestly hope I'm wrong. I was talking with a friend and discussing that this is one of the coolest collaborative periods I've seen in tech so far. Hopefully it plays out your way and not mine :)


I’ve been testing the MPT models and they’re better than all the other open source stuff I’ve tried. I’m using my task-based, one-shot prompts that are about 1-2K tokens. Seems pretty on par with GPT-3.5. Marketing or whatever, first time I’ve been at all impressed.


What is secret about infrastructure?


Falcon was trained by a research think tank, and this is going to happen more and more as AI compute gets cheaper and the frameworks get better. We aren't going to be stuck with overpriced A100s running early PyTorch eager mode code for long.

High quality "open" data is an tricky problem, but the community has already come up with some impressive datasets for finetunes, like the roleplaying collection used for Chronos and Pygmalion. I think this is going to get better too, epecially if cheap models can "distill" lower quality human datasets.


> We aren't going to be stuck with overpriced A100s running early PyTorch eager mode code for long.

I agree and am eagerly awaiting this on the hardware side and trying to participate in this where I can on the software side.


> Falcon was trained by a research think tank

A research think tank subsidized by UAE’s oil wealth. Somebody get Masayoshi on the line, the Saudis must be hungry for deals for their vision fund.


Schmidhuber, the father of deep learning, is already on it: https://cemse.kaust.edu.sa/people/person/jurgen-schmidhuber .


"The father of deep learning", hah. Sounds like something he would say.


Unnecessary ad hominem. Every research tank is subsidized by someone.


Very true, MIT should disclose their military affiliations too when publishing. Otherwise, they are just dishonest


They do. The grants themselves have requirements for acknowledging the funder in publications; most researchers also mention the funding sources for their work and projects on their website and in their CVs.

People who get those grants are proud of the fact and like the money.


The phi-1 paper suggests small models trained on higher quality data hold promise as well. If confirmed that would be great news for the potential accesibility of training open source models.

https://arxiv.org/abs/2306.11644


Doesn't this imply some kind of scaling law? An LLM of complexity Y will only benefit from a context length of at most Z? That feels truthy to me.


It dependends on the context (hah).

For, say, continuing a story, even a small LLM will benefit from a mega context even if it struggles to retrieve specific information... As long as perplexity isnt hit too much, which depends on the method.

But this is more of an issue for summarization or information retrieval.


Really? I found falcon 40b underwhelming compared to llama 65b


Try falcon-40b-instruct, that’s optimized for better interaction with instructions and performs A LOT better


When it rains, it pours.

We have the SuperHOT LORA and the associated paper, the Salesforce model, MPT, and this. Any other long context news I am missing?



Yup, someone supposedly improved superhot scaling already.

Here you go, https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkawar...


Turns out that dynamic NTK-Aware scaling further improves on that, with perplexity performance equal to or better than non-scaled at all context lengths: https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamic...


It just keeps getting better!

This is so exciting because contacts length has been a problem with these models for so long and it's awesome to see open source finally cracking that egg.

I need another GPU.

edit:

I'm actually a little bit skeptical of this. Yes it's dynamically scaling which is great when you have a model that's not fine tuned, but I think it's not going to work out too well when you try fine tune a model on a target that's moving like that. I'd rather one that stay static so that perplexity is always increasing up to the max rather than doing much those graph shows where it gets worse over time.

That said I don't really know what I'm talking about so maybe it'll be better regardless.


Yeah, there's a research paper idea sitting there for someone who wants to run the numbers on some more ablation tests and see if there are any unwanted side effects. Though if it gets the claimed performance on non-finetuned data, you may not need to fine tune it in the first place.

There's a symbiotic relationship between the open source community and the academic research here: the broader community can explore a ton of different improvements, at a much more rapid speed than either the academic research (slower because it has additional objectives) or closed-source (because the lack of sharing means that a lot of low-hanging fruit gets overlooked). Academic research can build the bigger, more experimental projects that create new capabilities. Research can also do the testing and research that the broader dev community doesn't have the time and resources for, giving us a better idea of what parts actually work best and why it works. It can be very valuable to have a paper that tells us something everybody knows, because you can verify the common assumptions empirically and give us numbers that tell us things like how well it works.

I expect to see a lot more discoveries that bounce back and forth between the audiences, because both groups benefit in different ways.


Ah yeah I confused rope with SuperHOT.


There is already another breakthrough on top of RoPE that promises even better scaling to larger contexts (based on NTK-awareness) - see my submission here:

https://news.ycombinator.com/item?id=36533816

The lmsys team is already looking into it, expect further progress soon.

It's really amazing to follow this progress in realtime, even if the pace can sometimes feel exhausting!


Why was the title editorialized?


I’m not sure I even understand what the title is saying.


Isn't condensing rotary embeddings just a clever hack to circumvent the original training limitations? Is it really a sustainable solution for extending context length in language models?


Yes and no.

Yes, it actually works. That's what matters at the end of the day.

But no, eventually you're going to have to fine tune on something that has a larger context or training a brand new model with the position embeddings semi randomized so that it learns to generalize in the first place instead of needing hacks like this to function.

But training a model costs millions and millions of dollars, so we're going to have to wait until some very generous group decides to do all that training and then release that model open source.

Or releases the model file for a paid charge fee, I'd pay like $200 for a good one.


I am not sure how to parse this. Can you clarify?


LOL it was released like a few days ago, lets give it sometime maybe, but yes it's great to raise that these things should be tested better.


Why don't they evaluate GPT-4, like they evaluated GPT-3.5? And why not try longer context windows?


On one hand, I confess that I'm not familiar with exactly how the tech works, but this absolutely sounds like cope. I've literally never ever ever ever seen open source get beat by "proprietary" in this regard. We shall see.


In what regard?


Is the tl;dr: without sufficient context length, the conversation gets confusing, because the AI can't keep track of the full context of the conversation?

It's like being in conversation with an Alzheimer's patient, who keeps forgetting the last sentences?

So, open source models don't do well, while closed source models do?

And, this post documents a way to better measure that important metric?


LLM's have no short term memory. Their internal state never changes.

When you have a conversation with one, you are actually feeding it the entire conversation every time you reply, usually with specific formatting the model is trained to "recognize" as a ongoing conversation, like its continuing a long story. There are some other formatting details, like the initial "instruction" always being preserved at the top as the old conversation gets truncated, and there is some caching done so it doesn't have to encode the entire conversation every time.

The context size is how much conversation you can feed them at once.

The popular open models were essentially limited to ~2K tokens until recently.


We already know how to do "short term memory". We can mathematically condense/average embedding representations of text and use these as input (i.e. combine a page of text into the representation space of one or several tokens, textual inversion style). Once embeddings get good enough, this kind of memory will be ubiquitous. Not sure why we are collectively having amnesia about sentence transformers and related ideas

It's worth noting that "soft-prompts" were a thing that the coomer crowd was already using with openassistant. Not sure why the mainstream NLP crowd is being so damn slow in implementing stuff like this

NLP is shocking undertooled, and I wrote a whole gist complaining about this that got to the front page of HN awhile ago: https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...


Soft prompts and textual inversion, as I understand them, required considerable hardware resources and were effectively made obsolete by low precision LORA training.

But the text token merging people do for negative prompts is fast iirc.

Not that you are wrong. LLM prompting does seem quite primitive compared to zoo of encoding/prompting methods we have for SD... And some of it is directly applicable to LLMs. I never even thought about that before.


All of it is directly applicable to LLMs.

Textual Inversion/Soft Prompts are not made "obsolete" at all by Loras. Textual Inversion is far faster to train, extremely effective on newer models like SDXL which have better clip encoding, and has neat properties of not messing with the general weights of the rest of the model, which is sometimes desirable.


Soft prompts for language models are still viable, but the most accessible version of open source soft prompt training was a TPU-based implementation that depended on features of Google Colab that were no longer available.

You can modify it to run locally on a GPU, though it is a bit slow to train. It provides different advantages when compared to a LoRA.


I occasionally remark or ask, how is it that LLM APIs are always either text -> text, or at best (for smaller models) text -> embeddings, with the idea being of you stuffing the latter in some database for future lookups - where the most interesting thing to expose should be embeddings -> text.

As in, I want to be able to feed the LLM arbitrary embedding vectors - whether they come from averaging embeddings of different prompts, some general optimization pattern (textual inversion style, indeed), or just iteratively shifting the input vector in some direction, walking down the latent space and generating output every step, just to see what happens.

I ask that, and I get blank stares, or non-answers like "oh, it's always been available, you can e.g. put text in, get embeddings out". Yeah, but I want to do the reverse. It's like I was making so big a category error, asking such nonsensical question, that no one can even parse it. But am I?


What exactly do you mean by "embeddings" there?

The way a transformer-based LLM works is that right after tokenization you get a "vector of embedding vectors" i.e. an embedding for each token, and that gets processed by the actual transformer layers. If instead of a list of tokens you want to pass in a list of embedding vectors, that is trivial (assuming you're running a LLM with your own code locally), that would be a <5 line function.

However, I'm not really sure when/why I would want to do that, as "averaging embeddings of different prompts, some general optimization pattern (textual inversion style, indeed), or just iteratively shifting the input vector in some direction, walking down the latent space and generating output every step" are NOT really enabled by that - since that inherently must be a vector of token-embedding vectors combined with positional embeddings, you can't just average different prompts, and there isn't a clear way how to tweak the individual token embeddings that would map to tweaking the whole prompt in some meaningful direction.

In essence, your question seems to make sense if the text was converted to a single "embedding vector", but it's not, at every layer of a transformer model the state is a list of separate embeddings distinctly separated at a token level.


> If instead of a list of tokens you want to pass in a list of embedding vectors, that is trivial (assuming you're running a LLM with your own code locally), that would be a <5 line function.

That sounds like what I wanted. I haven't got to the point of running open-source LLMs locally yet, so I didn't realize it's as easy as you say - rather, since I haven't seen it being mentioned as something you do with LLM, I assumed it must be nontrivial.

> However, I'm not really sure when/why I would want to do that

High-level dream: I want to explore the latent space of an LLM. Get a feel for how it's organized, and what of it is spurious, and what structures seem to be common among LLMs. Find answers to questions that will come to me when I "go there" and "look around".

It might be a futile pursuit - but I feel it's not the idea that's the problem, but rather that high-dimensional spaces might be too big to explore without mathematical understanding that's way beyond mine.

> as "[the specific ideas I listed]" are NOT really enabled by that - since that inherently must be a vector of token-embedding vectors combined with positional embeddings, you can't just average different prompts,

In general, or at all?

Am I wrong in suspecting that averaging the embeddings of "orange" and "apple" and "banana" may give us a new embedding that, fed to the LLM, will make it output something meaningful (vs. random)? From what I read today, elsewhere in the thread, it seems that this works as expected for Stable Diffusion family of models.

And sure, if I take two random multi-token prompts, I expect the result will make LLM output some noise in the shape of word salad. But what if it's multiple closely-related prompts? I have my suspicions, but I want to find out for sure.

> and there isn't a clear way how to tweak the individual token embeddings that would map to tweaking the whole prompt in some meaningful direction.

Shifting prompts in some meaningful direction may not be trivial (or useful), but it's not the only interesting thing to do. What if I push a single token in the prompt a little bit in some direction? What if I push all of them? How things change with subsequent pushes?

> In essence, your question seems to make sense if the text was converted to a single "embedding vector", but it's not, at every layer of a transformer model the state is a list of separate embeddings distinctly separated at a token level.

I'm definitely not assuming the text can be converted to a single "embedding vector". My questions boil down to, where is debug-level access to the LLM? Why is everyone focused on stringing tokens, instead of looking at what's lurking between the tokens and their sequences, in areas of the latent space you can't easily get to without embeddings that don't map to any possible prompt.

IDK, maybe my mental model of transformer models is completely broken, or maybe all those questions have been asked, and didn't lead to anything of interest, and I just can't seem to find the right papers covering this.


> LLM's have no short term memory. Their internal state never changes.

Many have interpreted recent advances in LLM as a sign that strong AI is just around the corner. Yet, their static nature is an obvious blocker to ever achieving that – humans, even animals, learn continuously, unlike LLMs where training and runtime are two completely separate processes.

Which makes me wonder – how difficult would it be to have something like an LLM which continuously learnt just like humans do? Could it be done by modifying existing LLM architectures such as Transformers, or would it require a completely different architecture?


It's actually really easy. Recurrent neural networks have been shown to be totally capable of handling language modeling. For example, rwkv has pretty good recall on par with human short term memory (in my opinion). The issue is training. If you train rwkv like an Rnn it's really expensive and time consuming. If you train it like a gpt.. then it's tractable

The math works the same either way. Keep in mind... Llms learn orders of magnitude faster than humans due to parallel nature of transformers.


I don't think that's it though – in recurrent neural networks, there is still a distinction between internal state (dynamic) and weights (static).

Whereas, in biological brains, the weights are updated continuously.


There is also a distinction in the brain between long-term and short-term memories. It is not so beyond belief that the brain stores short-term memories into long-term memories in a separate process (Perhaps sleep, whose lack thereof we know is linked with memory recall issues).

Recurrent neural networks 'learn' continuously by changing their weights (In effect) based upon the 'previous state'. There are papers showing that attention mechanisms in transformers basically provide a 'weight' update function so that models can 'learn' to accomplish a task based on a few examples. In other words, transformer networks 'learn' to train themselves based on the examples given. It is not so beyond beliefs that recurrent neural networks do the same thing. They learn to set up their internal states such that future tasks can be affected by previous input and patterns. In fact... playing around with models like RWKV, you soon learn that this is a fundamental part of the model. If you start talking to it in a certain way, it will start echoing that back. Clearly it's 'learning'.


> Whereas, in biological brains, the weights are updated continuously. My personal impression is that many "weights" are updated during sleep time. For example when training juggling, I will make no progress at all for hours of training. But later, after a night of sleep, I will have a large and instant progress.


If I learn something new in the morning, very often I still remember it in the afternoon, even though I haven't been to sleep yet.

Neuroscientists/psychologists/etc believe [0] humans have four tiers of memory: sensory memory (stores what you are experiencing right now, lasts for less than a second); working memory (lasts up to 30 seconds); intermediate-term memory (lasts 2-3 hours); long-term memory (anything from 30 minutes ago until the end of your life).

We don't need to sleep to form new long-term memories – if at dinner time you can still remember what you ate for breakfast (I usually can if I think about it), that's your long-term memory at work. What we need sleep for, is pruning our long-term memory – each night the brain basically runs a compression process, deciding which long-term memories to keep and which to throw away (forget), and how much detail to keep for each memory.

Regarding your juggling example – most neuroscientists believe that the brain stores different types of memories differently. How to perform a task is a procedural memory, and new or improved motor skills such as juggling are a particular form of procedural memory. How the brain processes them is likely quite different from how it processes episodic memories (events of your life) or semantic memories (facts, general knowledge, etc). Sleep may play a somewhat different role for each different memory type, so what's true for learning juggling may not be true for learning facts.

[0] https://en.wikipedia.org/wiki/Intermediate-term_memory


If performance and computing cost wasn't an issue (which it is!) then you could just have a page of code that continuously runs the LLM in a loop, outputting both the actual output and also some certain "short term memory state" that gets fed in to the next iteration, and that IMHO might get you at least halfway there.


In humans, sleep is also required fo learning, so it's not fully continuous. An AI that occasionnaly retrains using the new knowledge would still be very interesting.


From what I understand, there are two different processes – a continuous process which runs while we are awake, building new connections between neurons (acquiring new "weights"); and a pruning process which runs while we are sleep, sorting through those new connections and deciding which to keep and which to discard. That's very different from an AI that occasionally retrains.


But what if you could run a new acquired neuron "wight" trough another LLM that would based on its provided context (guidelines) be able to determine the importance of this new neuron. That loop could create more accurrate and higher quality data that could be used to re-train (fine-tune) new version of LLM.


Not quite the same though - if you teach me how to operate a drill press, I'll be able to do it today, I don't have to spend 8 hours in bed before the knowledge is available to me.


Why truncate the conversation? Why not have the LLM generate a summary, and use that instead of a truncated conversation?

That seems closer to what we do, we recall the important bits without every single word said.


This is not a bad idea. Some roleplaying UIs kinda do this.

But its also several passes through the LLM instead of one, and is prone to distortion.


Wow, somehow I have never heard about this before. In one sense, this makes LLMs even more impressive - the time it takes to consume a large amount of input seems very small.


There's a hack here though, right? Run the LLM once to say "summarise this conversation in 500 words", then tack the new exchange on the end and run it again?


Yes the entire conversation is fed every time. But also the web interfaces for example on ChatGPT or Bing do this automatically as part of their implementation detail, so the one above you is exactly right.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: