Hacker News new | past | comments | ask | show | jobs | submit login
RWKV: Reinventing RNNs for the Transformer Era (arxiv.org)
358 points by ianbutler on May 23, 2023 | hide | past | favorite | 171 comments



One thing I'm keen to understand is: how well does attention hold across huge context sizes, with respect to the usual transformer models, and also these proposed RNN models?

All these 2k/4k/8k context sizes that we've had recently are able to map pretty well to what a human could reasonably remember. What I mean is, you could ask a human to read some text with 8k tokens, and for the most part they could answer questions about the text coherently.

But what about 32k contexts, or beyond? At some point, as token size increases, the ability of a human to give a highly precise and detailed answer decreases. They must start to generalize. A human could not read Infinite Jest in one pass and then answer details about every single sentence. But could a transformer, or a RNN? As the context grows, is it harder to keep a high granularity of detail? Or am I wrong in trying to think of these models the way I think about the human mind, and they are actually able to handle this problem just fine?

I'm aware that we can cheat a bit, by adding a lookup step into an embedding database, to provide "infinite" context with "infinite" precision. But to me, that is analogous to a human looking up information in a library in order to answer a question. I'm interested in the inherent, emergent memory that these models have available to them in just one forward pass.


The word "attention" has been stretched pretty far to explain what is happening inside a transformer.

What's actually happening is that every token embedding interacts with every other token embedding before it and as the product of this interaction (dot product + softmax) it takes a fraction of every other token embedding and adds it to itself. Technically, it's different transforms/functions of the embedding.

You can view it as every token embedding mixing information from other embeddings into itself. Done ~100 times in parallel ("attention heads"), ~100 times in sequence (layers). As per GPT-3 model (175B).


I'm pretty sure in GPT3.5+ models, this concept of attention holds no longer true. In [1] the author suggests they are using "Intra-tensor sparsity" from a 2021 paper from Google [2].

Details aside, math suggests they _must_ be using some sparse attention method: the memory used by attention is O(l^2 * d) (here O notation is a bit of overkill: it's exactly that number in 8 bit quantization or twice that in 16 bit floats). With l=32k, and d probably in the order of a few k (even "old" BERT model had it at 768, it only went up since), that would be of the order of a few TB. For (a piece of) _a single_ layer, they probably have dozens, if not hundreds, of it. The largest GPUs in commerce have 80GB of memory; there's no way they are really using that many GPUs for each single layer (even leaving aside the fact that runtime would probably be absolutely horrible).

Implicitly or explicitly, the transformer must learn to have both a "summarized" version of attention (that tells it where to focus on) and more "detailed" ones. It doesn't need to be specifically coded, if using sparse attention, the model might for example get different levels of attentions at different layers, somehow, but they probably have coded something a bit smarter than that.

[1] https://kir-gadjello.github.io/posts/gpt4-some-technical-hyp... [2] https://arxiv.org/abs/2111.12763


> every token embedding interacts with every other token embedding before it

> it takes a fraction of every other token embedding and adds it to itself.

> every token embedding mixing information from other embeddings into itself

Noting the use of the word every. Phrased this way, calling it "attention" hardly makes sense, as attention is typically focused on something specific at any given time - not always on the entire thing.


While all other tokens are considered, the attention mechanism is putting an individual weight on each one, in a way "paying more attention" to some than others.


Yup, if you follow this definition of attention. It makes sense.

The mixing step, is computed with the trained weights, meaning the model does learn on its own, when/what to emphasise in this "mixing" process.

Hypothetically speaking, if a token does not matter, it mixing weights could end up being literally 0 (or something close to it)


Softmax takes a vector of arbitrary values and converts it into probabilities such that all the elements of the vector add to 1. Transformers use this to decide on the fractions.

This could be seen as "soft attention" where in theory there would be few winners for every "attention head".

It's also possible that the only purpose the softmax actually serves (or at least major component of) is that of normalization. Without it, the variance in the internal network dynamics between training samples would be fairly large (some training samples may have tokens that interact heavily, while others not). Making optimization problematic.


Most LLM architectures use “soft attention” where some fractional amount of attention is put on every token. “Hard attention” is the term for what you describe.


"...every token embedding interacts with every other token embedding before it"

And, in the case of BERT, every token embedding after it too.


Agreed. IMO - A part of me even argue we should stop calling it attention (but what to call it instead is a mess)

But since this was derived from apple lite attention paper. The name is gonna stick, due to a lack of better alternative


Naming is a perpetual problem. My issue is with "hallucination", which everyone takes to be the "problem" with GPT style networks making things up. Never mind that transformers are just trying to predict the next likely token, NOT the truth PLUS that they're trained from the internet. As everyone knows, the internet is not known for correctness and truth. If you want any neural network to figure out the truth independently, it'll obviously need the ability to go out into the real world and even needs to be allowed to experiment for most things.

Hallucination used to mean the following. A basic neural network is:

f(x) = y = repeat(nonlinearity(ax[0] + bx[1] + ...))

And then you adjust a, b, c, ... until y is reasonable, according to the cost function. But look! The very same backpropagation can adjust x[0], x[1] ... with the same cost function and only a small change in the code.

This allows you to reverse the question neural networks answer. Which can be an incredibly powerful way to answer questions.

And that used to be called hallucination in Neural networks. Instead of "change these network weights to transform x into y, keeping x constant" you ask "change x to transform x into y, keeping the network weights constant".

Now it's impossible finding half the papers on the this topic. AARGH!


> My issue is with "hallucination", which everyone takes to be the "problem" with GPT style networks making things up.

It's most famous as a problem with ChatGPT specifically, which is presented as a chat interface where an agent answers your question. In that context, it makes sense to think of confident and detailed wrong answers as hallucinations.

You could say that it's still an LLM underneath, but then you'd be talking about a layer of abstraction beneath what most people interact with. Given that they tap into the mental model of "chat with an agent" heavily with their interface and interactions, having small disclaimers and saying "I'm just an LLM" from time to time aren't sufficient to counter people's intuitions and expectations.


I believe the naming is perfect: the word attention clearly maches the intention of the model structure. It’s still clearly missing more efficiency, but we had to have ChatGPT / ChatGPT4 working and used to make the further research directions clear (increasing context length and decreasing hallucinations).


Haha, yea - naming things is hard

Every-time someone comes up and say "this is not attention, because it does X and not Y"

My response is, ok, what would you call it then? Because no one (including me) seem to to able to find a better term, that fits its use case.


> But what about 32k contexts, or beyond? At some point, as token size increases, the ability of a human to give a highly precise and detailed answer decreases

War and Peace is over 580,000 words long. Chapter one is ~2,020 words which encoded for GPT3 is ~2,956 tokens (lots of longer, older words and proper nouns eg. "scarlet-liveried footman" is six tokens), so we might expect the entire book to be ~750,000 tokens long.

Many people could reason about the book in its entirety. They would not have an encyclopedic recall of random dates and side characters or be able to quote any passage, but they could perform deep analysis on it.


Similarly, consider a series like A Song of Ice and Fire. A human reader is still consciously aware of (and waiting for) the answers to questions raised in the very first book. This is millions of tokens ago, and that's if our brains turn off when not reading the books.

I think this highlights a hurdle on the path to more human-like AGI. We keep track of so much stuff for very long periods of time, albeit perhaps with some loss of fidelity.

My guess is that there will need to be an advancement or two before we can can get an AI to read all of the ASIOF books so far and ask "What really happened at the Tower of Joy?"


> Similarly, consider a series like A Song of Ice and Fire. A human reader is still consciously aware of (and waiting for) the answers to questions raised in the very first book.

Some of them, some of the time. This is best comparable with ChatGPT having those books in its training dataset.

The context window is more like short-term memory. GPT-4 can fit[0] ~1.5 chapters of Game of Thrones; GPT-4-32k almost six. Making space for prompt, questions and replies, say one chapter for GPT-4, and five chapters for GPT-4-32k.

Can you imagine having a whole chapter in your working memory at once? Being simultaneously aware of every word, every space, every comma, every turn of phrase, every character and every plot line mentioned in it - and then being able to take it all into account when answering questions? Humans can do it for a paragraph, a stanza, maybe half a page. Not a whole chapter in a novel. Definitely not five. Not simultaneously at every level.

I feel in this sense, LLMs already surpassed our low-level capacity - though the comparison is a bit flawed, since our short-term memory also keeps tracks of sights, sounds, smells, time, etc. and emotions. My point here isn't really to compare who has more space for short-term recall - it's to point out that answering questions about immediately read text is another narrow, focused task which machines can now do better than us.

----

[0] - 298000 words in the book (via [1]), over 72 chapters (via [2]), gives us 4139 words per chapter. Multiplying by 4/3, we get 5519 tokens per chapter. GPT-4-8k can fit 1.45x that; GPT-4-32k can fit 5.8x that.

[1] - https://blog.fostergrant.co.uk/2017/08/03/word-counts-popula...

[2] - https://awoiaf.westeros.org/index.php/Chapters_Table_of_cont...


Just thinking about this, I realized that as a musician I do it all the time. I can recall lyrics, chords, instrumental parts and phrasing to hundreds if not thousands of pieces of music and "play them back" in my head. Unlike a training set, though, I can usually do that after listening to a piece only a few times, and also recall what I thought of each part of each piece, and how I preferred to treat each note or phrase each time I played it, which gives me more of a catalog of possible phrasings the next time I perform it. This is much easier for me than remembering exact words I've read in prose. I suspect the relationships between all those different dimensions is what makes the memory more durable. I must also be creating intermediary dimensions and vectors to do that processing, because one side effect of it is that I associate colors with pitches.


If we are trying to at least match human level then all we have to do is summarize and store information for retrieval in the context window. Emphasis on summarize.

We take out key points explicitly so it's not summarized, and for the rest (less important parts) we summarize it and save it.

That would very likely fit and it would probably yield equal to or better recall and understanding than humans.


Some loss to fidelity? Our memories are hugely lossy, reconstructed at recall based on a bunch of concepts. It's great, but it's also very lossy.


You got me.


ASIOF spoiler below!

> At the Tower of Joy, Ned Stark defeated three members of the Kingsguard and discovered his dying sister, Lyanna, who made him promise to protect her son, Jon Snow, whose true parentage remained a closely guarded secret.

Seems like ChatGPT-3 already knows, unless there's a deeper secret that I'm not deep enough into ASIOF fandom to know.


But this is becaude ASIOF was in the training dataset. Chatgpt wouldn't be able to say anything about this book if it wasn't in his dataset, and you wouldn't be able to have enough tokens to present the whole book to chatgpt.


Thinking of it as "the training dataset" vs "the context window" is the wrong way of looking at it.

There's a bunch of prior art for adaption techniques for getting new data into a trained model (fine tuning, RLHF etc). There's no real reason to think there won't be more techniques that turn what think of now as the context window into something that alters the weights in the model and is serialized back to disk.


It's a reasonable way to look at it given that's how pretty much all 'deployed' versions of LLM's work?


Exactly.

But also, not just ASIOF is in the training set, but presumably lots of discussion about it and all the interesting events in the book.


"The Magical Number Seven, Plus or Minus Two" [1] applies to many things. In this case, a book, could reasonably be reasoned about almost no matter the length as you argue, given the scope of a book is usually limited to a few topics (the French invasion of Russia). Similarly three to seven chapters could be retold, but not every 361 chapters. A "580,000" word long book about 58,000 things would be unfeasible for a human, but probably feasible for an LLM with a 580k context.

That in essence, I believe, is the difference. An LLM (while still predicting the next word, given it's context), seem to care less about the number of subjects in a given context, than humans do.

[1]: https://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus...


>Many people could reason about the book in its entirety

That’s analogous to having War and Peace in the training set. When actually reading War and Peace, nobody‘s working memory includes a precise recall of everything they’ve read thus far, which is more analogous to an LLM’s context size.


Realistically, you could not quote a passage from the chapter directly, and you don't really need to anyway.

Summarizing chapters with the very same LLM and including that as context may very well get you far.


These things are very different from the human mind. With your mind, you read the words one by one, building up intuition as you go. As part of that, there's a kind of continuous synthesis of information. The synthesis is highly temporal, because as you go you are training, and the hardware is changing (your emotions, dictated by your stomach or the sound you can hear or...) underneath you.

These things are very different. The mental model I have is that they put the entire book into a big hypercube and then *flash*, instantly, out comes the next token. And likewise, when they encounter text that is 8001 tokens long, token #1 is completely erased, gone, like it never existed.


> These things are very different from the human mind. With your mind, you read the words one by one, building up intuition as you go.

You're confusing your perception of how your mind works with how your mind actually works.

Simply put, we have no idea how the human mind works. For all we know, its underlying principles could be very similar to LLMs, or they could be something nobody has thought of yet. But under no circumstances is human intuition about the human mind a reliable indicator for what is actually going on.

This is like asking ChatGPT why it came up with a specific response. It will give you an answer, of course, but that answer is generated the same way all its answers are – ChatGPT doesn't suddenly turn on a magic "introspection mode" that allows it to examine its internals. And there's no reason to assume that humans have such an introspection mode either. In fact, many results from experimental psychology seem to indicate that humans don't understand their own mental operations at all.


While it's true that we may not be able to observe the elemental building blocks of our own minds, metacognition is a real capability that people have and use.


Is there any hard evidence that "metacognition" reflects actual cognitive processes, rather than being something that the mind pulls out of its ass?


To the contrary there is very good evidence that we pull arbitrary explanations whenever needed. See Gazzaniga's split brain experiments where people gave all sorts of reasons when asked why they did something and just could not know.


The fact that in extreme circumstances we pull explanations out of our asses (or even that we do so in non-extreme circumstances) doesn't mean we are incapable of genuine meta-cognition. No one is rational most of the times, but mathematicians can be rational for brief periods on limited subjects.


I think you're confusing cognitive processes with neurological processes. Of course we can reflect on actual cognitive processes such as forgetting or learning, and of course psychologists can gather evidence on these cognitive processes.

What we cannot do is make definitive claims about neurological processes and structures based on what we know about our cognitive processes.


> Of course we can reflect on actual cognitive processes such as forgetting or learning

Is there any actual evidence that our so-called reflections on our own thinking are anything more than hallucinations?

"It's obvious" doesn't count as evidence.


As I said, psychologists have been collecting actual evidence on cognitive processes such as forgetting and learning for a long time.

These studies confirm our perception that we can forget stuff.


"We can forget stuff" and "I believe I just forgot that" are very different things.

One is a general statement of fact, the other implies introspection of one's own individual cognitive processes. I have yet to see evidence that the latter is actually possible.


They are different things, but there is an entire scientific field that connects the two.

The phenomenon of people forgetting stuff is not subjective. It's not one person in isolation thinking they just forgot something. Forgetting is a phenomenon on which there is a lot of social feedback and repeatable experiments.

You seem to be denying the possibility of ever connecting science back to individual perception. Denying this makes any and all science completely meaningless though.

It's not limited to observations about ourselves. You could ask the question "but aren't you hallucinating?" about absolutely everything required to verify the outcome of a physics experiment for instance.


"Simply put, we have no idea how the human mind works." ~40,000 neuroscientists beg to differ. I'd argue that, more or less, we have a good, general, working theory about how the mind works.


"What I cannot create, I do not understand." - Richard Feynman

LLMs notwithstanding, I have yet to see a replication of the human mind created by humans. Therefore, I would take any claim that neuroscience 'understands' how the human mind works (a claim that I've never heard from an actual neuroscientist) with a huge amount of salt.


Your original comment was that we have "no idea" how the brain works. This is absurd. We understand quite a lot about how the brain works. We know its functional elements (neurons, possibly glial cells) its basic mechanism of information processing (firing rates, spike timing) and we have a very general model of how those things are combined to actually process information (mutually inhibiting cortical columns, a laminar structure which represents (roughly) sequential stages of processing) etc etc etc. This hardly constitutes "no idea," and, indeed, the fact that artificial neural networks recapitulate in general terms the ability of neurons to process information, suggests that by the Feynman criteria (he said a lot of dumb stuff, in my opinion, by the way) we are beyond the stage of "[having] no idea."

Not that it matters, but for various reasons I have spent quite a lot of time around neuroscientists and I think most of them would agree that "no idea" is a pretty ridiculous way to characterize the state of the field.


Not sure if it answers your question but the paper notes something similar in its discussion of limitations:

> the linear attention of RWKV leads to significant efficiency gains but still, it may also limit the model’s performance on tasks that require recalling minutiae information over very long contexts. This is due to the funneling of information through a single vector representation over many time steps, compared with the full information maintained by the quadratic attention of standard Transformers. In other words, the model’s recurrent architecture inherently limits its ability to “look back” at previous tokens, as opposed to traditional self-attention mechanisms. While learned time decay helps prevent the loss of information, it is mechanistically limited compared to full self- attention.

> Another limitation of this work is the increased importance of prompt engineering in comparison to standard Transformer models. The linear attention mechanism used in RWKV limits the information from the prompt that will be carried over to the model’s continuation. As a result, carefully designed prompts may be even more crucial for the model to perform well on tasks


Prompt design is definitely a huge one shifting to rwkv

Sadly too many folks copy and paste what works for openAI and move on when it fails


I'll give a very simplistic answer to your question and skirt some detail. Essentially, Transformers can 'look at' everything within the context with perfect fidelity (only one 'hop' between tokens to 'get the info'), while RNNs struggle with this because the information gets 'muddied' with all of the tokens in between. Like telephone between you and one friend (transformer) vs you and a sea of people (RNN).

Hopefully this gets the point across without the mathematics and a precise description.


Yea using the analogy of the eyes can see the entire document

Vs

I need to memorise everything as it’s spoken out, and then answer on it

Is a good approximate on the difference.

The kicker though as people pointed out, there are individuals on earth who can memorise things to great length. So the same will apply here (need more training)

It’s also why it is recommended by the community to ask the question first, then put your context document. In a document QnA task


In the shower today I was thinking about Jungian archetypes as a visual programming language for the subconscious.

For example, if someone says, "I should be a better mother", and hold this impression in their mind over time, the associations their brain makes with that image and various behaviors can be used as a guiding force during the physical reorganization which occurs when one creates a new habit, or adjusts a behavior or muscle memory.

And these associations depend upon observations made by the individual, which they have either been taught or innately know can be categorized as mother-like. The brain is typically trained to make this association in the case of the mother by mirroring and attaching to the first female who exhibits nursing behavior. These archetypes are a visual representation of deeper mental models which we are primed to carry on between generations.

And then I wondered how this might apply to neural models, if some system of archetypes naturally emerges, if they're useful for programming or memorization/compression, or how we might explore imbuing existing models with an archetypical system.


To answer your question: https://news.ycombinator.com/item?id=35904773

> we loaded the entire text of The Great Gatsby into Claude-Instant (72K tokens) and modified one line to say Mr. Carraway was “a software engineer that works on machine learning tooling at Anthropic.” When we asked the model to spot what was different, it responded with the correct answer in 22 seconds.


One of the biggest issue with incredibly large context sizes is the lack of training / dataset meant specifically to use such large context sizes.

And this apply to all models. So when a specific model now does badly even at 32k or 50k it’s hard to say if it’s an architecture design issue, or a dataset issue


Yes; and the amount of training examples that are really long (relative to the other training examples) becomes small really fast, so it's also a problem of the long-tail in a sense.


It feels like this is where training on code is going to go from important to critical. Most human texts won't require you to look back 100,000 tokens to understand what it means, but if you dump 1000 source files one after the other than it will.


I think what you're saying is warranted, but you should be very careful putting machine learning architectures in the context of the human mind. They are inherently different, although their output may be similar at some points. What you're saying addressed this directly.

Machine learning is optimizing mathematical weights and biases, and the human mind is messy wetware. One of the first things you will read in any machine learning textbook is to throw away the notion of AI trying to simulate the human mind.


I know a guy who memorized "Fahrenheit 451" back in the '60s, and went on to memorize a dozen other books - I think it was a requirement of the anti-establishment cult he was in at the time. Anyhow, his recall is still pretty fantastic.


Sounds like your friend was sort of the Charles Babbage of Large Language Models.


Okay ... the answer really is "much better than other RNNs of any type". And transformers do: LSTMs and GRUs networks effectively have a context size of 1.

Note that what people claim here is not true: just because you go past the context window does NOT mean everything is forgotten, it means that you're back to RNN levels of performance.

In transformers, just like in RNNs, processing the nth token influences the nth token AND the (n+1)th token. So the influence of the nth token "slowly dies down". It influences the (n+1)th token a lot, the (n+2)th token a bit less and so on. The special thing about attention is that it doesn't start dying until you go past the context size.

Or with a bit of math notation. If a neural network is a function f, then:

RNN: f(x[n]) = f( f(x[n-1]), x[n] )

Transformer: f(x[n]) = f(sum(i=0...context size, f(x-i)), x[n])

The difference between LSTMs and GRUs is in the structure of f. And "sum" is, like a lot of things in here, uh, basically accurate but missing a lot of detail.


RNNs have a theoretically infinite context size, but in practice it’s limited by vanishing gradients due to too much recursion. That’s why the recursive units are usually LSTMs or GRUs that have explicit functionality to cut off the recursion, using a learnable threshold that’s a function of both the current and recursive inputs.

In your example, x[n-1] = f(x[n-2], x[n-1]). So really, your expression should be

RNN: x[n] = f( … f( f( f( f(k0, x[0]), x[1]), x[2]), x[3]), …, x[n-1])

where k0 is a learnable parameter.


> One thing I'm keen to understand is: how well does attention hold across huge context sizes, with respect to the usual transformer models, and also these proposed RNN models?

We won't know before we've tried it. Reasoning by analogy with humans is not useful. In a year we'll have tried lots of things and can give a much better answer.


The primary benefit of transformers is that they generally can reference things within the context, but RNNs have trouble with that. The trade off is that transformers have memory and compute requirements quadratic in the context length.


Taking into account the limits of how much can be stored in a hidden state.

My theory is that the model will start generalising the information stored once it starts going past its limits (like real life humans)

So if we take books as an example, we probably do not remember every single word. But we might vaguely remember which section or book events occur

Combine this with the agent model, and we may have an alternative for embeddings. Where we can ask which pages the model recall is relevant to the question. Bring those pages up again. And get the answers

(Once again like how a human might answer in real life)


People condense information from more than several thousand tokens ago into higher level chunks. So the whole first chapter of Infinite Jest becomes “child tennis prodigy is weird and annoying”.

It’s not clear how to learn this. Transformers work so well because there are only a few layers between input tokens and output. Adding a summarization step is a whole new research project.


Hey side question here, are there any papers or anything really I can read on the part about cheating by using a lookup that you mentioned?


I wrote a home-brew neural network around 2006, just to see what would happen. I'd read no papers on it, and just kind of made it up as I went along. The result was basically a cube of "neurons" which had stronger and weaker trigger points to their neighbors and would propagate "spark" to one or more neighbors based on the strength and direction of spark they got from their other neighbors. Each of the connections and weights had a "certainty" attached to it which would go up when the output was closer to the ideal and would go down when the output was garbage. When those certainty values hit zero, the weights for that neuron would be re-randomized. I quickly realized that the only thing this could really do was try to transform one 10x10 pixel bitmap into a different one. But it was fascinating that it actually seemed to "learn" patterns as they got baked in.

What would be called "attention" now, though, basically didn't exist in that system. Once you started training it on something new it lost everything.

For anyone wondering, it was written in Actionscript 3, and ridiculously, each neuron was bound to a display class that displayed as a semitransparent cube that lit up as the inputs propagated through them. A thoroughly ridiculous side project.

But other than scaling that from 1000 neurons to billions, I'm curious what has changed about the concepts of pathing or tolerance to make these models better? Maybe my concept of the principle behind modern LLMs is too archaic or rooted in a cartoon understanding of our own wetware that I tried to reproduce.

[edit: I'm describing an ancient home project... for anyone downvoting this, I'm more than receptive to hearing your reasons why it's stupid. I'm the first to admit it seems stupid!]


A large percent of the RWKV community ain’t experts. And are here doing weird, dumb or crazy homebrew experiments

So keep doing weird experiments


> But other than scaling that from 1000 neurons to billions, I'm curious what has changed about the concepts of pathing or tolerance to make these models better? Maybe my concept of the principle behind modern LLMs is too archaic or rooted in a cartoon understanding of our own wetware that I tried to reproduce.

Realastically... the problem is a data problem. The math is mostly there. Transformers et al would have been figured out in no time had we had the sort of data tools we have today. I'm talking about the ease with which you can take gigabytes of data, throw it in S3, and analyze it in minutes.

That, combined with cheap and accessible compute (cloud) and the maturation of CUDA meant it was all a matter of time before this took place.


> Transformers et al would have been figured out in no time had we had the sort of data tools we have today.

I disagree. Even things that seem obvious in retrospect, take some time to be figured out.

Resnets, batch norm, dropout, etc are examples of this.

And I don't think transformers are obvious.


What determined whether it's output was "ideal" for the certainty measure to go off of? Backprop?


I'd need to go back and look at the code, but it was something primitive but similar to backprop. There was an evaluation routine that compared what lit up on the back of the cube to the desired output. The farther off it was, the more the certainty got docked from any of the [connected] neurons one layer up from the bad pixel, and that dragged down the certainties on the next layer and so on. It didn't have the ability to back check which neurons were involved in which specific output node. But I guess it was a scatter shot attempt at that.


Twitter thread announcing the paper with some additional color commentary: https://twitter.com/AiEleuther/status/1660811179239849986


Layperson summary

Good

- substantially cheaper and faster to run / train

- scalable to ridiculous context size

Bad

- you will need to change how you prompt this model sadly (it works differently)


See also, ChatRWKV, an open source LLM powered by RWKV instead of transformers.

https://github.com/BlinkDL/ChatRWKV


Hi Everyone,

I'm a regular involved with the RWKV community.

AMA, on RWKV, and I will do my best to answer them here for the next hour (one at a time)

PS: you can find our discord here : https://discord.gg/qt9egFA7ve


Do you think there could be a kind of universality principle at work here, where once you make a good enough architecture then the details don't matter so much compared to the model size and training flops and dataset size? In other words, maybe it wasn't a coincidence that your architecture worked about as well as the transformer architecture?


There is a reasonable argument for that (heard the idea go around between multiple AI engineers, that once you go past a certain scale, it does not matter for its evals)

One of the biggest issue for testing all of this, is it takes a crap ton of GPUs to prove all the alternatives to transformers beyond 1B param.

For example I’m waiting for someone to do a 1B-14B text based diffusion network

Finally, if this is truely the case (and all that really matter is size+dataset)

We really should use an architecture that is cheaper to train and run. And that’s what RWKV represents here

You can even run the 7B quantized model reasonably on most laptops (try the rwkv-cpp / rwkv-cpp-node project)


The paper says it's comparable to transformers right now but that means that it might be better later. Do you guys have concrete plans to make it better? Are they secret? Also, what's the deal with that foundation? Is it a cult or like the new OpenAI that will turn closed or maybe it's to reap the value of random contributors to the project?


Completely the opposite.

- it is NOT backed directly or owned by any VC funded company

- it is 100% OSS driven by the community (Apache 2 license)

- it’s currently the top OSS chat model that can be used commercially on the chatbot arena score board

- IMO it is undertrained, so expanding the training data alone will make it much better (however for the sake of this paper, we wanted to focus on architecture not training data, so we compared similarly trained models)

And yes we do have multiple experiments and plans to make it better. It’s a list, and we will not know which is final until we try. Individual members can go to great lengths on what they are working on

For better or worse, being truly OSS means our initiatives are more disorganized then a centrally planned org


> it’s currently the top OSS chat model that can be used commercially on the chatbot arena score board

To be fair, that filters the majority of models in the scoreboard.


Dun we all wish this wasn’t the case?

Where we have more OSS models to choose from without weird rule lawyering gotchas. Or needing to be from a research institute / a license to download the weights


Can beginners without money with by PhDs, contribute? If so what would be the best way to start?


There are lots of really low hanging fruits

- integrating this with AI platform X/Y/Z

- setting up evals

- improving the code quality

- making a how to guide (it’s stuck on my todo list)

- helping with dataset

- doing silly experiments on how the architecture work (and if the changes give good result)

- etc etc

One of the community goals is to make this a model for EVERYONE on earth that means we need quality dataset for all the non English languages

So even on that level there are things to do

( find something that interest you on the community )


i would guess, go to their discord. also they put their github so you could fix a bug in their github.


(Note: My comments do not represent or project those of my collaborators) I remember talking to Blink DL about this, I think the plan is just to build an ecosystem, provide more diversity in the DL space. There are plans to make a RWKV5, they are in the open in the RWKV5 channel. From an engineering standpoint I don't really see the "reap" the value of random contributors to the project. Most of us I believe ... are hackers and tinkerers that just want to learn and contribute and be apart of something that can change the current


Currently, what I'm seeing with RWKV is that attention fades of quickly. The model will start to produce output, but very quickly (a few dozen tokens), its own output tokens are suddenly taking 'precedence' over the input question and it starts to simply repeat itself.

For example, I'm currently attempting to use RWKV for named entity extraction. I ask it to analyze a piece of text and provide output in JSON format. It starts off great. However, eventually, it seems like the beginning of the JSON list 'overtakes' the question I asked, and it starts to just produce random data that would seem plausible based on the set of things in the list. I realize this is due perhaps to the precision losses of the RNN as weights decay.

However, I feel there ought to be some way we can prevent that. Any thoughts?


Rearrange the query.

Ask the question / explain the task first. Then give it the data you want to extract from.

Also you may want to give a one shot example for best result (the instruct training of raven model is very limited)


Yeah... So I did that which is how I got it to begin correctly. This is what I mean though.

I'll say "get a list of Blah from the following document in Json format like this:

Example"

Then I feed the document and add a spot for the answer.

The model begins correctly. But usually in the middle of the Json list generation, it will veer off, and start hallucinating as if it forgot the document and the task. I'm happy to share specifics and datasets but this is a cross cutting problem.

Rwkv is able to answer my questions when I ask simple yes/no or classification. It's the listing that throws it for a loop. Transformers do not have the same problem. Both llama and gpt are able to maintain focus.

Also, do you know where I'd find information on how the current weights were trained?


Hmm we might need to look into the instruct training data. Which is mostly based on gpt4all filtered and mixed with others

(You are using raven right? That’s the instruct trained varient)

Btw ping the discord if ur looking into finetuning for your usecase


Yeah I'm using raven. Raven does work better. And I'm on the discord.

Unfortunately I really would like machine readable responses and raven is a bit too verbose.

Looking at fine-tuning right now.


Why would asking the question first improve quality? Is it because the model will be better aware of what info it Can and can’t throw away at each step? This seems like the opposite of transformers.


RWKV does not work like transformers. The "transformer" part here is the training step. RWKV is an RNN with fixed-size state, so old information slightly decays each time it reads a new token. Hence the freshest memory is of the most recent tokens.


Are there any ways to train it to maintain attention on the original prompt no matter the distance from it, and selectively pay attention to its own output where relevant?


Instruction training. This is a WIP


Are there currently any plans to create a RWKV 30B or 65B? That seems to be the size at which the LLaMA transformer models become genuinely competitive with GPT3.5 for many tasks.


TLDR: please donate A100s to make this happen

Most of the focus is in the 1-14B range. Due to constraints of the dataset sizes (chinchilla law), and GPUs available

Community demand is also mostly in this range as there is a strong desire to optimise and run on local GPU. So more focus is in this range.

Not representing blink directly here - but if anyone wants to see a 30B / 65B model. Reach out to contribute the GPUs required to make it happen

The code is already there, just need someone to run it,

Ps: I too am personally interested in how it will perform at ~60B, which I believe will be to be optimal model size for higher level of thoughts (this number is based on intuition not research)


https://twitter.com/boborado/status/1659608452849897472

You might find that thread interesting, they're taking submissions for potential partnership with LambdaLabs a cloud compute company that has a few hundred H100s laying around. They have an open form and their cofounder is currently doing the rounds having meetings and this may be a good candidate.

I'm not associated with them at all, just interested in the space and things going on.


wierdly their form requires a company rep (which RWKV does not have, as its not a company) - lets see how it goes ...


Are there any estimates anywhere of how many A100s would be needed to e.g. train a 30B model in 6 months?


That’s a loaded question without deciding dataset size


Would it be possible to just use the exact same dataset as LLaMA? (There's an open source project currently training a transformer on exactly that).


You mean red pajama? I believe that has already started for 1-14B (need to double check)


Yep that's the one. Curious roughly how many A100s it'd take to train a 65B RWKV on that.


Really bad napkin math as no one has attempted 65B (so +\- 50%)

8 x 8 x 8 A100, should be able to do a 100k++ tokens/s at that size

With a dataset of 1.2 trillion tokens. That’s 12 million seconds. Or 140 days

(PS: this is why everyone is training <60B, its crazy the cost, even if my math estimate is wrong by 300%, its still a crazy number)


Thank you! 888 is 512 A100s, that is indeed pretty expensive.


can you elaborate on the chinchilla law / dataset problem a bit? (perhaps by editing your previous comment?)

what datasets are available to the community, how big are these, are they needed to be updated from time to time, where are these stored, what are the usual cost ranges involved, ...? :o

thank you!


Chinchilla law is a rule of thumb that you should have 11++ x training tokens for every param

If not, you are getting diminishing benefits for each param you add

I’m extreme cases your model can even perform worse with more param due to lack of training data

More complicated: the quality of the data matters as well

So there are 2 major directions. Build efficient models with good dataset and optimal param count for the task

Or go big on everything (aka openAI) which requires monster GPU time for every reply token

There are obviously in between as well. Hence why the question is so loaded

Ballpark: if your not setting aside a 100k for GPUs alone, to train a 60B model from scratch, your probably not ready to train one


30B would be interesting because that's the practical ceiling for local GPUs assuming 4-bit quantization.

Is there some kind of dedicated fund for training hardware? Donating an A100 sounds unlikely, but surely they could be crowdfunded?


weirdly enough, organisations are more willing to rent GPUs than money.

If you want to help fund RWKV, the ko-fi link is - https://ko-fi.com/rwkv_lm

IMO: this needs way more funding, just to sustain blink leading this project, let alone GPUs for training.

(Also - current tests shows this model doing really badly with 4bit quantized, but alright at Q5 and Q8)


Is there any potential improvements over transformers for interpretablity or alignment?


For anything past 8k context size

We are talking about over 10x reduction in GPU time for inferencing tokens and for training too

Aka it’s cheaper and faster

Alignment is frankly IMO purely a dataset design and training issue. And has nothing to do with the model


Is everyone still carefully not mentioning that it looks like it’s pronounced “Roku”? ;)


As a (mostly) layperson, this seems like it could be a very significant paper. What are the odds we see the next few years of machine learning models based on RWKV like we have seen with transformers since the attention is all you need paper?


Pretty high, we are going to be deploying something similar to Sono's GPT based LLM, and leverage RWKV for that, still playing around with it though!


This model seem to be good when input context is large in comparison to OpenAI, would just wait till if someone productize it. Most likely it wine be just one paper but many new and old ideas combined


Being a RNN there is another trick: caching a long prompt, because RNNs only look back one step while transformers see the whole sequence. So you can load your long context only once and reuse it many times.


Yup. This is commonly done in the community for the chat models as well (due to the huge amount of reuse for each reply)


> We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs.

Just to be clear to everyone: this is “use attention to train parameters, use recurrence for inference”

It’s a very cool idea and I hope we get more interesting approaches to inference, but attention is here to stay for training.


As one of the authors, I'd like to clarify: the equations of the RWKV model enable computational parallelization, provided that the sequence is predetermined. This parallelization occurs during both the training and inference stages, specifically during the prompt reading process (consider it an "encoding"), right before the generation (or decoding phase).


How can something recurrent be parallelized?

> the equations of the RWKV model enable computational parallelization, provided that the sequence is predetermined.

And sure, this is the core concept of self attention, no?


EleutherAI is stunning. The only other papers I've seen with such a diversity of institutions are high energy particle physics.


Awesome to see this published.

Work on transformer alternatives, especially parallelizable ones like this, is incredibly important - it would suck if we get sucked down a local optima in architecture without actually looking at nearby viable alternatives.


Yup. I’m all here for infinite scaling of context size


Jesus H. Christ, first time I see a collaboration this big on a ML paper. How does a team like that even come together? This isn't the LHC.


RWKV has been running in public (relatively obscure to other models) for the past 2 years

Mostly lead by a single person (blink). This community consist mostly of people outside the academia / big VC tech scene

When eleutherAI offered to help us with writing the paper. Various key folks banded together for the paper, as it’s what seems to be a very strong alternative to transformers

This does not mean everyone in the discord was credited.

The requirements are for significant contributions to the paper. typically several paragraphs long worth of drafting and revisions

Just doing a line of grammar change or a single benchmark is not enough


AKA: no one here is incentivised to fight for bigger ownership, no promotion or KPI or pay was on the line


If you think this is wild, see the PaLM 2 paper with 2.5 pages of 2 column attributions.

https://arxiv.org/pdf/2305.10403.pdf


No, the Bloom paper is wild: 2.5 pages of author names written with no spacing, I counted 472 authors after deduplication. PaLM 2 has only 181 authors. ;)

https://arxiv.org/pdf/2211.05100.pdf


Page 28.

That is wild, almost like film credits


Have you seen the author list of DeepSpeech2? https://arxiv.org/abs/1512.02595


They're a group of volunteers who came together over Discord.


I think it's just, everyone in their discord channel.


What channel? I have an application for sequence models I think might be novel, and I'd like to be able to get credit and help research it if possible. Probably somebody has already done it, but I cannot search well enough to find related literature.


Write it and submit it yourself. Sole authorship is given more weight these days for whatever reason. In a multi author publication, even if you are first author or listed as equal contribution, if there is a more famous person on the paper everyone will assume they did it.


Write it and submit it to who? I'm not familiar enough with the field to find any prior work or related work.


find the most closely related paper that you know of, even if it's not very close, and submit your idea to the same journal that published that other idea



Our discord servers (a primary one, a spin-off for RWKV, another spinoff for BioML, etc) have tens of thousands of people between them :) So not quite everyone. But this was a community effort with a public call for contributions


Are you the Stella from the Acknowledgements section? Are you in one of those spinoffs like BioML, or you aren't on the author list because you have some conflict with your work where you can't legally do something because of intellectual property laws or license things or NDA? Also when you talk about 'our discord servers' which ones do you mean? Is it ones run by Eleuther?


Yes I am (apparently) in the acknowledgments. I was not an author on the paper because I didn’t have time to contribute too much. I also try to err on the side of not being added to papers, as my position (I run EleutherAI) tends to encourage people to be overgenerous with offers. I anticipate having more time this coming month and being on the version that’s submitted for peer review, but we’ll see.

BlinkDL has been working on this project for two-ish years, originally in the EleutherAI discord and then created his own to house the project.

I wasn’t thinking too hard about my exact wording, but yes I was thinking of EleutherAI and its various spin-off servers. EleutherAI doesn’t /run/ any of the other servers, but we all have a close collaborative relationship. I’m sure there’s a lot of duplication of membership (e.g., I’m in all of them) but quickly adding up the membership of each server comes out to around 70,000. EleutherAI and LAION are the largest at 25k each, with the others typically having around 5k each. I would expect at least 30k of those users to be unique though.


The model is created by one person, he does not have enough time to write the paper


It's Type 10 of https://xkcd.com/2456/.


Hahaha. True. And even then it’s an undersell (the limited scope of the paper drops lots of the things being done)


Alas, it doesn't appear to work well for longer contexts:

https://twitter.com/arankomatsuzaki/status/16390003799784038...

Has anyone here experimented with this recently to confirm?


It has already been confirmed that with the right dataset we can scale it effectively from 2k to 4K, and 4K to 8k via fine tuning (you dun even need to train a new foundation model)

We believe this can be done for 16k to way beyond 100k

Research in how RWKV handle the hidden state shows that it is barely used (imo: <5%??) meaning lots of headroom for scaling context size

(This is actively being experimented on - we dun really know the limit yet)


Thank you. This is great to hear!

I'm going to take a closer look :-)


one of the authors here!, I think someone in our discord did experiments to prove that it does work for longer contexts, The pace of this work moves really fast. This might have been an earlier models in the series. RWKV it needs to be trained for longer contexts lengths in order to obtain that skill a context tuning if you will. IRCC there will be a follow up paper for it.


Thank you for taking the time to comment here!

That sounds promising. Maybe scale (of model and training samples) is all you need.

And RNNs are obviously so much more efficient at inference.

I'm going to take a closer look :-)


> "Our experiments reveal that RWKV performs on par with similarly sized Transformers"


This sounds pretty bad, right? Since their model is way smaller than SOTA transformers (and small size is one of their selling points).


The paper is meant to compare architecture vs architecture with similar model size, and dataset - to inform decisions in future architecture designs

Its main benefits being presented with the above staying closely the same, is that it has significantly lower running and training cost without performance penalty

If you want to compare any <20B model with GPT 3.5 / 4 / 100B models evals, thats another paper altogether


The paper lists the first author's institutional affiliation as "RWKV Foundation". However, I cannot find anything about this supposed "foundation" online, and as far as I can tell, the term RWKV originates in this very paper.

What's going on here?


It does not exists (yet maybe)

Blink is an individual and does not represent a company (aka not google, not eleuther, not <insert VC company>, etc)

So he had to fill something up i guess haha

The idea of a foundation has been tossed around. So it may eventually happen

Also the RWKV project predates this paper by over 2 years - nothing is truely new and revealed today on this paper

Your simply just finding out about it now in a formal paper consolidating various things that was learnt in this project


> Blink is an individual and does not represent a company (aka not google, not eleuther, not <insert VC company>, etc) So he had to fill something up i guess haha

I saw there is no one listed in the author list as independent researcher or no affiliation or whatever. It seems hard to believe, if it's a lot of random contributors on a discord channel. Did they all have to make up some kind of 'DBA' (doing business as) name?


I would not really call it random. While it was open for feedback/contributions.

There is a strong requirements for substantial contribution to the paper itself to qualify for authorship.

So unfortunately that does limit it in part to folks who are more familiar in writing such papers, have the resources for doing benchmarks, and charting, etc

And less so for folks who for eg, tinkers, contributed to dataset, porting the implementation to X lang, doing LoRa, or various other experiments that has nothing to do with the core architecture


Sorry I didn't mean to say "random" in the sense of stochastic, more like "eclectic". Usually when I saw such papers with eclectic author mix, at least a few are listed as independent researchers or unaffiliated (even though they might have a PhD or are applying to grad schools or whatever). I was wondering maybe I'm out of date and everyone in those circumstances now just has a 'doing research as' corporate alias.


Ahh. As someone who has been listed as “independent researcher” before. I get what you mean - oh well, it is what it is


Dumb arxiv question (sorry); is it possible to see what journal a paper was actually submitted to, to help find the reviewed version when it comes out?


It's not required for arxiv authors to submit papers to journals. As far as I can tell "Language Models are Unsupervised Multitask Learners", (the GPT-2 paper) which has 9,786 citations, was never published in a journal.


The paper is going to be submitted to EMNLP next month. An early version is being released now to garner feedback and improve the paper before submission.


Thanks!


Sometimes that is included in the journal LaTeX template, but otherwise authors (in my field at least) typically won’t say where a paper is submitted before it’s accepted for publication there, or at the very least if it’s made it past the editorial rejection hurdle and been sent out for review

Probably your best bet for getting a feed heads up is to set a GScholar alert on the first author?


This is straight to ARXIV (its not even the final copy yet)


that's what arxiv used to be, but that's not what it is anymore



I wish this was written with more care. None of the symbols are defined.

Worst of all, they use "channel dimension" in a sequence model. What even is a channel in a sequence of tokens? This happens as soon as you have a single person with CNN background on the team and it makes zero sense. What if you actually have channels in your data? What then?


If you have more specific feedback, like a specific digram or page, and how it can be made better. I will gladly forward that info, to improve the paper draft.

Because channel mixing, is a core component of this architecture, and that keyword "channel"is all over the place. I have no idea what is it you are critiquing specifically (i could not find the mention of "channel dimension" in the paper)


They said the paper is still working in progress and will improve it.

https://twitter.com/AiEleuther/status/1660811180901019648


My goal this year: to understand what this is about :-)


If you are familiar with how transformer network works

There is RWKV in 150 lines to help understand all the nitty gritty

https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_li...


Almost familiar: one more Karpathy lecture to go. Thanks!


Here is a summary of all comments by Transformers, wonder how RNN does:

RWKV is a new language model architecture that is comparable to transformers in terms of performance. RWKV is more efficient than transformers, which makes it possible to train larger models on smaller datasets. The RWKV community is open source and welcomes contributions from anyone. There are plans to create larger versions of RWKV, but this will require more computational resources. Here are some additional details about the chinchilla law and the dataset problem:

The chinchilla law states that the amount of data required to train a language model grows exponentially with the model size. This means that it is very expensive to train large language models, even with the latest hardware. The RWKV community is working on developing new methods for training large language models more efficiently. There are a number of datasets available to the RWKV community, including:

The Pile: A massive dataset of text and code. The Chinchilla: A smaller dataset of text and code that is designed for training RWKV models. The Red Pajamas: A dataset of text and code that is being used to train a 65B RWKV model. These datasets are stored in a variety of locations, including:

The RWKV GitHub repository The Chinchilla website The Red Pajamas website The RWKV community is constantly updating the datasets and adding new ones. If you are interested in contributing, please visit the RWKV GitHub repository.


> The chinchilla law states that the amount of data required to train a language model grows exponentially with the model size. This means that it is very expensive to train large language models, even with the latest hardware. The RWKV community is working on developing new methods for training large language models more efficiently. There are a number of datasets available to the RWKV community, including:

What? I though the chinchilla-optimal regime was something like "20 tokens per weight". That's not remotely exponential, even by the word's colloquial use.


It seems that the user is posting ChatGPT-generated text and not a real summary. It's complete nonsense, with about half of the sentences containing a factual error.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: