How the RWKV language model works

treprinum · on July 4, 2023

Linear attention will cause it to forget details and make it less rich than alternatives using proper quadratic attention. It only competes with the best transformers in benchmarks when the number of parameters is about the same anyway whereas one would expect a drastic decrease in parameters when using linear attention.

sebzim4500 · on July 4, 2023

That could still be a good trade off. It's fine for the feedforward blocks to be a bit slower (due to higher model dimension) if the 'attention' blocks are much faster (due to better complexity).

Buttons840 · on July 4, 2023

The ~150 line implementation[0] mentioned in the post looks great.

Anyone know of a similarly small implementation for transformers?

[0]: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_li...

ronsor · on July 4, 2023

karpathy's nanoGPT is a good one

leumassuehtam · on July 4, 2023

I think RWKV team should make a statement by training a really large model. Their 14B finetunes are fine but far from impressive. If they believe that this is the future, they should have at it. They probably have funding to do so (I think?).

Anyway, exciting to see what comes next.

pestatije · on July 4, 2023

RWKV - Receptance Weighted Key Value