Hacker News new | past | comments | ask | show | jobs | submit login
How the RWKV language model works (johanwind.github.io)
71 points by EvgeniyZh on July 4, 2023 | hide | past | favorite | 6 comments



Linear attention will cause it to forget details and make it less rich than alternatives using proper quadratic attention. It only competes with the best transformers in benchmarks when the number of parameters is about the same anyway whereas one would expect a drastic decrease in parameters when using linear attention.


That could still be a good trade off. It's fine for the feedforward blocks to be a bit slower (due to higher model dimension) if the 'attention' blocks are much faster (due to better complexity).


The ~150 line implementation[0] mentioned in the post looks great.

Anyone know of a similarly small implementation for transformers?

[0]: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_li...


karpathy's nanoGPT is a good one


I think RWKV team should make a statement by training a really large model. Their 14B finetunes are fine but far from impressive. If they believe that this is the future, they should have at it. They probably have funding to do so (I think?).

Anyway, exciting to see what comes next.


RWKV - Receptance Weighted Key Value




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: