Are there currently any plans to create a RWKV 30B or 65B? That seems to be the ...

pico_creator · on May 23, 2023

TLDR: please donate A100s to make this happen

Most of the focus is in the 1-14B range. Due to constraints of the dataset sizes (chinchilla law), and GPUs available

Community demand is also mostly in this range as there is a strong desire to optimise and run on local GPU. So more focus is in this range.

Not representing blink directly here - but if anyone wants to see a 30B / 65B model. Reach out to contribute the GPUs required to make it happen

The code is already there, just need someone to run it,

Ps: I too am personally interested in how it will perform at ~60B, which I believe will be to be optimal model size for higher level of thoughts (this number is based on intuition not research)

ianbutler · on May 23, 2023

https://twitter.com/boborado/status/1659608452849897472

You might find that thread interesting, they're taking submissions for potential partnership with LambdaLabs a cloud compute company that has a few hundred H100s laying around. They have an open form and their cofounder is currently doing the rounds having meetings and this may be a good candidate.

I'm not associated with them at all, just interested in the space and things going on.

pico_creator · on May 23, 2023

wierdly their form requires a company rep (which RWKV does not have, as its not a company) - lets see how it goes ...

logicchains · on May 23, 2023

Are there any estimates anywhere of how many A100s would be needed to e.g. train a 30B model in 6 months?

pico_creator · on May 23, 2023

That’s a loaded question without deciding dataset size

logicchains · on May 23, 2023

Would it be possible to just use the exact same dataset as LLaMA? (There's an open source project currently training a transformer on exactly that).

pico_creator · on May 23, 2023

You mean red pajama? I believe that has already started for 1-14B (need to double check)

logicchains · on May 23, 2023

Yep that's the one. Curious roughly how many A100s it'd take to train a 65B RWKV on that.

pico_creator · on May 23, 2023

Really bad napkin math as no one has attempted 65B (so +\- 50%)

8 x 8 x 8 A100, should be able to do a 100k++ tokens/s at that size

With a dataset of 1.2 trillion tokens. That’s 12 million seconds. Or 140 days

(PS: this is why everyone is training <60B, its crazy the cost, even if my math estimate is wrong by 300%, its still a crazy number)

logicchains · on May 23, 2023

Thank you! 888 is 512 A100s, that is indeed pretty expensive.

pas · on May 23, 2023

can you elaborate on the chinchilla law / dataset problem a bit? (perhaps by editing your previous comment?)

what datasets are available to the community, how big are these, are they needed to be updated from time to time, where are these stored, what are the usual cost ranges involved, ...? :o

thank you!

pico_creator · on May 23, 2023

Chinchilla law is a rule of thumb that you should have 11++ x training tokens for every param

If not, you are getting diminishing benefits for each param you add

I’m extreme cases your model can even perform worse with more param due to lack of training data

More complicated: the quality of the data matters as well

So there are 2 major directions. Build efficient models with good dataset and optimal param count for the task

Or go big on everything (aka openAI) which requires monster GPU time for every reply token

There are obviously in between as well. Hence why the question is so loaded

Ballpark: if your not setting aside a 100k for GPUs alone, to train a 60B model from scratch, your probably not ready to train one

int_19h · on May 23, 2023

30B would be interesting because that's the practical ceiling for local GPUs assuming 4-bit quantization.

Is there some kind of dedicated fund for training hardware? Donating an A100 sounds unlikely, but surely they could be crowdfunded?

pico_creator · on May 23, 2023

weirdly enough, organisations are more willing to rent GPUs than money.

If you want to help fund RWKV, the ko-fi link is - https://ko-fi.com/rwkv_lm

IMO: this needs way more funding, just to sustain blink leading this project, let alone GPUs for training.

(Also - current tests shows this model doing really badly with 4bit quantized, but alright at Q5 and Q8)