That’s a loaded question without deciding dataset size

logicchains · on May 23, 2023

Would it be possible to just use the exact same dataset as LLaMA? (There's an open source project currently training a transformer on exactly that).

pico_creator · on May 23, 2023

You mean red pajama? I believe that has already started for 1-14B (need to double check)

logicchains · on May 23, 2023

Yep that's the one. Curious roughly how many A100s it'd take to train a 65B RWKV on that.

pico_creator · on May 23, 2023

Really bad napkin math as no one has attempted 65B (so +\- 50%)

8 x 8 x 8 A100, should be able to do a 100k++ tokens/s at that size

With a dataset of 1.2 trillion tokens. That’s 12 million seconds. Or 140 days

(PS: this is why everyone is training <60B, its crazy the cost, even if my math estimate is wrong by 300%, its still a crazy number)

logicchains · on May 23, 2023

Thank you! 888 is 512 A100s, that is indeed pretty expensive.

pas · on May 23, 2023

can you elaborate on the chinchilla law / dataset problem a bit? (perhaps by editing your previous comment?)

what datasets are available to the community, how big are these, are they needed to be updated from time to time, where are these stored, what are the usual cost ranges involved, ...? :o

thank you!

pico_creator · on May 23, 2023

Chinchilla law is a rule of thumb that you should have 11++ x training tokens for every param

If not, you are getting diminishing benefits for each param you add

I’m extreme cases your model can even perform worse with more param due to lack of training data

More complicated: the quality of the data matters as well

So there are 2 major directions. Build efficient models with good dataset and optimal param count for the task

Or go big on everything (aka openAI) which requires monster GPU time for every reply token

There are obviously in between as well. Hence why the question is so loaded

Ballpark: if your not setting aside a 100k for GPUs alone, to train a 60B model from scratch, your probably not ready to train one