Hacker News new | past | comments | ask | show | jobs | submit login

That’s a loaded question without deciding dataset size



Would it be possible to just use the exact same dataset as LLaMA? (There's an open source project currently training a transformer on exactly that).


You mean red pajama? I believe that has already started for 1-14B (need to double check)


Yep that's the one. Curious roughly how many A100s it'd take to train a 65B RWKV on that.


Really bad napkin math as no one has attempted 65B (so +\- 50%)

8 x 8 x 8 A100, should be able to do a 100k++ tokens/s at that size

With a dataset of 1.2 trillion tokens. That’s 12 million seconds. Or 140 days

(PS: this is why everyone is training <60B, its crazy the cost, even if my math estimate is wrong by 300%, its still a crazy number)


Thank you! 888 is 512 A100s, that is indeed pretty expensive.


can you elaborate on the chinchilla law / dataset problem a bit? (perhaps by editing your previous comment?)

what datasets are available to the community, how big are these, are they needed to be updated from time to time, where are these stored, what are the usual cost ranges involved, ...? :o

thank you!


Chinchilla law is a rule of thumb that you should have 11++ x training tokens for every param

If not, you are getting diminishing benefits for each param you add

I’m extreme cases your model can even perform worse with more param due to lack of training data

More complicated: the quality of the data matters as well

So there are 2 major directions. Build efficient models with good dataset and optimal param count for the task

Or go big on everything (aka openAI) which requires monster GPU time for every reply token

There are obviously in between as well. Hence why the question is so loaded

Ballpark: if your not setting aside a 100k for GPUs alone, to train a 60B model from scratch, your probably not ready to train one




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: