It’s really, interesting work. I’m glad you’ve kept at it. I’d like to ask you about two issues.
I keep seeing papers like “Repeat After Me” claiming serious weaknesses of state space vs transformer models. What are the current weaknesses of RWKV vs transformers? Have you mitigated them? If so, how?
The other issue is that file sharing being illegal, Wikipedia requiring derivatives to be copyleft, etc means I can’t train models with most data legally. Pre-1920’s works in Project Gutenberg are totally public domain. Both the model and the training data would be 100% legal for reproducible research. Would your team be willing to train a 3B-7B model on only Gutenberg and release it to the public domain?
(Note: The Stack without GitHub Issues can be used for permissive code. However, there could be contamination issues like incorrect licenses, PII, etc. So, maybe at least one, 100% legal model. Maybe a second with Gutenberg and The Stack for coding research.)
> The other issue is that file sharing being illegal, Wikipedia requiring derivatives to be copyleft, etc means I can’t train models with most data legally.
That really depends on whether LLM pretraining ends up held as an infringing use. (Of course, it’ll take a while for the cases to work through the courts and for a body of jurisprudence to be developed on this subject.)
There’s two legal issues: sharing copyrighted data; training on it. It’s the latter that’s ambiguous. My problem is the former.
Making copies of and sharing copyrighted works without the authors’ permission is already illegal as proven in countless, file-sharing cases. The AI trainers do this with data sets like Common Crawl, The Pile, and RefinedWeb. Just sharing them is illegal for most of the content in them.
I got ideas for how to deal with that in countries with TDM exceptions, like Singapore. For now, the only things we can share with others for model training are (a) public domain works and (b) content licensed for permissive use and sharing. Gutenberg entries before a certain year should be pretty risk-free.
I keep seeing papers like “Repeat After Me” claiming serious weaknesses of state space vs transformer models. What are the current weaknesses of RWKV vs transformers? Have you mitigated them? If so, how?
The other issue is that file sharing being illegal, Wikipedia requiring derivatives to be copyleft, etc means I can’t train models with most data legally. Pre-1920’s works in Project Gutenberg are totally public domain. Both the model and the training data would be 100% legal for reproducible research. Would your team be willing to train a 3B-7B model on only Gutenberg and release it to the public domain?
(Note: The Stack without GitHub Issues can be used for permissive code. However, there could be contamination issues like incorrect licenses, PII, etc. So, maybe at least one, 100% legal model. Maybe a second with Gutenberg and The Stack for coding research.)
Example use of Gutenberg:
https://www.tensorflow.org/datasets/catalog/pg19