There’s two legal issues: sharing copyrighted data; training on it. It’s the lat...

There’s two legal issues: sharing copyrighted data; training on it. It’s the latter that’s ambiguous. My problem is the former.

Making copies of and sharing copyrighted works without the authors’ permission is already illegal as proven in countless, file-sharing cases. The AI trainers do this with data sets like Common Crawl, The Pile, and RefinedWeb. Just sharing them is illegal for most of the content in them.

I got ideas for how to deal with that in countries with TDM exceptions, like Singapore. For now, the only things we can share with others for model training are (a) public domain works and (b) content licensed for permissive use and sharing. Gutenberg entries before a certain year should be pretty risk-free.