> not entitled to easy and cheap access to data they don't own
This is not copyright as we know it. Copyright protects against copying, not accessing data. You can still compile statistics off data you don't own. The models are like a compressed version of the originals, so compressed you can't retrieve more than a few snippets of original text. Newer model train on filtered synthetic text, which is one step removed from the protected expression in the copyrighted works. Should abstractions be protected by copyright?
However in order to get to the compressed state, the original data would have to be processed in some way as a whole. This would require a copy of the material to be available. In case that copy was attained in an illegal way, what are the implications?
This is not copyright as we know it. Copyright protects against copying, not accessing data. You can still compile statistics off data you don't own. The models are like a compressed version of the originals, so compressed you can't retrieve more than a few snippets of original text. Newer model train on filtered synthetic text, which is one step removed from the protected expression in the copyrighted works. Should abstractions be protected by copyright?