Data, implies factual information. You can not copyright factual information.
The fact that I use the word "appalling" to describe the practice of doing this results in some vector relationship between the words. Thats the data, the fact, not the writing itself.
There are going to be a bunch of interesting court cases where the court is going to have to backtrack on copyrighting facts. Or were going to have to get some real odd legal interpretations of how LLM's work (and buy into them). Or we're going to have to change the law (giving everyone else first mover advantage).
Base on how things have been working I am betting that it's the last one, because it pulls up the ladder.
The argument that is going to be made, that your copy right work stands. That the model doesn't care about your document it cares that "the" was used N number of times and its relationships to other words. That information isnt your work, and it is factual. That "data" only has value is when it's weighted against all the "data" put into the system, again not your work at all. (We would say thats information derived, but it will be argued that it is transformed).
Data, implies factual information. You can not copyright factual information.
The fact that I use the word "appalling" to describe the practice of doing this results in some vector relationship between the words. Thats the data, the fact, not the writing itself.
There are going to be a bunch of interesting court cases where the court is going to have to backtrack on copyrighting facts. Or were going to have to get some real odd legal interpretations of how LLM's work (and buy into them). Or we're going to have to change the law (giving everyone else first mover advantage).
Base on how things have been working I am betting that it's the last one, because it pulls up the ladder.