Hacker News new | past | comments | ask | show | jobs | submit login

If I understand correctly your claim was that "the value lies in gathering [a database] of the training data"; that the curation of the training data is what gives the trainer an intellectual property claim on the otherwise mechanical process of creating a model, right? Not that the model itself was a database.

For them to make the argument in court that database rights over the database of training data mean they have rights over the model too, they'd need to argue that the model is a derivative work training data. And then it'd mean their model is also a derived work from all the billions of works they scraped to get that data set. It would destroy the business model of the OpenAIs of the world, there is no chance they try to argue this in court.




> For them to make the argument in court that database rights over the database of training data mean they have rights over the model too, they'd need to argue that the model is a derivative work training data. And then it'd mean their model is also a derived work from all the billions of works they scraped to get that data set. It would destroy the business model of the OpenAIs of the world, there is no chance they try to argue this in court.

This doesn't follow at all.

They can argue they used that work under fair-use and/or that their work was transformative. This is a fairly clear extension of arguments used by search engines that indexing and displaying summaries is not copyright violation and these arguments have been accepted by courts in most circumstances.


If the uncreative and automated work of training the model is transformative enough to impact the rights of the original content creators, it would also be transformative enough to impact the rights of the database curator.

The fair use case is much harder to make here than for search engines since the model will be directly competing with the content creators. And again, how could e.g. OpenAI simultaneously claim that their use of the original content to train the model, and then subsequent use the model and the model outputs, while simultaneously claiming that the model could not be used without infringing their DB rights? You can argue fair use for both or neither; trying to argue it for just one of my the two is just incoherent.

And everyone building models needs free access to the training data way more than they need copyright as a means to protect the model.


I don't necessarily disagree, but it's very unclear what a court would find.

I suggest https://arxiv.org/abs/2303.15715 for a complete overview.


Agreed! It being unclear was in fact my first message in this discussion :) Thanks for the link, I'll definitely need to read it.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: