It's not at all clear whether weights are copyrightable.

Radim · on March 29, 2023

There's some irony in BigCos using everyone's actual IP freely to train their models, no qualms whatsoever.

And then people being scared to even download said models because of "OMG IP!"

The asymmetry of power (and dare I say, domestication) is astounding.

FooBarWidget · on March 29, 2023

I'm pretty sure they are. If not copyrightable, then at least the database law should apply. One can easily make the case in front of a judge that the situation is similar to databases: the value of weights lies in the amount of work needed to gather the training data, thus weights should be considered a sort of crystallization of a database.

jsnell · on March 29, 2023

But the entire business model of the companies making the models seems to be including copyrighted data into the training set under the guise of fair use. If the weights are considered to be a derived work of the training data as a whole, it seems the weights would also have to be a derived work of the individual items in the training data. So I doubt any of them will be making that argument.

(Except maybe companies that have access to vast amounts of training data with an explicit license, e.g. because the content is created by their users rather than just scraped from the web?)

FooBarWidget · on March 29, 2023

That doesn't matter to database laws. Databases are protected under the premise that collecting the data takes work. How that data is licensed is orthogonal to database law.

jsnell · on March 29, 2023

If I understand correctly your claim was that "the value lies in gathering [a database] of the training data"; that the curation of the training data is what gives the trainer an intellectual property claim on the otherwise mechanical process of creating a model, right? Not that the model itself was a database.

For them to make the argument in court that database rights over the database of training data mean they have rights over the model too, they'd need to argue that the model is a derivative work training data. And then it'd mean their model is also a derived work from all the billions of works they scraped to get that data set. It would destroy the business model of the OpenAIs of the world, there is no chance they try to argue this in court.

nl · on March 30, 2023

> For them to make the argument in court that database rights over the database of training data mean they have rights over the model too, they'd need to argue that the model is a derivative work training data. And then it'd mean their model is also a derived work from all the billions of works they scraped to get that data set. It would destroy the business model of the OpenAIs of the world, there is no chance they try to argue this in court.

This doesn't follow at all.

They can argue they used that work under fair-use and/or that their work was transformative. This is a fairly clear extension of arguments used by search engines that indexing and displaying summaries is not copyright violation and these arguments have been accepted by courts in most circumstances.

jsnell · on March 30, 2023

If the uncreative and automated work of training the model is transformative enough to impact the rights of the original content creators, it would also be transformative enough to impact the rights of the database curator.

The fair use case is much harder to make here than for search engines since the model will be directly competing with the content creators. And again, how could e.g. OpenAI simultaneously claim that their use of the original content to train the model, and then subsequent use the model and the model outputs, while simultaneously claiming that the model could not be used without infringing their DB rights? You can argue fair use for both or neither; trying to argue it for just one of my the two is just incoherent.

And everyone building models needs free access to the training data way more than they need copyright as a means to protect the model.

nl · on March 30, 2023

I don't necessarily disagree, but it's very unclear what a court would find.

I suggest https://arxiv.org/abs/2303.15715 for a complete overview.

jsnell · on March 31, 2023

Agreed! It being unclear was in fact my first message in this discussion :) Thanks for the link, I'll definitely need to read it.

muyuu · on March 29, 2023

yea I do wonder about this, but even Meta are acting as if their releasing it means in effect that the cat is out of the bag

at this point, their not even complaining about it must mean that they accept the data is public now