Model weights could be treated the same way phone books, encyclopedias, and other collections of data are treated. The copyright is over the collection itself, even if the individual items are not copyrightable.
> Encyclopedias are copyrightable. Phone books are not.
It depends on the jurisdiction. The US Supreme Court ruled that phone books are not copyrightable in the 1991 case Feist Publications, Inc., v. Rural Telephone Service Co.. However, that is not the law in the UK, which generally follows the 1900 House of Lords decision Walter v Lane that found that mere "sweat of the brow" is enough to establish copyright – that case upheld a publisher's copyright on a book of speeches by politicians, purely on the grounds of the human effort involved in transcribing them.
Furthermore, under its 1996 Database Directive, the EU introduced the sui generis database right, which is a legally distinct form of intellectual property from copyright, but with many of the same features, protecting mere aggregations of information, including phone directories. The UK has retained this after Brexit. However, EU directives give member states discretion over the precise legal mechanism of their implementation, and the UK used that discretion to make database rights a subset of copyright – so, while in EU law they are a technically distinct type of IP from copyright, under UK law they are an application of copyright. EU law only requires database rights to have a term of 15 years.
Do not be surprised if in the next couple of years the EU comes out with a "AI Model Weights Directive" establishing a "sui generis AI model weights right". And I'm sure US Congress will be interested in following suit. I expect OpenAI / Meta / Google / Microsoft / etc will be lobbying for them to do so.
Encyclopedias may be collections of facts, but the writing is generally creative. Phone books are literally just facts. AI models are literally just facts.
What if I train an AI model on exactly one copyrighted work and all it does it spit that work back out?
eg if I upload Marvels_Avengers.mkv.onnx and it reliably reproduces the original (after all, it's just a fact that the first byte of the original file is OxF0, etc)
A work that is “substantially similar” to a copyrighted work infringes that work, under US law, no matter how it was produced. (Note: Some exceptions apply and you have to read a lot of cases to get an idea of what courts find “substantially similar” .)
Are they, or are they collections of probabilities? If they are probabilities, and those probabilities change from model to model, that seems like they might be copywritable.
If Google, OpenAI, Facebook, and Anthropic each train a model from scratch on an identical training corpus, they would wind up with four different models that had four differing sets of weights, because they digest and process the same input corpus differently.
That indicates to me that they are not a collection of facts.
The AI training algorithms are deterministic given the same dataset, same model architecture, and same set of hyperparameters. The main reasons the models would not be identical is due to differing random seeds and precision issues. The differences would not be due to any creative decisions.
Who gives a damn about copyright when this is clearly profiting off of someone else's work without compensation? Sometimes the law is inadequate and that's ok—the law just needs to change.
Model weights could be treated the same way phone books, encyclopedias, and other collections of data are treated. The copyright is over the collection itself, even if the individual items are not copyrightable.