That seems to fall apart quickly. Even if training could be considered fair use, surely just distributing the raw masses of copyrighted works can't be under any reasonable definition. Otherwise, why did TBP, KAT, and MegaUpload shut down if you could defeat copyright with sheer numbers?
Indeed. Also in the US, whether or not something is fair use involves a four factor test[1] and two of the factors are the amount and substantiality of what's taken and the effect on any market. In this case, the amount is "everything" and the effect on the market is potentially very large for authors/publishers.
>two of the factors are the amount and substantiality of what's taken and the effect on any market
books.google.com has been allowed to copy all the books they can lay their hands on, so long as they don't regurgitate them in full, so it's not really the taking, but any subsequent reproductions. And the effect on the market is insubstantial if the alternative wasn't going to be the equivalent sales.
One thing that we did with distributing certain copyright-protected textual material was to scramble them at the paragraph level.
If you take every paragraph in the Harry Potter saga and sort the paragraphs in alphabetical order, it's just as good for training short-context-window models, but not a "harm to the market" leading to a lost sale for anyone who wants to read the books.
The resulting model doesn't have access to the information about what follows what, so it can recreate paragraphs but can't recreate their proper order for a chapter or book. Well, it can try to guess..
Totally get that but the law doesn't care about that as I understand it. For the four-fold test it matters whether the use is going to affect the market for the original work (not the technicalities of how the model works). If people generate pseudo-Harry Potter via a model that was trained on Harry Potter then the court may well decide that the market for real Harry Potter is affected. That doesn't seem an unreasonable conclusion to me.
I'm pretty sure that's what the lawyers will argue in the Silverman case for example. It's going to be interesting to see how the courts decide.
Megaupload et all went against the entertainment industry in a time when that industry had the money to pay the lawyers to convince the judges what the law means.
In the present moment on the other hand, it is the entities in the AI industry (e.g. MS) that have the money and can hire the lawyers to convince the judges. Realistically speaking, it's very likely that things will swing the way of AI companies, which will benefit, albeit indirectly, these guys, even though by themselves they're too small to push their agenda, they're just bit players.