That seems to fall apart quickly. Even if training could be considered fair use,...

seanhunter · 2024-03-07T18:49:53 1709837393

Indeed. Also in the US, whether or not something is fair use involves a four factor test[1] and two of the factors are the amount and substantiality of what's taken and the effect on any market. In this case, the amount is "everything" and the effect on the market is potentially very large for authors/publishers.

[1] https://fairuse.stanford.edu/overview/fair-use/four-factors/

fsckboy · 2024-03-07T19:05:16 1709838316

>two of the factors are the amount and substantiality of what's taken and the effect on any market

books.google.com has been allowed to copy all the books they can lay their hands on, so long as they don't regurgitate them in full, so it's not really the taking, but any subsequent reproductions. And the effect on the market is insubstantial if the alternative wasn't going to be the equivalent sales.

ascorbic · 2024-03-07T22:06:29 1709849189

You can download the whole dataset, so they're certainly able to regurgitate them in full.

PeterisP · 2024-03-07T19:55:04 1709841304

One thing that we did with distributing certain copyright-protected textual material was to scramble them at the paragraph level.

If you take every paragraph in the Harry Potter saga and sort the paragraphs in alphabetical order, it's just as good for training short-context-window models, but not a "harm to the market" leading to a lost sale for anyone who wants to read the books.

seanhunter · 2024-03-08T10:09:08 1709892548

It absolutely could be a harm to the market if people use the resulting model to generate "Harry Potter" books instead of buying the real ones.

PeterisP · 2024-03-08T14:37:07 1709908627

The resulting model doesn't have access to the information about what follows what, so it can recreate paragraphs but can't recreate their proper order for a chapter or book. Well, it can try to guess..

seanhunter · 2024-03-08T15:20:38 1709911238

Totally get that but the law doesn't care about that as I understand it. For the four-fold test it matters whether the use is going to affect the market for the original work (not the technicalities of how the model works). If people generate pseudo-Harry Potter via a model that was trained on Harry Potter then the court may well decide that the market for real Harry Potter is affected. That doesn't seem an unreasonable conclusion to me.

I'm pretty sure that's what the lawyers will argue in the Silverman case for example. It's going to be interesting to see how the courts decide.

justinclift · 2024-03-07T19:05:32 1709838332

Since when has TBP shut down?

gosub100 · 2024-03-07T19:50:03 1709841003

I think they are referring to the many times the domain name has been seized, and shut down temporarily.

fennecfoxy · 2024-03-08T11:18:44 1709896724

https://www.youtube.com/watch?v=eTOKXCEwo_8 for those that haven't seen it.

RecycledEle · 2024-03-07T19:28:49 1709839729

Some of the founders were convicted of crimes but the database and code are out there.

YeGoblynQueenne · 2024-03-07T21:03:53 1709845433

Megaupload et all went against the entertainment industry in a time when that industry had the money to pay the lawyers to convince the judges what the law means.

In the present moment on the other hand, it is the entities in the AI industry (e.g. MS) that have the money and can hire the lawyers to convince the judges. Realistically speaking, it's very likely that things will swing the way of AI companies, which will benefit, albeit indirectly, these guys, even though by themselves they're too small to push their agenda, they're just bit players.