Hacker News new | past | comments | ask | show | jobs | submit login

That seems to fall apart quickly. Even if training could be considered fair use, surely just distributing the raw masses of copyrighted works can't be under any reasonable definition. Otherwise, why did TBP, KAT, and MegaUpload shut down if you could defeat copyright with sheer numbers?



Indeed. Also in the US, whether or not something is fair use involves a four factor test[1] and two of the factors are the amount and substantiality of what's taken and the effect on any market. In this case, the amount is "everything" and the effect on the market is potentially very large for authors/publishers.

[1] https://fairuse.stanford.edu/overview/fair-use/four-factors/


>two of the factors are the amount and substantiality of what's taken and the effect on any market

books.google.com has been allowed to copy all the books they can lay their hands on, so long as they don't regurgitate them in full, so it's not really the taking, but any subsequent reproductions. And the effect on the market is insubstantial if the alternative wasn't going to be the equivalent sales.


You can download the whole dataset, so they're certainly able to regurgitate them in full.


One thing that we did with distributing certain copyright-protected textual material was to scramble them at the paragraph level.

If you take every paragraph in the Harry Potter saga and sort the paragraphs in alphabetical order, it's just as good for training short-context-window models, but not a "harm to the market" leading to a lost sale for anyone who wants to read the books.


It absolutely could be a harm to the market if people use the resulting model to generate "Harry Potter" books instead of buying the real ones.


The resulting model doesn't have access to the information about what follows what, so it can recreate paragraphs but can't recreate their proper order for a chapter or book. Well, it can try to guess..


Totally get that but the law doesn't care about that as I understand it. For the four-fold test it matters whether the use is going to affect the market for the original work (not the technicalities of how the model works). If people generate pseudo-Harry Potter via a model that was trained on Harry Potter then the court may well decide that the market for real Harry Potter is affected. That doesn't seem an unreasonable conclusion to me.

I'm pretty sure that's what the lawyers will argue in the Silverman case for example. It's going to be interesting to see how the courts decide.


Since when has TBP shut down?


I think they are referring to the many times the domain name has been seized, and shut down temporarily.


https://www.youtube.com/watch?v=eTOKXCEwo_8 for those that haven't seen it.


Some of the founders were convicted of crimes but the database and code are out there.


Megaupload et all went against the entertainment industry in a time when that industry had the money to pay the lawyers to convince the judges what the law means.

In the present moment on the other hand, it is the entities in the AI industry (e.g. MS) that have the money and can hire the lawyers to convince the judges. Realistically speaking, it's very likely that things will swing the way of AI companies, which will benefit, albeit indirectly, these guys, even though by themselves they're too small to push their agenda, they're just bit players.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: