Hacker News new | past | comments | ask | show | jobs | submit login

I don't see why that would require them to pay royalties.

The training data is also publicly available, https://pile.eleuther.ai/




From The Pile: An 800GB Dataset of Diverse Text for Language Modeling:

https://arxiv.org/pdf/2101.00027.pdf

7.1 Legality of Content

While the machine learning community has begun to discuss the issue of the legality of training models on copyright data, there is little acknowledgment of the fact that the processing and distribution of data owned by others may also be a violation of copyright law. As a step in that direction, we discuss the reasons we believe that our use of copyright data is in compliance with US copyright law.

Under pre (1984) (and affirmed in subsequent rulings such as aff (2013); Google (2015)), non-commercial, not-for-profit use of copyright media is preemptively fair use. Additionally, our use is transformative, in the sense that the original form of the data is ineffective for our purposes and our form of the data is ineffective for the purposes of the original documents. Although we use the full text of copyright works, this is not necessarily disqualifying when the full work is necessary (ful, 2003). In our case, the long-term dependencies in natural language require that the full text be used in order to produce the best results (Dai et al., 2019; Rae et al., 2019; Henighan et al., 2020; Liu et al., 2018).

Copyright law varies by country, and there may be additional restrictions on some of these works in particular jurisdictions. To enable easier compliance with local laws, the Pile reproduction code is available and can be used to exclude certain components of the Pile which are inappropriate for the user. Unfortunately, we do not have the metadata necessary to determine exactly which texts are copyrighted, and so this can only be undertaken at the component level. Thus, this should be be taken to be a heuristic rather than a precise determination.

1984. Sony corp. of america v. universal city studios, inc. 2003. Kelly v. arriba soft corp. 2013. Righthaven llc v. hoehn.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: