Hacker News new | past | comments | ask | show | jobs | submit login

my personal perspectives

1) straight up copying. download a bunch of copyrighted stuff -> making a copy. no way out of this one.

2) a derivative work can be/is being generated here. very grey area — what counts as a “derivative” work? read about robin thicke blurred lines court case for a rollercoaster of a time about derivative musical works.

3) making the model so so? do you mean getting an output and user copying the result? that’s copying the derivative work, which, depends on whatever copyright agreement happens once a derivative work claim is sorted out.

that’s based on my 5 years of music copyright experience, although it was about ten years ago now so might be some stuff i’ve got wrong there.




You can ensure a model trains on transformative not derivative synthetic texts, for example, by asking for summary, or turning it into QA pairs, or doing contrastive synthesis across multiple copyrighted works. This will ensure the resulting model will never regurgitate the training set because it has not seen it. This approach only takes abstract ideas from copyrighted sources, protecting their specific expression.

If abstract ideas were protectable what would stop a LLM from learning not from the original source but from social commentary and follow up works? We can't ask people not to reproduce ideas they read about. But on the other hand, protecting abstractions would kneecap creativity both in humans and AI.


That's an interesting argument, which makes the case for "it's what you make it do, not what it can do, which constitutes a violation" a little stronger IMO.


1) It's definitely copying, but that doesn't necessarily mean the end product is itself a copyright violation. (And that remains true even where some of the steps to make it were themselves violations).

2) Agreed! Where this becomes interesting with LLMs is that, as with people, they can have the capacity to produce a derivative work even without having seen the original.

For example, an LLM that had "read" enough reviews of Harry Potter might be able to produce a reasonable stab at the book (at least enough so for the law to consider it a derivative) without ever having consumed the work itself or direct derivatives.

3) It's more of a tool-use and intent argument. One might make the argument that an LLM is a machine, not a set of content/data, and that the liability for what it does sits firmly with the user/operator, not those who made it. If I use a typewriter to copy Harry Potter - or a weapon to hurt or kill someone - in neither case does the machine or its maker have any liability there.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: