Hacker News new | past | comments | ask | show | jobs | submit login

Literally no rights agreement covers LLMs. They cover reproduction of the work, but LLMs don't obviously do this i.e. that the model transiently runs an algorithm over the text is superficially no different to the use of any other classifier or scoring system like those already used by law firms looking to sue people for sharing torrents.



> They cover reproduction of the work, but LLMs don't obviously do this

LLMs are much smaller than their training sets, there is no space to memorize the training data. They might memorize small snippets but never full books. They are the worst infringement tools ever made - why replicate Harry Potter by LLM, it's show, expensive and lossy, when you could download the book so much easier.

A second argument is that using the LLM blends a new intent into the process, that of the prompter. This can render the outputs transformative. And most LLM interactions are one-time use, like a scratch pad not like a finished work.


The lossy compression argument is interesting.

How many bits of entropy in Harry Potter?

How many bits of entropy in a lossy-compressed abridgement that is nevertheless enough, when reconstituted, to constitute a copyright infringement of Harry Potter?

The latter is absolutely small enough to fit in an LLM, although how close it would get to the original work is debatable. The question is whether copyright is violated:

1) inherently by the model operator, during the training.

2) by the model/model owner, as part of the generation.

3) by the user, in making the model so so and then reproducing the result.


my personal perspectives

1) straight up copying. download a bunch of copyrighted stuff -> making a copy. no way out of this one.

2) a derivative work can be/is being generated here. very grey area — what counts as a “derivative” work? read about robin thicke blurred lines court case for a rollercoaster of a time about derivative musical works.

3) making the model so so? do you mean getting an output and user copying the result? that’s copying the derivative work, which, depends on whatever copyright agreement happens once a derivative work claim is sorted out.

that’s based on my 5 years of music copyright experience, although it was about ten years ago now so might be some stuff i’ve got wrong there.


You can ensure a model trains on transformative not derivative synthetic texts, for example, by asking for summary, or turning it into QA pairs, or doing contrastive synthesis across multiple copyrighted works. This will ensure the resulting model will never regurgitate the training set because it has not seen it. This approach only takes abstract ideas from copyrighted sources, protecting their specific expression.

If abstract ideas were protectable what would stop a LLM from learning not from the original source but from social commentary and follow up works? We can't ask people not to reproduce ideas they read about. But on the other hand, protecting abstractions would kneecap creativity both in humans and AI.


That's an interesting argument, which makes the case for "it's what you make it do, not what it can do, which constitutes a violation" a little stronger IMO.


1) It's definitely copying, but that doesn't necessarily mean the end product is itself a copyright violation. (And that remains true even where some of the steps to make it were themselves violations).

2) Agreed! Where this becomes interesting with LLMs is that, as with people, they can have the capacity to produce a derivative work even without having seen the original.

For example, an LLM that had "read" enough reviews of Harry Potter might be able to produce a reasonable stab at the book (at least enough so for the law to consider it a derivative) without ever having consumed the work itself or direct derivatives.

3) It's more of a tool-use and intent argument. One might make the argument that an LLM is a machine, not a set of content/data, and that the liability for what it does sits firmly with the user/operator, not those who made it. If I use a typewriter to copy Harry Potter - or a weapon to hurt or kill someone - in neither case does the machine or its maker have any liability there.


do those classifiers read copyrighted material? i thought they simply joined the swarm and seeded (reproduction with permission)

youtube, etc classifiers definitely do read others material though.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: