Counterpoint (not just to be annoying — I think you pose a very interesting unanswered question):
If I read your hobby website about photography and use it to take 1% better pictures, do I owe you 1% of what my clients pay me?
I think that probably most people would say no, assuming you could even determine that 1% in a way that both parties agreed was fair. I think generally, we have an understanding that some stuff is put out into the world for other humans to learn from and use to make themselves better, and that they don’t owe the original authors anything other than the price of admission.
I guess it comes down to this: do we think that training a model is:
- like storing and later reproducing a version of some collected data, or
- like learning from collected data, and synthesizing new info?
Is there even a meaningful distinction, for a computer?
(Is there even a meaningful distinction for a human…?)
This is a very thought provoking point and it throughly stimulated me to think deeper and through. Purpose of my website is threefold, document my own knowledge, maybe some vanity and the urge to give back something to "someone" make a better living or similar.
Things get interesting at corporate scale. There are fat VC funds, executives, board of directors and what not - making far more and far more comfortable than an individual trying to get better at their craft to put food on the table. And on top of that, you don't give me access to the product that was refined on my input.
It is like someone learning photography from my website but later taking a really masterpiece shot but asking me for money each time I want to view the photo in their studio.
Yes, it is interesting. To me, the important thing is that our labour is exploited in many more (and many more malicious) ways than making an LLM 0.000001% better, maybe (or maybe it makes it worse!). Therefore, the problem isn't the AI, it is this giant financial machine which sucks value out of all who actually produce it, no matter what tools it uses to do so.
I doubt the number of content creators will increase or even stay constant if they know that only AI models will continue "reading" them.
> do I owe you 1% of what my clients pay me?
I would still derive some immaterial gain or satisfaction from you reading my website specifically and using what you learnt to improve yourself. As I expect most people would, so it's still a give and take relationship. LLMs sever that link.
It is doubtful many people will be as willing to continue "putting stuff out into the world" if they know that they are only contributing to some sort of (arguably semi-dystopian) hive-mind.
IMHO whether what they are doing or not is justifiable from a legalistic perspective is tangential and not that relevant if we're talking about free/non-commercial content.
Do they though? I mean, do you personally have a link to the people that are consuming the content you post publicly?
I find all the vitriol around LLMs being trained on public data to be a bit weird. If you don't want that data being used then don't publish it for the world to see? Why get mad when you are the one freely publishing the data in the first place? That's like posting your content on a bulleting board in the dorm common room and telling the trust-fund kids they can't read it because they are rich and you don't want them learning anything from you that might make them richer. Maybe a bad analogy, but I feel like it's a fair approximation of the vitriol I see.
It should be treated as learing. If it truly stores and reproduces a photo (to some high accuracy), then there are already laws in place that handle this. Your client using the output may infringe on the photographer's rights, which may fall back on you depending on your contract.
If I watch a youtube video my browser is also in a way scraping youtube and storing a (temporary) copy of the video. Does it make sense to protect the protect the owner's right's at this point? Absolutely not. Instead we wait to see if I share that downloaded video or content from it again, or somehow reuse it in my own products. Only then does the law step in.
The distinction is scale at which OpenAI can make profit off of your work. Now this might sound trivial, but it's scale of fraud possible has been the biggest argument against online elections.
I used to go to the library, find books with the relevant chapters related to what I wanted to learn, and the librarian would photo copy all the pages I wanted to take home. So I guess technically you could copy all the books for your own use.
It's just impractical to photocopy every page of every book in a library.
When I was in libraries you could photocopy a percentage of a book (15% maybe?), although I doubt it was enforced. One could do many trips, but it is impractical, as you say.
Not really sure that this analogy applies, because I could definitely photocopy as many books from the library as I physically can. No one is going to stop me.
Well it's not so much about the physical act of doing it, it's the trying to convince the world it's for your own private use and not for commercial gain.
Otherwise, intellectual property laws can perhaps apply.
It'd be a hard push to claim it's fair use, a wholesale copying of other's works.
Because authors and publishers wouldn't be very excited about that and would lobby governments to limit that (and I 100% believe they would be right to do that).
I think we need to put this argument in terms of consent and actual harms caused. Human artists are generally down for other human artists to learn from their art and use their stuff as a reference for the purpose of learning, because the next artist generally will have their own style from their own quirks in muscle memory, skill, experience, etc. That contributes meaningfully to Art and keeps the field alive by allowing new artists to enter the field.
AI training is basically only extractive and has the potential to severely disrupt the actual field that made the AI systems possible at all. It's a much more mechanical process that the human interaction of studying a master. It doesn't develop any human skills.
Even if the processes were the same (and I don't think they are, as someone who has actually done computational psychology research), I would still think the AI companies are doing something they know is harmful to actual creative people that generate real value.
What if very rich people came to your small free-entry photo studio to look at your pictures, and - perhaps because they have very fast jets - also go to every other photo studio in the world to look at every other painter’s pictures? Knowing this, would you still let them in for free?
I believe no. Most people would make a distinction between “normal” and “rich”. They would give normal people free access, but the rich should pay for it.
It’s like a billionaire asking for a free hot dog. It’s like “come on, you can easily pay $100, which could even sponsor it for the next 100 people”.
Here it’s not the AI itself that’s exploiting you. It’s the rich people that make the AI that get even richer - partly thanks to your free work.
I don't think we really even need to dive that deep into the philosophical aspect of this. I think that it's fine to simply treat humans and machines differently, the same way we decided that animals cannot hold copyright for a work.
The reason copyright law exists in the first place is due to the difference of scale between copying books by hand and using a machine to do it, so I think "it's different because a machine is doing it" is a completely rational stance to take.
I think there is a straight distinction that can be made. With humans, you can't determine if or how that information will be utilized. With any machine, you will. It's practically a copy. If it's only storing derivate information, if there is fuzziness, that's intended.
Far in the future - if ever - where we have biological grade artificial beings which you can't program, control and limit in the classical software development sense, this could be rethought.
I know very few altruist humans. Whenever someone puts up some content online I believe there is always some motive from the author to benefit themselves even if it's subconscious. Perhaps through ad revenue or exposure from their blog/OSS project or just the dopamine of fake internet points from answering questions on forums. A human may particularly like your content and keep coming back to it or spread it with attribution.
Even if training a model turns out to be similar to human learning, I don't think it necessarily follows that it should be treated the same, legally or morally. There's nothing wrong with human laws or morals that enshrine human behavior, like the human way of learning, as special and distinct from machine learning.
Better analogy would be: I read your hobby website and start a photography section in my Q&A website based on what I've learned from your site. That leads to a 1% increase in my revenue.
If I read your hobby website about photography and use it to take 1% better pictures, do I owe you 1% of what my clients pay me?
I think that probably most people would say no, assuming you could even determine that 1% in a way that both parties agreed was fair. I think generally, we have an understanding that some stuff is put out into the world for other humans to learn from and use to make themselves better, and that they don’t owe the original authors anything other than the price of admission.
I guess it comes down to this: do we think that training a model is:
- like storing and later reproducing a version of some collected data, or
- like learning from collected data, and synthesizing new info?
Is there even a meaningful distinction, for a computer?
(Is there even a meaningful distinction for a human…?)