I don’t actually think this is complicated and reading a comment is not the same thing as scraping the internet and you obviously know that.
A few factors that come to mind would be:
- scale
- informed consent which there was none in this case
- how you are going to use that data. For example using everybody others work so the worlds richest company can make more money from it while giving back nothing in return is a bullshit move.
I think it's even simpler than that: incentives. The entire premise of copyright law (and all IP law) is to protect the incentive to create new stuff, which is often a very risky and highly time or capital intensive endeavor.
So here's the question:
Does a person reading a comment destroy the incentive for the author to post it? No. In fact, it is the only thing that produces the incentive for someone to post. People post here when they want that thing to be read by someone else.
Does a model sucking up all the artistic output of the last 400 years and using that to produce an image generator model destroy the incentive of producing and sharing said artistic output? Yes. At least, that is the goal of such a model -- to become so good it is competitive with human artists.
Of course you have plenty of people positioned benefit from this incentive-destruction claiming it does no such thing. I personally tend to put more credence in the words of people who have historically actually been incentivized by said incentives (i.e. artists) who generally seem to perceive this as destructive to their desire to create and share their work.
> Does a model sucking up all the artistic output of the last 400 years and using that to produce an image generator model destroy the incentive of producing and sharing said artistic output?
Copyright, at least in the US, cares about the effect of the use on the market for that specific work. It's individual ownership, not collective. And while model regurgitation happens, it's less common than you think.
The real harm of AI to artists is market replacement. That is, with everyone using image generators to pop out images like candy, human artists don't have a market to sell into. This isn't even just a matter of "oh boo hoo I can't compete with Mr. Diffusion". Generative AI is very good at creating spam, which has turned every art market and social media platform into a bunch of warring spambots whose output is statistically indistinguishable from human.
The problem is, no IP law in the world is going to recognize this as a problem, because IP is a fundamentally capitalist concept. Asserting that the market for new artistic works and notoriety for those works should be the collective property of artists and artists alone is not a workable legal proposal, even if it's a valid moral principle. And conversely the history of copyright has seen it be completely subverted to the point where it only serves the interests of the publishers in the middle, not the creators of the work in question. Hell, the publishers are licking their chops as to how many artists they can fire and replace with AI, as if all their whinging about Napster and KaZaA 24 years ago was just a puff piece.
> Copyright, at least in the US, cares about the effect of the use on the market for that specific work.
Not quite. The historical implementation of copyright has mostly protected individual pieces of work. Not only does IP law broadly protect much more than individual pieces of work, but the philosophical basis of IP law in general is to protect incentives. Now that the technological landscape has shifted, the case law will almost certainly shift as well because it’s clearly undesirable to live in a world where no one is willing to dedicate themselves to becoming an excellent artist/writer/musician/etc.
IP law is a natural extension of property rights, which in turn is predicated on a utilitarian need to protect certain incentives.
It isn’t clear to me that these models destroy incentive to create. I mean, ChatGPT can generate comments in my style all day, and yet I’m still incentivized to comment.
I fancy myself a photographer. I still want to take photos even if DALL-E 4 will generate better ones.
What even is the point of creating art? I think there are two purposes: personal expression and enjoyment for others.
People will continue to express themselves even if a bot can produce better art.
And if a bot can produce enjoyment for others en masse, then that seems like a huge win for everybody.
> I think there are two purposes: personal expression and enjoyment for others.
This is exactly what non-artists assume artists do art for.
The reality is that most professional visual artists work in publishing, marketing, entertainment and the like. It’s a regular job. The incentive is money. Similarly for theatre, music, video, dance, etc etc. Artists can’t feed their families off exposure and expressing themselves. Their work has value and taking that work to create free derivative works without compensating them is theft.
Selling art as a way of making a living actually is a pretty recent thing. Up until not too long ago patronage was pretty much the only way to survive for artists. Maybe we will go back to that model?
Artists have been selling art for as long as we've been selling anything. Sure, some successful artists, during a few periods in history, managed to secure patronage, but those have always been the minority. Most artists have sold or traded their wares directly and your attempts to derail this discussion with inaccurate histories is neither helpful nor appropriate.
Scale: Many companies (e.g. Google, Bing) have been scraping at scale for decades without issue. Why does scale become an issue when an LLM is thrown into the mix?
Informed consent: I’m not sure I fully understand this point, but I’d say most people posting content on the public internet are generally aware that people and bots might view it. I guess you think it’s different when the data is used for an LLM? But why?
Data usage: Same question as above.
I just don’t see how ingestion into an LLM is fundamentally different than the existing scraping processes that the internet is built on.
There's a big difference between scraping a website so you can direct curious people to it (Googlebot) and scraping a website so you can set up a new website that conveys the same information, but earns you money and doesn't even credit the sources used (which these LLM services often do).
There is a whole genre of copyright infringement where someone will scrape a website and create a per-pixel copy of it but loaded up with ads, and blackhat SEOed to show up above the original website on searches. That's bad, and to the extent that LLMs are doing similar things, they are bad too.
Imagine I scrape your elaborate GameFAQs walkthrough of A Link to the Past. I could 1) use what I learn to direct curious people to its URL, or 2) remove your name from it, cut it into pieces, and rehost the content on my own page, mashed up with other walkthroughs of the same game. Then I sell this service as a revolutionary breakthrough that will free people from relying on carefully poring through GameFAQs walkthroughs ever again.
People will get mad about the second one, and to the extent what LLMs do is like that, will get mad at LLMs.
> There's a big difference between scraping a website so you can direct curious people to it (Googlebot) and scraping a website so you can set up a new website that conveys the same information, but earns you money and doesn't even credit the sources used (which these LLM services often do).
"Crediting the sources used" is not really a principle in copyright law. (Funny enough, online fanartists seem determined to convince everyone it is as a way of shaming people into doing it.)
Whether or not a use is transformative is protective though, and is what both of those cases rely on.
Yes, credit doesn't matter for copyright, but I'm more talking about why people are mad about some uses of scraping and not others.
Legality aside, there is something very strange about a device that both 1) relies on your content to exist and could not work without it and 2) is attempting to replace it with its own proprietary chat interface. Googlebot mostly doesn't act like it's going to replace the internet, but Gemini and ChatGPT etc all are.
They're announcing "hi we are going to scrape all your data, put it into a pot, sell that pot back to you, and by the way, we are pushing this as a replacement for search, so from now on your only audience will be scrapers; all the human eyeballs will be on our website, which as we said before, relies on your work to exist."
> remove your name from it, cut it into pieces, and rehost the content on my own page, mashed up with other walkthroughs of the same game.
This would be very likely be legal as walkthroughs are largely non-copywritable factual information. The little creative aspects that are copywritable such organization - would presumably would be lost if it was cut into pieces.
Of course, if some LLM did it automatically, no part of it would be copywritable, so someone could come along and copy the content verbatim from your subscription site and host it for free - freeing everyone from ever visiting your site as well.
But then if I write a Pulitzer prize article called "No snark intended: How the web became such a toxic place", where your comment, and all other of ur comments for good measure, figure prominently while I ridicule you and this habit of dumbing down complex problems to reduce them to little witty bites, maybe you'd feel I stole something.
Not something big, not something you can enforce, but you d feel very annoyed Im making good money on something you wrote while you get nothing. I think ?
I think scale is what changes the nature of the thing. At the point where you're having a machine consume billions of documents I don't think you could reasonably call that reading anymore. But what you are doing in my eyes is indexing, and the legal basis for that is heavily dependent on what you do with it.
If a human reads it that would be a reproduction of the work, but if you serve that page as a cache to a human you're okay, usually.
If you compile all that information in a database and use it to answer search queries that's also okay, and nothing forbids you from using machine learning on that data to better answer those search queries.
Both of the above are actually being challenged right now but for the time being they're fine.
But that database is a derivative work, in that it contains copyrighted material and so how you use it matters if you want to avoid infringement — for example a Google employee SSHing to a server to read NYT articles isn't kosher.
What isn't clear is whether the model is a derivative work. Does it contain the information or is it new information created from the training data Sure, if you're clever you could probably encode information in the weights and use it as a fancy zip file but that's a matter of intent. If you use Rewind or Windows Recall and it captures a screenshot of a NYT article and then displays it back to you later is that a reproduction? Surely not. And that's an autonomous system that stores copywritten data and regurgitates it verbatim.
So if it's impractical to actually use it for piracy and it very obviously isn't anyone's intent for it to be used as such then I think it's hard to argue it shouldn't be allowed, even on data that was acquired through back channels.
But copyright is more political than logical so who knows what the legal landscape will be in 5 years, especially when AI companies have every incentive to use their lawyers to pull the ladder up behind them.
Data gets either stolen or freed depending on whether the guy who copied it is someone you dislike or like. Personally, I think that Apple is giving the data more exposure which, as I've been informed many times here, is much more valuable than paying for the data.
The irony of "do it for the exposure" is that everyone who actually wants to pay you in exposure isn't actually going to do that, either because they aren't popular enough to measurably expose you, or because they're so popular that they don't want to share the limelight.
AI is a unique third case in which we have billions of creators and no idea who contributed what parts of the model or any specific outputs. So we can't pay in exposure, aside from a brutally long list of unwilling data subjects that will never be read by anyone. Some of the training data is being regurgitated unmodified and needs to be attributed in full, some of it is just informing a general understanding of grammar and is probably being used under fair use, and yet more might not even wind up having any appreciable effect on the model weights.
None of this matters because nobody actually agreed to be paid in exposure, nor was it ever in any AI company's intent - including Apple - to pay in exposure. Data is free purely because it would be extraordinarily inconvenient if anyone in this space had to pay.
And, for the record, this applies far wider than just image or text generators. Apple is almost surely not the worst offender in the space. For example: all that facial recognition tech your local law enforcement uses? That was trained on your Facebook photos.
No snark intended; I’m seriously asking. If the answer is “no” then where do you draw the line?