I personally disagree but you make fair points. Scale: Many companies (e.g. Goog...

roywiggins · on June 10, 2024

There's a big difference between scraping a website so you can direct curious people to it (Googlebot) and scraping a website so you can set up a new website that conveys the same information, but earns you money and doesn't even credit the sources used (which these LLM services often do).

There is a whole genre of copyright infringement where someone will scrape a website and create a per-pixel copy of it but loaded up with ads, and blackhat SEOed to show up above the original website on searches. That's bad, and to the extent that LLMs are doing similar things, they are bad too.

Imagine I scrape your elaborate GameFAQs walkthrough of A Link to the Past. I could 1) use what I learn to direct curious people to its URL, or 2) remove your name from it, cut it into pieces, and rehost the content on my own page, mashed up with other walkthroughs of the same game. Then I sell this service as a revolutionary breakthrough that will free people from relying on carefully poring through GameFAQs walkthroughs ever again.

People will get mad about the second one, and to the extent what LLMs do is like that, will get mad at LLMs.

astrange · on June 10, 2024

> There's a big difference between scraping a website so you can direct curious people to it (Googlebot) and scraping a website so you can set up a new website that conveys the same information, but earns you money and doesn't even credit the sources used (which these LLM services often do).

"Crediting the sources used" is not really a principle in copyright law. (Funny enough, online fanartists seem determined to convince everyone it is as a way of shaming people into doing it.)

Whether or not a use is transformative is protective though, and is what both of those cases rely on.

roywiggins · on June 11, 2024

Yes, credit doesn't matter for copyright, but I'm more talking about why people are mad about some uses of scraping and not others.

Legality aside, there is something very strange about a device that both 1) relies on your content to exist and could not work without it and 2) is attempting to replace it with its own proprietary chat interface. Googlebot mostly doesn't act like it's going to replace the internet, but Gemini and ChatGPT etc all are.

They're announcing "hi we are going to scrape all your data, put it into a pot, sell that pot back to you, and by the way, we are pushing this as a replacement for search, so from now on your only audience will be scrapers; all the human eyeballs will be on our website, which as we said before, relies on your work to exist."

Aloisius · on June 11, 2024

> remove your name from it, cut it into pieces, and rehost the content on my own page, mashed up with other walkthroughs of the same game.

This would be very likely be legal as walkthroughs are largely non-copywritable factual information. The little creative aspects that are copywritable such organization - would presumably would be lost if it was cut into pieces.

Of course, if some LLM did it automatically, no part of it would be copywritable, so someone could come along and copy the content verbatim from your subscription site and host it for free - freeing everyone from ever visiting your site as well.