Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I personally disagree but you make fair points.

Scale: Many companies (e.g. Google, Bing) have been scraping at scale for decades without issue. Why does scale become an issue when an LLM is thrown into the mix?

Informed consent: I’m not sure I fully understand this point, but I’d say most people posting content on the public internet are generally aware that people and bots might view it. I guess you think it’s different when the data is used for an LLM? But why?

Data usage: Same question as above.

I just don’t see how ingestion into an LLM is fundamentally different than the existing scraping processes that the internet is built on.



There's a big difference between scraping a website so you can direct curious people to it (Googlebot) and scraping a website so you can set up a new website that conveys the same information, but earns you money and doesn't even credit the sources used (which these LLM services often do).

There is a whole genre of copyright infringement where someone will scrape a website and create a per-pixel copy of it but loaded up with ads, and blackhat SEOed to show up above the original website on searches. That's bad, and to the extent that LLMs are doing similar things, they are bad too.

Imagine I scrape your elaborate GameFAQs walkthrough of A Link to the Past. I could 1) use what I learn to direct curious people to its URL, or 2) remove your name from it, cut it into pieces, and rehost the content on my own page, mashed up with other walkthroughs of the same game. Then I sell this service as a revolutionary breakthrough that will free people from relying on carefully poring through GameFAQs walkthroughs ever again.

People will get mad about the second one, and to the extent what LLMs do is like that, will get mad at LLMs.


> There's a big difference between scraping a website so you can direct curious people to it (Googlebot) and scraping a website so you can set up a new website that conveys the same information, but earns you money and doesn't even credit the sources used (which these LLM services often do).

"Crediting the sources used" is not really a principle in copyright law. (Funny enough, online fanartists seem determined to convince everyone it is as a way of shaming people into doing it.)

Whether or not a use is transformative is protective though, and is what both of those cases rely on.


Yes, credit doesn't matter for copyright, but I'm more talking about why people are mad about some uses of scraping and not others.

Legality aside, there is something very strange about a device that both 1) relies on your content to exist and could not work without it and 2) is attempting to replace it with its own proprietary chat interface. Googlebot mostly doesn't act like it's going to replace the internet, but Gemini and ChatGPT etc all are.

They're announcing "hi we are going to scrape all your data, put it into a pot, sell that pot back to you, and by the way, we are pushing this as a replacement for search, so from now on your only audience will be scrapers; all the human eyeballs will be on our website, which as we said before, relies on your work to exist."


> remove your name from it, cut it into pieces, and rehost the content on my own page, mashed up with other walkthroughs of the same game.

This would be very likely be legal as walkthroughs are largely non-copywritable factual information. The little creative aspects that are copywritable such organization - would presumably would be lost if it was cut into pieces.

Of course, if some LLM did it automatically, no part of it would be copywritable, so someone could come along and copy the content verbatim from your subscription site and host it for free - freeing everyone from ever visiting your site as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: