It's exactly DRM, management of legal access to digital content. The "media" par...

samatman · 2025-01-24T22:49:01 1737758941

I'll also add something that I expect to be somewhat controversial, given earlier conversations on HN[0]: I see contexts in which it would be perfectly valid to use this and ignore robots.txt.

If I were directing some LLM agent to specifically access a site on my behalf, and get a usable digest of that information to answer questions, or whever, that use of the headless browser is not a spider, it's a user agent. Just an unusual one.

The amount of traffic generated is consistent with browsing, not scraping. So no, I don't think building in a mandatory robots.txt respecter is a reasonable ask. Someone who wants to deploy it at scale while ignoring robots.txt is just going to disable that, and it causes problems for legitimate use cases where the headless browser is not a robot in any reasonable or normal interpretation of the term.

[0]: I don't entirely understand why this is controversial, but it was.