Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No. The key difference is that if a user asks about a specific page, when Perplexity fetches that page, it is being operated by a human not acting as a crawler. It doesn’t matter how many times this happens or what they do with the result. If they aren’t recursively fetching pages, then they aren’t a crawler and robots.txt does not apply to them. robots.txt is not a generic access control mechanism, it is designed solely for automated clients.


Many people don't want their data used for free/any training. AI developers have been so repeatedly unethical that the well-earned Baysian prior is high probability that you cannot trust AI developers to not cross the training/inference streams.


> Many people don't want their data used for free/any training.

That is true. But robots.txt is not designed to give them the ability to prevent this.


It is in the name, rules for the robots. Any scraping ai or not, and even mass recrsive or single page, should abide by the rules.


I would only agree with this if we knew for sure that these on-demand human-initiated crawls didn't result in the crawled page being added to an overall index and scheduled for future automated crawls.

Otherwise it's just adding an unwilling website to a crawl index, and showing the result of the first crawl as a byproduct of that action.


> It doesn’t matter how many times this happens or what they do with the result.

That's where you lost me, as this is key to GP's point above and it takes more than a mere out-of-left-field declaration that "it doesn't matter" to settle the question of whether it matters.

I think they raised an important point about using cached data to support functions beyond the scope of simple at-request page retrieval.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: