Has anyone bothered to properly quantify the worst case load (i.e., requests per second) that has been incurred by these scraping tools? I recall a post on HN a few weeks/months ago about something similar, but it seemed very light on figures.
It seems to me that ~50% of the discourse occurring around AI providers involves the idea that a machine reading webpages on a regular schedule is tantamount to a DDOS attack. The other half seems to be regarding IP and capitalism concerns - which seem like far more viable arguments.
If someone requesting your site map once per day is crippling operations, the simplest solution is to make the service not run like shit. There is a point where your web server becomes so fast you stop caring about locking everyone into a draconian content prison. If you can serve an average page in 200uS and your competition takes 200ms to do it, you have roughly 1000x the capacity to mitigate an aggressive scraper (or actual DDOS attack) in terms of CPU time.
I mean, it did happen, don't you remember in March when SourceHut had outages because their most expensive endpoints were being spammed by scrapers?
Don't you remember the reason Anubis even came to be?
It really wasn't that long ago, so I find all of the snarky comments going "erm, actually, I've yet to see any good actors get harmed by scraping ever, we're just reclaiming power from today's modern ad-ridden hellscape" pretty dishonest.
It seems to me that ~50% of the discourse occurring around AI providers involves the idea that a machine reading webpages on a regular schedule is tantamount to a DDOS attack. The other half seems to be regarding IP and capitalism concerns - which seem like far more viable arguments.
If someone requesting your site map once per day is crippling operations, the simplest solution is to make the service not run like shit. There is a point where your web server becomes so fast you stop caring about locking everyone into a draconian content prison. If you can serve an average page in 200uS and your competition takes 200ms to do it, you have roughly 1000x the capacity to mitigate an aggressive scraper (or actual DDOS attack) in terms of CPU time.