Hacker News new | past | comments | ask | show | jobs | submit login

So, here's a story I heard recently.

The person involved wanted to create a local archive of records. An index of material was possible to obtain, but rapid sequential requests resulted in an IP block preventing further access.

Modest levels of restructuring the requests, in random sequence, with a significant (several minutes) delay between requests, and random delay, eventually succeeded in retrieving the material.

If that had failed, a distributed set of requests could have been attempted.

When I've faced issues of high (to the level of service-degrading) levels of traffic, I've found tools that allow me to aggregate requests by similar attributes, including requests coming from a defined network space (CIDR or ASN), which can be quite useful. Reading such patterns just from eyeball scans of logs is pretty bloody difficult, and tools to assist in this are ... poorly developed.




>Reading such patterns just from eyeball scans of logs is pretty bloody difficult, and tools to assist in this are ... poorly developed.

There's some enterprise software out there designed for use cases like this, but they're typically very expensive. There are also other issues, like the storage requirements of full logging of request headers and bodies if you really want to see the big picture.

Simple IP rate limiting will stop the majority of would-be scrapers/scanners in their tracks though. Especially if there's so much material that it could take days or weeks to finish a scrape if you had to add a random delay of 3 or more minutes per request.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: