I’ve done a fair share of scraping, and I learned that on a large scale, there are a lot of cross-cutting repetitive concerns. Things like caching, fetching HTML (preferably in parallel), throttling, retries, navigation, emitting the output as a dataset…
My library, Skyscraper [0], attempts to help with these. It’s written in Clojure (based on Enlive or Reaver, both counterparts to Beautiful Soup), but the principles should be readily transferable everywhere.
In developing this what were some sites used to test it, what was the desired data and format of the data to be extracted, and what was the most challenging of those sites.
My most extensive use of Skyscraper to date has been to produce a structured dataset of proceedings, including individual voting results, of Central European parliaments (~500K total pages scraped, ~100M entries). I’ll do a full writeup at some point.
My library, Skyscraper [0], attempts to help with these. It’s written in Clojure (based on Enlive or Reaver, both counterparts to Beautiful Soup), but the principles should be readily transferable everywhere.
[0]: https://github.com/nathell/skyscraper