Hacker News new | past | comments | ask | show | jobs | submit login

I’ve done a fair share of scraping, and I learned that on a large scale, there are a lot of cross-cutting repetitive concerns. Things like caching, fetching HTML (preferably in parallel), throttling, retries, navigation, emitting the output as a dataset…

My library, Skyscraper [0], attempts to help with these. It’s written in Clojure (based on Enlive or Reaver, both counterparts to Beautiful Soup), but the principles should be readily transferable everywhere.

[0]: https://github.com/nathell/skyscraper




In developing this what were some sites used to test it, what was the desired data and format of the data to be extracted, and what was the most challenging of those sites.


Thanks for the interest!

My most extensive use of Skyscraper to date has been to produce a structured dataset of proceedings, including individual voting results, of Central European parliaments (~500K total pages scraped, ~100M entries). I’ll do a full writeup at some point.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: