I’ve done a fair share of scraping, and I learned that on a large scale, there are a lot of cross-cutting repetitive concerns. Things like caching, fetching HTML (preferably in parallel), throttling, retries, navigation, emitting the output as a dataset…
My library, Skyscraper [0], attempts to help with these. It’s written in Clojure (based on Enlive or Reaver, both counterparts to Beautiful Soup), but the principles should be readily transferable everywhere.
In developing this what were some sites used to test it, what was the desired data and format of the data to be extracted, and what was the most challenging of those sites.
My most extensive use of Skyscraper to date has been to produce a structured dataset of proceedings, including individual voting results, of Central European parliaments (~500K total pages scraped, ~100M entries). I’ll do a full writeup at some point.
Shameless request for scraping enthusiasts at:
https://www.pdap.io, an open source Police Data Accessibility project started on HN and Reddit. Our goal is scraping and collating all county level public records, giving us a dataset to enable "Policing the Police"
It seems like your primary call to action on your site is donating when I don't even really know what I am working with or looking at on your site. I think you need a big clear button pointing people to the data and how to get it.
This is more of a beginners guide than master class. This method will not extract most content on modern websites because of the way javascript behaves on them. It is also vertically not horizontally scalable. There are many other reasons that this a step one when web scraping.
It's part of a series of blog posts that talks explicitly about crawling. There are indeed other links that do better explaining advanced extraction techniques.
I worked on a large web scraper for several years and JavaScript almost never needs to be executed. The only times I've had were to extract obfuscated links that are revealed by some bit twiddling code, specific to each request, and this was achievable by forking out to deno.
I think javascript comes up because cloudflare use some kind of javascript challenge as part of the DDOS protection. There are python libraries that know how to deal with it, or you can use some level of headless browser.
https://github.com/VeNoMouS/cloudscraper
This is something I find a lot of web scraping tools miss. Are there any you'd recommend that specifically deal with things like async JavaScript content loading, or loading content based on what you click on a page (e.g., in Single Page Apps)?
Javascript content loading is easier in most cases. Just look at your browser network inspector and grab the URL.
Usually the response is in JSON and you can ignore the original page. You might have to auth/grab session cookies first, but thats still easier than working with the HTML.
Thanks! If there are any third-party managed tools to do this, that would be awesome to know about (i.e., where they somehow run common JS functions/site interactions to test for additional content).
Is there a reason, other than the BeautifulSoup library, that Python is considered by many to be the ideal language for web scraping? I would think that JavaScript would be a far better choice since it could natively parse scripts on the page and libraries for querying and parsing the DOM have existed for a long time in JavaScript and are well known (to the point of being boring -- eg: jQuery).
You don't really get any benefit from writing it in javascript, other than the normal benefits you get from writing anything in javascript. (I say this having very little experience with server-side javascript, so take it with a grain of salt)
DoM emulation and selectors are pretty much equivalent between nodejs and python, you can use css or xpath selectors on html/xml content on either of them. Either way you need to emulate something like a DoM, as neither language/execution-environment has a "native" DoM.
You don't want to execute random javascript code from the web inside your scraper, and just being able to parse the scripts doesn't do you much good. So you're not getting the main advantage I think you're suggesting, being able to emulate page javascript, being able to actually run that code.
Generally if you want to interact with javascript you need to do it in another process (I guess a sufficiently advanced sandbox could work too, an interpreter in your interpreter, but so far that doesn't exist). If you're already going to be running that javascript in a different process for security reasons that different process might as well just be a "remote controlled" web browser.
Historically that was done using selenium, which has good python bindings.
Now days it's being done more with playwright, which started out as a nodejs binding but is moving towards python....
Ultimately I think the reason is that there's no real advantage to using javascript and python is a nicer language with a healthier ecosystem, but your mileage my vary.
Actually one big advantage I see is the ability to quickly come up with needed functions and code from Browser DevTools then use the exact same code in a node script.
Personally I use this method with Puppeteer for advanced pages such as Single Page Apps (SPA) and other pages that depend on JavaScript, CSS, or other features in the page. Another example of an advanced page would be a site where you have to psychical scroll and wait for content to load from a web service. In these cases a headless browser with JavaScript makes the most sense to me.
I've found where it gets tricky with JavaScript is if you have a single missing `async/await` you can introduce bugs in your code that take extra time to solve.
For simple pages I do like Python and that you don't need `async/await`.
Selenium and playwright both allow you to inject javascript directly into the page, which can be nice.
I see your point though. Also when I do playwright scripting I normally use async/await, so I guess the grass is always greener ;p
In python I find a missing async/await is apparently very early on and doesn't really take extra time to solve. Maybe just better tracebacks in python?
If performance, Especially concurrency matters then we should include Go based libraries in the discussion as well.
Colly[1] is a all batteries included scrapping library, But could be bit intimidating for someone new to HTML scrapping.
rod [2] is a browser automation library based on DevTools Protocol which adheres to Go'ish way of doing things and so it's very intuitive but comes with overhead of running chromium browser even if its headless.
BeautifulSoup is great if you don't care about the performance at all. Because it is painfully slooooooww.
Lxml doesn't work well with broken html, but is an or two orders of magnitude faster for parsing, and same for querying with xpath.
A part from that, there is also Scrapy which is used a lot, but same it is also very slow, it is just horizontally scalable easily.
There are a lot of times in which scrapping doesn't use html parsing, when you are scrapping pages which change a lot of structure, it might be better to go with full text search, and in this case, the faster the better. And in that area Python is far from the best, except when .split() and .join() are enough. Even re.match is slow because of the algorithm they use is slow
And to finish, Requests is also super slow, if you want something fast you have to use pycurl.
Does Scrapy's slow speed actually matter much? Your main bottleneck is always going to be network calls and rate limiting. I don't know how much optimization can help there.
Libxml is pretty slow (lxml uses it). Selectolax is 5 times faster for simple CSS queries. It is basically a thin wrapper for a well optimized HTML parser written in C.
I was unaware, I always use bs4 with lxml for parsing xml just because I like the interface. For what I'm doing, the bottleneck is the remote system/network, so it doesn't really matter. But now I'm curious about which parts are slower and why. Maybe I'll run some experiments later.
Executing javascript and being to render a HTML page are completely different things. To render an HTML page you need a way to create a DOM, donwload all ressources, ... An Node gives you no advantage, as you have to use another lib for that
True - that's why you run scrapers using Playwright or Selenium - both of which can easily be scripted from either JavaScript or Python, while executing website code in a sandboxed browser instance.
That's when you break out PySelenium (if you want to stick with python). Many languages work with selenium drivers, I don't think there's much point in debating which language is best for scraping. Probably one that supports threads, it depends on the scale of course and how much performance you want.
While BeautifulSoup is great, lxml + xpath really is the way to go. XPath is a W3C standard and works cross language and even in the browser.
If you need a an quick way to scrape javascript generated content, you can just open your browser console and use `document.evaluate` with an XPath query.
No mention of Javascript. All the pages that I would consider scraping are constructed, at least in part, client side by Javascript. If that Javascript is not executed then there is nothing interesting to scrape.
That's because javascript isn't relevant. The _only_ way the browser can interact with the server is via http requests. That's the level the scraper operates at - imitating the http requests the browser does.
In particular, it doesn't matter why the browser did those http requests. It could be because the user submitted a form, or clicked a link, or javascript did some AJAX request, it's done by a web worker or browser plugin, or god help us calls some function some Active-X component. Provided the scraper emulates http request perfectly, there is no way server can tell if the request came from the component it expects or a scraper.
It is both a benefit and a curse. It's a benefit because all the complexity of javascript libraries, DOM's and what not goes away. For example, back in the day I've scraped the satellite imagery from maps.google.com. Maps is a giant horridly complex javascript application - you really want avoid understanding how it does what it does. The http requests it makes on the other hand are pretty simple.
However Google didn't want you scrapping it, so they included authentication in there. Authentication always boils down to taking some data they sent in a previous request, mangling it with javascript then sending it back as a cookie or hidden field in a form. You have to replicate that mangling perfectly, which involves reading and understand the minimised javascript. That's the curse. Such reverse engineering can take a while, but it's mercifully rare.
The payoff is speed, and reduced fragility. The speed comes arises because most of the crap a browser downloads is only useful to human eyes, and the scraper doesn't have to download it. Fragility is reduced because GUI's, even web GUI's and especially javascript laden SPA's often want mouse clicks and keystrokes in a certain order, and while particular parts of the screen have focus. For some reason web designers love tweaking their UI's which breaks that order. The data they send back with their forms and AJAX requests is far more stable.
I've been using Airflow to coordinate scrapers that hit a number of various sites as part of a global market awareness system I've been building over the last year or so.
I have given up on BeautifulSoup and Scrapy since so many modern websites use obfuscated JS to hide the underlying data they are serving up, so I feel like its better to just act like a user and slowly walk through whatever site actions need to be done to get to the data you want to ingest.
Needless to say, as many have touched on in this post's comments, scraping reliably, and selectively retrying based on the many tens if not hundreds of different potential errors that can occur (either server side / API limitations, or client side based on the interaction that your browser, i.e. shit crashing, etc) is really almost an optimization problem of its own.
Definitely a boon to have scraping as an option, but as always, licensure of data especially if you want to resell it becomes a major concern that you should be thinking about up-front even if you kinda just want to hack things together in the beginning.
Does Airflow support streaming the outputs to downstream tasks? I tried to do something like this with Prefect but with Prefect you have to wait for the upstream task to finish before a downstream task can begin working.
Hit me up at my email in my profile if you want to chat about this stuff, I have a lot of thoughts on this but it's probably off topic for this post and I usually am just hacking stuff together to get my systems up and running.
What you're talking about is very sensible and I was equally surprised that Airflow didn't support long running tasks but you can layer over the workflow orchestration system a kind of ad-hoc higher order system that enables what you speak of. It kind of feels ugly but can get a lot done.
There are definitely ways to accomplish what you are saying using a combination of DockerOperators + ephemeral WebSocket servers running within containers as semi-long running tasks, and basically just have a dumb/heavy Redis container that persists to run streaming between the coordination architecture across these data flow jobs.
"Work in progress" lol!
EDIT: updating Airflow from 1.10.10 to 2.1.2 recently was a huge pain in the ass for what it's worth, good luck to all our fellow protagonists that are dealing with multi tens of thousands of task DAG setups... big ooooff
Yeah your understanding is correct but if you relax the idempotency constraint you can achieve a lot more with just a little bit extra logic in your interface layer with other internal services or potentially other mechanisms to ensure consistency. YMMV
I had been experimenting with web crawling with a lot of technologies. (Python based and others).
What most (uninitiated) developers do not realize is that web crawling is not for mere mortals.
1. We are at the mercy of the webpage authors.
HTML is a great lanuage to encode information. But most developers (usually webpage authors) see it as a kind of tool for presentation only. Infomation can go anywhere in the document. And they are prone to changes.
2. The internet society frowns on web crawling
You look into any site's TnC, you might come across a clause which prevents you from crawling. The specific word may not be in the legalese, however, it implies any kind of crawling is denied. There is good reason for this - it is mostly done to encourage fair use of the service.
3. No body designs services to be crawlable.
Most big name companies do have some alternative. Like Facebook had "graphs". (Now obsolete). They allowed end users to extract data using simple queries - like " list friends of X who live in city Y, and who is not your friend". But "graphs" feature came a lot later after Facebook launch. Not at beginning.
Usually at the beginning stages of any services we are at the mercy of #1 and #2
For #1 no one ever designs page to have have information always at a standard location. It changes.
4. The tech isn't ripe yet.
This is my personal view. I had been experimenting with puppeteer and selenium behind a corporate environment. I wasn't that happy with the "net" developer experience. I found things like taking a screenshot or pdf buggy. For e.g. to get the latter I have to run my browser in non-headless mode. In headless mode my laptop system policy disabled some extensions important for the webpage to load correctly.
Yep, I wanted to crawl a job search website so that I could search for jobs at work, without going to the job site (don't blame me, my job back then sucked). It was impossible to find information because all the tags were generated in some sort of framework that obfuscated everything.
Yeah architecturally Chromium Headless is an "embedder" so it doesn't automatically get all the front-end goodies full-fledged Chrome supports unless someone puts in the work to plumb through the code.
I really wouldn't recommend building a web scraper from scratch. You'll soon have to think about caching/rate-limiting/retries.
Personally, I use Scrapy and it works fine. For best practice, I wouldn't use the Pipeline concept Scrapy provides -don't do data transformation inside scrapy. Simply save the responses and perform the validation and transformations outside of Scrapy. The Pipeline concept is flawed because you cannot create DAGs with it -only serially linked pipelines.
Are there any good alternatives to Puppeteer/Playwright for other languages besides JavaScript? The full browser "emulation" is necessary for most sites nowadays.
To be fair, I've only used Puppeteer so far and I assumed that Playwright was mostly the same thing. Python support for Puppeteer was very buggy. Thanks for the pointer!
I’ve found imaging the page and doing OCR on the image is quite good for text extraction. Many pages on the Internet render with JavaScript, which means BS may not see the text in the DOM.
I learned scraping with python and beautiful soup. My biggest challenge was that on certain sites, the html I would get from requests was different than what would be seen in chrome.
I tried using selenium to get around this but was never successful. The issue has really handicapped my ability to scrape.
User agent spoofing, also chrome adjusts/fixes some html so sometimes copying css or xpath directly will not work and requires modification. Good to work in Jupyter notebook locally to test and optimize scrapers
The article touches on that a bit in the "Avoid being blocked" section. Sometimes user-agent isn't enough, I've run into other headers that can trigger a block or a change in behavior. The last one I ran into was gating on accept-language to not control the language, but serve up a honey-pot type page for automated crawlers.
Kotlin + Jsoup is a very solid scraping combo. Type safety and non-nullabilty is nice. Lots of http clients to choose from. Only downside is the lack of proper XPath, Jsoup's selector syntax is similar but not exact.
Hmmm, I would use Docker Selenium and have python connect to your container with remote web driver. You can make pretty robust scrapers that way. I didn't know people still used beautiful soup.
My library, Skyscraper [0], attempts to help with these. It’s written in Clojure (based on Enlive or Reaver, both counterparts to Beautiful Soup), but the principles should be readily transferable everywhere.
[0]: https://github.com/nathell/skyscraper