Sometimes it's also helpful to use beautiful soup to isolate the elements you want, feed the text of the elements into StringIO and give that to read_html.
I've been involved in many web scraper jobs over the past 25 years or so. The most recent one, which was a long time ago at this point, was using scrapy. I went with XML tools for controlling the DOM.
It's worked unbelievably well. It's been running for roughly 5 years at this point. I send a command at a random time between 11pm and 4am to wake up an ec2 instance. It checks its tags to see if it should execute the script. If so, it does so. When it's done with its scraping for the day, it turns itself off.
This is a tiny snapshot of why it's been so difficult for me to go from python2 to python3. I'm strongly in the camp of "if it ain't broke, don't fix it".
I certainly can keep using it. There have been so many efforts to get people to update Python 2 code to Python 3 code that it's on my backlog to do it. Will I get to it this year? Probably not.
Why use autoscaling and not just launch the instance directly from lambda? The run time is short so there's no danger of two instances running in parallel
the tl;dr for all web scraping is to just use scrapy (and scrapyd) - otherwise you end up just writing a poorer implementation of what has already been built
My only recent change is that we no longer use Items and ItemLoaders from scrapy - we've replaced it with a custom pipeline of Pydantic schemas and objects
One tip I would pass on when trying to scrape data from a website, start by using wget in mirror mode to download the useful pages. It's much faster to iterate on scraping the data once you have it locally. Also, less likely to accidentally kill the site or attract the attention of the host.
I did web-scraping professionally for two years, in the order of 10M pages per day. The performance with a browser is abysmal and requires tonnes of memory so not financially viable. We used them for some jobs, but rendered content isn't a problem, you can also simulate the API calls (common) and read the JSON, or regex the script and try to do something with that.
I'd say 99% of the time you can get by without a browser.
Rendering the page in Puppeteer / Selenium and then scraping it from there sounds like a lot easier than somehow trying to replicate that in your scraper?
If they're really being generated client-side, you're free to generate them yourself by any means you want. But also, that's a strange thing for the website to do, since it's applying a security feature (signatures) in a way that prevents it from providing any security.
If they're generated server-side like you would expect, and sent to the client, you'd get them the same way you get anything else, by asking for them.
I'm not sure what's your point. Of course you can replicate every request in your scraper / with curl if you want to if you know all the input variables.
Doing that for web scraping purposes where everything is changing all the time and you have more than one target website is just not feasible if you have to reverse engineer some custom JS for every site. Using some kind of headless browser for modern websites will be way easier and more reliable.
As someone who has done a good bit of scraping, how a website is designed dictates how I scrape.
If it's a static website that has consistently structured HTML and is easy to enumerate through all the webpages I'm looking for, then simple python requests code will work.
The less clear case is when to use a headless browser vs reverse engineering JS/server side APIs. Typically, I will do like a 10 minute dive into the client side js and monitor ajax requests to see if it would be super easy to hit some API that returns JSON to get my data. If reverse engineering seems to hairy, then I will just do headless browser.
I have a really strong preference for hitting JSON apis directly because, well, you get JSON! Also you usually get more data then you even knew existed.
Then again, if I was creating a spider to recursively crawl a non-static website, then I think Headless is the path of least resistance. But usually, I'm trying to get data in the HTML, and not the whole document.
>If they're really being generated client-side, you're free to generate them yourself by any means you want. But also, that's a strange thing for the website to do
what??
Page loads -> Javascript sends request to backend -> it returns data -> javascript does stuff with it and renders it.
My last contract job was to build a 100% perfect website mirroring program for a group of lawyers who were interested in building class action lawsuits against some of the more henious scammers out there.
I ended up building like 8 versions of it, literally using every PHP and Python library and resource I could find.
I tried httrack, php-ultimate-web-scraper (from github), headless chromium. headless selenium, and a few others
By far the biggest problem was dealing with JS links...you wouldn't think from the start it would be such a big deal but yet..it was.
Selenium with python turned out to be the winning combination, and of course, it was the last one I tried. Also, this is an ideal project to implement recursion altho you have to be careful about exit conditions.
One thing that was VERY important for performance was not visiting any page more then once because, obviously, certain links in headers and footers are duped sometimes 100s of times.
JS links often made it very difficult to discover the linked page, are certain library calls that were supposed to get this info for you often didn't work.
It was a super fun project, and in the end considering I only worked for 2 months, I shipped some decent code that was getting like 98.6% of the pages perfectly.
The final presentation was interesting...for some reason my client I think got in his head that I wasn't very good programmer or something, and as we ran thru his list of sample sites expecting my program to error out or incorrectly mirror the site, but it handled all 10 of the sites about perfectly and he was rather flabbergasted because he told me it would have taken him a week hand clicking the site for the mirror but instead the program did them all in under an hour.
I had to solve nearly the exact same problem for the same reasons. I too ended up with Selenium.
My favorite part was having a nice working system, then throwing it in the cloud and finding out a socking number of sites tell you to go away if you come at them from a cloud-based IP.
Shouldn't be surprising, but it was still annoying.
There are a number of so-called “residential VPN” services with clients that also serve as the firm’s p2p VPN / proxy edge. Some can be subscribed to commercially to resolve precisely the above issue.
Preferably, only give money to one that tells their users this is how it works.
True. We went down that path for a while, but for our purposes, I was never happy with the ones I could find, since they were super vague how they got the IPs in the first place. Some of it felt like protecting commercial secrets, which is fine, but some of it felt like protecting questionable business practices.
For our purposes, and the websites we needed to track, a traditional VPN was good enough.
Ha! I am currently building something very similar at work, and JS links are driving me up a wall. Its interesting because you would think its super simple, but I still haven't found a good solution. Luckily my boss is understanding.
I think this article does an OK job covering how to scrape websites rendered serverside, but I strongly discourage people from scraping SPAs using a headless browser unless they absolutely have to. The article's author touches on this briefly, but you're far better off using the network tab in your browser's debug tools to see what AJAX requests are being made and figuring out how those APIs work. This approach results in far less server load for the target website as you don't need to request a bunch of other resources, reduces the overall bandwidth costs, and greatly speeds up the runtime of your script since you don't need to spend time running javascript in the headless browser. That can be especially slow if your script has to click/interact with elements on the page to get the results you need.
Other than that, I'd strongly caution anyone looking into making parallel requests. Always keep in mind the sysadmin and engineers behind the site you are targeting. It's can be tempting to value your own time by making a ton of parallel requests to reduce the overall time of your script, but you can potentially cause massive server load for the site you're targeting. If that isn't enough motivation to cause you pause, keep in mind that the site owner is more likely to make the site hostile to scrapers if there are too many bad actors hitting the site heavily.
There was (is?) a DARPA project called "Memex" that was built to crawl the hidden web that has many tools like crawling with authentication, automatic registration, machine-learning to detect search-forms, auto detecting pagination etc etc etc etc https://github.com/darpa-i2o/memex-program-index
I don't! As far as I know, scraping data behind a login is illegal in the united states. You can look into the supreme court case Facebook v Powers Inc for information behind that. This page https://www.rcfp.org/scraping-not-violation-cfaa/ seems to have a decent overview of scraping laws in general. It's definitely a legal gray area so I'd suggest doing your research! This doesn't constitute legal advice and all that, I'm not a lawyer just a guy who does some scraping here and there :)
It’s fun to combine jupyter notebooks and py scraping. If you are working 15 pages/screens deep, you can “stay at the coal face” and not have to rerun the whole script after making a change to the latest step.
`ipython -i <script>` also works similarly for debugging, by having the powerful interpreter open after running the script, without the jupyter overhead.
I wanted to do some larger distributed scraping jobs recently and although it was easy to get everything running on one machine (with different tools including Scrapy), I was surprised how hard it was to do at scale. The open source ones I could find was hard/impossible to get working, overly complex, badly documented etc.
The services I found to be reasonably priced for small jobs, but at scale they quickly become vastly more expensive than setting this up yourself. Especially when you need to run these jobs every month or so. Even if you have to write some code to make the open source solutions actually work.
It gets more complicated when you need to leverage real browser engines (eg Chrome). I've got jobs spread across ~ 20 machines/ 140 concurrent browser instances, it's non-trivial.
One thing I notice with all blog articles, and HN comments, on scraping is that they always omit the actual use case, i.e., the specific website that someone is trying to scrape. Any examples tend to be so trivial as to be practically meaningless. They do not prove anything.
If authors did name websites they wanted to scrape, or show tests on actual websites, then we might see others come forward with different solutions. Some of them might beat the ones being put forward by the pre-packaged software libraries/frameworks and commercial scraping services built on them, e.g., less brittle, faster, less code, easier to repair.
I do not have any particular use cases that I cannot get my own solutions to work with. Thus I have no motivation to try these Python solutions. That is why I would be curious to have an example use case that someone thought could only be handled by some Python framework.
In my career I found several reasons not to use regular expressions for parsing an HTML response, but the largest was the fact that it may work for 'properly formed' documents, but you would be surprised how lax all browsers are about requiring the document to be well-formed. Your regex, unless particularly handled, will not be able to handle sites like this (and there are a lot, at least from my career experience). And you may be able to work 'edge cases' into your RegEx, but good luck finding anyone but the expression author who fully understands and can confidently change it as time goes on. It is also a PITA to debug when groupings/etc. aren't working (and there will be a LOT of these cases with HTML/XML documents).
It is honestly almost never worth it unless you have constraints on what packages you can use and you MUST use regular expressions. Just do your future-self a favor and use BeautifulSoup or some other package designed to parse the tree-like structure of these documents.
One way it can be used appropriately is just finding a pattern in the document- without caring where it is w.r.t. the rest of the document. But even then, do you really want to match: <!-- <div> --> ?
For all the things jQuery got wrong, it got one thing right: arguably the most intuitive way to target a set of data in a document is by having a concise DSL that works on a parsed representation of the document.
I'd love to see more innovation/developer-UX research on the interactions between regexes, document parse trees, and NLP. For instance, "match every verb phrase where the verb has similar meaning to 'call' within the context of a specific CSS selector, and be able to capture any data along that path in capturing groups, and do something with it" right now takes significant amounts of coding.
https://spacy.io/usage/rule-based-matching does a lot, but (a) it's not particularly concise, (b) there's not a standardized syntax for e.g. replacement strings once you detect something, and (c) there's no real facilities to bake in a knowledge of hierarchy within a larger markup-language document.
I think scraping is just inherently brittle whether you go by the DOM traversal or by regex. AI may have the best potential. Regex can be slightly more brittle like you point out with commented html or myriad other problems, but it can also be less brittle than DOM if you craft more lenient patterns. The main problem I found was regex's not being performant due to recursiveness and stack overflows (Google's RE2 lib addresses this). My favorite performance trick is to use negated character classes rather than a dot, /<foo[^>]*>/
I have been developing scrapers and crawlers and writing[1] about them for many years and used many Python based libs so far including Selenium. I have write such scrapers for individuals and startups for several purposes. The biggest issue I faced was rendering of dynamic sites and blocking of IPs due to absence of proxies which are not cheap at all, especially for individuals.
Services like Scrapingbee and ScraperAPI are serving quite good for such problems. I personally liked ScraperAPI for rendering dynamic websites due to the better response time.
Shameless Plug: In case if anyone is interested, long time back, I had written about it on my blog which you can read here[2]. Now you do not need to setup remote Chrome instance or anything. What all is required is to hit an API endpoint to fetch content from a dyanmic JS rendered websites.
Been scraping for a long time. If handling JS isn't a requirement, XPath is the 100% the way to go. It's a standard query language, very powerful, and there are great browser extensions for helping you write queries.
fetching html and then parsing it navigating the parsed result (or with regexp) is what used to work 20 years ago.
These days, with all these reactive javascript frameworks you better skip to item number 5: headless browsing.
Also mind that Facebook, Instagram, ... will have anti-scraping measures in place. It's a race ;)
I've found that a lot of the time that's not needed. Often you'll find the data as a JSON blob in the page and can just read it directly from there. Or find that there's an API endpoint that the javascript reads.
This. Even relatively simple websites are much harder to parse today. I did a minor side project for a customer scraping some info and anti-scraping measures were in full force. It feels like an all out war.
Usually starts from simple to difficult. User agent stuff, IP address detection, aggressive rate limiting, captcha checking, browser fingerprinting, etc.
It can be overcome and, admittedly, I am new to this so for me that means way more time spent trying to make it work. The odd one that I got stuck on for a while was a presented list where individual record held pertinent details, but the list had, seemingly randomly, items that looked like records on the surface, but were not ( so those had to be identified and ignored ). Small things like that.
Still, I would love to learn more about your approach if you would be willing to share.
That said, there are quite a few services which battle these systems for you nowadays (such as scraperapi - not affiliated, not a user). They are not always successful, but they have an advantage of maaany residential proxies (no doubt totally ethically obtained /s, but that's another story).
Maybe it will turn out to be that way, but this is far from reality at the moment. There are not many sites that cannot be scraped statically and there definitely are very few sites/apps that are webasm.
It'll change, but who knows how much. At least currently, most scraping professionals are not even using headless browsers as their targets are statically rendered.
I'm not 100% sure what you mean by a 'webasm' site (web assembly powered?), but the article describes scarping via headless browsers which actually render the page and allow you to select elements that are client rendered.
I have been websraping for almost 4 years now. That is my entire niche.
The problem with web scraping is that you really don't know the ethical point of scraping ends. These days I will reverse engineer a website to minimize the request load and only target specific API endpoints. But then again I am breaching some security measures they have while doing that.
Is there a SOTA library for common web scraping issues at scale( especially distributed over cluster of nodes) for Captcha detection, IP rotation, Rate throttling, Queue Management etc.?
There is no "state of the art library" to build your own google. But "Rate throttling/limiting" can be done with Redis, rotating ip is still rate-limiting with Redis, Captcha Detection - You have to pay $$ I think.
Personally I have not needed Beautifulsoup a single time, when web scraping. People say it is better for unclean HTML, which I cannot confirm, because I never needed it and always were able to get my result using LXML + etree with XPath and CSS selectors. Once I also used Scrapy, but still not Beautifulsoup. I am glad there is a guide, that starts with LXML, instead of immediately jumping to Beautifulsoup.
lxml is terrific. I agree, never really understood what beautifulsoup added that lxml couldn't just handle on its own. On the JVM, jsoup is excellent but doesn't support XPath.
We created a fun side project to grab the index page of every domain - we downloaded a list of approx 200m domains. However, we ran into problems when our provider complained. It was something to do with the DNS side of things and we were told to run our own DNS server. If there is anyone on here with experience of crawling across this number of domain names it would be great to talk!
The biggest struggle I had while building web scrappers is scaling Selenium. If you need to launch Selenium hundred of thousands times per month, you need a lot of computer power which is really expensive on EC2.
A couple years ago, I discovered browserless.io which does this job for you and it's amazing. I really don't know how they made this but it just scales without any limit.
I recently undertook my first scraping project, and after trying a number of things landed upon Scrapy.
It’s been a blessing. Not only can it handle difficult sites, but it’s super quick to write another spider for the easy sites that provide the JSON blob in a handy single API call.
Only problem I had was getting around cloudflare, tried a few things like puppeteer but no luck.
For data extraction I highly recommend weboob. Despite the unfortunate name, it does some really cool stuff. Writing modules is quite straightforward and the structure they've chosen makes a lot of sense.
I do wish there was a Go version of it, mostly because I much prefer working with Go, but also because single binary is extremely useful.
I really appreciate the tips in the comments here.
As a beginner it makes a lot of sense to iterate on a local copy with jupyter rather than fetching resources over and over until you get it right. I wish more tutorials focused on this workflow.
I've always had pretty bad experiences with web scrapping, it's such a pain in the ass and frequently breaks. I'm not sure if I'm doing it wrong or if that's how it's supposed to be.
It can definitely depend on what you're scraping, but in the last few years or so the only project I had trouble with was one where they changed the units for the unpublished API (the real UI made two requests which mattered, one to grab the units, and I missed that in my initial inspection -- it bit me awhile later when they changed the default behavior for both locations).
A few tips:
As much as possible, try to find the original source for the data. E.g., are there any hidden APIs, or is the data maybe just sitting around in a script being used to populate the HTML? Selenium is great when you need it, but in my experience UI details change much more frequently than the raw data.
When choosing data selectors you'll get a feel for those which might not be robust. E.g., the nth item in a list is prone to breakage as minor UI tweaks are made.
If robustness is important, consider selecting the same data multiple ways and validating your assumptions about the page. E.g., you might want the data with a particular ID, combination of classes, preceding title, or which is the only text element formatted like a version number. When all of those methods agree you're much more likely to have found the right thing, and if they don't then you still have options for graceful degradation; use a majority vote to guess at a value, use the last known value, record N/A or some indication that we're not sure right now, etc. Critically though, your monitoring can instantly report that something is amiss so that you can inspect the problem in more detail while the service still operates in a hopefully acceptable degraded state.
It’s heavily dependent on the site you’re scraping. If they put in active counter measures, have a complex structure, or update their templates frequently, it’s going to be an uphill battle.
Most SPA pages will honour direct uri requests and route you properly in javascript. You just need to have your scraping pipeline use phantomJS or selenium to wait until the page stops loading then scrape the html.
Although it might just be easier to scrape their api endpoints directly instead of mucking with html if its a dynamic page. The data is structured that way, and easier to query.
Usually the approach is to use a headless browser. The headless browser instance runs purely in memory without a GUI then renders the website you're interested in. Then, it comes down to regular DOM parsing. A common library that I enjoy is Selenium with Python.
Aside from the Beautiful Soup library, is there something about Python that makes it a better choice for web scraping than languages such as Java, JavaScript, Go, Perl or even C#?
Don't know about large scales, but just today I threw together a script using selenium, imported pandas to mangle the scraped data and quickly exported to json. For quick and dirty, possibly one-off jobs like that, Python is a great choice.
I find javascript (node) to be best suited to web scraping personally. Using the same language to scrape/process as you use to develop those interfaces seems most natural.
Especially with stuff like Puppeteer which allows you to execute JS in context of the browser (which admittedly can lead to weird bugs as the functions are serialized and lose context)
I'm using this exact strategy to scrape content directly from DOM using APIs like document.querySelectorAll. You can use the same code in both headless browser clients like Puppeteer or Playwright and DOM clients like cheerio or jsdom (assuming you have a wrapper over document API). Depending on the way a web page was fetched (opened in a browser tab or fetched via nodejs http/https requests), ExtractHtmlContentPlugin, ExtractUrlsPlugin use different DOM wrappers (native, cheerio, jsdom) to scrape the content.
I'd say that really depends on your scale and what you're doing with the content you scrape.
In my experience with large scale scraping you're much better off using something like Java where you can more easily have a thread pool with thousands of threads (or better yet, Kotlin coroutines) handling the crawling itself and a *NUM CORES thread pool handling CPU bound tasks like parsing.
Could you give a ballpark figure for what you mean by large scale scraping? I've only worked on a couple projects, one was a broad (100K to 500K domains) and shallow (root + 1 level of page depth, also with a low cap on the number of children pages). The other just a single domain but scraping around 50K pages from it.
Does anyone know how could I script Save Page WE extension in Firefox? It does a really nice job of saving the page as it looks, including dynamic content.
That's what I meant by narrow...a known set of sites and data you want to extract.
I imagine, for example, building on the SERP example might hit a wall if you added logged in vs not logged SERPS, iterating over carousel data, reading advertisement data etc.
I am often contacted by people who ask me to scrape a dynamic/JS rendered websites. You might be surprised to know that many of such dynamic websites are actually depending on some API end-point which is being accessed via some AJAX like functionality which you can access directly and get the required data. I often faced the situation where data was not fetched via some external source was already available either in data-field or some JSON like structure hence no need to use Selenium with headless browser.
https://pandas.pydata.org/pandas-docs/stable/reference/api/p...