For websites that use React, my favorite trick is loading a copy of React Developer Tools inside a headless Chrome instance.
From there, you just find the component you want to copy data from and you copy the state or props. Very little string parsing or data formatting required, no malformed data, etc. There's a library floating around on GitHub somewhere that makes loading a simplified version of React Developer Tools inside Puppeteer just a script you eval with a jQuery-like API for selecting React components, but I can't remember the name right now.
Someone could probably do this without needing a headless web browser (via jsdom)
Doesn't most/all react data come from xhr? Can't you just figure out how the xhr works, and simply parse that?
I did this with an investment website, where I was able to retrieve all data using simple python. It _should_ be more robust than parsing react components/html.
For applications though, it's definitely easier to just make an HTTP request if you can. However, you're more likely to run into issues like APIs blocking datacenter IPs, rate limiting etc than when it appears you're just loading the website like a human
I'd add in Postman into that workflow, especially if there's headers you need to know about which are non-obvious in the xhr url.
From the network tab of your browser's debugger, copy the network request as cURL, paste the cURL into Postman's import, and then click the "code" button to translate to python (or whatever else) code.
I don't precisely mean React Developer Tools because the UI is unnecessary for this usecase, but it provides similar functionality where you can access the state/props from the component instance.
// resq is the stringified source of the library
// page is a Puppeteer page
// this line injects resq into the page
await page.evaluate(resq);
// This finds a React component with a prop "country" set to "us"
const usProps = await page.evaluate(
`window["resq"].resq$("*", document.querySelector("#__next")).byProps({country: "us"}).props`
);
// This finds a React component with a prop "expandRowByClick" set to true
const news = await page.evaluate(
`window["resq"].resq$("*", document.querySelector("#__next")).byProps({expandRowByClick: true}).props.dataSource`
);
Tools like JSDom are pretty nice for this, but I've found that most web scraping involves a lot of low level manipulation of strings and lists – stripping/formatting/concatenating/ranges/etc, and I find JS to have much worse ergonomics for this than languages like Python and Ruby. I actually find the ergonomics of this most comparable to Swift, the difference being that with Swift you get a ton of safety and speed for that trade-off.
If your whole stack is JS and you need a little bit of web scraping, this makes sense. If you're starting a new scraping project from scratch, I think you'll get far further, faster, with Python or Ruby.
Actually, if you're scraping at any scale above a hobby project, most of your web scraping hours would now be spent on avoiding bot detection, reverse engineering APIs and trying to make HTTP requests work where it seems only a browser can help. The time spent "working with strings" is not even noticeable to me.
I scrape for a living and I work with JS, because currently, it has the better tools.
I'm currently working to turn my hobby scraper into something profitable. "Working with strings" is already the least of my concern. I've spent most of the time with finding an architecture / file structure that allows me to
- easily handle markup changes on source-pages and
- quickly integrate new sources
I've feared it would be impossible to handle unexpected structural changes from a multitude of sources. Turns out that rarely happens. Like, once every x years per source's page-type.
>Tools like JSDom are pretty nice for this, but I've found that most web scraping involves a lot of low level manipulation of strings and lists – stripping/formatting/concatenating/ranges/etc, and I find JS to have much worse ergonomics for this than languages like Python and Ruby. I actually find the ergonomics of this most comparable to Swift, the difference being that with Swift you get a ton of safety and speed for that trade-off.
I think from es6 and up this is handled pretty well.
It has made it better, but things like slice operators are still missing, which can help a lot, Set/Map types aren't that great to use and aren't used much in practice, and there are still lots of sharp edges for newcomers even with simple things like iteration. That's also not mentioning things like the itertools/collections modules in Python which provide some rich types that come in handy.
Do demonstrate, those are pretty basic operations and all scripting languages work with those equally well. I dont see the benefit of a typed language here. Most parsing to me has been a bunch of xpaths then (named) regex on the result text nodes. Never needed any thing more than those 2.
I agree (Ruby + Selenium is a great combination), but for situations where I was forced to use Puppeteer, switching to the context of the browser and executing native Javascript is quite easy, since you can write the code first in the browser console.
I would recommend lxml over Beautiful Soup, it's got a much bigger API, which means more options for parsing awkward bits, it copes with badly formatted markup well, and it's considerably faster.
If you're using JavaScript for scraping, you should go straight to the logical conclusion and run your scraper inside a real browser (potentially headless) - using Puppeteer or Selenium or Playwright.
My current favourite stack for this is Selenium + Python - it lets me write most of my scraper in JavaScript that I run inside of the browser, but having Python to control it means I can really easily write the results to a SQLite database while the scraper is running.
IMO, for the most of the data-gathering needs running a browser (even a headless one) would be an overkill. Browser is better suited for complex interactions, when you need to fully pretend to be a user.
Or just for testing purposes so your environments match.
I've used Selenium API running in Firefox in the past to scrape customers data out of proprietary .Net WebForm systems requiring a login that didn't offer any option to export the data.
Crawling the list pages and then each edit page in turn allowed for dumping the name and value from each input field to the log as key:value pairs for processing offline.
Navigating paging was probably the biggest challenge.
I have done the same, to "export" 10s of thousands of pages from a client's Sitecore website where they were in a very adversarial relationship with the incumbent Sitecore dev/hosts.
I totally don't recommend doing this. But it worked for this case.
"I hate to advocate drugs, alcohol, violence, or insanity to anyone, but they've always worked for me." -- Hunter S Thompson
Good approach, but advanced Selenium detection goes beyond heuristics. Selenium injects JavaScript into the page to function, and the presence of this is how Selenium is detected.
Interesting, I've worked on both sides of scraping and protecting content but hadn't really considered checking for JavaScript frameworks as a trigger. I'm assuming this is something you could configure in a F5 that also injects its own JavaScript?
Randomising field names, seeding hidden bogus data and messing with element order was more what I would look at once a persistent scraper was using enough IPs to get around rate limits.
I believe you could recompile Selenium with different names to get around it, but since Puppeteer uses CDP baked into the browser, no injection is necessary, bypassing alot of this.
I agree with both you and the post you're replying to.
One comment though, once you've past that "I can't do this without a real browser" line in the sand a few times, you end up with a collection of snippets and skills that moves that line much closer. Sure, I'll load the page and watch in browser tools to see what's in the html and what's coming back to XHR calls, but when I've got a directory full of previously used example code to fire up that uses Python/Selenium and deals with "boilerplate" parts, it's a much easier decision to jump that way than the first time I stared at the BeautifySoup documentation.
(When the only tool you have is a nailgun, every problem looks like a messiah...)
most sites these days are single page apps. Unless cheerio and phantomjs work well with those (have not tried), I don't see any other option. Benefit of a browser is that it does multi-processing much better than you do. I only need to add some custom code to block non-js requests to improve performance a bit.
Like if you do ad-hoc web scrapping then it's fine to spend time looking for the most efficient way, but if your web scrapping framework is part of a data pipeline that scrapes all sort of website then a browser is the most development time-saving route.
I do that for the background scraping. (via userscript that parses the data out of the page I visit and stores info to database in the background)
So for example if I buy some electronics module on aliexpress, my scrapper automatically saves all the product description and images to the database right from the browser as I'm making the order.
These details usually contain vital info to use the module, so it's important to me to have an easily searchable reference for all this information. I really don't trust myself to collect all the necessary info manually.
I've used puppeteer + better-sqlite3 in node for similar jobs in the past... Great combo, but tend to use it only if/when node-fetch + cheerio aren't feasable.
+1 to using cheerio.js. When I need to write a web scraper, I've used Node's `request` library to get the HTML text and cheerio to extract links and resources for the next stage.
I've also used cheerio when I want to save a functioning local cache of a webpage since I can have it transform all the various multi-server references for <img>, <a>, <script>, etc on the page to locally valid URLs and then fetch those URLs.
The article didn't touch on this very well, but the reason to upgrade from cheerio to jsdom is if you want to run scripts. E.g., for client-rendered apps, or apps that pull their data from XHR. Since jsdom implements the script element, and the XHR API, and a bunch of other APIs that pages might use, it can get a lot further in the page lifecycle than just "parse the bytes from the server into an initial DOM tree".
Self-plug warning but FWIW if you're using cheerio _just_ for the selector syntax a related tool is Stew [1] which is a dependency-free [2] node module that allows one to extract content from web pages (DOM trees) using CSS selectors, like:
var links = stew.select(dom,'a[href]');
extended with support for embeded regular expressions (for tags, classes, IDs, attributes or attribute values). E.g.:
var metadata = stew.select(dom,'head meta[name=/^dc\.|:/i]');
[2] there's an optional peer-dependency-ish relationship to htmlparser or htmlparser2 or similar to generate a DOM tree from raw HTML but anything that creates a basic DOM tree (`{type:, name:, children:[] }`) will suffice
If I recall correctly, what was really helpful about it that I could write whatever code I would need to query and parse the DOM in the browser console and the copy and paste it into a script with almost no changes.
It made it really simple to go from a proof of concept into pipeline for scraping material and feeding it into a database.
This article is woefully incomplete and only covers a very specific limited use case for web scraping.
It doesn't mention puppeteer or why you may need to use something like that. It doesn't mention cookies or sessions or anything like that. And it doesn't mention using proxies or any web scraping countermeasures. It's very easy to make crawling difficult, and only very basic sites are easy to crawl with the methods described in the article.
For more generic web indexing you need to use a browser. You do not index pages served by a server anymore, you index pages rendered by javascript apps in the browser. So as a part of the "fetch" stage I usually let parsing of title and other page metadata to a javascript script running inside the browser (using https://www.browserless.io/) and then as part of the "parse" phase I use cheerio to extract links and such. It is very tempting to do everything in the browser, but architecturally it does not belong there. So you need to find the balance that works best for you.
Thanks for the mention! I'm the founder of browserless.io, and agree with pretty much everything you're saying.
Our infrastructure actually does procedure for some of our scraping needs: we scrape puppeteer's GH documentation page to build out our debugger's autocomplete tool. To do this, we "goto" the page, extract the page's content, and then hand it off to nodejs libraries for parsing. This has two benefits: it cuts down the time you have the browser open and running, and let's you "offload" some of that work to your back-end with more sophisticated libraries. You get the best of both worlds with this approach, and it's one we generally recommend to folks everywhere. Also a great way that we "dogfood" our own product as well :)
Yes: often the case is that JS does some kind of data-fetching, API calls, or whatever else to render a full page (single-page apps for instance). With Github being mostly just HTML markup and not needing a JS runtime we could have definitely gone that route. The rationale was that we had a desire to use our product ourselves, to gain better insight into what our users do, and become more empathetic to their cause.
In short: we wanted to dogfood the product at the cost of some time and machine resources
Maintainer of jsdom here. jsdom will run the JavaScript on a page, so it can get you pretty far in this regard without a proper browser. It has some definite limitations, most notably that it doesn't do any layout or handling of client-side redirects, but it allows scraping of most single-page client-side-rendered apps.
Not necessarily. It is true that most websites today are JavaScript heavy. However, they are server-side rendered more often than not. Mostly for performance reasons. Also, not all search engines are as good as Google at indexing dynamic JS websites, so it's better to serve pre-rendered HTML for that reason as well.
Hmm, I think I'd still choose Scrapy over JS in this case. While it can be a bit convoluted, for real production stuff I don't know any better choices.
I have myself deployed a Scrapy web scraper as AWS Lambda function and it has worked quite nicely. Every day for the last year now I guess, it has been scraping some websites to make my life a little easier.
Hey everyone, maintainer of the Apify SDK here. As far as we know, it is the most comprehensive open-source scraping library for JavaScript (Node.js).
It gives you tools to work with both HTTP requests and headless browsers, storages to save data without having to fiddle with databases and automatic scaling based on available system resources. We use it every day in our web scraping business, but 90% of the features are available for free in the libary itself.
Brute or Generic Scraping - you need to be able to scrape any site and get the data into your organization to serve to your customers, therefore you probably don't care about manipulating things on a string level and you do care about having something that can handle a JS based site. Here you do not make money from the individual scrapes but being able to have everything for everyone, and thus you cannot afford to spend much extra development effort for a site because scraping that site in itself probably isn't worth much money for you.
Bespoke scraping, here you care about being able to extract data at a very atomic level and you need string manipulation and everything else. Probably you make money on each individual site scraped because the sites have been strategically chosen to enhance a product - for example you have a product serving the legal needs of everyone in the EU but you want to expand into all EEA / EFTA countries, each legal info site you adopt your scraper for is worth lots of money and you put developer effort into getting things at a granular data level matching your data model of legal information.
I have not tried the Headless Chrome Crawler personally, but try the Apify SDK out https://github.com/apify/apify-js if the Headless Chrome crawler does not scale well enough. We use it to scrape billions of pages every month.
Engineers often love to say you can't do this because regular expressions parse regular languages, and HTML is context-sensitive, not regular, and therefore it's impossible to parse.
What they often miss is that the language actually being scraped may only be regular. If you want to parse a page to see if it has the word Banana on it, then your language may defined as .?Banana.?, and that's regular, it doesn't matter that it's HTML. This even applies to questions like "does this contain <element> in the <head>?", or "is there a table in the body".
HTML is not regular, but you're not implementing a browser, you're implementing the language of what you're scraping, and that may well be regular.
This works as long as you're really sure that the language you'll want to parse tomorrow will be regular also. It doesn't take much to accidentally add a new requirement that isn't, and once you've committed to regexps you may be tempted to break out the non-regular extensions that most regexp engines support, and that way lies madness.
Starting with a real HTML parser is a good way to future-proof your code for when someone asks you to add just one more thing.
That's true, although I've also seen scraping fail because it was being too precise – looking for something at a particular point in the DOM tree because the parser encourages things like XPaths or CSS selectors, where a regex would have been less brittle _for that use-case_.
For me this just highlights why it's important that engineers understand at some basic what these different things all mean, and what limitations you may have with your solutions, or even those you may want.
I assume this is a counterpoint to my Banana example? It still depends on your language. Maybe this is ok! I wasn't clear on whether I meant it being in the human readable page, or the raw text of the page, but maybe either is sufficient for this contrived hypothetical.
There are definitely cases like this where you have to be careful, but my point still stands that it's important to understand the language you are parsing, and the fact that it might be a regular language. Hell, it could even be Turing complete and then you're out of luck!
In my experience, until you've made the mistakes that not properly parsing html leads to - you mostly jump to naive regex/substring solutions too quickly where you should learn/use well tested html parsing libraries instead. Those mo4re advanced techniques aren't always required, but they're worth knowing and once you know them it's smarter to "over solve" the problem sometimes than "cowboy it" with a regex just because it looks like it'll do the job.
Overwhelmingly (in my experience), you're not even really parsing HTML with regex. Rather, you're just treating it as a text document and using certain tags or code snippets as boundary points for finding the data that you want. It's certainly way faster, though prone to its own issues that don't come up as often with something like a DOM library or headless browser.
Many HTML documents will have the same data included multiple times, so a lot of the limitations can be avoided by targeting the places that appear the most consistently. Most of the reason why a web scraper would break would be because only one place was being targeted for data, and often very loosely. That place would get changed. Suddenly, you wind up with either a lot of wrong data or none at all.
A great web-scraping architecture is the pipeline model similar to 3D rendering pipelines. |Stage 1|: Render and HTML, |Stage 2| Save HTML to disk, |Stage 3|: Parse and translate HTML to whatever output you need; JSON, CSV etc...
It's great if each of these processes can be invoked separately, so that after the HTML is saved, you don't need to redownload it, unless the source has changed.
By dividing scraping into; rendering, caching and parsing you save your self a lot of web requests. This also helps prevent the website from triggering IP-blocking, DDOS protection and Rate-limiting.
Please let me add as well more specific complexity like depth of the pages (of a site), length of parameters of dynamically generated links (which can potentially be infinite if there is a circular perpetually "adding" mechanism in the website's code), upper/lowercase characters in links (irrelevant for the protocol & domain but relevant for the rest like path and parameters), etc... .
I just started with this theme and I'm having a lot of unexpected "fun" :)
any tool you know that deals with these? Been using the offline version of apify, multithreading, queues, workers seem to be good, does not seem to do rate limiting.
I'm using python and selectorlib( https://selectorlib.com/ ) for my work flow. Since most of the webpages I crawl can be broken down to:
- get to the webpage (selenium)
- do some clicks to expand certain information (selenium)
- save the html (selenium)
- and parse (selectorlib)
For me, almost everything can be done by css selectors or xpath. Selectorlib allows you to write just a tree of css selectors. The css selectors in the children only apply to currently selected objects.
The nice thing is the magical browser tool of the same name, which makes the first iteration much easier.
However, the browser tool output and the python code does not always match, that causes some headaches.
Overall, it cut down like 90% of the code and move it into a configuration.
a lot of this selectorlib (get text or attributes) is achievable with xpath 1.0, which is built-in browsers and testing tools. What I do in my scrapping framework is that it takes a dict of name -> xpath and return a json object. This way the framework knows exactly what need to be etracted and stop loading the page as soon as all information are collected.
Done a fair bit of scraping in my time, mostly with PHP/curl and PHP's DOMDocument if necessary.
I'd say to anyone learning how to code it's a good exercise in learning. Think a scraper for most sites can be built in an hour or two, depending on navigation and how data is sent to the client.
Definitely noticed a trend towards XHR and JSON responses typically using a numeric ID. Probably the easiest type of site to scrape where you don't need to crawl navigation, simply iterate over a number range and the scraped data is already pretty much structured.
Agreed. Though often I find sites and pages that need Chrome's flavor of JS. It's becoming increasingly inevitable one will need Chrome/ium to reliably get the rendered markup.
I've never really scraped anything where the valued data is in JS or dependent on a browser. Sometimes the browser uses JS to fetch the data, but generally the call is easily found out in your browser console. The patterns are generally obvious.
I've used artoo.js (https://github.com/medialab/artoo) for my in browser one off web scrapes for a while and find it pretty useful. Ripping tables -> csv is pretty straightforward and it handles pagination pretty well too.
A lot of comments there about puppeteer or selenium. For home needs I found Chrome extension much more useful. Puppeteer and Chrome debug protocol in general is more complex and restirected comparing to Chrome extension.
I was surprised that this article only mentions DOM parsing as a tool. In these days I find it better to use something like a headless browser to do scraping from websites.
this is all about static html page scraping which is really basic, there are many sites are SPA or has lots of Javascript code these days, your best bet is using puppeteer etc though they're much slower.
as mentioned here apify is the scrapy-in-python version with javascript.
From there, you just find the component you want to copy data from and you copy the state or props. Very little string parsing or data formatting required, no malformed data, etc. There's a library floating around on GitHub somewhere that makes loading a simplified version of React Developer Tools inside Puppeteer just a script you eval with a jQuery-like API for selecting React components, but I can't remember the name right now.
Someone could probably do this without needing a headless web browser (via jsdom)