+1 to using cheerio.js. When I need to write a web scraper, I've used Node's `request` library to get the HTML text and cheerio to extract links and resources for the next stage.
I've also used cheerio when I want to save a functioning local cache of a webpage since I can have it transform all the various multi-server references for <img>, <a>, <script>, etc on the page to locally valid URLs and then fetch those URLs.
The article didn't touch on this very well, but the reason to upgrade from cheerio to jsdom is if you want to run scripts. E.g., for client-rendered apps, or apps that pull their data from XHR. Since jsdom implements the script element, and the XHR API, and a bunch of other APIs that pages might use, it can get a lot further in the page lifecycle than just "parse the bytes from the server into an initial DOM tree".
Self-plug warning but FWIW if you're using cheerio _just_ for the selector syntax a related tool is Stew [1] which is a dependency-free [2] node module that allows one to extract content from web pages (DOM trees) using CSS selectors, like:
var links = stew.select(dom,'a[href]');
extended with support for embeded regular expressions (for tags, classes, IDs, attributes or attribute values). E.g.:
var metadata = stew.select(dom,'head meta[name=/^dc\.|:/i]');
[2] there's an optional peer-dependency-ish relationship to htmlparser or htmlparser2 or similar to generate a DOM tree from raw HTML but anything that creates a basic DOM tree (`{type:, name:, children:[] }`) will suffice
If I recall correctly, what was really helpful about it that I could write whatever code I would need to query and parse the DOM in the browser console and the copy and paste it into a script with almost no changes.
It made it really simple to go from a proof of concept into pipeline for scraping material and feeding it into a database.
I've also used cheerio when I want to save a functioning local cache of a webpage since I can have it transform all the various multi-server references for <img>, <a>, <script>, etc on the page to locally valid URLs and then fetch those URLs.