I find javascript (node) to be best suited to web scraping personally. Using the...

isbvhodnvemrwvn · on Feb 10, 2021

Especially with stuff like Puppeteer which allows you to execute JS in context of the browser (which admittedly can lead to weird bugs as the functions are serialized and lose context)

a1sabau · on Feb 11, 2021

I'm using this exact strategy to scrape content directly from DOM using APIs like document.querySelectorAll. You can use the same code in both headless browser clients like Puppeteer or Playwright and DOM clients like cheerio or jsdom (assuming you have a wrapper over document API). Depending on the way a web page was fetched (opened in a browser tab or fetched via nodejs http/https requests), ExtractHtmlContentPlugin, ExtractUrlsPlugin use different DOM wrappers (native, cheerio, jsdom) to scrape the content.

[1] https://github.com/get-set-fetch/scraper