Hacker News new | past | comments | ask | show | jobs | submit login

I find javascript (node) to be best suited to web scraping personally. Using the same language to scrape/process as you use to develop those interfaces seems most natural.



Especially with stuff like Puppeteer which allows you to execute JS in context of the browser (which admittedly can lead to weird bugs as the functions are serialized and lose context)


I'm using this exact strategy to scrape content directly from DOM using APIs like document.querySelectorAll. You can use the same code in both headless browser clients like Puppeteer or Playwright and DOM clients like cheerio or jsdom (assuming you have a wrapper over document API). Depending on the way a web page was fetched (opened in a browser tab or fetched via nodejs http/https requests), ExtractHtmlContentPlugin, ExtractUrlsPlugin use different DOM wrappers (native, cheerio, jsdom) to scrape the content.

[1] https://github.com/get-set-fetch/scraper




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: