Hacker News new | past | comments | ask | show | jobs | submit login

I worked on a large web scraper for several years and JavaScript almost never needs to be executed. The only times I've had were to extract obfuscated links that are revealed by some bit twiddling code, specific to each request, and this was achievable by forking out to deno.



I think javascript comes up because cloudflare use some kind of javascript challenge as part of the DDOS protection. There are python libraries that know how to deal with it, or you can use some level of headless browser. https://github.com/VeNoMouS/cloudscraper


This is highly domain (and sometimes User-Agent) dependent and in my experience JS is required more and more.

e.g. good luck trying to get much out of youtube.com (or any other video site) without executing JS.


YouTube has "var ytInitialData" & "var ytInitialPlayerResponse" params hardcoded in HTML. No need to run JS!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: