Hacker News new | past | comments | ask | show | jobs | submit login

An open-source browser built from scratch is bold. What inspired the development of Lightpanda?





Thanks! The three of us worked together at our former company - ecomm saas start up where we spent a ton of $ on scraping infrastructure spinning up headless Chrome instances.

It started out as more of an R&D thesis - is it possible to strip out graphical rendering from Chrome headless? Turns out no - so we tried to build it from scratch. And the beta results validated the thesis.

I wrote a whole thing about it here if you're interested in delving deeper https://substack.thewebscraping.club/p/rethinking-the-web-br...


Not sure what category of ecomm sites you were scraping but I scrape >10million ecomm URLs daily and, honestly, in my experience the compute is not a major issue (8 times out of 10 you can either use API endpoints and/or session stuffing to avoid needing a browser for every request; and in the 2 out of 10 sites where you really need a browser for all requests it's usually to circumvent aggressive anti-bot which means you're very likely going to need full chrome or FF anyway - and you can parallelise quite effectively across tabs).

One niche where I could definitely see a use for this though is scraping terribly coded sites that need some JS execution to safely get the data you want (e.g. they do some bonkers client side calculations that you don't want to reverse engineer). It would be nice to not pay the perf tax of chrome in these cases.

Having said all of that, I have to say from a geek perspective it's super neat what you guys are hacking on! Zig+V8+CDP bindings is very cool.


> not pay the perf tax

I've typically used pyminiracer in such cases and provided some dummy window objects and whatnot as necessary for the script to succeed.


fully agree here, using a browser for everything is the dumb way. You just usually use it to circumvent the blocking and then reuse the cookies to call the endpoints directly.

It might works if you need to handle a few websites. But this retro engineering approach is not maintainable if you want to handle hundreds or thousands of websites.

Scrapping modern web pages is hard without full support for JS frameworks and dynamic loading. But a full browser, even headless, has huge ressource consumption. This has a huge cost when scraping at scale.



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: