Show HN: Lightpanda, an open-source headless browser in Zig

fbouvier · 2025-01-24T13:43:32 1737726212

Author here. The browser is made from scratch (not based on Chromium/Webkit), in Zig, using v8 as a JS engine.

Our idea is to build a lightweight browser optimized for AI use cases like LLM training and agent workflows. And more generally any type of web automation.

It's a work in progress, there are hundreds of Web APIs, and for now we just support some of them (DOM, XHR, Fetch). So expect most websites to fail or crash. The plan is to increase coverage over time.

Happy to answer any questions.

JoelEinbinder · 2025-01-24T16:29:58 1737736198

When I've talked to people running this kind of ai scraping/agent workflow, the costs of the AI parts dwarf that of the web browser parts. This causes computational cost of the browser to become irrelevant. I'm curious what situation you got yourself in where optimizing the browser results in meaningful savings. I'd also like to be in that place!

I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website. But it should even out or get worse as the websites get richer. The nature of web scraping is that the worst sites take up the vast majority of your cpu cycles. I don't think lowering the ram usage of the browser process will have much real world impact.

fbouvier · 2025-01-24T17:10:27 1737738627

The cost of the browser part is still a problem. In our previous startup, we were scraping >20 millions of webpages per day, with thousands of instances of Chrome headless in parallel.

Regarding the RAM usage, it's still ~10x better than Chrome :) It seems to be coming mostly from v8, I guess that we could do better with a lightweight JS engine alternative.

radium3d · 2025-01-25T01:52:44 1737769964

As a web developer and server manager AI trainers scraping websites with no throttle is the problem. lol

cush · 2025-01-24T20:14:18 1737749658

> there are hundreds of Web APIs, and for now we just support some of them (DOM, XHR, Fetch)

> it's still ~10x better than Chrome

Do you expect it to stay that way once you've reached parity?

fbouvier · 2025-01-25T15:03:37 1737817417

I don't expect it to change a lot. All the main components are there, it's mainly a question of coverage now.

nwienert · 2025-01-24T21:07:14 1737752834

Playwright can run webkit very easily and it's dramatically less resource-intensive than Chrome.

fbouvier · 2025-01-24T23:16:42 1737760602

Yes but WebKit is not a browser per se, it's a rendering engine.

It's less resource-intensive than Chrome, but here we are talking orders of magnitude between Lightpanda and Chrome. If you are ~10x faster while using ~10x less RAM you are using ~100x less resources.

bdhcuidbebe · 2025-01-25T01:13:44 1737767624

How well does it compare to specialized headless scraper browsers, like camoufox (firefox based) or secret agent (chrome based)?

Either should reduce your ram usage compared to stock chrome by a lot.

whatevaa · 2025-01-25T10:31:35 1737801095

Careful, as you implement misssing features your RAM usage might grow too. Happened to many projects, lean at the beggining, get's just as slow when dealing with real world mesiness.

msoad · 2025-01-25T12:19:50 1737807590

Does it work nicely on Linux? I'm very curious about this

niutech · 2025-01-29T11:14:59 1738149299

How about using QuickJS instead of full-blown V8? For example, Elinks has support for SpiderMonkey, QuickJS, MuJS: https://github.com/rkd77/elinks/blob/master/doc/ecmascript.t... and takes a few MB of RAM.

Tostino · 2025-01-24T17:38:15 1737740295

You may reduce ram, but also performance. A good JIT costs ram.

fbouvier · 2025-01-24T17:57:27 1737741447

Yes, that's true. It's a balance to find between RAM and speed.

I was thinking more on use cases that require to disable JIT anyway (WASM, iOS integration, security).

Tostino · 2025-01-24T18:02:01 1737741721

Yeah, could be nice to allow the user to select the type of ECMAScript engine that fits their use-case / performance requirements (balancing the resources available).

cxr · 2025-01-25T04:30:47 1737779447

If your target is consistent enough (perhaps even stationary), then at some point "JIT" means wasting CPU cycles.

refulgentis · 2025-01-24T17:01:44 1737738104

Generally, for consumer use cases, it's best to A) do it locally, preserving some of the original web contract B) run JS to get actual content C) post-process to reduce inference cost D) get latency as low as possible

Then, as the article points out, the Big Guns making the LLMs are a big use case for this because they get a 10x speedup and can begin contemplating running JS.

It sounds like the people you've talked to are in a messy middle: no incentive to improve efficiency of loading pages, simply because there's something else in the system that has a fixed cost to it.

I'm not sure why that would rule out improving anything else, it doesn't seem they should be stuck doing nothing other than flailing around for cheaper LLM inference.

> I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website.

I'm a bit lost, the ram usage benchmark says its ~10x less, and you feel its deceptive because you'd expect ram usage to be less? Steelmanning: 10% of Chrome's usage is still too high?

JoelEinbinder · 2025-01-24T17:09:47 1737738587

The benchmark shows lower ram usage on a very simple demo website. I expect that if the benchmark ran on a random set of real websites, ram usage would not be meaningfully lower than Chrome. Happy to be impressed and wrong if it remains lower.

fbouvier · 2025-01-24T17:15:43 1737738943

I believe it will be still significantly lower as we skip the graphical rendering.

But to validate that we need to increase our Web APIs coverage.

szundi · 2025-01-25T11:42:09 1737805329

Then came deepseek

danielsht · 2025-01-24T21:24:33 1737753873

Very impressive! At Airtop.ai we looked into lightweight browsers like this one since we run a huge fleet of cloud browsers but found that anything other than a non-headless Chromium based browser would trigger bot detection pretty quickly. Even spoofing user agents triggers bot detection because fingerprinting tools like FingerprintJS will use things like JS features, canvas fingerprinting, WebGL fingerprinting, font enumeration, etc.

Can you share if you've looked into how your browser fares against bot detection tools like these?

fbouvier · 2025-01-24T21:46:45 1737755205

Thanks! No we haven't worked on bot detection.

bityard · 2025-01-24T17:48:50 1737740930

Please put a priority on making it hard to abuse the web with your tool.

At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that. It goes without saying that you should not allow users to make hundreds or thousands of "blind" parallel requests as these tend to have the effect of DoSing sites that are being hosted on modest hardware. You should also be measuring response times and throttling your requests accordingly. If a website issues a response code or other signal that you are hitting it too fast or too often, slow down.

I say this because since around the start of the new year, AI bots have been ravaging what's left of the open web and causing REAL stress and problems for admins of small and mid-sized websites and their human visitors: https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-sit...

hombre_fatal · 2025-01-25T15:33:23 1737819203

This is HN virtue signaling. Some fringe tool that ~nobody uses is held to a different, weird standard and must be the one to kneecap itself with a pointless gesture and a fake ethical burden.

The comparison to DRM makes sense. Gimping software to disempower the end user based on the desires of content publishers. There's even probably a valid syllogism that could make you bite the bullet on browsers forcing you to render ads.

gkbrk · 2025-01-24T18:49:25 1737744565

Please don't.

Software I installed on my computer needs to the what I want as the user. I don't want every random thing I install to come with DRM.

The project looks useful, and if it ends up getting popular I imagine someone would make a DRM-free version anyway.

tossandthrow · 2025-01-24T18:53:29 1737744809

Where do you read DRM?

Parent commenter merely and humbly asks the author of the library to make sure that it has sane defaults and support for ethical crawling.

I find it disturbing that you would recommend against that.

gkbrk · 2025-01-24T18:58:11 1737745091

Here's what the parent comment wrote.

> And there should not be an option to override that.

This is not just a sane default. This is software telling you what you are allowed to do based on what the rights owner wants, literally DRM.

This is exactly like Android not allowing screenshots to be taken in certain apps because the rights owner didn't allow it.

blacksmith_tb · 2025-01-24T19:33:09 1737747189

Not sure what "digital rights" that "manages"? I don't see it as an unreasonable suggestion that the tool shouldn't be set up out of the box to DoS sites it's scraping, that doesn't prevent anyone who is technical enough to know what they're doing to fork it and remove whatever limits are there by default? I can't see it as a "my computer should do what I want!" issue, if you don't like how this package works, change it or use another?

benatkin · 2025-01-24T22:45:11 1737758711

Digital Restrictions Management, then. Have it your way.

concerndc1tizen · 2025-01-25T12:09:39 1737806979

There are so many combative people on HackerNews lately who insist to misinterpret everything.

I really wonder if it's bots or just assholes.

cies · 2025-01-25T13:15:40 1737810940

Indeed DRM is a very different thing from adhering to standards like `robots.txt` as a default out of the box (there could still be a documented option to ignore it).

concerndc1tizen · 2025-01-25T13:17:20 1737811040

- That's just like, your opinion, man

He was using DRM as a metaphor for restricted software. And advocating that software should do whatever the user wants. If the user is ignorant about the harm the software does, then adding robots.txt support is win-win for all. But if the user doesn't want it, then it's political, in the same way that DRM is political and anti-user.

JambalayaJimbo · 2025-01-24T20:56:37 1737752197

This is software telling you what you are allowed to do based on what the software developer wants* (assuming the developers cares of course...). Which is how all software works. I would not want my users of my software doing anything malicious with it, so I would not give them the option.

If I create an open-source messaging app I am also not going to give users the option of clicking a button to spam recipients with dick pics. Even if it was dead-simple for a determined user to add code for this dick pic button themselves.

concerndc1tizen · 2025-01-25T12:12:05 1737807125

> I find it disturbing

Oh no, someone on the internet found something offensive!

tossandthrow · 2025-01-25T17:15:28 1737825328

Disturbing, not offensive - it is literally right there in the quote you have been so nice to pass along.

GuB-42 · 2025-01-25T16:54:19 1737824059

Who told you about DRM? It is an open source tool.

Simply requiring a code change and a rebuild is enough of a barrier to prevent rude behavior from most people. You won't stop competent malicious actors but you can at least encourage good behavior. If popular, someone will make a fork but having the original refuse to do stuff that are deemed abusive sends a message.

It is like for the Flipper Zero. The original version does not let you access frequency bands that are illegal in some countries, and anything involving jamming is highly frowned upon. Of course, there are forks that let you do these things, but the simple fact that you need to go out of your way to find these should tell you it is not a good idea.

bityard · 2025-01-24T19:24:43 1737746683

I feel like you may have a misunderstanding of what DRM is. Talking about DRM outside the context of media distribution doesn't really make any sense.

Yes, someone can fork this and modify it however they want. They can already do the same with curl, Firefox, Chromium, etc. The point is that this is project is deliberately advertising itself as an AI-friendly web scraper. If successful, lots of people who don't know any better are going to download it and deploy it without a full understanding (and possibly caring) of the consequences on the open web. And as I already point out, this is not hypothetical, it is already happening. Right now. As we speak.

Do you want cloudflare everywhere? This is how you get cloudflare everywhere.

My plea for the dev is that they choose to take the high road and put web-server-friendly SANE DEFAULTS in place to curb the bulk of abusive web scraping behavior to lessen the number of gray hairs it causes web admins like myself. That is all.

randunel · 2025-01-24T21:28:43 1737754123

It's exactly DRM, management of legal access to digital content. The "media" part has been optional for decades.

The comment they replied to didn't suggest sane defaults, but DRM. Here's the quote, no defaults work that way (inability to override):

> At a _bare_ minimum, that means obeying robot.txt and NOT crawling a site that doesn't want to be crawled. And there should not be an option to override that.

samatman · 2025-01-24T22:49:01 1737758941

I'll also add something that I expect to be somewhat controversial, given earlier conversations on HN[0]: I see contexts in which it would be perfectly valid to use this and ignore robots.txt.

If I were directing some LLM agent to specifically access a site on my behalf, and get a usable digest of that information to answer questions, or whever, that use of the headless browser is not a spider, it's a user agent. Just an unusual one.

The amount of traffic generated is consistent with browsing, not scraping. So no, I don't think building in a mandatory robots.txt respecter is a reasonable ask. Someone who wants to deploy it at scale while ignoring robots.txt is just going to disable that, and it causes problems for legitimate use cases where the headless browser is not a robot in any reasonable or normal interpretation of the term.

[0]: I don't entirely understand why this is controversial, but it was.

benatkin · 2025-01-24T22:49:26 1737758966

> Talking about DRM outside the context of media distribution doesn't really make any sense.

It’s a cultural thing, and it makes a lot of sense. This fits with DRM culture that has walled gardens in iOS and Android.

calvinmorrison · 2025-01-25T15:55:01 1737820501

i still wont forgive libtorrent for not implementing sequential access.

and also, xpdf for implementing the "you cant select text" feature

MichaelMoser123 · 2025-01-25T04:48:18 1737780498

That would make it impossible to use this as a testing tool. How should automatic testing of web applications work, if you obey all of these rules? There is also the problem of load testing. This kind of stuff is by its nature of dual use, a load test is also a kind of DDOS attack.

benatkin · 2025-01-24T22:50:58 1737759058

Make it faster and furiouser.

There are so many variables involved that it’s hard to predict what it will mean for the open web to have a faster alternative to headless Chrome. At least it isn’t controlled by Google directly or indirectly (Mozilla’s funding source) or Apple.

mpalmer · 2025-01-25T19:52:04 1737834724

If it's already a problem, nothing this developer does will improve it, including crippling their software and removing arguably legitimate use cases.

buzzerbetrayed · 2025-01-25T17:17:51 1737825471

Nerfing developer tools to save the "open web" is such a fucking backward argument.

holoduke · 2025-01-25T06:18:28 1737785908

In 10 lines of code I could create a proxy tool that removes all your suggested guidelines so the scraper still operates. In other words. Not really helping.

cchance · 2025-01-25T02:33:29 1737772409

Its literally open source, any effort put into hamstringing it would just be forked and removed lol

xena · 2025-01-25T04:13:47 1737778427

Any barrier to abuse makes abuse harder.

cchance · 2025-01-26T01:40:38 1737855638

Not really lol, literally if you add a robots.txt check someone can just create a fork repo with a git action that escapes that routine every time the original is pushed... Adding options for filter and respecting things is great even as defaults, but trying to force "good behavior" tends to just lead to people setting up a workaround that everyone eventually uses because why use the hamstrung version instead of the open version and make your own choices.

internet_points · 2025-01-25T13:37:47 1737812267

Yes! Having long time ago done some minor web scraping, I did not put any work at all into following robots.txt, simply because it seemed like a hassle and I thought "meh it's not that much traffic is it and boss wants this done yesterday". But if the tool defaulted to following robots.txt I certainly wouldn't have minded, it would have caused me to get less noise and my tool to behave better.

Also, throttling requests and following robots.txt actually makes it less likely that your scraper will be blocked, so even for those who don't care about the ethics, it's a good thing to have ethical defaults.

andrethegiant · 2025-01-25T18:10:10 1737828610

This is why I’m making crawlspace.dev, a crawling PaaS that respects robots.txt, implements proper caching, etc by default.

bsnnkv · 2025-01-25T17:33:47 1737826427

Looking at the responses here, I'm glad I just chose to paywall to protect against LLM training data collection crawling abuse.[1]

[1]: https://lgug2z.com/articles/in-the-age-of-ai-crawlers-i-have...

sesm · 2025-01-24T16:30:06 1737736206

Great job! And good luck on your journey!

One question: which JS engines did you consider and why you chose V8 in the end?

fbouvier · 2025-01-24T16:41:52 1737736912

We have also considered JavaScriptCore (used by Bun) and QuickJS. We did choose v8 because it's state of the art, quite well documented and easy to embed.

The code is made to support others JS engine in the future. We do want to add a lightweight alternative like QuickJS or Kiesel https://kiesel.dev/

ksec · 2025-01-25T03:34:38 1737776078

Thank You I was thinking of JSC and Bun as well. Was half expecting JSC since that combination seems to work well.

keepamovin · 2025-01-25T02:07:53 1737770873

If you support Page.startScreencast or even just capture screenshot we could experiment with using this as a backend for BrowserBox, when lightpanda matures. Cool stuff!

https://github.com/BrowserBox/BrowserBox/

returnofzyx · 2025-01-25T09:13:41 1737796421

Hi. Can I embed this as library? Is there C API exposed? I can't seem to find any documentation. I'd prefer this to a CDP server.

fbouvier · 2025-01-25T10:01:36 1737799296

Not now but we might do it in the future. It's easy to export a Zig project as a C ABI library.

returnofzyx · 2025-01-25T13:50:39 1737813039

Oh please do. I'm sure there are many people like me who want this.

niutech · 2025-01-29T11:08:16 1738148896

Congratulations! But does it support Google Account login? And ReCAPTCHA?

afk1914 · 2025-01-24T18:24:55 1737743095

I am curious how Lightpanda compares to chrome-headless-shell ({headless: 'shell'} in Puppeteer) in benchmarks.

fbouvier · 2025-01-24T18:45:45 1737744345

We did not run benchmarks with chrome-headless-shell (aka the old headless mode) but I guess that performance wise it's on the same scale as the new headless mode.

toobulkeh · 2025-01-24T16:24:35 1737735875

I’d love to see better optimized web socket support and “save” features that cache LLM queries to optimize fallback

dtj1123 · 2025-01-24T16:30:04 1737736204

Very nice. Does this / will this support the puppeteer-extra stealth plugin?

katiehallett · 2025-01-24T16:43:42 1737737022

Thanks! Right now no, but since we use the CDP (playwright, puppeteer), I guess it would be possible to support it

867-5309 · 2025-01-25T01:06:48 1737767208

does this work with selenium/chromedriver?

fbouvier · 2025-01-25T01:08:02 1737767282

For now we just support CDP. But Selenium is definitely in our roadmap.

xena · 2025-01-24T18:37:20 1737743840

How do I make sure that people can't use lightpanda to bypass bot protection tools?

dolmen · 2025-01-27T11:47:59 1737978479

One of Lightpanda's goals is to ease building bots.

frankgrecojr · 2025-01-24T20:35:19 1737750919

The hello world example does not work. In fact, no website I've tried works. It's usually always panics. For the example in the readme, the errors are:

```

./lightpanda-aarch64-macos --host 127.0.0.1 --port 9222

info(websocket): starting blocking worker to listen on 127.0.0.1:9222

info(server): accepting new conn...

info(server): client connected

info(browser): GET https://wikipedia.com/ 200

info(browser): fetch https://wikipedia.com/portal/wikipedia.org/assets/js/index-2...: http.Status.ok

info(browser): eval script portal/wikipedia.org/assets/js/index-24c3e2ca18.js: ReferenceError: location is not defined

info(browser): fetch https://wikipedia.com/portal/wikipedia.org/assets/js/gt-ie9-...: http.Status.ok

error(events): event handler error: error.JSExecCallback

info(events): event handler error try catch: TypeError: Cannot read properties of undefined (reading 'length')

info(server): close cmd, closing conn...

info(server): accepting new conn...

thread 5274880 panic: attempt to use null value

zsh: abort ./lightpanda-aarch64-macos --host 127.0.0.1 --port 9222

```

lbotos · 2025-01-25T01:04:43 1737767083

Not OP -- do you have some kind of proxy or firewall?

Looks like you couldn't download https://wikipedia.com/portal/wikipedia.org/assets/js/gt-ie9-... for some reason.

In my contributions to joplin s3 backend "Cannot read properties of undefined (reading 'length')" was usually when you were trying to access an object that wasn't instantiated. (Can't figure out length of <undefined>)

So for some reason it seems you can't execute JS?

krichprollsch · 2025-01-25T09:49:46 1737798586

Lightpanda co-author here.

Thanks for opening the issue in the repo. To be clear here, the crash seems relative with a socket disconnection issue in our CDP server.

> info(events): event handler error try catch: TypeError: Cannot read properties of undefined (reading 'length')

This message relates to the execution of gt-ie9-ce3fe8e88d.js. It's not the origin of the crash.

I have to dig in, but it could be due to a missing web API.

zelcon · 2025-01-25T03:40:01 1737776401

That's Zig for you. A ``modern'' systems programming language with no borrow checker or even RAII.

hansvm · 2025-01-25T05:49:43 1737784183

Those statements are mostly true and also worth talking about, but they're not pertinent to that error (remotely provided JS not behaving correctly), or the eventual crash (which you'd cause exactly the same way for the same reason in Rust with a .unwrap() call).

IshKebab · 2025-01-25T10:14:07 1737800047

Not exactly the same. `.unwrap()` will never lead to UB, but this can in Zig in release mode.

Also `unwrap()`s are a lot more obvious than just a ?. Dangerous operations should require more ceremony than safe ones. Surprising to see Zig make such a mistake.

hansvm · 2025-01-25T22:29:14 1737844154

> UB vs "safe" panic

Yes, it's not exactly the same if you compile in ReleaseFast instead of ReleaseSafe. Both are bad though, and I'd tend to blame the coding pattern for the observed behavior rather than quibble about which unacceptable scenario is worse.

I see people adopting forced null unwrapping for dumb reasons all the time. For the remaining reasons, do you have a sense of what the language deficiencies are which make that feature helpful? I see it for somewhat sane reasons when crossing language boundaries. Does anything else stand out?

> ceremony

Yes. Thankfully ".?" is greppable, but I wouldn't mind more ceremony there, and less for `try` coding patterns.

jbggs · 2025-01-25T06:48:16 1737787696

you shouldn't be unwrapping, error cases should be properly handled. users shouldn't see null dereference errors without any context, even in cli tools...

hansvm · 2025-01-25T22:19:04 1737843544

That too, as a general coding pattern. I was commenting on the criticism of Zig as a sub-par system's language though, contrasting with a language most people with that opinion seem to like.

igorguerrero · 2025-01-25T06:33:57 1737786837

You could build the same thing in Rust and have the same exact issue.

audunw · 2025-01-25T12:25:20 1737807920

If that kind of stuff is always preferable, the nobody would use C over C++, yet to this day many projects still do. Borrow checking isn’t free. It’s a trade-off.

I mean, you could say Rust isn’t a modern language because it doesn’t use garbage collection. But it’s a nonsensical statement. Different languages serve different purposes.

Besides, Zig is focusing a lot more on heavily integrating testing, debug modes, fuzzing, etc. in the compiler itself, which when put together will catch almost all of the bugs a borrow checker catches, but also a whole ton of other classes of bugs that Rust doesn’t have compile time checks for.

I would probably still pick Rust in cases where it’s absolutely critical to avoid bugs that compromise security.

But this project isn’t that kind of project. I’d imagine that the super fast compile times and rapid iteration that Zig provides is much more useful here.

steeve · 2025-01-25T11:12:04 1737803524

That has absolutely nothing to do with RAII or safety…

psanchez · 2025-01-25T07:14:56 1737789296

I think this is a really cool project. Scrapping aside, I would definitely use this with playwright for end2end tests if it had 100% compatibility with chrome and ran with a fraction of the time/memory.

At my company we have a small project where we are running the equivalent of 6.5 hours of end2end tests daily using playwright. Running the tests in parallel takes around half an hour. Your project is still in very early stages, but assuming 10x speed, that would mean we could pass all our tests in roughtly 3 min (best case scenario).

That being said, I would make use of your browser, but would likely not make use of your business offering (our tests require internal VPN, have some custom solution for reporting, would be a lot of work to change for little savings; we run all tests currently in spot/preemptible instances which are already 80% cheaper).

Business-wise I found very little info on your website. "4x the efficiency at half the cost" is a good catch phrase, but compared to what? I mean, you can have servers in Hetzner or in AWS and one is already a fraction of the cost of the other. How convenient is to launch things on your remote platform vs launch them locally or setting it up? does it provide any advantages in the case of web scrapping compared to other solutions? how parallelizable is it? Do you have any paying customers already?

Supercool tech project. Best of luck!

fbouvier · 2025-01-25T09:57:16 1737799036

Thank you! Happy if you use it for your e2e tests in your servers, it's an open-source project!

Of course it's quite easy to spin a local instance of a headless browser for occasional use. But having a production platform is another story (monitoring, maintenance, security and isolation, scalability), so there are business use cases for a managed version.

weinzierl · 2025-01-24T16:38:50 1737736730

If I don't need JavaScript or any interactivity, just modern HTML + modern CSS, is there any modern lightweight renderer to png or svg?

Something in the spirit of wkhtmltoimage or WeasyPrint that does not require a full blown browser but more modern with support of recent HTML and CSS?

In a sense this is Lightpanda's complement to a "full panda". Just the fully rendered DOM to pixels.

nicoburns · 2025-01-24T19:31:45 1737747105

We're working on this here: https://github.com/DioxusLabs/blitz See the "screenshot" example for rendering to png. There's no SVG backend currently, but one could be added.

(proper announcement of project coming soon)

dang · 2025-01-24T22:17:07 1737757027

(This was on the frontpage as https://news.ycombinator.com/item?id=42812859 but someone pointed out to me that it had been a Show HN a few weeks ago: https://news.ycombinator.com/item?id=42430629, so I've made a fresh copy of that submission and moved the comments hither. I hope that's ok with everyone!)

cropcirclbureau · 2025-01-24T16:18:33 1737735513

Pretty cool. Do you have a list of features you plan to support and plan to cut? Also, how much does this differ from the DOM impls that test frameworks use? I recall Jest or someone sporting such a feature.

fbouvier · 2025-01-24T16:46:49 1737737209

The most important "feature" is to increase our Web APIs coverage :)

But of course we plan to add others features, including

- tight integration with LLM

- embed mode (as a C library and as a WASM module) so you can add a real browser to your project the same way you add libcurl

andrethegiant · 2025-01-24T18:02:48 1737741768

Could it potentially fit in a Cloudflare worker? Workers are also V8 and can run wasm, but are constrained to 128MB RAM and 10MB zipped bundle size

fbouvier · 2025-01-24T18:21:33 1737742893

WASM support is not there yet but it's on the roadmap and we had it in our mind since the beginning of the project, and have made our dev choices accordingly.

So yes it could be used in a serverless platform like Cloudflare workers. Our startup time is a huge advantage here (20ms vs 600ms for Chrome headless in our local tests).

Regarding v8 in Cloudflare workers I think we can not used directly, ie. we still need to embed a JS engine in the wasm module.

gwittel · 2025-01-24T20:28:02 1737750482

Interesting. Looks really neat! How do you deal with anti bot stuff like Fingerprintjs, Cloudflare turnstile, etc? Maybe you’re new enough to not get flagged but I find this (and CDP) a challenge at times with these anti-bot systems.

zlagen · 2025-01-24T22:25:51 1737757551

what do you think would be the use cases for this project? being lightweight is awesome but usually you need a real browser for most use cases. Testing sites and scraping for example. It may work for some scraping use cases but I think that if the site uses any kind of bot blocking this is not going to cut it.

fbouvier · 2025-01-24T23:07:37 1737760057

There are a lot of uses cases:

- LLM training (RAG, fine tuning)

- AI agents

- scraping

- SERP

- testing

- any kind of web automation basically

Bot protection of course might be a problem but it depends also on the volume of requests, IP, and other parameters.

AI agents will do more and more actions on behalf of humans in the future and I believe the bot protection mechanism will evolve to include them as legit.

zlagen · 2025-01-25T02:42:47 1737772967

thanks, it doesn't seem like it's the direction it's going at the moment. If you look at the robots.txt of many websites, they are actually banning AI bots from crawling the site. To me it seems more likely that each site will have its own AI agent to perform operations but controlled by the site.

m3kw9 · 2025-01-24T16:19:27 1737735567

How does this work because the browser needs to render a page and the vision model needs to know where a button is, so it still needs to see an image. How does headless make it easier?

katiehallett · 2025-01-24T16:29:05 1737736145

Headless mode skips the visual rendering meant for humans, but the DOM structure and layout still exist, allowing the model to parse elements programmatically (e.g. button locations). Instead of 'seeing' an image, the model interacts with the page's underlying structure, which is faster and more efficient. Our browser removes the rendering engine as well, so it won't handle 100% of automation use cases, but it's also what allows us to be faster and lighter than Chrome in headless mode.

10000truths · 2025-01-24T20:28:48 1737750528

The issue is that DOM structure does not correspond one-to-one with perceived structure. I could render things in the DOM that aren't visible to people (e.g. a transparent 5px x 5px button), or render things to people that aren't visible in the DOM (e.g. Facebook's DOM obfuscation shenanigans to evade ad-blocking, or rendering custom text to a WebGL canvas). Sure, most websites don't go that far, but most websites also aren't valuable targets for automated crawling/scraping. These kinds of disparities will be exploited to detect and block automated agents if browser automation becomes sufficiently popular, and then we're back to needing to render the whole browser and operate on the rendered image to keep ahead of the arms race.

emporas · 2025-01-25T09:48:56 1737798536

Servers operate on top of tcp/ip not to serve information, rather to serve information plus something else, usually ads. This is usually implemented with websites and captchas n stuff.

That's a problem of misaligned economic incentives. If there is a blockchain which enables micro-transactions of 0.000001 cent per request, and in the order of a million tps or a billion tps, then servers have no reason not to accept money in exchange for information, instead of using ads to extract some eyeball attention.

There is no reason that i cannot invoke a command line program: `$fetch_social_media_posts -n 1000` and get the last thousand posts right there in the console, as long as i provide some valid transactions to the server.

Websites and ads are the wrong solution to the problem of gaining something while serving information, and headless browsers and scraping are the wrong solution to the first wrong solution and the problems it creates.

aniviacat · 2025-01-25T15:37:46 1737819466

No need for blockchain, microtransaction functionality should be integrated into our existing payment methods.

emporas · 2025-01-25T19:15:12 1737832512

Existing payment methods, paypal, google pay etc, have been absolutely crucial for internet payments, but the micro in the word never ends.

If there are internet payments with a minimum payment of 1 cent, then we need payments of 0.1 cents. If that's achieved, then we need 0.01 cents minimum transaction. The micro in the transaction always needs to be smaller (and faster).

Free competition (or perfect competition) over a well defined landscape, internet protocols that is, has proven to always deliver better quality goods and lower price. Money derived from governments is far, far from free competition, let alone well defined internet protocols, and there is a point in which existing payment methods get stuck and cannot deliver smaller transactions.

I don't personally know where and when that point is, but if i have to guess, existing payment methods have reached that minimum point for at least a decade. In other words, their transactions minimums have to be high enough for them to make a profit. Yes, they can implement microtransactions, but they will not be profitable.

wiradikusuma · 2025-01-24T17:16:00 1737738960

But what if the human programmer needs to visually verify that their code works by eyeballing which element got selected, etc?

fbouvier · 2025-01-24T17:26:47 1737739607

You're right, the debugging part is a good use case for graphical rendering in a headless environment.

I see it as a build time/runtime question. At build (dev) time I want to have a graphical response (debugging, computer vision, etc.). And then, when the script is ready, I can use Lightpanda at runtime as a lightweight alternative.

codetrotter · 2025-01-25T00:15:24 1737764124

I was doing a personal side project for a while where I was trying to make my own little Wayback Machine-alike. Mine was very rudimentary, built on top of Firefox and WebDriver plus Squid proxy.

For debugging purposes you could have your headless browser function as a HTTP Proxy Server, maybe? And in your headless browser you could capture a static snapshot of the DOM after your JavaScript runtime has executed the scripts for the page. Similar to how the archive.today guy serves static snapshots of websites. And then developers using your headless browser could point their Firefox or Chrome browser to the HTTP Proxy server hosted by your headless browser program, in order to get a static snapshot view of what the DOM is like after your headless browser has executed JavaScript from the page. And then Firefox or Chrome will render static HTML view of what the page looked like to your headless browser, that the developer can inspect to make decisions about further interactions with the page. As a tool for debugging.

chrisweekly · 2025-01-24T17:29:47 1737739787

If you want a human to eyeball it, you don't use a "headless" browser.

dolmen · 2025-01-24T17:29:51 1737739791

The human programmer can save the DOM as HTML in a file and open it in a headfull browser.

But the use case for Lightpanda is for machine agents, not humans.

cratermoon · 2025-01-24T18:13:20 1737742400

So is this the scraper we need to block? https://news.ycombinator.com/item?id=42750420

fbouvier · 2025-01-24T18:41:30 1737744090

I fully understand your concern and agree that scrapers shouldn't be hurting web servers.

I don't think they are using our browser :)

But in my opinion, blocking a browser as such is not the right solution. In this case, it's the user who should be blocked, not the browser.

jjcoffman · 2025-01-24T18:48:24 1737744504

If your browser doesn't play nicely and obey robots.txt when its headless I don't think it's that crazy to block the browser and not the user.

fbouvier · 2025-01-24T19:16:29 1737746189

Every tool can be used in a good or bad way, Chrome, Firefox, cURL, etc. It's not the browser who doesn't play nicely, it's the user.

It's the user's responsibility to behave well, like in life :)

hansvm · 2025-01-25T06:00:41 1737784841

The first thing that came to mind when I saw this project wasn't scraping (where I'd typically either want a less detectible browser or a more performant option), but as a browser engine that's actually sane to link against if I wanted to, e.g., write a modern TUI browser.

Banning the root library (even if you could with UA spoofing and whatnot) is right up there with banning Chrome to keep out low-wage scraping centers and their armies of employees. It's not even a little effective also risks significant collateral damage.

slt2021 · 2025-01-24T20:00:17 1737748817

it is trivial to spoof user-agent, if you want to stop a motivated scraper, you need a different solution that exploits the fact that robots use headless browser

sangnoir · 2025-01-24T20:12:13 1737749533

> it is trivial to spoof user-agent

It's also trivial to detect spoofed user agents via fingerprinting. The best defense against scrapers is done in layers, with user-agent name block as the bare minimum.

Kathc · 2025-01-24T16:43:22 1737737002

An open-source browser built from scratch is bold. What inspired the development of Lightpanda?

katiehallett · 2025-01-24T16:49:32 1737737372

Thanks! The three of us worked together at our former company - ecomm saas start up where we spent a ton of $ on scraping infrastructure spinning up headless Chrome instances.

It started out as more of an R&D thesis - is it possible to strip out graphical rendering from Chrome headless? Turns out no - so we tried to build it from scratch. And the beta results validated the thesis.

I wrote a whole thing about it here if you're interested in delving deeper https://substack.thewebscraping.club/p/rethinking-the-web-br...

corford · 2025-01-24T22:49:48 1737758988

Not sure what category of ecomm sites you were scraping but I scrape >10million ecomm URLs daily and, honestly, in my experience the compute is not a major issue (8 times out of 10 you can either use API endpoints and/or session stuffing to avoid needing a browser for every request; and in the 2 out of 10 sites where you really need a browser for all requests it's usually to circumvent aggressive anti-bot which means you're very likely going to need full chrome or FF anyway - and you can parallelise quite effectively across tabs).

One niche where I could definitely see a use for this though is scraping terribly coded sites that need some JS execution to safely get the data you want (e.g. they do some bonkers client side calculations that you don't want to reverse engineer). It would be nice to not pay the perf tax of chrome in these cases.

Having said all of that, I have to say from a geek perspective it's super neat what you guys are hacking on! Zig+V8+CDP bindings is very cool.

hansvm · 2025-01-25T05:54:16 1737784456

> not pay the perf tax

I've typically used pyminiracer in such cases and provided some dummy window objects and whatnot as necessary for the script to succeed.

zlagen · 2025-01-25T02:45:36 1737773136

fully agree here, using a browser for everything is the dumb way. You just usually use it to circumvent the blocking and then reuse the cookies to call the endpoints directly.

fbouvier · 2025-01-25T15:12:10 1737817930

It might works if you need to handle a few websites. But this retro engineering approach is not maintainable if you want to handle hundreds or thousands of websites.

dolmen · 2025-01-24T17:25:33 1737739533

Scrapping modern web pages is hard without full support for JS frameworks and dynamic loading. But a full browser, even headless, has huge ressource consumption. This has a huge cost when scraping at scale.

zelcon · 2025-01-25T03:38:02 1737776282

Why didn't you just fork Chromium and strip out the renderer? This is guaranteed to bitrot when the web standards change unless you keep up with it forever and have perpetual funding. Yes, modifying Chromium is hard, but this seems harder.

fbouvier · 2025-01-25T09:36:16 1737797776

It was my first idea. Forking Chromium has obvious advantages (compatibility). But it's not architectured for that. The renderer is everywhere. I'm not saying it's impossible, just that it did look more difficult to me than starting over.

And starting from scratch has other benefits. We own the codebase and thus it's easier for us to add new features like LLM integrations. Plus reducing binary size and startup time, mandatory for embedding it (as a WASM module or as C lib).

oever · 2025-01-25T19:39:21 1737833961

The Chromium/Webkit renderer used to have multiple rendering backends. You might use or add a no-op backend.

cxr · 2025-01-25T05:09:43 1737781783

> modifying Chromium is hard, but this seems harder

Prove it.

tetris11 · 2025-01-25T04:27:19 1737779239

Why do anything: because it shows what's possible, and makes the next effort that much more easier.

I call this process of frontier effort and discovery: "science"

zelcon · 2025-01-25T04:35:43 1737779743

Redoing what others have already done is not what I think of when I hear "frontier effort"

the__alchemist · 2025-01-25T02:28:45 1737772125

I have a meta question from browsing the repo: Why do C, C++, and Zig code bases, by convention, include a license at the top of every module" IMO it makes more sense to insetead include of an overview of the module's purpose, and how it fits in with the rest of the program, and one license at the top-level, as the project already has.

AndyKelley · 2025-01-25T02:37:29 1737772649

100% of my projects, including the Zig compiler itself, have only the license file at the root of the project tree, except of course for files that were copy pasted from other projects.

evanjrowley · 2025-01-25T16:34:09 1737822849

I'm interested to see if this could be made to work as a drop-in replacement for the headless Chromium that Hoarder uses to archive web content. I don't have a problem with the current Hoarder solution, but it would be nice to use something that requires less RAM.

surfmike · 2025-01-24T20:18:27 1737749907

Another browser in this space is https://ultralig.ht/, it's geared for in-game UI but I wonder how easy it would be to retool it for a similar use case.

kavalg · 2025-01-24T17:48:47 1737740927

Why AGPL? I am not blaming you. I am just curious about the reasoning behind your choice.

fbouvier · 2025-01-24T18:10:56 1737742256

We had some discussions about it. It seems to us that AGPL will ensure that a company running our browser in a cloud managed offer will have to keep its modifications open for the community.

We might be wrong, maybe AGPL will damage the project more than eg. Apache2. In that case we will reconsider our choice. It's always easier this way :)

Our underlying library https://github.com/lightpanda-io/zig-js-runtime is licensed with Apache2.

optixyt · 2025-01-25T10:11:02 1737799862

The second social media botters find this.

randomMatrix101 · 2025-01-25T08:57:15 1737795435

Very cool project, congrats guys!

stuckkeys · 2025-01-25T04:49:34 1737780574

How does it do against captchas?

monkmartinez · 2025-01-24T17:29:13 1737739753

This is pretty neat, but I have to ask; Why does everyone want to build and/or use a headless browser?

When I use pyautogui and my desktop chrome app I never have problems with captchas or trigger bot detectors. When I use a "headless" playwright, selenium, or puppeteer, I almost always run into problems. My conclusion is that "headless" scrapping creates more problems than it solves. Why don't we use the chrome, firefox, safari, or edge that we are using on a day to day basis?

fbouvier · 2025-01-24T17:43:11 1737740591

I guess it depends on the scale of your requests.

When you want to browse a few websites from time to time, a local headful browser might be a solution. But when you have thousands or millions of webpages, you need a server environment and a headless browser.

fbouvier · 2025-01-24T17:51:39 1737741099

In the past I've run hundreds of headful instances of Chrome in a server environment using Xvfb. It was not a pleasant experience :)