Hacker News new | past | comments | ask | show | jobs | submit login
Server-side sandboxing: Containers and seccomp (figma.com)
226 points by emilsjolander 11 months ago | hide | past | favorite | 44 comments



If you are looking to self-host a scalable backend that runs arbitrary code in python/typescript/bash/go with optional sandboxing using nsjail like figma, nsjail is what we use as isolation layer at https://windmill.dev (Open-source alternative to Retool/Airplane)

(Our python nsjail config for instance: https://github.com/windmill-labs/windmill/blob/main/backend/...)


nsjail author here (the original one, as the tool is also maintained by others), good job!

Irrelevant nit: .proto files are protobuf definition files (like this one: https://github.com/google/nsjail/blob/master/config.proto), a text representation of a specific protobuf contents is typically called (as per man clang-format): .textpb .pb.txt or .textproto - I use .config for examples distributed with nsjail, but it's licentia poetica :)


The wonders of HN strikes again. Thank you for this amazing piece of technology that is nsjail. Nsjail is very core to our security, our multitenant would be so slow without it and I think we're one of the applications that leverage it in a way that showcase nsjail to its full extent (as in, we beat containers/firecracker cold starts by a fair margin while keeping most of their benefits). That's one of the reason we're order of magniture more efficient than Airplane that uses fargate under the hood. I would love to chat if you had time, my email in my profile.


Oddly enough the canonical extension seems to now be none of those but .txtpb:

https://protobuf.dev/reference/protobuf/textformat-spec/#tex...


Why not just use json?


I may be mistaken, but does JSON offer the ability to define a schema with default values? Utilizing a single .proto file, I can tackle both the issues of default values and configuration structure, eliminating the need to manually check for missing mandatory sections.

However, I presume there are now JSON extensions that provide similar functionality?


Running arbitrary user code inside a jail that doesn’t isolate networking might not be enough isolation. Also kernel mount namespace binds into the jailed env increases the attack surface. Great for some use-cases, but multi-tenant workloads might need a tighter setup? I'm definitely going to give Windmill a try. It looks really cool!


Wow, this nsjail setup is now part of your opensource version? Last I tried Windmill there was no isolation mechanism for scripts on the free version.


It's pretty easy to apply seccomp to a process using systemd by adding SystemCallFilter= in its unit file. There's a reasonable set of permitted syscalls for general system processes, aptly called `@system-service`, but you can tweak that to suit your needs [1]. I generally use this, among other settings, to further lock down system services [2].

[1] https://www.freedesktop.org/software/systemd/man/latest/syst...

[2] https://www.redhat.com/sysadmin/mastering-systemd


Yep, can recommend systemd in this case, really easy to apply basic hardening to services that just works.


I haven't used seccomp, but have recently been playing around with the Linux pledge port[1]. It has a very friendly UI, but I still struggled with allowing some complex apps to run at all, because of the sheer amount of syscalls and devices they required. Digging through a mountain of strace output is tedious...

Can someone with experience with both comment on how (the Linux port of) pledge compares to seccomp? Can it be considered a replacement at this point?

It seems like it could handle the last scenario described in the article fine, since it allows setting granular rwcx permissions on individual paths.

[1]: https://justine.lol/pledge/


Pledge works well if the software developers implement it on their own application.

It also works well if the software developers document what syscalls they rely on and what permissions they need.

When it comes to retrofitting something like pledge (or seccomp) into an existing application when you've not developed it and/or can't easily tell what syscalls are being called then it's always a nightmare.

It doesn't really matter if it's pledge or seccomp at that point (although undoubtedly seccomp is far harder to make use of), if you're doing this kind of security by retroactive whitelist, you're going to have trouble making it work. It's going to take time and effort to implement.


> When it comes to retrofitting something like pledge (or seccomp) into an existing application when you've not developed it and/or can't easily tell what syscalls are being called then it's always a nightmare.

Quite the contrary. If the software in question has been written in a remotely sane way, adding some basic pledge restrictions is a matter of adding one line: pledge("stdio rpath whatever you need", NULL) - it usually goes somewhere in main, after setup() but before while(!quit).

You can usually figure out the permission set within a few attempts, even without a very good understanding of the internals, as most (sane) programs will do only a couple of things: an httpd needs to accept connections, read static files, write logs, etc; a window manager needs to talk to X11, open font files, etc; of course there are also complex beasts like Chrome but that one has been done as well.

The *real* challenge is breaking up a complex program (e.g. a streaming music player) into separate processes that are concerned with just one or two things, e.g. separate process to make requests over the network, a separate one to decode media, another to maintain an on-disk cache, and so on. Placing restrictions on these subprocesses is the easy part; figuring out where to draw these lines is what's hard.

https://man.openbsd.org/pledge.2


I think realistically most of the things people seriously care about pledging are complex application and server software so yes the point is you run into problems trying to pledge things. Obviously you can come up with the minimum set of things you can pledge pretty quickly to get things working, but that doesn't guarantee they will keep working. And you can also be very broad, but there's a point at which that doesn't gain you much in terms of security (or as much as you may have set out to gain).

This reminds me of a situation where I tried to use firejail to isolate this proprietary piece of software, I ran firejail in the "auto-generate-something-sensible" mode and then tried running it in that profile. It would just randomly break at that point. I never quite figured this out due to lack of time. I was expecting to be able to roll with an auto-generated profile at first and tighten it later, the actual end result was there was no profile at all.

The other issue you run into is getting things which work sometimes and then stop working randomly. Especially when it's a large graphical application. It will do something strange when you click a specific button and crash. Now you are annoyed, probably not in the mood to debug this, so maybe you make a note for later. Now you have to recreate the issue under strace, figure out what you need to pledge now, and repeat.

Yes, if you're trying to pledge ls, it's pretty easy. If you're trying to pledge anything non-trivial (i.e. anything which would _really_ benefit from these security restrictions) you end up iterating a lot.

It's not very fun.


> This reminds me of a situation where I tried to use firejail to isolate this proprietary piece of software [...].

This is exactly the point where the experience between pledge and e.g. firejail drastically diverges. The entire reason why pledge is so nice, is because it assumes source access. You can use execpromises to "jail" something you don't control, but the set of promises is always going to be unnecessarily broad, as even the sanest software out there often needs a tiny backflip before it can enter the main loop. Source access also means you have the means to investigate what exactly went wrong, or to actually fix the stupidity (rather than broadening the privileges).

The amount of things you can do to a proprietary blob to contain it is pretty limited - by definition! I think using a container/VM to completely isolate it would be a better call.

> It will do something strange when you click a specific button and crash. Now you are annoyed, probably not in the mood to debug this, so maybe you make a note for later.

When you do pledge("... error", ...), the app will get errors from disallowed syscalls, rather than a SIGABRT, which is useful when you're not sure. Mis-handling an error can still blow things up, but that's a sign that maybe the overall code quality is not great. In any case, yes you do basically need to be running the application under ktrace (/strace) for as long as you're not certain.

Actually what I think would be great is a "tracing mode" for pledge, where the kernel reports violations (PID + syscall with parms + suggested promise), maybe even takes a core snapshot at each violation, but otherwise doesn't hurt the application.

> If you're trying to pledge anything non-trivial (i.e. anything which would _really_ benefit from these security restrictions) you end up iterating a lot.

Indeed, but that's a curse of complex software, not a shortcoming of pledge itself. OpenBSD introduced pledge and contained almost the entire base system within a single release cycle.

I personally think that as an industry, software is going through a crisis of explosive complexity. I see efforts like pledge as a mirror, through which that complexity stares back at us, in all of its ugliness. Blaming the mirror does not address the issue.


>not a shortcoming of pledge itself

I was never trying to imply that pledge had shortcomings.

I don't think we're disagreeing on anything here anyway.

The point I was making is that if you've got a large and complicated piece of software, which you didn't write yourself, which wasn't written with the intention of someone implementing a syscall filter for it, you will have a bad time. It's not quite as bad as if you have the code but it's always going to be pretty bad regardless.

I think pledge is great, and the rollout was really good (I use OpenBSD for my home router and for some other infrastructure). The OpenBSD developers were in the beneficial position that they are already familiar with the source code for their base userland, they already regularly audit and maintain security improvements for it. Also noteworthy is the fact that most of the OpenBSD base is (intentionally) not formed of extremely complex software.


There's no tracing tool to build policy with pledge? Seems like an obvious area to add functionality if it doesn't exist.

Commercial tools have had it for a long time.. even automatic profiling. Either explicitly profile during a test stage, which is best, or profile-on-first-observation.

In the full automatic mode, which is not optimal but is least effort, any operation performed in the first XX minutes/hours/days are considered 'allowed behavior' and anything after that is denied. Then it will either enforce or 'wait-to-enforce' where enforcement mode only turns on if there are no policy violations in the next XX configurable units of time.


1. You really need to understand the application more than that. Does ls need network sockets? Sure does, if you have yp enabled. But this won't appear in your trace unless you trace in such an environment. (Although pledge on openbsd transparently handles this case for you.)

2. Just because a program makes a system call doesn't mean it should. Or should at that moment. A lot of late initialization can be done earlier for tighter policies. Auto traced policies tend to be extremely broad, permitting too much stuff.


And it’s not just your application, too: it’s what you depend on. Surely your command line application that works with a handful of files is fine with read/write? Your libc might be using something else like iovec or press/pwrite under the hood.


We too ended up adding pledge and unveil to Nanos.

Seccomp and seccomp-bpf are indeed entirely way too limiting. It wasn't really designed for end app developers who are, imo, the ones that should be dictating the policy. The whole lack of pointer deref'ing makes it really difficult for application level developers to make policies that are easier to create.

The promises arg in pledge, https://man.openbsd.org/pledge.2 , does a decent job of grouping related calls together but I think there is a ton of room to make all of this a lot better than it is today.


>Digging through a mountain of strace output is tedious

did you consider logging the syscalls invoked during normal usage with 'strace --output=/some/dir -f ...'? This + grep + uniq should make it really simple.


Despite the availability of linux pledge, and frequent comments mentioning its existence, I'm not aware of many people using it.


I think it would acquire more usage if it was part of mainline Linux distros. As far as I can tell people must feel like this is some kind of optional, nonstandard thing.

It works well with openbsd because its standard, and most if not all openbsd packages make use of it

Though maybe I'm misunderstanding how Linux pledge works. I'm only familiar with the openbsd usage of it


So what's the difference between nsjail[1] and bubblewrap[2]?

[1] https://github.com/google/nsjail [2] https://github.com/containers/bubblewrap


bubblewrap aims to be reasonably secure by default but leaves sleeping soundly at night as an exercise to the reader. It's not exhaustive. It's more of a blast radius/convenience tool. Conversely nsjail aspires to facilitate sleeping soundly out of the box, with security as the primary motivating factor.


I don't have extensive experience with nsjail, but from reading the docs it seems to me like nsjail covers namespaces, cgroups and virtual networking, while bwrap only covers namespaces. On the other hand, bwrap is deliberately kept simple because it is SUID.


Are operating systems failing at their jobs if one can't run independent workloads on them anymore?

It seems like something is broken, and we are all patching things up piecemeal.


It’s actually not that hard to make OSes that run completely independent workloads. The problem is that this is not useful.


Yes and yes.


We had seccomp containers at Dropbox, and I remember Max Serrano helping me set that up with ReactServer :) Talented engineer, though I do remember that the jails were kind of a maintenance nightmare for the security team.


"seccomp containers" sounds weird... like what is a container in Linux anyway :D


Good intro. I'd be curious how they do the syscall tracing, eg, strace logs as part of CI?

Funny enough, we've gone the reverse path for LLM AI-generated code sandboxing for louie.ai / Graphistry . We started with container isolation with careful network, volume, compute etc enablement first, and only now adding nsjail to the runners within the container as an extra defense layer.

The negative space is interesting too. We initially explored alternatives like wasm (too slow and underpowered for our generated python GPU analytics workloads) and firecracker vm (too unwieldy and unportable for our small team). As we do more k8s and enable more interactive data viz customization + web-scale static serving, would love to revisit both.

On which note, we have a bit of budget for someone to help harden the nsjail layer, if of interest!


I have yet to find a firecracker-style thing for k8s that is simple to deploy. Firekube seemed interesting, but is archived...

Liquid Metal from Weaveworks seems interesting but I don't even know where I would start.


Virtink[1] has been reasonably stable as long as you're okay with Cloud Hypervisor instead.

[1] https://github.com/smartxworks/virtink


Kata just released a new version, it is the only thing that I've found easy to setup with k8s... though my experience running Docker-in-Kata hasn't been very good.


We were looking at Kata as well, especially as an 'easier' firecracker, though I forgot why we didn't go further, and I've been curious why they seem to get so little attention in practice?

I believe part is portability, as they may require nested virtualization features to be available, and maybe QEMU overhead. Maybe also something about use in China vs elsewhere?

They (and QEMU) have been around a long time and some major companies are supporting it...


Seccomp BPF is great. There was some recent issues due to IO_uring and extensible syscalls, but I believe for now, those issues are avoidable.

I believe the next generation looks something like landlock (https://docs.kernel.org/userspace-api/landlock.html).


I love ideas behind Landlock but I don't fully see the struggle currently without taking into considerations issues with io_uring api. Seccomp nowadays with AppArmor|SElinux is enough even for Nested rootless containers. Nested even into std runc things. Both AppArmor and Seccomp profiles are stackable. If you don't need to generate unique profiles per each container you should be fine...


+1


seccomp is heaps better than selinux, but still too overly complicated to be using in everyday production unless you're truly on the "refine and secure" path or dealing with high-stakes sandboxing.


way too different things, everyone using seccomp when they don't have AppArmor only profile. sometimes even do both.


Nice, love seccomp though as with all things security it can be very fiddly.


Is there an isolation method close to the functionality to nsjail but for .net code? I know I can protect my AppDomain but how to protect the system/network from rouge .net code?


Check out systemd-nspawn. Built in and works great!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: