Thanks for writing the article and explaining some background here. Are you fami...

phkamp · on June 30, 2022

I'm not a fan of any archtectural radicalism, and tend to think that there are things best done in both hardware, kernel and libraries and applications :-)

That is not to say that the boundaries should be cast in stone, they should obviously be flexible enough that you do not need a complete multi-user management system in a single-service jail or container nor a full-blown journaled COW storage-manager on a small embedded system.

In other words: I am firmly for the "Software Tools" paradigm.

eru · on June 30, 2022

From the paper:

> The defining tragedy of the operating systems community has been the definition of an operating system as software that both multiplexes and abstracts physical resources. The view that the OS should abstract the hardware is based on the assumption that it is possible bath to define abstractions that are appropriate for all areas and to implement them to perform efficiently in all situations. We believe that the fallacy of this quixotic goal is self-evident, and that the operating system problems of the last two decades (poor performance, poor reliability, poor adaptability, and inflexibility) can be traced back to it. The solution we propose is simple: complete elimination of operating system abstractions by lowering the operating system interface to the hardware level.

Basically, they say to let libraries do the abstraction.

The source code of your applications will still mostly look the same as before. It's just that the libraries will do more of the work, and the kernel will do less.

phkamp · on June 30, 2022

Yes, and I dont (quite) buy that argument, but I understand where it comes from.

The problem starts when you, quite sensibly implement something like SHA256 in hardware. It is a perfect example of something hardware does better than software.

But Dennis, Ken and Brian didn't think about cryptographic hash-algorithms when they created UNIX, and because UNIX no longer have a recognized architectural authority, nobody provides a timely architecture for such new features, and instead we end up with all sorts of hackery, some in kernels, some in libraries and some in applications.

SHA256 should be a standard library API, and if the CPU has a HW implementation, the platforms library should spot that and use that, no need to get the kernel involved, it's just a fancy XOR on purely userland data.

But SHA256 being a good example does not mean that we should throw out the baby with the bath-water.

Things like file-systems are incredibly ill-suited for userland implementations.

What they dont say in the article is that they will need monolithic "libraries" for things like filesystems, and to implement things like locking, atomicity, these libraries will have to coordinate amongst the processes which use the filesystem, and must do so without the control and power available to the kernel.

There are ways to do that, see for instance MACH or the original MINIX. It transpires there are disadvantages.

And that's what I mean by "archtectural radicalism": Try to use the right tool for the job, and sometimes the kernel is the right tool (filesystems) and sometimes it is not (SHA256).

kragen · on June 30, 2022

Which of the disadvantages of microkernel userland filesystems do you think are most important and essential to the concept, and which do you think are a matter of bad implementations? I thought L4 and QNX had pretty reasonable filesystem stories, and even on poor old Linux I've been using FUSE with things like NTFS for years without much trouble. Is it just a matter of the cost of context switching between userland processes when you don't have enough cores?

If it's a question of performance, with enough cores and shared memory that's accessible for atomic operations, I'd think talking to a userland filesystem would just* be a matter of pushing requests onto a lock-free request queue in shared memory from your application process and reading the responses from a lock-free response queue. Of course each application needs its own shared-memory area for talking to the filesystem to get fault isolation.

Even if it's a matter of IPC message-passing cost on a single core, I think L4 has shown how to make that cheap enough that we should regard putting the filesystem in the kernel as a dubious optimization, and at that one that's second-best to granting the application mappings on an NVDIMM or something.

Perhaps this is stating the obvious, but I don't think you can get much fault isolation with a pure library filesystem; if all the processes participating in the filesystem are faulty then there's no way to protect the filesystem from fatal corruption from faults. You might be able to reduce the presumed-correct filesystem core process to something like a simplified Kafka: a process that grants other processes read-only access to an append-only log and accepts properly delimited and identified blocks of data from them to append to it.

If we're interested in efficiency and simplicity of mechanism, though, a library filesystem is likely faster and might be simpler than a conventional monolithic filesystem server, particularly a single-threaded one, because you can rely on blocking I/O. And the library might be able to wrap updates to the persistent store in lock-free transactions to reduce the frequency of filesystem corruption.

The Xerox Alto famously used a single-tasking library filesystem similar to MS-DOS, but each sector was sufficiently self-describing that filesystem corruption was usually minor and easy to recover from. The filesystem directory could be reconstructed from the data blocks when required. Neither the Alto nor MS-DOS had to worry about locking, though!

KeyKOS, as you know, took a lot of the ideas from the CAP machine and similar capability machines (and languages like Smalltalk), and implemented them on IBM 370 hardware using its regular MMU, with L4-like lightweight IPCs through the kernel for capability invocations. It went to the opposite extreme from having a library filesystem: each directory and each file was a "domain" of its own, which is to say a single-threaded process. Persistence was handled by a systemwide copy-on-write snapshot of the whole system state, plus a journal-sync call their database used to provide durable transactions. EUMEL and L3 took similar approaches; L4 instead takes persistence and even virtual memory out of the kernel.

I wrote some somewhat sketchy notes on how Flash performance suggests rearchitecting things the other day at https://news.ycombinator.com/item?id=31902551; I know you have a very substantial amount of experience with this as a result of Varnish and your involvement with Fastly. What do you think?

______

* "Just" may be a loaded term here.

phkamp · on July 1, 2022

Sorry, overlooked your question.

First, I have not been actively involved in Fastly, apart from telling Artur to "go for it!" :-)

With respect to Flash technology I have noted elsewhere in this discussion that today our SSD devices effectively contain a filesystem in order to pretend they are disks, and that stacking two filesystems on top of each other is ... suboptimal.

But as I also just noted, flash isn't just flash, some properties are very hard to generalize, so i tend to think that we will have to let the people who decide what to solder onto the PCB provide at least the wear-levelling.

If I were to design an OS today, I think I would stick firmly with the good ol' UNIX name-hierarchy model, but I would probably split the filesystem layer horizontally in a common and uniform "naming layer" serviced by per-mount "object stores".

If you look at FreeBSD, you will see that UFS/FFS is sorta-split that way, but I would move the cut slightly and think in terms of other primitives which take SSD and networks better into account, but see also: Plan9.

The service I would want from a SSD device is simply:

A) Write object, tell me it's name when written.

B) Read object named $bla

C) Forget object named $bla

Then I'll build my filesystem on top of that.

(The NVME crew seems to be moving in the right direction, but it is my impression that some patents prevent them from DTRT, just like Sun's "Prestoserve" patent held up development.).

kragen · on July 2, 2022

Thanks! Interesting!

eru · on July 1, 2022

Keep in mind that (conventional) micro-kernels are not the same as exokernels.

FUSE is fun, I've written my own filesystems with it, but it's basically a micro-kernel idea, not an exokernel one. (L4 is also great! But I don't think it qualifies as an exokernel?)

https://pdos.csail.mit.edu/6.828/2019/lec/faq-exokernel.txt explains a lot about exokernels that's not mentioned in the papers.

Exokernels never caught on, at least not under that name. The closest equivalent in widespread use today are actually hypervisors for running virtual machines. (Especially if you are running a so called 'unikernel' on them.)

About filesystems: if you just want the kinds of abstractions that conventional filesystems already give you, you won't get too much out of using an exokernel. (As you mention, perhaps you can get a bit of extra performance?) From the FAQ I linked above:

> Q: In what kind of applications is an exokernel operating system preferable? There are naturally tradeoffs with the extra flexibility provided e.g. it is easier to make a mistake in user code.

> A: An exokernel is most attractive to an application that needs to do something that is possible with an exokernel, but not possible with other kernels. The main area in which the 1995 exokernel paper increased flexibility was virtual memory. It turns out there are a bunch of neat techniques applications can use if they have low-level access to virtual memory mappings; the Appel and Li paper (citation [5]) discusses some of them. Examples include distributed shared memory and certain garbage collection tricks. Many operating systems in 1995 didn't give enough low-level access to virtual memory mappings to implement such techniques, but the exokernel did. The exokernel authors wrote a later paper (in SOSP 1997) that describes some examples in much more depth, including a web server that uses a customized file system layout to provide very high performance.

The HN submission we are nominally discussing here is also about memory, so that might be applicable.

An example for filesystems I could envision: direct low-level hardware access to an SSD's internals for a database. Databases don't really care about files, and might also want to deal with SSD's peculiar writing processes in a way that's different from the abstractions typical file systems give you.

> Perhaps this is stating the obvious, but I don't think you can get much fault isolation with a pure library filesystem; if all the processes participating in the filesystem are faulty then there's no way to protect the filesystem from fatal corruption from faults. You might be able to reduce the presumed-correct filesystem core process to something like a simplified Kafka: a process that grants other processes read-only access to an append-only log and accepts properly delimited and identified blocks of data from them to append to it.

That might be possible, but wouldn't really be faster than letting a kernel handle it, I'd guess? (But it would perhaps be more flexible to develop, since it's all userland.) You can also take inspiration from how eBPF allows you to upload user level logic into the Linux kernel and run them securely. Instead of uploading them into the kernel, you could also upload them into your filesystem service, I guess?

Some of the original exokernel papers had some more interesting ideas sketched out.

> I know you have a very substantial amount of experience with this as a result of Varnish and your involvement with Fastly. What do you think?

I'm afraid you are mixing me up with someone else?

kragen · on July 1, 2022

I agree that L4 is not an exokernel, though it does go a little further in the exokernel direction than conventional microkernels. I agree that FUSE is microkernelish rather than exokernelish, though there's nothing in the exokernel concept as I understand it that excludes the possibility of having servers for things like some or all of your filesystem functionality.

Databases are indeed an application that commonly suffers from having to run on top of a filesystem.

> That might be possible, but wouldn't really be faster than letting a kernel handle it, I'd guess?

I think reading files by invoking library calls that follow pointers around a memory-mapped filesystem might well be faster than reading files by repeatedly context-switching back and forth into even a supervisor-mode kernel, much less IPC rendezvous via a kernel with a filesystem server. This is particularly true in the brave new SSD world where context switch time is comparable to block device I/O latency, rather than being orders of magnitude smaller.

Writes to Kafka are very cheap and support extreme fan-in because the Kafka design pushes almost all the work out to the clients; the Kafka server does very little more than appending chunks of bytes, containing potentially many separate operations, to a log. It seems very plausible to me that this could be faster than handling a series of individual filesystem operations (whether in a kernel or in a microkernel-style server), at least for some applications; particularly with orders of magnitude lower penalties for nonlocality of reference than for traditional filesystems, and for applications where many writes are never read.

Running logic in the kernel or in a server using a restrictive interpreter is indeed an interesting architectural possibility, but from a certain point of view it's the opposite extreme from the Kafka approach.

> > I know you have a very substantial amount of experience with this as a result of Varnish and your involvement with Fastly. What do you think?

> I'm afraid you are mixing me up with someone else?

I hope this isn't rude, but I wrote that in response to phk's comment, so I was addressing him in it, not you, eru, although I did enjoy your comment very much as well.

eru · on July 1, 2022

> Running logic in the kernel or in a server using a restrictive interpreter is indeed an interesting architectural possibility, but from a certain point of view it's the opposite extreme from the Kafka approach.

In general, a restricted language. You interpret or compile that language, and still have similar security guarantees.

> I hope this isn't rude, but I wrote that in response to phk's comment, so I was addressing him in it, not you, eru, although I did enjoy your comment very much as well.

Oh, that's fine. I was just confused because that came in a reply to my comment.

SSLy · on July 1, 2022

>I'm afraid you are mixing me up with someone else?

To clear this up – they were addressing phk, their parent comment.

gpderetta · on June 30, 2022

There has been a slow trend to hardware virtualization and moving drivers to userspace. The issue is that often the required hardware support (SR-IOV for example) is locked behind premium SKUs and it is trickling into consumer producsts very slowly. As such OSs will be very slow to embrace it fully.