*When a computer's permanent storage is all right there in the processors' memor...

zasdffaa · on March 26, 2022

> SSD is fast enough that buffering SSD in RAM is mostly unnecessary.

A very strong claim, which I'd contest depending on what you're caching. If it's application code then probably but if you've lots of data which you're expecting to re-read such as with a database with many gigs, I really disagree with that. If an SSD had say a 1 gig/sec bandwidth for reasing that's about 22 times slower than my arthritic 7-year dual-channel RAM bandwidth[1]. A more modern higher spec server will have more channels (and hopefully more modern, faster ram).

Can you elaborate a bit on your thoughts?

[1] and the path to ram may not have as many page faults -> kernel switches as reading from an SSD, but I may be wrong and it may not matter

kragen · on March 26, 2022

https://www.club386.com/silicon-motion-pcie-5-0-nvme-ssds-pr... says 14 GB/s.

And of course the sum of bandwidths within a set of NVDIMMs can easily be much higher than that.

jandrewrogers · on March 26, 2022

> Lots of RAM. Swapping out to disk, even SSD, is so last-cen. The moment that happens performance degrades so badly you may as well restart.

RAM is the new disk. These days we try to make codes live out of CPU cache as much as possible, scheduling the RAM access the way we used to schedule disk access, which is effective -- plus ça change, plus c'est la même chose.

> SSD is fast enough that buffering SSD in RAM is mostly unnecessary.

This is not even approximately true in practice for many applications. Striping an array of NVMe storage devices is still about an order of magnitude less bandwidth than RAM. Storage I/O is also high latency and highly concurrent; if your storage I/O is zero-copy then it implies locking up considerable amounts of RAM just to schedule the I/O transactions safely, never mind caching. Modern database kernels require clever I/O schedulers and large amounts of RAM to keep throughput from degrading to the throughput of even very fast storage hardware. And this assumes your storage code is aware of the peculiarities of SSD (inconvenient mix of block sizes, no overwrite, garbage collection, etc).

That said, you obliquely raise a valid point. Buffering storage in a cache usually doesn't work well once the storage:cache ratios become high enough, and the 1000:1 ratios seen in modern servers are well past that threshold. In these cases, you still need a lot of RAM but you end up redeploying it to other parts of the kernel architecture where it can be more effective than a buffer cache for storage.

solarengineer · on March 27, 2022

“ If you messed up the image, undoing the mess was hard. Revision control? What's that?”

VisualAge for Smalltalk provided built-in version control.

kragen · on March 27, 2022

More importantly, so does Squeak.

lproven · on March 27, 2022

And NewSpeak, which I mentioned in the talk.

I didn't submit this, but I did the original talk.

BTW the slideshow is just eyecandy and you can't get anything very useful from it alone... but there's a script on my blog.

kragen · on March 26, 2022

Multiuser systems with orthogonal persistence, like the single-user Smalltalk and image-based Lisps you mention, include EUMEL, L3, KeyKOS, OS/400 (iSeries), and I think Multics. (The Forths I know of were never image-based, so I can't comment on them.) It is definitely not the case that "only one person could really work on something [at a time]" on EUMEL, L3, KeyKOS, and OS/400. Maybe only one person could hack on the TCB at a time, but EUMEL, L3, KeyKOS, and OS/400 had extremely tiny TCBs, for that reason among others.

Having a single-level store doesn't save you from minimizing and discarding state. It just (mostly) decouples discarding state from power failures, making it intentional rather than accidental. In today's battery-powered environments, this is probably a better idea than ever. KeyKOS and EUMEL still had named files, containing sequences of bytes, in directories. They just weren't coupled to the disk device drivers. OS/400 has files, too. Smalltalk and Lisp environments were more radical in this sense.

Also, Smalltalk has had version control within its images for decades.

Stephen White and Pavel Curtis's MOO also had transparent persistence, didn't have files, and supported massively multiuser access; possibly a MOO server has supported more concurrent users than any AS/400, and certainly more than any EUMEL, L3, or KeyKOS system ever did. (LambdaMOO has 48 users online right now according to http://mudstats.com/Browse but had many more at its peak.) But, while MOO was programmable, it never supported high-performance or high-reliability computing as those other systems did. You could think of it as being written in a groupware scripting DSL, like a centralized Lotus Notes. The most complex MOO software I know of was a Gopher client, a "Gopher slate" that multiple people could use simultaneously. It used cooperative multitasking with timeouts that would abort long-running event handlers without rolling back their incomplete effects.

I agree, though, that a single-level store on its own doesn't solve the problems OSes are facing. But it might be useful. We have many layers of software on top of everything that accesses persistent data being built on the assumption that anything persistent is necessarily slow enough to cover millions of nanoseconds of processing. That greatly reduces the benefits of the enormous improvement SSDs represent over spinning rust.

I'd say that buffering SSD in RAM goes beyond "mostly unnecessary": it needs to be done in hardware (as with NVDIMMs) to not hurt performance in many cases. FRAM is even more extreme on this axis.

Another interesting question to me is migration. If I'm editing a Jupyter notebook on my laptop, it'd be nice to move it onto my cellphone and keep editing it when I travel, and it'd be nice to move the computations onto heavy servers when I'm connected to the internet. Lacking this, we seem to be moving to warehouse-scale computing: everybody runs all their notebooks on Google App Engine so they can use TPUs and access them from anywhere.

Related to migration is UI remoting. Why can't I use two cellphones together to access the same app, or two cellphones and a mouse as I/O devices to the same music synthesizer (running on my laptop)? Accelerometers and multitouch could be useful for a lot of things. Maybe ROS (the Willow Garage one, not the BBS software) has the right approach here?

Also, why is everything so goddamn slow and unreliable? Ivan Sutherland had a VR wireframe cube running on a head-mounted display more than 50 years ago. Why are these supercomputers in our pockets still too slow to run VR with low enough latency that people don't throw up? Why does it take multiple minutes to start up again if its battery runs down and I plug it in? Why does the alarm app on my cellphone refuse to launch when its Flash fills up? Why does my laptop become unresponsive and require rebooting when a web page uses too much RAM?

shakes fist at cloud

Anyway maybe I'll have a running prototype of a capability-based single-level-store operating system using optimistically-synchronized software transactional memory to guarantee real-time responsivity without an MMU or lots of concurrency bugs, I don't know, sometime next week. Or maybe not.

skissane · on March 27, 2022

> but EUMEL, L3, KeyKOS, and OS/400 had extremely tiny TCBs

L3's TCB is indeed quite small, and I don't know enough about EUMEL and KeyKOS to comment – but is OS/400's TCB really "extremely tiny"? On the contrary, my impression of it is that it quite expansive. When, in the mid-1990s, as part of the AS/400's CISC-to-RISC transition, IBM rewrote the OS/400's "Licensed Internal Code" layer from PL/MP (an IBM-internal-use-only PL/I dialect) into C++, the result was over 1 million lines of code [0]; the original PL/MP-based version had a roughly similar line count. A long way off "extremely tiny".

Part of the TCB for OS/400 (and contemporary IBM i) is the compiler backend – the compiler frontends compile to a common intermediate language (MI or TIMI), and the backend compiles that intermediate language into machine code for the underlying machine architecture (IMPI for the original CISC AS/400s, Power from the mid-1990s onwards). This is because the traditional OS/400 architecture has all processes executing in a single address space, and security is enforced at the intermediate language level not by the hardware, making that compiler backend security-critical. It is hard to have an "extremely tiny" TCB when the compiler backend has to belong to it. Another factor making the TCB big, is that IBM went for a "feature-rich"/"CISC-like" intermediate language rather than a more spartan one which would have pushed more functionality above the TCB.

[0] Berg, W., Cline, M., & Girou, M. (1995). Lessons learned from the OS/400 OO project. Communications of the ACM, 38(10), 54–64. doi:10.1145/226239.226253

kragen · on March 27, 2022

I had the idea that the IMPI-code-emitting backend was entirely responsible for isolation between tasks, and that it was quite small and simple. It sounds like neither of those is correct?

Similar sorts of virtual machines implemented as interpreters are commonly a few hundred to a few thousand machine instructions.

skissane · on March 27, 2022

It isn't small and simple, because keeping the underlying VM small and simple was never one of IBM's design goals. Instead, IBM made it feature-rich. Part of that was the influence of CISC design philosophy – if your hardware ISAs are rich, it is natural that your virtualised software ISA becomes rich as well, even richer (since richness is cheaper in software than hardware). Some would also suggest, IBM made it complicated to try to prevent anyone trying to clone it–if that was their plan, it was pretty successful, nobody has ever tried to clone the whole thing, although there have been partial emulations of certain limited subsets, such as the RPG compiler and facilities commonly used by RPG programs.

Have a look at the MI instruction set [0]: https://www.ibm.com/docs/en/i/7.2?topic=interface-machine-in...

You will see instructions like MATUP (Materialize User Profile): https://www.ibm.com/docs/en/i/7.2?topic=instructions-materia...

So stuff like user accounts is directly implemented in the VM.

You'd think in a capability-based operating system, you'd make the underlying VM agnostic as to stuff like user accounts – instead you'd have some kind of "security server" running as a process on top of the VM which generates "user ID capabilities" (it being the only process with the capability to generate those capabilities), and the VM itself doesn't treat those capabilities specially. But no, that is not how IBM designed AS/400 (or its System/38 predecessor.)

[0] Technically that's not the current MI, that's "old MI" (OMI). The "new MI" (NMI, aka W-code), introduced with "ILE" (Integrated Language Environment) in the mid-1990s is undocumented (although possibly IBM will share the documentation for it if you pay them $$$$ and sign an NDA). OMI is translated to NMI, and then NMI is in turn translated to POWER machine code; the current IBM compilers target NMI directly, whereas legacy IBM (or third-party) compilers output OMI. However, some of the NMI compilers, such as the ILE C compiler, support embedding OMI instructions; NMI supports most OMI instructions, although effectively it models them as built-in functions.

kragen · on March 28, 2022

Thank you very much for the corrections!

zozbot234 · on March 26, 2022

> It just (mostly) decouples discarding state from power failures, making it intentional rather than accidental.

Just wondering, but how is an OS supposed to achieve this without fsync()ing all writes to memory and being slow as dog$h!t? Having a "single level" of storage means it becomes harder to tell what might be purely ephemeral, and can thus be disregarded altogether if RAM or disk caches are lost in a failure. You might think you can do this via new "volatile segments" API's but then you're just reintroducing the volatile storage that you were trying to get rid of.

Animats · on March 27, 2022

Just wondering, but how is an OS supposed to achieve this without fsync()ing all writes to memory and being slow as dog$h!t?

Separate from the persistent storage thing, I've wanted traditional file systems that work like this:

* Unit files. The unit of file consistency is the file. The default case. Open, create, close, no other open can read it before close, on close it replaces the old file for all later opens. On crashes, you always get a consistent file, the old version if the close hadn't committed yet.

* Log files. The unit of file consistency is one write. Append only. On crashes, you get a consistent file up to the end of a previous write. No trailing off into junk.

* Temp files. There is no file consistency guaranteed. Random access. On crashes, deleted.

* Managed files. A more complex API is available. Random access. On a write, you get two completion events. The first one is "buffer has been copied". The second is "data is safely in permanent storage". This is for databases, which really care about when the data is safely stored. The second full completion event might also be offered for other file types, primarily log files.

That covers the standard use cases and isn't too hard to do.

jandrewrogers · on March 27, 2022

As additional support for this thesis, the interface to the logical (sometimes physical) file system inside many database kernels is almost exactly this. All four file types have a concrete use case that requires their existence.

One of the things that makes them easier to use in a database kernel context is that even though they are all built on top of the same low-level storage primitives, each file type usually has its own purpose-built API instead of a generic overloaded file API, which would get messy very quickly. A file's type never changes and the set of types is enumerable, so this is clean.

The only thing keeping this from being a user space thing is that there is no practical way of making this POSIX compliant. POSIX has all but permanently frozen what is feasible in a file system. Many database kernels either work off raw block devices or emulated raw blocks device on top of a POSIX file system, on which they build a file system, but it is never exposed to the world which eliminates any requirement for compliance with external standards. Imagine how much easier it would be to build things like databases if this was a first-class API.

rurban · on March 28, 2022

not being POSIX compliant is actually needed for a modern OS. with POSIX you'll never get to safe concurrency, so you'll hardly can make use of all the CPU kernels.

it must be a microkernel, supporting only safe languages. certainly not rust.

Genode is close, pony was close, Fuchsia is getting there, singularity was there already. concurrent pascal also.

kragen · on March 28, 2022

Hmm, are you saying POSIX compliance makes scalability impossible? Because it makes it possible to run software written in unsafe languages? I'm pretty sure Fuchsia will always be able to run software written in C.

rurban · on March 29, 2022

not impossible but hard. but more importantly not safe.

Fuchsia supports C++, sure. Fuchsia is not a safe OS, just like Rust a way to better safety. Dart for flutter is their safe language there.

kragen · on March 30, 2022

Sounds reasonable. I didn't know about Dart for Flutter!

kragen · on March 27, 2022

I agree that this would be useful. Have you ever written a prototype of it?

zozbot234 · on March 27, 2022

> The unit of file consistency is the file. The default case. Open, create, close, no other open can read it before close

"$PROGRAM cannot open document $FILENAME because it's already open in $PROGRAM. Try closing any open applications and open $FILENAME again." Who thinks this is actually sane behavior? Unix does it right.

skissane · on March 27, 2022

> "$PROGRAM cannot open document $FILENAME because it's already open in $PROGRAM. Try closing any open applications and open $FILENAME again." Who thinks this is actually sane behavior? Unix does it right.

Wouldn't a better approach be: "$FILENAME is already open in $PROGRAM. Would you like to: (1) Switch to $PROGRAM and continue editing it there; (2) Ask $PROGRAM to close it; (3) Edit it anyway (DANGEROUS!); (4) Abort". (For non-interactive processes, you wouldn't want to ask, you'd probably just want to always abort–which causes problems for some situations like "I want to install an upgrade to this shared library but $PROGRAM is executing using it", but there are ways to solve those problems.)

Kind of like swap files in vim – although vim doesn't have the "switch to $PROGRAM" option, because the underlying operating systems don't have good support for it – but that's an application-level feature which only works in one app. It would be better if it was an OS-level feature which worked across all apps.

kragen · on March 27, 2022

Instead of giving you a mutual exclusion error, a newly created file written this way might simply be invisible until it's complete, at which point it atomically replaces the old version of the file (which might still be accessible through mechanisms like VMS version numbers, WAFL snapshots, Linux reflinks, or simply having opened it before the replacement). This is probably what a transactional filesystem would give you: the creation, filling, and naming of the file would normally be part of the same transaction.

But, with optimistic synchronization, transactions that were reading an old version of the file might require redoing from the start when a new version was written.

kragen · on March 26, 2022

Both KeyKOS and EUMEL used periodic globally consistent checkpoints supplemented by an explicit fsync() call for use on things that required durability like database transaction journals. So when the system crashed you would lose your last few minutes of work; EUMEL ran on floppies, so the interval could be as long as 15 minutes, but I think even KeyKOS only took a checkpoint every minute or so, perhaps because it would pause the whole system for close to a second on the IBM 370s it ran on. (Many years later, Norm Hardy showed me a SPARC workstation with I think 32 or 64 MiB of RAM; he said it was the biggest machine KeyKOS had ever run on, twice the size of the mainframe that was the previous record holder. Physically it was small enough to hold in one hand.)

I assume L3 used the same approach as EUMEL. I don't know enough about OS/400 or Multics to comment intelligently, but at least OS/400 is evidently reasonably performant and reliable, so I'd be surprised if it isn't a similar strategy.

There's a latency/bandwidth tradeoff: if you take checkpoints less often, more of the data you would have checkpointed gets overwritten, so your checkpoint bandwidth gets smaller. NetApp's WAFL (not a single-level store, but a filesystem using a similar globally-consistent-snapshot strategy) originally (01996) took a checkpoint ("snapshot") once a second and stored its journals in battery-backed RAM, but with modern SSDs a snapshot every 10 milliseconds or so seems feasible, depending on write bandwidth.

Most of these systems did have "volatile segments" actually, but not for the reason you suggest, which is performance; rather, it was for device drivers. You can't swap out your device drivers (in most cases) and they don't have state that would be useful across power failures anyway. So they would be, as EUMEL called it, "resident": exempted from both the snapshot mechanism and from paging.

zozbot234 · on March 26, 2022

> So when the system crashed you would lose your last few minutes of work

So, pretty much how disk storage is managed today? But then if you're going to checkpoint application programs from time to time, it might better to design them so that they explicitly serialize their program state to a succinct representation that they can reload from, and only have OS-managed checkpoint and restore as a last resort for programs that aren't designed to do this...Which is again quite distinct from having everything on a "single level".

> but with modern SSDs a snapshot every 10 milliseconds or so seems feasible, depending on write bandwidth.

Not so. Even swapping to SSD is generally avoided because every single write creates wear-and-tear that impacts lifetime of the device.

kragen · on March 26, 2022

Well, I don't think it's "pretty much how disk storage is managed today," among other things because you don't have a multi-minute reboot process, and you usually lose some workspace context in a power failure.

The bigger advantage of carefully designed formats for persistent data is usually not succinctness but schema upgrade.

It's true that too much write bandwidth will destroy your SSDs prematurely, and that might indeed be a drawback of a single-level store, because you might end up spending that bandwidth on things that aren't worth it. This is especially true now that DWPD has collapsed from 10 or more down to, say, 0.2. But they're fast enough that you can do submillisecond durable commits, potentially hundreds of them per second. In the limit where each snapshot writes a single 4 KiB page, 100 snapshots per second is 0.4 MB/s.

If you have a 3-year-life 1TiB SSD with 0.2 DWPD, you can write about 220 TiB to it before it fails. So if you have 4 GiB of RAM, in the worst case where every checkpoint dirties every page, you can only take 56000 checkpoints before killing the drive: about an hour and a half if you're doing it every 10 ms. But actually the drive probably isn't that fast. If you limit yourself to an Ultra DMA 2 speed of 33 MB/s on average, while still permitting bursts of the SSD's maximum speed, that's 2.8 months, but a maximum checkpoint latency of more than 2 minutes. You have to scale back to 7.7 MB/s to guarantee a disk life of a year, at which point your worst-case checkpoint time rises to over 9 minutes.

jdougan · on March 27, 2022

GemStone/S is a multi-user Smalltalk (with database characteristics). It has revision control and is quite fun to work in.

kragen · on March 28, 2022

I should have mentioned it.

imtringued · on March 27, 2022

>And errors are forever.

Have you tried turning it on and off again? Oh, you can't.

>- Lots of CPUs. There are now 128-core CPUs. NVidia just announced a 144-core ARM CPU. The OS and hardware need to be really serious about interprocess communication. It's an afterthought in Unix/Linux. Needs to be at least as good as QNX. Probably better, and with hardware support.

Ah yes, Minecraft, the biggest traitor of all. A world bigger than earth but even getting to 250 players on a single server requires absurd levels of optimizations.

rurban · on March 28, 2022

> Have you tried turning it on and off again? Oh, you can't.

Actually you can. That was one of their biggest advantages, to save and restore running images, and debug into running images.

But diffing a serialized binary image (what changed?) was hard. you had to diff the commands leading to your image, a bit like in docker.

a9h74j · on March 26, 2022

> GPUs need to be first-class players in the OS

I've been starting to wonder, taking a crazy-stupid extreme, whether a GPU unit should be in charge of booting. Perhaps you boot into a graphical interpeter (like a boot into BASIC of old) which can then implement a menu (or have "autoexec.bas") to select a heavyweight OS to boot into.

GauntletWizard · on March 27, 2022

Your Raspberry Pi already does :) https://thekandyancode.wordpress.com/2013/09/21/how-the-rasp...

More seriously, there's probably something to this from a motherboard design perspective, there's enough general purpose compute on a modern graphics card to do the initial bootloader stuff. But graphics cards are not standardized as much as CPUs are, so it's unclear if it would work as we've currently designed systems.

a9h74j · on March 27, 2022

Great link for this, thanks. I had suspected some different ordering might already be called for given the need for e.g. RAM training.

That [version of?] pi booting is much more filename-oriented than all of the partitioning I have seen documented for a Rockchip [ http://www.loper-os.org/?p=2295 ].