Debugging operating systems with time-traveling virtual machines (2005) [pdf]

fatcunt · 2024-08-18T21:07:37 1724015257

Microsoft has, or had, a similar technology they use internally, called TKO: https://www.microsoft.com/security/blog/2020/05/04/mitigatin...

It's written in Rust and is based around a version of Bochs modified for deterministic execution. It's got time-travel debugging (with WinDbg), which works by replaying forward from the nearest snapshot to the point at which the user is asking to move backwards to.

The primary author of this software wanted to open source it, but the higher-ups at MSFT refused. He's been working on similar projects in a personal capacity though, e.g. https://gamozolabs.github.io/fuzzing/2020/12/06/fuzzos.html

ashkankiani · 2024-08-19T11:08:40 1724065720

I wrote a similar snapshot system for our deterministic trading engine. It's hard to imagine a system that doesn't do that unless you actually enforce every single event in your log to be reversible/non-destructive/non-aliasing. Even then, you still want snapshots to jump to a point in time. The only annoying thing is the case where you want to step back one, meaning a naive implementation would jump back to the last snapshot and play forward.

An 80% solution is to keep the last N states in memory. Snapshots compress well within a small time frame, so whenever we "paused" the playback, we could stash deltas from the pause point to reconstruct stuff (I sadly never got around to implementing this part before I left since it wasn't high enough priority).

mark_undoio · 2024-08-19T17:19:25 1724087965

> Even then, you still want snapshots to jump to a point in time. The only annoying thing is the case where you want to step back one, meaning a naive implementation would jump back to the last snapshot and play forward.

At Undo.io we use similar techniques to time travel debug C/C++ (and other languages).

> An 80% solution is to keep the last N states in memory.

We use a slightly more specialised version of what you describe - we maintain a set of snapshots at wide spacings throughout history (for big time jumps the user might want to do) and ones just a small gap before the "current time" the user is viewing, so small steps backwards can be fast too.

It's been a complex process figuring out an algorithm that balances these concerns without using up unreasonable amounts of RAM.

The other way to tackle slowness - though this is for larger jumps in history - is to search the timeline in parallel. Since you've got deterministic re-execution you can play several chunks of history at once while looking for a key event. It can't help for small jumps in time but if you are looking far into the past it can be a significant speed-up.

ttecho2021 · 2024-08-18T23:15:44 1724022944

Wow I stumbled on Gamozolabs' youtube when looking into determinism and hypervisors. Didn't know he also worked on TKO for MSFT. Glad he is making a similar project!

Cyph0n · 2024-08-18T22:42:01 1724020921

Watched some of his streams a while back. One of the most talented/productive devs I have ever seen tbh.

grepfru_it · 2024-08-18T22:43:15 1724020995

The trick at Microsoft is to start working on your project in your spare time. Then incorporate it into your project at MSFT. You get the clout associated with having an open source project but then you also get to use it internally as a sanctioned tool

DaiPlusPlus · 2024-08-19T05:11:49 1724044309

> The trick at Microsoft is to start working on your project in your spare time. Then incorporate it into your project at MSFT.

(Ex-msftie here)

Microsoft is great for being amongst the tiny number of software companies that not only allow their FTEs to have their own software projects without attempting to claim any kind of ownership (compare with Apple where I’m told having even a public GitHub account without manager-approval is grounds for termination…) - but MS even actively encourages it too (though I cynically note that was when the Windows Phone App Store’s app-shortage was thought-of as a serious problem)

BUUUUT - they do require you to strictly avoid cross-over with their own products: you can’t just re-use your own code in a company project, or even reimplement it from-memory. Yes, there are ways of doing this properly (involving weeks of meetings with LCA arranging a permanent irrevocable - and possibly exclusive - rights transfer), wasting time better spent on other things) - so your project had better be something special and you’d need to give-up hope of getting rich from it (spare a very modest bonus the year you do the LCA paperwork if your skiplevel sees any value in it).

pjmlp · 2024-08-19T06:48:20 1724050100

Yeah, and that is how stuff like C++/CX gets killed, replaced by a side project that was never at the same tooling level in Visual Studio, even though there was a CppCon talk promising otherwise, and currently is in maintenance state because "it achived its goals", who cares about those paying customers that had to rewrite their code to a lesser capable tool, now abandonend.

roca · 2024-08-18T21:27:12 1724016432

I wonder if this inspired the VMWare VM record-and-replay functionality that came out in 2008. They discontinued it in 2011, but it's important to me because we used it at Mozilla to great effect and that made it easier for me to get Mozilla to support the development of rr, which started in 2011.

gwd · 2024-08-19T13:43:29 1724075009

Yes, at some point when I was a student I went out to VMWare and gave a presentation to a team of their engineers.

The thing about execution replay is that, in the general case, it's incredibly fragile. I happened to chat to some people from VMWare in 2008 and they said that there was a team of 10 people whose only job it was to fix execution replay when it broke. The main thing I learned from my PhD on execution replay was how to debug insane bugs.

Presumably rr, focusing on processes, offers a more constrained environment that's easier to log & replay.

mark_undoio · 2024-08-19T17:25:29 1724088329

> Presumably rr, focusing on processes, offers a more constrained environment that's easier to log & replay.

As someone who's worked elsewhere on time travel debug, I'm really curious on @roca's take on this - because I'd have expected a full-VM solution to be easier to make reliable.

Hardware-level behaviour sounds harder but it's well-constrained. The behaviour an OS can rely on from the hardware is large but well-documented and slow to change.

In contrast, the process boundary is really ill-defined and permeable. Also, when you need things to be precisely the same, you notice bits of kernel behaviour leaking into the user space ABI in unexpected ways.

Veserv · 2024-08-20T17:05:13 1724173513

If you are doing a single core virtual machine, the hypervisor is properly implemented, and you can change the hypervisor then it is fairly straightforward. You just funnel everything through a single guest memory write function and then instrument that.

Unfortunately, virtual machines are basically only ever used in multicore configurations which are fundamentally unconstrained multithreading which makes record-replay based (versus full trace based) implementations impossible (or at least very challenging) unless you serialize.

The device driver interface is most likely going to be trapping accesses and communicating back via asynchronous shared memory writes if not pass-through or paravirtualized. I suspect most hypervisor implementations will not even allow you to intercept/trace those operations externally, and if they do it is likely going to be hard to sequence them relative to VM execution/instruction stream as they are rooted in a shared memory interface.

If they are paravirtualized, or just in general, the hypercall interface is also usually very ad-hoc and ill-defined and also likely not possible to intercept or trace. The scope is probably going to be more like trying to represent the effects of every random vendor-specific ioctl.

roca · 2024-08-20T21:19:59 1724188799

The assumption here is that you're modifying the hypervisor itself. No-one expects to be able to bolt a record-replay tool onto a third-party hypervisor.

Veserv · 2024-08-20T22:39:33 1724193573

Well, that is what you did with rr (bolted it onto a third-party unmodified OS) so I thought I would cover all the bases.

Once you are in modification land it really depends on the kind of hypervisor and its implementation. Like doing it for a OS, a well-designed, consistent, and tight interface makes it pretty easy. But, the unconstrained multithreading for multicore VMs makes it problematic for record-replay solutions if you do not want to serialize.

mark_undoio · 2024-08-21T10:36:02 1724236562

Yes, this is true. Genuine parallelism will absolutely be a problem there - similarly to how it is for user-level record/replay tech. And serialising is going to have more impact when you're recording a whole OS vs just one application.

Microsoft's TTD (a user-level tech) does allow genuine parallelism, I believe. But my understanding is that it has slower (though still very impressive) single-threaded performance as a result, so there's a tradeoff. But perhaps you could do something more like that.

We believe it's possible to handle parallelism in record/replay tech while still having good single-threaded performance - but it implies a much more complex implementation to do so safely.

Veserv · 2024-08-21T17:04:05 1724259845

The literature on Microsoft TTD [1][2] indicates it is a instruction emulator full-trace approach. It is a little unclear if it is a basic emulator or a instrumented JIT approach, but the reported ~10x overhead is more consistent with a basic emulator. Though maybe their recording implementation is just inefficient causing such excess overhead.

[1] https://msrc.microsoft.com/blog/2019/05/time-travel-debuggin...

[2] https://www.usenix.org/legacy/events/vee06/full_papers/p154-...

roca · 2024-08-21T11:43:32 1724240612

The Linux kernel supports pretty powerful APIs for inspecting and managing processes, like ptrace and seccomp and /proc, from another userspace process. There is no comparable set of APIs for hypervisors, unless Xen has something maybe?

And we don't handle unrestricted multithreading in rr. If we wanted to it would complicate things a lot and I don't know if existing kernel APIs would be adequate. We're skating along the edge of feasibility as it is.

roca · 2024-08-20T16:02:44 1724169764

My instinct is the same as yours, but I'm also well aware of the "grass is greener on the other side of the fence" cognitive bias :-). So without actually having written a record-and-replay hypervisor, I hesitate to judge.

roca · 2024-08-20T16:07:31 1724170051

The interface between userspace processes and the kernel and CPU is pretty complicated. It's hard for me to tell whether it's more or less complicated than the interface between guests and the hypervisor (given most record and replay scenarios don't require virtualizing exotic hardware).

It probably is easier to debug rr replay bugs than VM-level replay bugs, partly because rr is lighter-weight. rr replay bugs are still insane, but it's all relative :-).

gwd · 2024-08-21T13:15:34 1724246134

My first prototype of execution replay I implemented was for Linux processes (in like, 2000 or something), which I did by modifying Linux. The primary thing I did there was to instrument copy_to_user().

The final version of execution replay that I did was for Xen, which had the asynchronous device drivers. In that case, the goal of the drivers was to have the device do DMA directly into the guest's memory space, and thus had to be modeled as an external processor. If you were content to modify your virtual devices to always doing copying, it would be significantly easier.

But yeah, in both cases you just have so much that the thing underneath might be doing, it's hard to tell which is going to be more fragile. :-)

The final version also did multi-processor VMs, using the paging hardware to track sharing across virtual cpus [1]. The key take-away from that was, "It's slow, but often not as slow as you might think."

[1] https://www.semanticscholar.org/paper/Execution-replay-of-mu...

soamv · 2024-08-19T05:41:10 1724046070

It might have, I remember attending a talk by Peter Chen when I was at VMware around that time, and I know there was some kind of collaboration. (I wasn't involved in record-replay at VMware, but I was interested in it due to some ancient work I did with userspace record-replay debugging, https://lizard.sf.net).

rr is fantastic work, mad props! And the multiprocess stuff in pernosco looks super neat.

purple-dragon · 2024-08-19T13:11:33 1724073093

Yes. There’s a whole family of papers by Chen, his students, and other VMware folks in this vein.

icholy · 2024-08-18T21:29:41 1724016581

I don't get how rr isn't more popular.

Veserv · 2024-08-18T20:41:35 1724013695

A history of other time traveling debugging papers and products (including this one):

https://jakob.engbloms.se/archives/1554

https://jakob.engbloms.se/archives/1564

userbinator · 2024-08-18T21:42:28 1724017348

Debuggers have had history tracing functionality for a long time, but being extremely slow and consuming a lot of storage meant it was rarely used except for very specific cases. Now that CPUs are faster and the average machine has a lot more RAM, it becomes more feasible to do this.

drewg123 · 2024-08-18T20:14:40 1724012080

This needs a [2005] qualifier

Intralexical · 2024-08-18T18:29:32 1724005772

[flagged]

ForOldHack · 2024-08-19T00:19:09 1724026749

Related work: IBM first mainframes were so unreliable, that they put a feature in that you could restore the machine to the instruction that crashed, and continue processing, and you could back out of the last few instructions. State of the art for the late 1950s.

It was not virtual, that would become standard through on less than a decade.

Optimizing çopilers came with source tapes, and a huge expensive.