A bug fix in the 8086 microprocessor, revealed in the die's silicon

anyfoo · on Nov 27, 2022

Once again, absolutely amazing. Those are more details of a really interesting internal CPU bug than I could have ever hopes for.

Ken, do you think in some future it might be feasible for a hobbyist (even if just a very advanced one like you) to do some sort of precise x-ray imaging that would obviate the need to destructively dismantle the chip? For a chip of that vintage, I mean.

Obviously that's not an issue for 8086 or 6502, since there are more than plenty around. But if there were ever for example an engineering sample appearing, it would be incredibly interesting to know what might have changed. But if it's the only one, dissecting it could go very wrong and you lose both the chip and the insight it could have given.[1]

Also in terms of footnotes, I always meant to ask: I think they make sense as footnotes, but unlike footnotes in a book or paper (or in this short comment), I cannot just let my eyes jump down and back up, which interrupts flow a little. I've seen at least one website having footnotes on the side, i.e. in the margin next to the text that they apply to. Maybe with a little JS or CSS to fully unveil then. Would that work?

molticrystal · on Nov 27, 2022

>some sort of precise x-ray imaging that would obviate the need to destructively dismantle the chip

I don't know about a hobbyist, but are you talking about something along the lines of "Ptychographic X-ray Laminography"? [0] [1]

[0] https://spectrum.ieee.org/xray-tech-lays-chip-secrets-bare

[1] https://www.nature.com/articles/s41928-019-0309-z

kens · on Nov 27, 2022

That X-ray technique is very cool. One problem, though, is that it only shows the metal layers. The doping of the silicon is very important, but doesn't show up on X-rays.

anyfoo · on Nov 27, 2022

I haven't ever looked into it itself, but what you just pasted seems like a somewhat promising answer indeed. Except for the synchrotron part I guess? Maybe?

sbierwagen · on Nov 27, 2022

>Except for the synchrotron part I guess? Maybe?

They were using 6.2keV x-rays from the Swiss Light Source, a multi-billion Euro scientific installation. 6.2keV isn't especially energetic by x-ray standards, (a tungsten target x-ray tube will do ten times that) so either they needed monochromacy or the high flux you can only get from a building-sized synchrotron. Given that the paper says they poured 76 million Grays of ionizing radiation into an area of 90,000 cubic micrometers over the course of 60 hours suggests the latter. (A fatal dose of whole-body radiation to a human is about 5 Grays. This is not a tomography technique that will ever be applied to a living subject, though there is some interesting things no doubt being done right now to frozen bacteria or viruses.)

anyfoo · on Nov 27, 2022

No question, but:

"Though the group tested the technique on a chip made using a 16-nanometer process technology, it will be able to comfortably handle those made using the new 7-nanometer process technology, where the minimum distance between metal lines is around 35 to 40 nanometers."

The discernible feature size required for imaging CPUs of 6502 or even 8086 vintage is much lower than that. The latter was apparently made with a 3 µm process, so barring any differences in process naming since, that's about 3 orders of magnitude less.

Plus, as Ken's article says, the 8086 has only one metal layer instead of a dozen.

So my (still ignorant) hope is that you maybe don't need a synchrotron for that.

drmpeg · on Nov 27, 2022

When I was at C-Cube Microsystems in the mid 90's, during the bring-up of a new chip they would test fixes with a FIB (Focused Ion Beam). Basically a direct edit of the silicon.

mepian · on Nov 27, 2022

Intel is still using FIB for silicon debugging.

userbinator · on Nov 26, 2022

The obvious workaround for this problem is to disable interrupts while you're changing the Stack Segment register, and then turn interrupts back on when you're done. This is the standard way to prevent interrupts from happening at a "bad time". The problem is that the 8086 (like most microprocessors) has a non-maskable interrupt (NMI), an interrupt for very important things that can't be disabled.

Although it's unclear whether the very first revisions of the 8088 (not 8086) with this bug ended up in IBM PCs, since that would be a few years before its introduction, the original PC and successors have the ability to disable NMI in the external logic via an I/O port.

_tom_ · on Nov 26, 2022

This was in the early IBM PCs. I know, because I remember I had to replace my 8088 when I got the 8087 floating point coprocessor. I don't recall exactly why this caused it to hit the interrupt bug, but it did.

anyfoo · on Nov 27, 2022

Maybe because IBM connected the 8087's INT output to the 8086's NMI, so the 8087 became a source of NMIs (and with really trivial stuff like underflow/overflow, forced rounding etc., to boot).

Even if you could disable the NMI with external circuitry, that workaround quickly becomes untenable, especially when it's not fully synchronous.

_tom_ · on Nov 27, 2022

That sounds right. It's been a while.

acomjean · on Nov 27, 2022

We used to disable interrupts for some processes at work (hpux/pa-risc). It makes them quasi real time, though if something goes wrong you have to reboot to get that cpu back…

Psrset was the command, there is very little info on it since HP gave up on HPUX.. I thought it was a good solution for some problems.

They even sent a bunch of us to a red hat kernel internals class to see if Linux had something comparable.

https://www.unix.com/man-page/hpux/1m/psrset/

azalemeth · on Nov 27, 2022

I'd never heard of processor sets (probably because I've never used HPUX in anger), but they sound like a great feature, especially in the days where one CPU had only one execution engine on it. Modern Linux has quite a lot of options for hard real time on SMP systems, some of them Free and some of them not. Of all places, Xilinx has quite a good overview: https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/188424...

kaszanka · on Nov 26, 2022

For those who are curious, it's the highest bit in port 70h (this port is also used as an index register when accessing the CMOS/RTC): https://wiki.osdev.org/CMOS#Non-Maskable_Interrupts

userbinator · on Nov 27, 2022

Port 70h is where it went in the AT (which is what most people are probably referring to when they say "PC compatible" these days.) On the PC and XT, it's in port A0:

https://wiki.osdev.org/NMI

Thre's also another set of NMI control bits in port 61h on an AT. It's worth noting that the port 70h and 61h controls are still there in the 2022 Intel 600-series chipsets, almost 40 years later; look at pages "959" and "960" of this: https://cdrdv2.intel.com/v1/dl/getContent/710279?fileName=71...

fnordpiglet · on Nov 26, 2022

I sincerely wish I were this person. Reading this makes me feel I’ve fundamentally failed in my life decisions.

quickthrower2 · on Nov 27, 2022

Whatever you decide to do next, please be gentle with yourself. If you are in a position to even appreciate the post, you have probably done well!

I get it. And i think i might know why: we are all smart here. Smart enough to do outstandingly well at high school. As time goes on: university and career the selection bias reveals itself. We are now with millions of peers and statistically some of them will be well ahead by some measure. In short: the feeling is to be expected, and it is by design of modern life and high interconnectedness

toofy · on Nov 27, 2022

this is a really nice reply.

sometimes in the too often contrarian comment sections, it’s jarring to see someone’s comment recognize the human being on the other end. it’s inspiring.

i’m not the op, but your comment was kinda rad. thanks.

kens · on Nov 26, 2022

I'm not sure if I should be complimented or horrified by this.

redanddead · on Nov 27, 2022

why not onboard him into the initiative?

hyperman1 · on Nov 27, 2022

I am always awestruck by Kens post, and felt severely underqualified to comment on anything, even as a regular if basic 'elektor' reader.

But the previous post, about the bootstrap drivers gave me a chance. I had an hour to spare, the size of the circuit was small enough, so I just stared at it on my cell phone, scrolling back and forth, back and forth. It took a while, but slowly I started picking up some basic understanding. Big shoutout to Ken, BTW, he did everything to make it easy for mortals to follow along.

I am still clearly at the beginner level, and surely can't redo that post on my own, but there is a path forward to learn this.

If I'd redo that post, I'd advise to have pen and paper ready, and print out the circuit a few times to doodle upon them. And have lots of time to focus.

holoduke · on Nov 27, 2022

I actually had some eye openers on how a cpu works in more detail after reading this article. Is there any good material for understanding cpu design for beginners?

johannes1234321 · on Nov 27, 2022

Checkout Ben Eaters series on building a breadboard computer: https://youtube.com/playlist?list=PLowKtXNTBypGqImE405J2565d...

In the series he builds all the things more or less from the group up. And if that isn't enough he in another video series builds his breadboard VGA graphics card

Sirened · on Nov 27, 2022

Hardware is fun! It's never too late :)

prettyStandard · on Nov 27, 2022

I'm sure you could figure this out if you wanted to.

https://www.reddit.com/r/devops/comments/8wemc2/feeling_unsu...

_Microft · on Nov 26, 2022

It's never too late.

fnordpiglet · on Nov 26, 2022

https://youtu.be/n3SsJdm8bMY

ajross · on Nov 26, 2022

I love these so much.

kens: were you able to identify the original bug? I understand the errata well enough, it implies that the processor state is indeterminate between the segment assignment and subsequent update to SP. But this isn't a 2/386 with segment selectors; on the 8086, SS and SP are just plain registers. At least conceptually, whenever you access memory the processor does an implicit shift/add with the relevant segment; there's no intermediate state.

Unless there is? Was there a special optimization that pre-baked the segment offset into SP, maybe? Would love to hear about anything you saw.

kens · on Nov 27, 2022

I believe this was a design bug: the hardware followed the design but nobody realized that the design had a problem. Specifically, it takes two instructions to move the stack pointer to a different segment (updating the SS and the SP). If you get an interrupt between these two instructions, everything is deterministic and "works". The problem is that the combination of the old SP and the new SS points to an unexpected place in memory, so your stack frame is going to clobber something. The hardware is doing the "right thing" but the behavior is unusable.

ajross · on Nov 27, 2022

Huh... well that's disappointing. That doesn't sound like a hardware bug to me at all. There's nothing unhandleable about this circumstance that I see, any combination of SS/SP points to a "valid" address, code just has to be sure the combination always points to enough memory to store an interrupt frame.

In the overwhelmingly common situation where your stack started with an SP of 0, you just assign SP back to 0 before setting SS and nothing can go wrong (you are, after all, about to move off this stack and don't care about its contents). Even the weirdest setups can be handled with a quick thunk through an SP that overlaps in the same 64k region as the target.

It's a clumsy interface but hardly something that Intel has traditionally shied away from (I mean, good grief, fast forward a half decade and they'll have inflicted protected mode transitions on us). I'm just surprised they felt this was worth a hardware patch.

I guess I was hoping for something juicier.

pm215 · on Nov 27, 2022

The problem is that in general "code" can't do that because the code that's switching stack doesn't necessarily conceptually control the stack it's switching to/from. One SS:SP might point to a userspace process, for instance, and the other SS:SP be for the kernel. Both stacks are "live" in that they still have data that will later be context switched back to. You might be able to come up with a convoluted workaround, but since "switch between stacks" was a design goal (and probably assumed by various OS designs they were hoping to get ports for), the race that meant you can't do it safely was definitely a design bug.

caf · on Nov 27, 2022

In the overwhelmingly common situation where your stack started with an SP of 0, you just assign SP back to 0 before setting SS and nothing can go wrong (you are, after all, about to move off this stack and don't care about its contents).

Consider the case when the MSDOS command interpreter is handing off execution to a user program. It can't just set SP to 0 first (well, really 0xFFFF since the stack grows down), because it's somewhere several frames deep in its own execution stack, an execution stack which will hopefully resume when the user program exits.

Programs using the 'tiny' memory model, like all COM files, have SS == CS == DS so new_ss:old_sp could easily clobber the code or data from the loaded file.

ajross · on Nov 27, 2022

Right, so you just thunk to a different region at the bottom of your own stack in that case. I'm not saying this is easy or simple, I'm saying it's possible to do it reliably entirely in software at the OS level without the need for patching a hardware mask.

Edit, just because all the folks jumping in to downvote and tell me I'm wrong are getting under my skin, this (untested, obviously) sequence should work to safely set SP to zero in an interrupt-safe way, all it requires is that you have enough space under your current stack to align SP down to a 16 bit boundary and fit 16 bytes and one interrupt frame below that.

Again, you wouldn't use this in production OS code, you'd just make sure there was a frame of space at the top of your stack or be otherwise able to recover from an interrupt. But you absolutely can do this.

     mov %sp, %ax  // Align SP down to a segment boundary
     and %ax, ~0xf
     mov %ax, %sp
    loop:          // Iteratively:
     mov %ss, %ax  // + decrement the segment (safe)
     dec %ax
     mov %ax, %ss
     mov %sp, %ax  // + increment the stack pointer by the same amount (safe)
     add %ax, 16
     mov %ax, %sp
     jnz loop      // + Until SP is zero (and points to the same spot!)

anyfoo · on Nov 28, 2022

User software is allowed to switch their stack, too, and user software must be able to be interrupted. Sure, it may be possible to work around this in software with some careful elaborate protocol that involves making sure some value of SP is usable with any value of SS, and prescribing that anything at all adhere to that (because any code at all can be interrupted by an NMI), but that not only means the use of this protocol for all stack layouts, but also implies that every single “MOV SS” or “POP SS” becomes part of some elaborate sequence of instructions (something like “push some other register on the stack, save current SP, in that register, change SP to the magic value, change SS… extra fun if you wanted POP SS, restore SP from the other register, restore the other register”).

That’s a lot to put up with, and in my mind qualifies as a design bug with no palpable software workaround. Not great for wanting folks to use your brand new CPU.

The protected mode transitions that you mentioned are much more palpable because they only happen in very bespoke parts of the OS kernel, usually only in bootstrap in any non-ancient OS.

ajross · on Nov 28, 2022

> Sure, it may be possible to work around this in software with some careful elaborate protocol

Uh, yes, yes it is. That's the point. Go check the 80286 PRM for examples where the same hardware vendor inflicted far worse protocols on their users only a few years after the hack we're looking at.

The subject under discussion isn't "whether or not the 8086 implemented a good stack switching interface". Clearly it did not. It's "whether the 8086 stack switching interface as designed was so irreparably broken that it required a hardware patch[1] to fix". I think the case there is pretty weak, honestly, given the state of the industry at the time. And I found it surprising that Intel would have bothered.

[1] And increase interrupt latency!

anyfoo · on Nov 28, 2022

I’ve literally got the 80286 PRM in my room with me, the actual book, not a printout. What do you think in there is a “far worse protocol” than this abomination?

And then the 286 came along at a time where the x86 architecture was already enormously successful, part of the ubiquitous IBM PC. The 8086 on the other hand was a much simpler, completely new CPU competing with many others. There aren’t many warts in the 8086 itself, whose data sheet is a few pages instead of a book (segmentation was a plus at the time because it allowed a large address space and some easy multitasking without relocation in such a simple CPU), and still it actually only won out against other CPUs for marginal reasons. If 68k had been ready for the PC, that would likely have been it.

Intel had no capital at the time to inflict such an insanity to people. And why is patching the mask in a later revision of the chip such a big deal anyway?[1]

Just look at the solution you proposed. Always gradually adjusting SS and SP in a loop until SP reaches 0, just because an NMI may hit. That’s untenable.

[1] Interrupt latency is only slightly increased on MOV/POP SS. That shouldn’t happen so much that the much more often CLI/STI blocks with many instructions in them aren’t a bigger concern. And the 8086 wasn’t exactly low latency anyway.

ajross · on Nov 28, 2022

Why is this an argument? Is it really so upsetting to you that I thought this was a weird thing to fix at the mask level? It... was. Chips in 1978 had bugs like this everywhere, and we all just dealt with it. You mentioned the 68k, for example, while forgetting the interrupt state bug that forced the introduction of the 68010!

Motorola, when faced with a critical bug that would affect OS implementors in a way they could technically work around (c.f. Apollo, who used two chips) but didn't want to, didn't even bother to fix the thing and made them buy the upgraded version (obviously in some sense the '010 was the equivalent "fix it in the mask" thing). They were selling chips that couldn't restart a hardware fault well into the late 80's.

> Just look at the solution you proposed.

That wasn't the solution I proposed. I proposed just having the stack reserve an interrupt frame at the top and setting SP to zero before changing SS. The code I showed was just proof that even putative users[1] who couldn't change the OS would be able to work around it.

[1] There were none. This was several years before MS-DOS; no one was shipping 8086 hardware in any significant quantities, certainly not to a market with binary software compatibility concerns.

anyfoo · on Nov 29, 2022

> Chips in 1978 had bugs like this everywhere, and we all just dealt with it.

Like what?

> Motorola, when faced with a critical bug [...] (c.f. Apollo, who used two chips)

I am well aware of the Apollo two chip hack, but I was not aware that this was considered a "critical bug". I thought the original 68000 just did not support virtual memory with paging, and the 68010 introduced that. Given that you needed an external MMU if you wanted to do any kind of virtual memory, which most clients did not care about, and that the 8086 and many if not most contemporaries in that class did not support any kind of virtual memory either, that seemed like a reasonable limitation to me.

The kicker is, I actually own a Siemens PC-D, a marvelous machine that has an 80186 (the differences to an 8086 don't matter here), and an actual external MMU made up of discrete 74 series logic chips. It runs SINIX, and has full memory protection through that MMU.

And yet it does not support paging either, because by the time the external circuitry recognized a fault, the CPU had already moved on further. The only thing SINIX did was kill the faulting process.

> while forgetting the interrupt state bug that forced the introduction of the 68010

Are we actually talking about the same thing then? The Apollo hack I know was because after a bus error, not an interrupt, you could not restart the faulting instruction, presumably because by the time the fault was recognized, the CPU had already continued executing an indeterminate amount. This does not matter for regular interrupts, since they are not synchronized to running code and do not restart instructions. It would have been a pretty useless chip otherwise, and yet it thrived even after its descendant was introduced.

The point is, it seemed to me Apollo really wanted to use the 68000 for something it did not support (yet), and other users of the 68000 were not affected (most did not even have the MMU). It was a marvelous, unique hack, that I've actually never seen fully documented or confirmed anywhere (please let me know if you have). The PC would have been amongst the ones not affected, and the chip even spun off variants long after the 68010 was released, not having paged virtual memory just wasn't a dealbreaker for most.

> That wasn't the solution I proposed. I proposed just having the stack reserve an interrupt frame at the top and setting SP to zero before changing SS. The code I showed was just proof that even putative users[1] who couldn't change the OS would be able to work around it.

Yes, you proposed every single piece of code switching a stack, not just OS code, to do some careful dance whenever it switches the stack, on top of making sure that every stack in existence has the proper amount of unused space at its base to support that.

And then all interrupting code would have to immediately switch to its own stack (which, while required today, wasn't necessary back then, mainly because all code ran with all privileges and plenty of stack). While taking care that all registers can be properly restored on the way back, of course. Or only the NMI handler, but then you would have to additionally have to save and restore the interrupt flag around your stack switching.

But okay, let's agree to disagree then. You say there is a reasonable software workaround, I say there is not, for what to me was a pretty clear design bug. Albeit one that was also easily fixed. In my mind, Intel did the sane thing back then by simply patching it up. Can't say that it hurt them.

ajross · on Nov 29, 2022

This has gone completely off the rails, and I simply don't understand why you're being so combative given you seem to have a genuine interest in the same stuff I do. But in the interests of education:

> I thought the original 68000 just did not support virtual memory with paging, and the 68010 introduced that

The 68000 "did not support virtual memory" because instructions that caused a synchronous[1] bus exception could not be restarted. It was a bug with the state pushed on the stack (though I forget the details). In fact it was a bug[2] almost exactly like the one that produced the stack mess we're talking about. And Motorola never bothered to fix it. Intel could likewise have announced "you can only assign to SS when SP is zero" and refused, with little change in device success. And I find it surprising that they didn't.

[1] Because technically the bus isn't synchronous on a 68k, thus the obvious-in-hindsight design thinko. See [2].

[2] A design error following from the failure of the engineers to anticipate the interaction between two obvious-seeming features (processor use of the stack for interrupts, and stack segments in the former, hardware bus errors being used to report synchronous instruction results in the latter).

anyfoo · on Nov 29, 2022

I'm not sure I'm the one being combative, to be honest. So far Intel at the time seemed to think this was a design bug with significant enough consequences that they decided fixing it in some unused part of the mask. And I just agree with that.

I've been privy to such discussions myself, albeit in modern times. I'm familiar with silicon engineers and kernel engineers coming together to discuss a CPU bug or design mishap, and coming up with all sorts of crazy workarounds. Until it's eventually decided that this is just too much to inflict, and it's fixed.

Since you are interested in the 68000 thing, I think I do know the details and can explain, because I have before looked at the difference between the 68000 and the 68010 that makes paging possible. Let's look at the 68010 manual here: http://www.bitsavers.org/components/motorola/68000/MC68010_6...

(Notice how it already says "VIRTUAL MEMORY MICROPROCESSORS" on the front while the 68000 manual in the same directory does not, though I suppose it could be a good marketing move if there was an actual unintended bug, and Motorola decided to advertise the fix as a new CPU?)

Page 5-12: "Bus error exceptions occur when external logic terminates a bus cycle with a bus error signal. Whether the processor was doing instruction or exception processing, that processing is terminated, and the processor immediately begins exception processing."

It specifies external logic, which would be the MMU in the interesting cases. Everyone not using an MMU did not care, and bus errors were still valuable with or without MMU to signal illegal accesses and crash the program that caused it. The full problem is elaborated on the next page: "The value of the saved program counter does not necessarily point to the instruction that was executing when the bus error occurred, but may be advanced by up to five words. This is due to the prefetch mechanism on the MC68010 that always fetches a new instruction word as each previously fetched instruction word is used (see 7.1.2 Instruction Prefetch)."

The 68010 fixed it by adding enough (opaque, undocumented) internal state for the CPU on the stack to effectively undo what it did so far and restart the instruction, that state is shown (without detail, and subject to change between revisions) on the diagram.

The 68000 manual simply says: "Although this information is not sufficient in general to effect full recovery from the bus error, it does allow software diagnosis" (which should include a multitasking OS killing the task and proceeding), and none of the internal state exists on the stack.

But you may still well be right that Motorola did intend for bus errors to be restartable and failed. They would not say so explicitly in the manual of course.

But from what I can see this still only affected systems with an MMU, and then only if they wanted to do full virtual memory including paging. SINIX on the PC-D did not, they were content with swapping and merely detecting illegal accesses, like other UNIXes at the time. The 8086 problem affected everyone, including in application code.

If you know more than what I've cobbled together and can elaborate, I'd be extremely interested in it, because the Apollo DN100 hack always fascinated me to no end (as should be obvious).

anyfoo · on Nov 29, 2022

I might add, it's obvious that you are competent by the way. I think we just disagree on the severity here.

munch117 · on Nov 27, 2022

> That doesn't sound like a hardware bug to me at all.

It isn't. It's a rare case of a software bug being worked around in hardware. That alone makes it fascinating.

The natural way of handling it in software would be to disable interrupts temporarily while changing the stack segment and pointer. You could just say that software developers should do that, but there's a performance cost, and more importantly: The rare random failures from the software already out there that didn't do this, would give your precious new CPU a reputation for being unstable. Better to just make it work.

cesarb · on Nov 27, 2022

As others have already noted: yes, software could disable interrupts temporarily while changing the stack segment and pointer, but on these processors, there's a second kind of interrupt ("non-maskable interrupt" aka NMI) which cannot be disabled.

(Some machines using these processors have hardware to externally gate the NMI input to the processor, but disabling the NMI through these methods defeats the whole purpose of the NMI, which is being able to interrupt the processor even when it has disabled interrupts.)

munch117 · on Nov 27, 2022

I saw those other 8087+NMI comments, but the 8087 is from 1980, and I thought this fix was older than that.

anyfoo · on Nov 28, 2022

The 8087 just happened to be one particular source of NMIs because IBM decided to connect it to the 8086’s NMI line. NMIs can be used for many things. Without this fix, they’re breaking stuff, at least without another hideous hardware workaround.

duskwuff · on Nov 27, 2022

> I understand the errata well enough, it implies that the processor state is indeterminate between the segment assignment and subsequent update to SP.

The root cause was effectively a bug in the programming model, not in the implementation -- there was no way to atomically relocate the stack (by updating SS and SP). The processor state wasn't indeterminate between the two instructions that updated SS and SP, but the behavior resulting from an interrupt at that point would almost certainly be unintended, since the CPU would push state onto a stack in the wrong location, potentially overwriting other stack frames, and the architecture as originally designed provided no way to prevent this.

wbl · on Nov 26, 2022

It's not that the state is indeterminate it's that a programmer would have some serious trouble ensuring that it was possible to dump registers at that moment.

ajross · on Nov 26, 2022

No, that would just be a software bug (and indeed, interrupt handlers absolutely can't just arbitrarily trust segment state on 8086 code where app code messes with it, for exactly that reason). The errata is a hardware error: the specified behavior of the CPU after a segment assignment is clear per the docs, but the behavior of the actual CPU is apparently different.

anyfoo · on Nov 27, 2022

The CPU itself pushes flags and return address on the stack during an interrupt. If it's an NMI, you can't prevent an interrupt between mov/pop ss and mov/pop sp, which most systems will always have a need to do eventually.

Therefore, the CPU has consistent state, but an extra fix was still necessary just to prevent interrupts after a mov/pop ss.

mjw1007 · on Nov 26, 2022

The 8086 didn't have a separate stack for interrupts.

So, as I understand it, the problem isn't that code running in the interrupt handler might push data to a bad place; the problem is that the the processor itself would do so when pushing the return address before branching to the interrupt handler.

cesarb · on Nov 27, 2022

> the problem is that the the processor itself would do so when pushing the return address before branching to the interrupt handler.

I want to emphasize this point: when handling an interrupt, the x86 processor itself pushes the return address and the flags into the stack. Other processor architectures store the return address and the processor state word into dedicated registers instead, and it's code running in the interrupt handler which saves the state in the stack; that would easily allow fixing the problem in software, instead of requiring it to be fixed in hardware (a simple software fix would be to use a dedicated save area or interrupt stack, separate from the normal stack; IIRC, it's that approach which later x86 processors followed, with a separate interrupt stack).

robocat · on Nov 27, 2022

> with a separate interrupt stack

Some detail here: https://www.kernel.org/doc/html/latest/x86/kernel-stacks.htm...

kklisura · on Nov 26, 2022

> While reverse-engineering the 8086 from die photos, a particular circuit caught my eye because its physical layout on the die didn't match the surrounding circuitry.

Is there a software that builds/reverses circuity from die photos or this is all manual work?

speps · on Nov 26, 2022

Check out the Visual 6502 project for some gruesome details: http://www.visual6502.org/

The slides are a great summary: http://www.visual6502.org/docs/6502_in_action_14_web.pdf

kens · on Nov 26, 2022

Commercial reverse-engineering places probably have software. But in my case it's all manual.

_Microft · on Nov 26, 2022

A Twitter thread by the author can be found here:

https://twitter.com/kenshirriff/status/1596622754593259526

raphlinus · on Nov 26, 2022

Or, if you prefer Mastodon: https://mastodon.online/@kenshirriff@oldbytes.space/10941232...

kens · on Nov 26, 2022

Usually if I post a Twitter thread, people on HN want a blog post instead. I'm not used to people wanting the Twitter thread :)

jiggawatts · on Nov 26, 2022

Blogs are too readable.

I prefer the challenge of mentally stitching together individual sentences interspersed by ads.

IIAOPSW · on Nov 27, 2022

Well I've got great news for you (1/4)

bhaak · on Nov 27, 2022

Twitter threads force the author to be more concise and split the thoughts into smaller pieces. And if there are picture to the tweets that makes them even better.

It's a very good starting point, getting a summary of the topic, before diving into a longer blog post knowing what to expect and with some knowledge you probably didn't have before.

The way I'm learning best is having some skeleton that can be fleshed out. A concise Twitter threads builds such a skeleton if I didn't have it already.

ilyt · on Nov 27, 2022

Forcing to be concise in nuanced topics usually does more bad than good. And format of Twitter is abhorrent.

Like, even your comment would need to be split into 2 twits

Then again I read pretty fast so someone writing slightly longer sentence than absolutely necessary doesn't really bother me in the first place

bhaak · on Nov 27, 2022

> Forcing to be concise in nuanced topics usually does more bad than good.

Following that argument you can't ever build up proper knowledge unless you get a deep introduction into a topic.

> And format of Twitter is abhorrent.

I disagree but you need to write to its format. Just writing a blog post and then splitting it up for cramming as much as possible into a tweet, possibly even breaking within a sentence, yes, that is abhorrent.

> Like, even your comment would need to be split into 2 twits

3 even if you preserve the paragraphs I used.

> Then again I read pretty fast so someone writing slightly longer sentence than absolutely necessary doesn't really bother me in the first place

It's not about reading slow, it's about getting small pieces of information. Tweets usually don't have the fluff that many blog posts do. Or even articles that take up one third of the whole piece to describe a person in a coffee shop, drinking a specific brand of coffee to illustrate some superficial point that has a vague connection to the actual topic of the article.

kjs3 · on Nov 28, 2022

For Odins sake, PLEASE don't add the mental complexity of just trying to read a Twitter thread on top on the mental complexity of grokking your posts. This is what blogs are for.

lazzlazzlazz · on Nov 26, 2022

The Twitter thread is better. You get the full context of other peoples' reactions, comments, references, etc. This is lost with blogs.

It's the same reason why many people go to the Hacker News comments before clicking the article. :)

serf · on Nov 26, 2022

>The Twitter thread is better.

let's recognize subjectivity a bit here. it's different, you may like it more -- that's great , but prove 'better' a bit better.

i've said this dozens of times, and i'm sure people are getting tired of the sentiment, but Twitter is a worse reading experience than all but the very worst blog formats for long-format content.

ads, forced long-form formatting that breaks flow constantly, absolutely unrelated commentary half of the time (from people that have absolutely no clout or experience with whatever subject), and the forced MeTooism/Whataboutism inherent with social media platforms and their trolling audience.

It gets a lot of things right with communication, and many people love reading long-form things on Twitter -- but really i'm just asking you to convince me here ; there is very little 'better' from what I see : help me to realize the offerings.

tl;dr : old man says 'I just don't get IT.' w.r.t. long-form tweeting.

lazzlazzlazz · on Nov 26, 2022

It should be very obvious that when someone says "X is better" they mean that — in their estimation and according to their beliefs, X is better. Spelling it all out isn't always necessary.

anyfoo · on Nov 27, 2022

I don't think that's obvious. You could well mean that "X is better" in general, and consequentially even advocate that "so Y, which is not better, is a waste of resources and should not be done".

It's hard for an outside observer to judge which one you mean, and just in case you meant the advocating one, people might feel to discuss to prevent damage.

It's easy for you to clarify with "specifically for me" or something similar.

And there even is something that I think is generally better, namely that I've seen this blog (and others) attract long form comments from people who were also there, and elaborate some more historical details in them. On Twitter it's mostly condensed to the bare minimum, which you can see on this very Twitter thread! So losing the blog would be bad.

ilyt · on Nov 27, 2022

In according to subjective media, sure. But in tech-focused social media, not really. There are mostly objective things (like too little contrast, or site working slow) that can be said about readability for example.

anyfoo · on Nov 27, 2022

> absolutely unrelated commentary half of the time (from people that have absolutely no clout or experience with whatever subject)

Yeah, you can see that by just looking at the comments for the first tweet there. A few ones are maybe interesting nuggets of information (but a comment on the blog or on HN might have elaborated more), the rest is... exactly what you say.

Shared404 · on Nov 26, 2022

I like having the twitter/mastodon thread easily linked, but 10/10 times would rather see the article first.

kelnos · on Nov 27, 2022

I won't quite say I don't care about other people's reactions (seeing as I'm here on HN, caring about other people's reactions), but Twitter makes it annoying enough to follow the original multi-tweet material such that I greatly prefer the blog post.

MBCook · on Nov 26, 2022

So this was all to avoid having to re-layout everything and cut new rubylith, right? And at this point was all still done by hand?

I suppose you’d have to re-test everything with either a new layout or this fix, so no real cost to save there?

retrac · on Nov 26, 2022

> And at this point was all still done by hand?

Sort of. From what I've gathered, at this point Intel was still doing the actual layout with rubylith, yes. But in small modular sections. The sheets were then digitized and stitched into the final whole in software; there wasn't literally a giant 8086 rubylith sheet pieced together by hand, unlike just a few years before with the 8080. But the logic and circuits etc. were on paper, and there was no computer model capable of going from that to layout. The computerized mask was little more than an digital image. So a hardware patch it would have to be, unless you want to redo a lot.

Soon, designers would start creating and editing such masks directly in CAD software. But those were just giant (for the time) images, really, with specialized editors. Significant software introspection and abstraction handling them came later. I don't think the modern approach of full synthesis from a specification, really came into use until the late 80s.

rasz · on Nov 27, 2022

286 was an RTL model hand converted module by module to transistor/gate level schematic. Afair 386 was the first Intel CPU where they fully used synthesis (work out of UC Berkeley https://vcresearch.berkeley.edu/faculty/alberto-sangiovanni-... co-founder of a little company called Cadence) instead of manual routing. Everything went thru logic optimizers (multi-level logic synthesis) and will most likely be unrecognizable.

I found a paper on this 'Coping with the Complexity of Microprocessor Design at Intel – A CAD History' https://www.researchgate.net/profile/Avinoam-Kolodny/publica...

>In the 80286 design, the blocks of RTL were manually translated into the schematic design of gates and transistors which were manually entered in the schematic capture system which generated netlists of the design.

>Albert proposed to support the research at U.C. Berkeley, introduce the use of multi-level logic synthesis and automatic layout for the control logic of the 386, and to set up an internal group to implement the plan, albeit Alberto pointed out that multi-level synthesis had not been released even internally to other research groups in U.C. Berkeley.

>Only the I/O ring, the data and address path, the microcode array and three large PLAs were not taken through the synthesis tool chain on the 386. While there were many early skeptics, the results spoke for themselves. With layout of standard cell blocks automatically generated, the layout and circuit designers could myopically focus on the highly optimized blocks like the datapath and I/O ring where their creativity could yield much greater impact

>486 design:

    A fully automated translation from RTL to layout (we called it RLS: RTL to Layout Synthesis)
    No manual schematic design (direct synthesis of gate-level netlists from RTL, without graphical schematics of the circuits)
    Multi-level logic synthesis for the control functions
    Automated gate sizing and optimization
    Inclusion of parasitic elements estimation
    Full chip layout and floor planning tools

retrac · on Nov 27, 2022

Thanks. Fascinating!

marcosdumay · on Nov 26, 2022

I remember the Pentium was marketed as "the first computer entirely designed on CAD", well into the 90's. But I'm not sure how real was that marketing message.

eric__cartman · on Nov 27, 2022

That's a good advertisement for 486s because they were baller enough to pull that off.

dboreham · on Nov 26, 2022

Yes. A hardware patch.

dezgeg · on Nov 27, 2022

And even still today this kind of manual hand-patching will be done for minor bug fixes (instead of 'recompiling' the layout from the RTL code).

One motivator is if the fixes are simple enough, the modifications can be done such they only affect the top-most metal layer, so there is no need to remanufacture the masks for other layers, saving time and $$$.

techwiz137 · on Nov 26, 2022

Ah yes, the infamous mov ss, pop ss that caused many a debuggers to fail and be detected.

klelatti · on Nov 27, 2022

Fantastic work by Ken yet again.

It's striking how many of the early microprocessors shipped with bugs. The 6502 and 8086 both did and it was an even bigger problem for processors like the Z8000 and 32016 where the bugs really helped to hinder their adoption.

And the problem of bugs was a motivation for both the Berkeley RISC team and for Acorn ARM team when choosing RISC.

adrian_b · on Nov 27, 2022

All the current processors are shipping with known bugs, including all Intel and AMD CPUs and all CPUs with ARM cores, also including even all the microcontrollers with which I have ever worked, regardless of manufacturer.

The many bugs (usually from 20 to 100) of each CPU model are enumerated in errata documents with euphemistic names like "Specification Update" of Intel and "Revision Guide" of AMD. Some processor vendors have the stupid policy of providing the errata list only under NDA.

Many of the known bugs have workarounds provided by microcode updates, a few are scheduled to be fixed in a later CPU revision, some affect only privileged code and it is expected that the operating system kernels will include workarounds that are described in the errata document, and many bugs have the resolution "Won't fix", because it is considered that they either affect things that are not essential for the correct execution of a program, e.g. the values of the performance counters or the speed of execution, or because it is considered that those bugs happen only in very peculiar circumstances that are unlikely to occur in any normal program.

I recommend the reading of some of the "Specification Update" documents of Intel, to understand the difficulties of designing a bug-free CPU, even if much of the important information about the specific circumstances that cause the bugs is usually omitted.

albert_e · on Nov 27, 2022

Fascinating!

Are there any good documentary style videos that dive into some of such details of chips and microprocessor architectures?

caf · on Nov 27, 2022

I believe this one-instruction disabling of interrupts is called the 'interrupt shadow'.

hota_mazi · on Nov 27, 2022

I started reading this article and immediately thought "That's something Ken Shiriff could have written", but I didn't recognize the domain name.

Lo and behold... looks like Ken just changed his domain name.

Amazing work, as always, Ken.

kens · on Nov 27, 2022

Its the same domain name that I've had since the 1990s.

iamtedd · on Nov 27, 2022

Maybe the parent is from another dimension. What would be your second choice for a domain?

StayTrue · on Nov 27, 2022

Fantastic read. I’ll note for others this blog has an RSS feed (to which I’m now subscribed!).

sgtnoodle · on Nov 27, 2022

Just brainstorming a software workaround. It seems like you could abuse the division error interrupt, and manipulate the stack pointer from inside its ISR? I assume the division error interrupt can't be interrupted since it is interrupt 0?

kens · on Nov 27, 2022

There's no stopping an NMI (non-maskable interrupt), even if you're in a division error interrupt.

sgtnoodle · on Nov 27, 2022

That's some funny priority inversion! Reading up on it, the division error interrupt will be processed by the hardware first (pushed to the stack), but then be interrupted by NMI before the ISR runs.

colejohnson66 · on Nov 27, 2022

“Interrupt 0” wasn’t its priority, but it’s index. IIRC, the 8086 doesn’t have the concept of interrupt priority. Upon encountering a #DE, the processor would look up interrupt 0 in the IVT (located at 0x00000 though 0x00003 in “real mode”).

ilyt · on Nov 27, 2022

I'm more surprised that it already had microcode-equivalent, even if it was essentially hardcoded

carry_bit · on Nov 27, 2022

I have a textbook on processor design from 1979, and microcode was the go-to technique described in it.

pwg · on Nov 27, 2022

Wikipedia (https://en.wikipedia.org/wiki/Microcode#History) dates it back to 1951 and Maurice Wilkes.