My, what strange NOPs you have!

daeken · on Jan 12, 2011

I can't help but think about the x64 NOP story: http://www.pagetable.com/?p=6

cosmicray · on Jan 12, 2011

Early developers used to put NOPs into code, so they would have a place to insert hex patches later on. At a minimum they could plug in a branch instruction, bounce down to the end of the code segment, put a longer sequence of code there, then branch back. Old school stuff.

1amzave · on Jan 12, 2011

Modern compilers may actually do this as well, though for slightly different reasons -- GCC (at the right optimization level) likes to put branch targets and functions at aligned addresses, so you often end up with little pads of NOP instructions scattered throughout your binaries (run 'objdump -d' on something built with 'gcc -O2' to see firsthand, if you're interested).

I actually have a mostly-written LD_PRELOAD library lying around that exploits this for purposes more like the ones you describe though -- hot-patching code in memory at program load to dispatch system calls via little dynamically-generated trampolines so you can insert calls to arbitrary tracing functions. Perhaps I'll polish it up a bit and toss it on github...

calloc · on Jan 12, 2011

Something along these lines: http://timetobleed.com/rewrite-your-ruby-vm-at-runtime-to-ho...

DCoder · on Jan 12, 2011

Nowadays MSVC supports the /hotpatch command line option, which inserts fillers like "mov edx, edx" or "lea edi, [edi]" at the start of each function for the same purpose.

abhijitr · on Jan 12, 2011

See also: Detours library by MSFT research(http://research.microsoft.com/en-us/projects/detours/). It doesn't require you to recompile your binaries to add NOPs.

rlpb · on Jan 12, 2011

Out of interest, why does this need a NOP? Why not replace an existing instruction with a branch, bounce down to the end of the code segment, put a longer sequence of code there, put the instruction that you replaced there, and then jump back?

Is it that the patch could be conditional and not incur as much of a performance penalty if the condition isn't met and the patch doesn't run? Would this have made a significant difference to performance?

kenjackson · on Jan 12, 2011

Depending on the instruction, it might not be long enough for a branch. Then you'll have to replace two instructions, and if the program is executing (doing a live/hot patch) then you may have to be extra careful that code hasn't executed the first, but not yet the second instruction.

steveklabnik · on Jan 12, 2011

Some of these msdn blogs almost feel like a non-apple folklore.org.

I like it.

ot · on Jan 12, 2011

I discovered Raymond Chen's blog some weeks ago and I started reading all the posts that I could, it is fascinating and very well written.

All the problems that Windows developers had to face for compatibility almost make me sympathize with them. Some context is given in Joel Spolsky's "How Microsoft Lost the API War" (http://www.joelonsoftware.com/articles/APIWar.html), where he talks about "The Raymond Chen Camp".

msarnoff · on Jan 12, 2011

In addition to a regular NOP, the Motorola 6809 (and possibly other 6800-series processors) has a BRN instruction: branch never. It's the opposite of BRA (branch always): it takes an 8-bit offset and ignores it, effectively making it a two-byte NOP.

nitrogen · on Jan 12, 2011

On PIC microcontrollers, it's common to use a "goto $+1" instruction as a two-cycle NOP to save an instruction word (every word counts when you only have 256 words of ROM and 32 bytes of RAM). It just branches to the next instruction.

See http://www.piclist.com/techref/piclist/codegen/delay.htm

sjs · on Jan 12, 2011

I remember that from the 6811s we programmed in school. BRN also takes more cycles to complete than a standard NOP (or 2 NOPs) since it performs some branching calculations before it ends up doing nothing.

gfodor · on Jan 12, 2011

Honestly I found this part the most interesting:

> Late in the product cycle (after Final Beta), upper management reversed their earlier decision and decide not to support the B1 chip after all.

After all that work, on compilers, processes, tools, etc, it was all scrapped anyway. I wonder if there is a lesson to be learned there. (The usual startup-mantras don't really seem to apply!)

rlpb · on Jan 12, 2011

If it was the right thing to do, then it was the right thing to do despite any previous work. See: http://en.wikipedia.org/wiki/Sunk_cost

paulgerhardt · on Jan 12, 2011

Sure, decisions were made on the battlefield that at the time and considering the circumstances were optimal (for local maxima.) This kind of glib remark doesn't really prevent anyone else from making the same mistakes.

I don't know what the correct approach should have been, but it seems that changes to support other architectures should have been made independently from other fixes and tagged such that they could have been backed out rather than contaminating your source and pushing it to gold. (Remember this is before the days of Windows Update!)

rlpb · on Jan 12, 2011

> This kind of glib remark doesn't really prevent anyone else from making the same mistakes.

Where's the mistake? Wouldn't backing out the workarounds introduce new potential for error due to other changes made after the workarounds were put in? Why do the workarounds need to be backed out? They don't break anything. If it ain't broke...

My point is that it wasn't really a mistake; it was a change in circumstance.

alex_c · on Jan 12, 2011

Fascinating read.

Does anyone have any examples or explanations for how these CPU bugs are caused? I probably have enough knowledge of CPU architecture from school to follow an explanation, but not enough to make my own guess.

1amzave · on Jan 12, 2011

Circuit-level (layout) problems I'd guess make up a decent portion. I heard (from a guy who worked on it) about a bug in a prototype version of a processor from a major company that didn't cause any correctness problems but was a major performance problem: one of its four cache ways simply didn't have its power rails connected.

Some CPU verification code I wrote on an internship a couple years ago discovered a few bugs in a certain fairly widely-used processor, though I'm pretty sure they were all logic-level problems (i.e. RTL bugs, not circuit level ones)...

- The L1 D-cache tracked clean/dirty status at half-cache-line granularity, and if you did a store (with just the right timing) to one half of a cache line you had just explicitly cleaned with a cache-clean instruction, the dirty bit wouldn't get set on that half line, so as soon as the cache got flushed the data written by the store was lost.

- The prefetcher would shut down sometimes as a power saving technique, but if you laid out the right sequence of cache operations and branches in the last 32 bytes of a 4KB page, sometimes concurrent TLB misses would cause it to not get re-enabled, meaning the processor would lock up, stop fetching instructions and just sit there dead in the water until an interrupt came in (assuming interrupts were enabled).

There were a couple more, but I thought those were the more interesting ones. Granted, these weren't bugs that were likely to be encountered in normal usage for various reasons (in addition to being extremely difficult to reproduce sometimes -- i.e. on one in particular you could run the exact same sequence of instructions from system power-on and sometimes it happened, sometimes it didn't), but bugs nonetheless.

luu · on Jan 12, 2011

I didn’t work on the 386, and bugs can be caused by anything, but based on bugs that I’ve seen on processors that I’ve worked on, my guess would be that there was some forwarding logic designed to speed up consecutive 16-bit operations and consecutive 32-bit operations, along with some logic to detect when to apply the forwarding logic.

If the detection logic is wrong, you could easily end up forwarding 16 good bits + 16 bits of random garbage into one input of a 32-bit operation. That would explain Raymond’s "if all the stars line up exactly right" line, since the hole in the forwarding logic must have been really small (or it would have been caught in testing).

adestefan · on Jan 12, 2011

If it's repeatable, then it's a bug in the logic or microcode of a chip. Some of these could be really extreme corner cases, especially when people are doing things with mixing 16-bit and 32-bit instructions.

It's amazing how much hardware out there is buggy and fixed by operating systems and drivers.

rwmj · on Jan 12, 2011

The article also explains why NOP is 0x90 on the 8086. On the Z80, NOP is 0x00 which I always thought was more logical (since you can zero out memory or code).

nitrogen · on Jan 12, 2011

One could make a case against using 0x00 as NOP to prevent a rogue program from NOPping its way across zero-filled pages and into other parts of memory.

RodgerTheGreat · on Jan 12, 2011

If that was the goal, wouldn't you want to reserve 0x00 as a HALT or some sort of interrupt-firing instruction?

nitrogen · on Jan 12, 2011

If 0x00 isn't a valid opcode, as soon as one is hit, the CPU will fire an invalid instruction interrupt (if it supports invalid instruction interrupts), allowing the operating system (if any) to do something about it.

However, since the original comment was about the Z80 which according to http://www.z80.info/decoding.htm treats invalid instructions as a NOP, the point is moot.

By the way, z80.info is the type of site that made me fall in love with the web -- it's full of information compiled by people doing it not for AdSense impressions, but just for the love of sharing knowledge. I miss the days when all of my Google searches would take me to these sites instead of the contentless content farms of today.

wallflower · on Jan 12, 2011

Fascinating read even though I never had the balls to program ASM.

One of the commenters said that he would buy a book of this story and more and someone replied that there is in fact a book from him already:

"Old New Thing, The: Practical Development Throughout the Evolution of Windows" by Raymond Chen

http://www.informit.com/store/product.aspx?isbn=0321440307

JoeAltmaier · on Jan 12, 2011

There were 8 NOPs on the x86 - XCHG AX,AX was 1-byte 0x90. XCHG BX,BX etc were 2-byte 0x87 0xXX where the 2nd byte selected the general-purpose registers.

Very strange for a frequency-encoded instruction set to let such short opcode sequences go to waste.

ars · on Jan 12, 2011

Remember that the bits that make up the instructions are not actually numbers - they are switches. i.e. each bit in the number turns on a section of the cpu, then that section of the cpu checks the next bit, and turn on the next part of the cpu, following it down till you get to the proper component that carries out the instruction.

So by having a pattern to your instructions you can optimize the layout of the CPU. XCHG AX,AX does exactly that, with no visible effect, but the action is carried out.

That was then.

These days CPUs compile machine language into microcode, so you can pick any number for the opcode. The microcode is still switches based on the bits, but you don't see that.

jbri · on Jan 12, 2011

In fact, once the 486 hit, 0x90 was an explicit NOP (1 cycle instead of the 3 that an XCHG usually takes, won't stall the pipeline even if other stuff touches AX, etc.).

Similarly, on x64 0x90 is a true no-op, while stuff like XCHG EBX, EBX clears the upper 32 bits of the register.