X86 assembly doesn’t have to be scary

tvmalsv · on June 18, 2018

When I was about 14 I was enthralled with programming my new Commodore 64, first in BASIC, then 6510 assembly. I had the opportunity to accompany my mother to a one-day class on programming. Being just an intro on the subject, I was well ahead of what they would be discussing, but thought it would interesting to talk to some adults that were also into programming.

I was talking to a couple of guys about what I had been doing on my C=64, and when I mentioned the assembly stuff I was writing, one of them said, "How can you possibly write anything with only three registers?!" (just the accumulator and x/y registers). I was wondering what the big deal was since that was the only architecture I had known at that point. Every game and utility I had were only using three registers, so it was already proven to me that three were "enough".

It's funny how you can just adapt and work with whatever is available, and that becomes your norm. Especially when you don't even realize there are other options out there.

Those were the days!

userbinator · on June 19, 2018

It's funny how you can just adapt and work with whatever is available, and that becomes your norm. Especially when you don't even realize there are other options out there.

Another example of this, also quite relevant to the article, is the 64K address space of a 16-bit segment; to anyone who is used to HLLs, 64-bit address spaces, and gigabytes of RAM, it seems impossibly small. Even more so when you take a modern C++ compiler with all its default settings and produce a 72KB(!) "Hello World" binary.

One of the things you quickly realise when you work with Asm is that it's not impossibly small; everything is os just horribly bloated --- piles of abstractions upon abstractions, using gargantuan statically-linked library functions of which a tiny fraction of the bytes is actually executed, etc. Try writing a nontrivial utility in 16-bit realmode Asm, where a "Hello world" is around a dozen bytes, and you'll find that 64KB is actually quite a lot already. In particular, look at the 256b and below DOS categories in the demoscene:

http://www.pouet.net/prodlist.php?type%5B%5D=32b&type%5B%5D=...

Those effects were accomplished using fewer bytes of machine instructions than the text of this post, which is already over 1KB. Really makes you think.

barrkel · on June 19, 2018

Hello world in real mode is leveraging a bunch of code and data from the BIOS, so it's not quite as small as you're suggesting, but yes, bloat is the cost in general purpose abstractions.

Smart linking is related to GC. Not all the conditional and indirect edges in the control flow graph can be skipped by the linker's graph traversal, so you usually end up with lots and lots of unused code. And then there's all the sledgehammers being used on peanuts, just because they're there, like xpath to retrieve config values.

messe · on June 19, 2018

Not important, I just thought it'd be fun to write but:

> Hello world in real mode is leveraging a bunch of code and data from the BIOS

Not necessarily, you can avoid the BIOS calls by just copying the string directly to video memory:

            mov ax, 0xB800
            mov es, ax
            xor di, di
            mov ah, 0x07
            mov si, hello
    ;; Load character, if zero, jump to end otherwise store
    ;; character and color in video memory
    @@:     lodsb
            test al, al
            jz @f
            stosw
            jmp @b
    ;; Halt and wait for interrupt
    @@:     hlt
            jmp @b

    hello:  db 'Hello, World!', 00h

That's a total of 37 byes including the string.

gecko · on June 19, 2018

I might be missing something, but you're still relying on the BIOS to have the graphical representation of those letters. Right?

messe · on June 19, 2018

As far as I remember, the default VGA font is stored on the graphics card ROM and copied into the video memory by that same onboard ROM, not the motherboard BIOS, so no I don't believe so. I may be mistaken.

dvfjsdhgfv · on June 19, 2018

And that's why I love D. I can get fairly small binaries and still use powerful language constructs so that I can focus on the logic and not on the language.

weinzierl · on June 18, 2018

I also started as a programmer on a C64. BASIC didn‘t get you very far, so assembly it was. The machine code monitor, all the three letter mnemonics and ??? when it couldn‘t disassemble are fond memories.

When I switched to the PC I had great hopes but outright hated it after a short while. All this restrictions on the registers were confusing. The segment modell even more. And good graphics required programming the VGA cards bitplane modes [1] which still makes me dizzy when I think about it. Wished I had an Amiga.

Programming the 6510 and VIC and SID was so smooth in comparison.

[1] In these modes the bits making up a single pixel where not in the same place. To manipulate a single pixel you had to manipulate several bits in different places of the RAM without affecting the adjacent bits in the respective words. Crazy stuff.

Narishma · on June 18, 2018

Didn't the Amiga's graphics system also use bitplanes?

Jupe · on June 18, 2018

Yep, sure did.

user5454 · on June 19, 2018

But with AGA chipset the Amiga got chunky pixels (e.g. write palette index to the framebuffer as in VGA).

gpderetta · on June 19, 2018

That was supposed to be a feature of AAA. I do not think AGA supported chunky pixels.

Jupe · on June 20, 2018

Yep, but for complexity, nothing matched HAM (hold and modify) mode, except maybe the aftermarket DCTV, the details of which escapes me, but I recall it having separate chroma and luminense.

Edit: found some info on DCTV: https://retrocomputing.stackexchange.com/questions/2201/how-...

ghaff · on June 18, 2018

The segment model was the thing I really hated. It didn't help that the DOS program that I sold for a number of years sort of ran out of in-memory space for data and I ended up doing a bunch of whacky things with segments rather than rewrite it from scratch. Which I eventually started to do but, at that point, it didn't make sense to put the work in.

AnIdiotOnTheNet · on June 18, 2018

Zero-page instructions also helped alleviate the problem of not having many registers.

gruez · on June 18, 2018

>"How can you possibly write anything with only three registers?!"

AFAIK you can even go down to 1 register, which is how stack machines work. You might even say it’s 0 registers because it’s not something you can directly access.

MaxBarraclough · on June 18, 2018

A one-register machine is not the equivalent of a pure stack machine.

Wikipedia has a section on accumulator machines: https://en.wikipedia.org/wiki/Accumulator_(computing)#Accumu...

Assuming Wikipedia's article on the PDP-8 is correct, the PDP-8 had just two public registers: the program counter, and the general-purpose 'accumulator' register.

poizan42 · on June 18, 2018

There are many simple microcontroller architectures with only one register, usually called A (accumulator) or W (work register). You just need to constantly load memory to and from the register, nothing weird in that. Sometimes they will call their memory locations registers, in which case they have lots of registers - the terminology gets quite unclear when everything is on the same chip anyways.

A true stack machine does not have any registers, so that would be 0.

blattimwind · on June 18, 2018

RPN calculators are simple stack machines.

tvmalsv · on June 19, 2018

I was going to say "Man, that'd be terrible to work with!" And immediately realized I'm no different than those 8088 guys in that class :) You work with what you have.

mmjaa · on June 19, 2018

The answer is, of course, that the 0-page is a full set of 255 registers.. you just have to know how to use the X/Y/Accumulator to access them properly ... ;)

prirun · on June 20, 2018

I worked at Prime Computer (a minicomputer company) in Detroit when I was 19. Ford was converting their car design program, PDGS (Product Design Graphic System) to Prime. It ran on 16-bit 32K CDC computers. The program was huge and used overlays to swap different features in and out from disk as the user clicked on menu options on a vector display. For the young tikes here, a vector display is like a traffic control display: it draws perfectly straight lines, perfect circles, by moving an electron beam over a continuous phosphor-coated screen. No pixels.

By contrast, the Prime minis had a Multics architecture with virtual memory (no overlay swapping!), something like 8MB of memory, a 32MB disk drive (16MB fixed, 16MB removable cartridge) and could support 2 vector displays on 1 computer. Happy days!

So yeah, for us oldsters (I'm 58), modern software often seems very bloated.

bcheung · on June 18, 2018

Saturn assembly programming (HP48 calculator) even had a concept of a "nibble". You had to chose which portion of the data you wanted to access. The CPU only supported working with 4 bits at a time.

zokier · on June 18, 2018

If your intention is to avoid scaring newbies off, I'm not sure if 16bit real mode, PC boot process, BIOS services and all that arcana that follows is the best place to begin.

msla · on June 18, 2018

I agree with you to an extent: The IBM PC was not a very elegant design, and other computers certainly had less complexity to them. But there is a point I want to make which defends this choice somewhat:

Newbies like me are more frustrated by thinking there's no path from introductory material to something useful or realistic. Something that's more immediately friendly would be a simplified virtual machine with no or trivial peripheral hardware, like Redcode in Core War:

http://vyznev.net/corewar/guide.html

The death of education is "So What?": "OK, I've learned to play Core War and I know enough to write a warrior that can occasionally beat other warriors written by beginners. So far, so good... so what? I want to program computers in assembly, not play games using assembly programs, so where's the path from where I am now to where I want to be?"

In this case, the pathway from VM opcodes to native opcodes is hardware interface and, while the IBM PC has some funky hardware, it's heartening to know that your code could, in principle and barring emulation errors, run unmodified on real hardware and do something. Not something useful, but you can get to useful. You can ramp up to it, now that you have the pathway in front of you. It might be a long pathway, but it's there.

zokier · on June 18, 2018

Well, I had in mind something more useful and realistic, not less; namely teaching assembly in normal Linux environment. Sure, there are all sorts of complexities there too, but in general I feel like they are also more worthwhile.

msla · on June 18, 2018

> Well, I had in mind something more useful and realistic, not less; namely teaching assembly in normal Linux environment.

This might sound odd, but as someone who's done some assembly programming under Linux, it usually isn't different enough from C to be worth the effort. Even if you eschew libc, the kernel APIs are still fairly high-level and, more to the point, there are no new concepts relative to C: You have pointers and pointer arithmetic, you have fixed-size buffers, you have ints and doubles, and the rest is just syntax.

The exception is doing something like a really tight high-performance kernel using SIMD opcodes, which necessarily involves learning a lot about the specific SIMD hardware and data organization to optimize cache use and other details C can't express.

e12e · on June 18, 2018

Indeed. Maybe something along the lines of (or building up to) :

"Sshtalk: An SSH-based chat made in assembler" : https://news.ycombinator.com/item?id=15829206

The library comes with a lot, but network stuff and Unicode support to name a few:

https://2ton.com.au/HeavyThing/

jonhendry18 · on June 19, 2018

It seems like a better introduction would have the aim of helping the newbie learn assembly that would be useful in debugging or working around bugs in code that you don't have source to.

JustSomeNobody · on June 18, 2018

Why not? If they're determined enough, they'll get it. I learned a lot back in the day by spending many hours with the book PC Intern and some other book on x86 Assembler.

Granted, I had a little experience with assembler on a C=64, but, I learned _that_ in a similar way.

xboxnolifes · on June 18, 2018

> If they're determined enough, they'll get it.

That's the exact opposite mindset of someone trying to "not scare newbies away."

krylon · on June 18, 2018

On the other hand, I think it is fair to assume that people who are interested in assembly programming at all bring a certain degree of determination to the table.

JustSomeNobody · on June 18, 2018

I think the determination of newbies is being underestimated. A lot of us "old timers" didn't have the internet. Just a few books and a whole lot of "Imma figure this crap out!".

pjmlp · on June 18, 2018

That arcana felt liberating coming from a Timex 2068.

Different times, I know.

linker3000 · on June 18, 2018

Different times - in X86 Assembly:

https://github.com/linker3000/Historic-code-PC-Pascal-and-AS...

ChrisSD · on June 18, 2018

I wonder if the history of x86 is holding us back in a big way. It started out being close to the metal but now it's an abstraction that can mislead you if you think processors are literally working the way x86 assembly describes.

And surely the whole spectre issue could be lessened if we could be less reliant on CPUs having to guess what to keep in cache, which code paths are most likely, etc?

blattimwind · on June 18, 2018

> I wonder if the history of x86 is holding us back in a big way. It started out being close to the metal but now it's an abstraction that can mislead you if you think processors are literally working the way x86 assembly describes.

But when you compare high-end cores, you always see the same picture, regardless of ISA. Large surface area (~1/3 of the core) used for insn fetch/decode/schedule; they look all the same, whether it's POWERn or the latest Sandy-Bridge/Haswell rehash... people claim that other ISAs would be a lot easier/faster/more efficient to decode, but that doesn't seem to be true in practice. It seems to me that any difference that might be there gets dwarfed by the sheer complexity of OoOE and speculative execution.

What we do know is outside some niches (like DSPs, where VLIW reigns king) static techniques essentially do not work for application code, because it's impossible to predict statically.

thechao · on June 18, 2018

Even 15 decoders are a pinprick on the core’s area—at least when I saw die area for LRB. Fetch & schedule we’re a bit larger. Most of the core area was register files & floating-vector logic.

brandmeyer · on June 19, 2018

In fairness, Larrabee was originally an in-order design.

gpderetta · on June 19, 2018

It was also a simpler core, which means that for a complex OoO monster, the decoder would be an even smaller area. The OoO instruction scheduler would be a significant chunk, but that has nothing to do with being x86.

bogomipz · on June 18, 2018

>."..at least when I saw die area for LRB"

What is LRB?

woadwarrior01 · on June 18, 2018

Probably Larrabee, Intel’s cancelled GPGPU with an x86 ISA.

ptx · on June 19, 2018

What if the parts that are impossible to handle statically were instead handled in software by a JIT compiler that could speculate and deoptimize when necessary? Could that be a more efficient model for executing high-level languages like Java etc.?

acqq · on June 19, 2018

> Could that be a more efficient model

No, because what the CPUs have is the advantage of being able to have specialized circuits with the new answers in each clock, every instruction in every pipelne stage and so fast adaptation to the use in run time.

bogomipz · on June 19, 2018

>"What we do know is outside some niches (like DSPs, where VLIW reigns king)"

Interesting, why is VLIW is so predominant in DPS chips?

blattimwind · on June 19, 2018

Because typical DSP code does calculations using predictable memory access patterns and usually contains very little logic. This makes it possible to largely statically schedule the code. It's fairly similar in that regard to e.g. GPU shaders (and indeed GPUs have used VLIW-like designs in the past) and scientific data crunching code.

bogomipz · on June 20, 2018

>"typical DSP code does calculations using predictable memory access patterns and usually contains very little logic."

Thanks, that's what I was missing - the access patterns of signal processing is what makes VLIW a good fit.

Is there any other use case where VLIW architectures are prevalent or are DSPs the primary use case?

gpderetta · on June 20, 2018

also DSP code has a decent amount of instruction level parallelism.

gaius · on June 19, 2018

Because VLIW is a great idea but we only know how to compile DSP code effectively for it, that being a subset of general-purpose code

ahartmetz · on June 18, 2018

Speculation has nothing to do with x86 in particular, although it has something to do with its general approach to execution (as opposed to e.g. the Mill CPU).

Speculation is part of out of order execution, which is logic to execute code passages when their inputs are ready, as opposed to strictly in program order, while maintaining the program visible effects of executing in program order. It works the same in ARM, MIPS etc. Not doing OOO execution roughly halves CPU throughput on your average code. AFAIK OOO without speculation is possible, but loses maybe half of its efficacy.

Actual experts may want to correct my numbers.

digi_owl · on June 18, 2018

These days there are an absurd number of abstraction layers in any computing stack.

Fire up anything electron based, and you are looking at scripts being JITed inside a "VM", sitting on top of an OS that abstract away the hardware, sitting on top of hardware that is pretending to be something from the 80s/90s.

Koshkin · on June 18, 2018

> electron based

Electrons are arguably at the very base of the whole pyramid (I know, I know :). But then again, there's these pieces of Si pretending to be transistors, etc...

krylon · on June 18, 2018

Well, Intel has tried repeatedly to replace x86 (AXP32, i860/i960, Itanium), but in a way, they were a victim of their own success.

jecel · on June 18, 2018

Actually, the iAPX432 came first in 1976. When it became obvious that this project was going to take longer than expected (it was only launched in 1981) Intel did a quick extension to the 8080 to keep customers from going to the competition while they waited.

krylon · on June 19, 2018

Thank you for pointing that out!

ksk · on June 18, 2018

>It started out being close to the metal but now it's an abstraction that can mislead you if you think processors are literally working the way x86 assembly describes.

Well, the fact that we were able to abstract it out into an ISA, and still make progress would indicate that it isn't holding us back!

>And surely the whole spectre issue could be lessened if we could be less reliant on CPUs having to guess what to keep in cache, which code paths are most likely, etc?

Why would consumers purchase CPUs that performed worse? The only reason to use branch prediction is because it works and is a huge benefit.

inferiorhuman · on June 19, 2018

> Well, the fact that we were able to abstract it out into an ISA, and still make progress would indicate that it isn't holding us back!

Just because something works doesn't mean it works well. Imagine if x86-64 ditched the variety of non-32/64-bit modes. Would x86 be able to better compete with ARM in power consumption?

BeeOnRope · on June 19, 2018

Not really, those use a microscopic amount of area on a modern die and shouldn't have much power impact at all when they aren't being used.

You could probably change the ISA in various ways to save power, or make other uarch or arch changes - but cutting out say the 16-bit modes isn't really one of them.

inferiorhuman · on June 19, 2018

> You could probably change the ISA in various ways to save power, or make other uarch or arch changes - but cutting out say the 16-bit modes isn't really one of them.

I've got no idea (obviously) but it seems that maintaining compatibility with legacy hardware shouldn't be all that cheap. If it is, kudos to Intel.

BeeOnRope · on June 19, 2018

There was some point when it wasn't necessarily cheap: when chips had transistor counts in the hundreds of thousands or single digit millions. But current chips generally have billions of transistors (or at least high triple digit millions). There are just a huge number of gates to go around, and since the complexity of 16-bit support remains largely constant it becomes an ever-shrinking piece of the die.

These backwards compatibility modes don't need to be super fast (after all, current chips are perhaps 1000x the speed of the old 16-bit chips), so they don't need to get a ton of gates thrown at them: they just need to work.

If there were no backwards compatibility concerns, I'm sure Intel and AMD could like to get rid of this stuff: not so much because it uses a lot of space, but just because the design and validation effort to make sure these old ISAs keep working is probably considerable.

It's the same kind of situation with all the old effectively obsolete instructions: supporting them takes minimal space, especially when they are allowed to be slow. The encoding space they take up is more of a problem - but that can't be changed now...

gpderetta · on June 20, 2018

What I heard is that, by far, the biggest cost is validating the final hardware. Intel and AMD have very large experience and test tools so the incremental cost for a new CPU is not huge, but it is a big barrier for a new competitor in the x86 space (but then again, patents are probably an even bigger barrier).

Newer/Simpler architectures should be easier to validate.

ksk · on June 19, 2018

> Imagine if x86-64 ditched the variety of non-32/64-bit modes. Would x86 be able to better compete with ARM in power consumption?

In what way? Do you mean instruction decoding? There is very little separation between x86 and ARM on that. x86 is optimized like crazy for the common instructions.

https://en.wikipedia.org/wiki/CPU_cache#Micro-operation_(%CE...

lmm · on June 19, 2018

Branch prediction is great except in the rare cases where it isn't. I wouldn't want to see it removed but I would like to see better documentation and a more transparent interface onto it, so that those rare programs that do need precise control over it (e.g. security code) can have it.

ksk · on June 19, 2018

>Branch prediction is great except in the rare cases where it isn't.

Sure, is there any CPU feature that this does not apply to?

>I wouldn't want to see it removed but I would like to see better documentation and a more transparent interface onto it, so that those rare programs that do need precise control over it (e.g. security code) can have it.

Its hard to parse what your exact request is here. Branch predictors on various Intel/AMD/etc CPUs work slightly differently from each other, and so even if you had access to the design documents, you would have to write programs specific to each CPU instead of targeting x86/x86-64. Not to mention that microcode updates could alter the implementation. Or do you want to just disable it? But then there are other places where bugs can hide. DMA Controller bugs, side channel attacks on caches, or basic DRAM attacks like Row Hammer,etc. Letting external programmers access CPU internals is only going to open another HUGE can of worms.

lmm · on June 19, 2018

> Branch predictors on various Intel/AMD/etc CPUs work slightly differently from each other, and so even if you had access to the design documents, you would have to write programs specific to each CPU instead of targeting x86/x86-64. Not to mention that microcode updates could alter the implementation.

Sure, though hopefully that's something a compiler could do. What I'd like is to be able to write a compiler that could understand the difference between as-fast-as-possible code (most code) and constant-time code, and build the latter correctly.

> But then there are other places where bugs can hide. DMA Controller bugs, side channel attacks on caches, or basic DRAM attacks like Row Hammer,etc. Letting external programmers access CPU internals is only going to open another HUGE can of worms.

There are other holes, sure. That's not a reason not to fix the biggest hole. We'll fix the smaller holes in turn.

ksk · on June 19, 2018

>What I'd like is to be able to write a compiler that could understand the difference between as-fast-as-possible code (most code) and constant-time code, and build the latter correctly.

If you want your compiled code to rely on the internal design of the CPU then its going to be tightly coupled to one particular CPU. CPU Upgrade -> code no longer works. Oh you ran the software on your laptop which has Core i5 instead of Core i7 -> Code no longer works. Oh your S/W Vendor went out of business and you purchased a new machine with a different CPU -> No working code for you.

>There are other holes, sure. That's not a reason not to fix the biggest hole.

OK, please propose a solution then. Or are we merely arguing over hypotheticals and lamenting the fact that stuff isn't secure? If so, yeah, stuff isn't as secure as any of us would like it to be! That doesn't get us anywhere though.

lmm · on June 20, 2018

> If you want your compiled code to rely on the internal design of the CPU then its going to be tightly coupled to one particular CPU. CPU Upgrade -> code no longer works. Oh you ran the software on your laptop which has Core i5 instead of Core i7 -> Code no longer works. Oh your S/W Vendor went out of business and you purchased a new machine with a different CPU -> No working code for you.

Indeed. Better that than CPU Upgrade -> silently leak your private keys, which is what we currently get.

> OK, please propose a solution then.

I've been proposing a solution this whole thread: publish enough information about exact CPU behaviour to allow a (specialised) compiler to produce reliably constant-time code, even if only for a single CPU model at a time.

ksk · on June 20, 2018

That is not a solution though. Its your personal wish list. Neither Intel nor AMD is going to publish their CPU design. No consumer is going to purchase/use software that arbitrarily breaks. Its as unrealistic as someone saying don't release any software with bugs. A solution recognizes the reality of the current situation. You'll have to figure out how to get all parties on-board your agenda. I don't expect you to come up with one right this moment, but I did want to challenge your position.

jeffwass · on June 18, 2018

As someone who hasn’t written assembly in 20+ yrs (some 68k back then, and bits of z80), I’m curious about some examples of how the assembly abstraction diverges from the underlying processor.

MechanicalTwerk · on June 18, 2018

https://news.ycombinator.com/item?id=9264195

xenadu02 · on June 19, 2018

Nope.

x86 decode to micro-ops is a trivial amount of die. It is constant (not completely true but true enough for all useful purposes) while transistor count has gone through many doublings. In some cases x86 is a slight win, as it can function as a form of instruction compression, slightly reducing i-cache pressure.

There may be architectural limitations (like strong store ordering) that are relevant but in practice it hasn't made a huge difference.

tiborsaas · on June 18, 2018

I already posted about WebAssembly in this thread and I think you are correct. It looks completely different than anything else before. I can see functional languages targeting it directly.

Pica_soO · on June 18, 2018

Intel abandoned it themselves, internally? The whole concept of backwards compatibility to a extensive library of old fixed functions was give up about 10 years gao. All that remains is a bw-compatible Assembly-API and whatever proprietary happens to the microcode generated from this.

convolvatron · on June 18, 2018

there was at least one non-realized research project where the cache was exposed as part of the architecture to be managed by the compiler (ppc lets you do some of that).

i think thats a potentially very fruitful approach, but would it have helped the spectre situation in any way?

oh, what you're suggesting in an analog for compiler driven speculation? that may have helped, and is also probably worth thinking about

xenadu02 · on June 19, 2018

Every single CPU architecture that has decided to lean on "smart compilers" has failed. Intel flushed billions down the drain on Itanium. They lit $100 bills on fire trying to make a smart compiler. It never paid off. Think about that for a minute.

No matter how good your compiler I'll put my money on using hardware to solve Spectre.

In the future I'd expect caches to gain some kind of exclusive tagging or set-aside a per-core reorder area. Fetches will have to go there until the instruction that performed the fetch is retired successfully, then the entry can be made visible. Other threads or cores won't observe any changes to cache state based on speculative execution. It will be complicated, cost some transistors, and have a minor but measurable performance impact.

The perf impact being having to fetch the same memory address multiple times, possibly all the way from RAM, because the instruction that fetched it before you hasn't retired yet. I don't know what impact that will have on cache traffic. If we're all very lucky, it won't be possible to use L3/L2 as side channels and we can ignore them (we probably aren't that lucky).

pjmlp · on June 19, 2018

I just read Microsoft is having their own go at it.

https://www.theregister.co.uk/2018/06/18/microsoft_e2_edge_w...

ryanianian · on June 18, 2018

> but would it have helped the spectre situation in any way?

Presumably a compiler could put cache-lookups and/or cache-waits in non-predictably (perhaps choosing to do so when it suspects a cache-miss anyway).

krylon · on June 18, 2018

I think to many programmers assembly is the "GOTO" of programming languanges: From the day you start learning to program, you are told that all this fancy high-level-language stuff is there so you do not have to deal with assembly. So most people never go there.

I did go there, briefly, about ten years ago. It was wicked fun. But all in all, I may have written maybe 20 or 30 instructions of assembly in total. I did try to rewrite a few small, heavily-used functions from our code base in assembly only to discover that the code I came up with was practically identical to what the compiler emitted. At that point I figured that the people who told me that "you can't beat the compiler" were probably right and called it a day[0]. Alas, I never had the hardcore performance requirements that would make go back there. But it was fun to get a taste of it.

[0] At the same time, I was kind of proud that I did not get beat by the compiler. Then again, those functions were fairly trivial.

scottlamb · on June 18, 2018

> At that point I figured that the people who told me that "you can't beat the compiler" were probably right and called it a day

On that subject...my understanding [0] is:

These days, the best bet for beating the compiler is to use vendor intrinsics (for SIMD, encryption, bit-twiddling, etc). Shaving an instruction off the inner loop might give you a few percent; using SIMD lets you operate on 256 or 512 bits per instruction instead of 8, 16, 32, or 64. You might be able to show your inner loop is memory-bound (and thus prove further improvements have to come from algorithmic improvements / better cache locality, rather than continuing to fiddle with instructions).

The compiler automatically uses SIMD sometimes, but it can't do so reliably:

* The transformations require things the compiler isn't allowed to do, like increasing alignment of key variables or altering the larger algorithm.

* code that might run on older processor revisions needs multiple implementations selected at runtime. I think gcc has some magic extension ("target_clones"?) to do this relatively easily; otherwise you might need to write your own logic to decide which function pointer to use.

Note that each "vendor intrinsic" matches one assembly instruction, and it's valuable to understand assembly while writing them, but the actual code you check in can end in .cc (C++) or .rs (Rust) or whatever. Doing so means it can be inlined into functions written in the higher-level language, you don't have to encode knowledge about the platform's calling convention into your code, etc.

[0] Not from personal experience. Corrections welcome.

glangdale · on June 18, 2018

Your understanding is very good for someone without personal experience.

Automatic compiler use of SIMD is rarely that great unless you're in a nice big loop doing nice regular things. I've pretty much never seen it on the stuff I do.

Using intrinsics gets you 95% of the way there. I reach for asm only when I absolutely have to. It is a huge PITA. My irritation at the "bro, just write a .s file" people peaks when I'm trying to write a 200LOC function with 10 different variants based on (say) pipeline depth and unroll width. Yeah, because I'd like to spend the next year doing register allocation by hand.

The compiler is really good at doing routine stuff, and when I hand-edit the asm to do things that better fit my idea of regalloc and scheduling I usually make things worse. Where the compiler falls down is instruction selection and stuff that borders on algorithm design.

For example, I built a shift-or string matcher in SIMD where a first-stage was OK to have false positives (positives in shift-or are represented by zeros in the bit vector). I was able to get a big performance boost by tolerating these false positives when shifting SIMD bits and bringing in some zeros, but no compiler is going to know that a few false positives are OK in that circumstance.

IMO the best way to work is with intrinsics, a tiny bit of embedded asm for things that you can't get intrinsics for (I had to resort to gcc asm blocks to make a cmov happen) and close inspection of your object file (at least on the hotspots) to ensure that the code you're getting is what you think you're getting. It's possible to make minor screwups and suddenly see dozens of extra instructions pushing everything in and out of memory for no good reason.

The other place you can beat the compiler is by doing deeper/wider pipelining of branch free code. This is a dark art. Often going branch free is 10-20% worse than branchy when you have 1 iteration happening at a time but it will scale better when you are going lots of stuff at once - if you have (say) 12 different copies of your loop body happening in one iteration, and there's a mildly unpredictable branch per loop body, the branch miss on one iteration stops all the others from progressing too!

I occasionally blog on these things at branchfree.org and have some more low-level stuff brewing shortly.

krylon · on June 19, 2018

> These days, the best bet for beating the compiler is to use vendor intrinsics (for SIMD, encryption, bit-twiddling, etc)

Well, I would not count that as "beating" the compiler so much, but as "using it", "helping it", or something like that. ;-)

Anyway, when I still wrote C code for a living, we were using the OpenWatcom compiler, and as far as I could figure out at the time, it made no effort whatsoever to support SIMD, and the only way to use them was to drop to assembly. (OpenWatcom's support for inline assembly and even inline binary code was very nice, though.)

Senderman · on June 19, 2018

When I hobby-program, I use assembly. I find it extremely relaxing; doing most things in assembly requires attention and concentration, so it's like solving a puzzle or like physically building something; I don't think you go into 'problem solving' mode very much, so maybe it's a break from that.

Higher level languages let you skip most of the 'menial' work of laying out the code, so your brain power gets spent a pretty different way. I like each. Obviously some languages are better suited to certain tasks.

I don't have a point; I'm just sharing what sprang to mind when I read your comment.

I will say that for something like C, having some experience with assembly makes it a lot easier to get a feel for pointers, the stack. I genuinely think everybody should try it at least once; it's just enjoyable talking almost-directly to the machine.

ahartmetz · on June 18, 2018

I always describe it as: Assembly is dead simple, it is only hard to write nontrivial programs in it. Manual register allocation, other micromanagement, and the necessary changes when you change the logic, that is hell.

Veedrac · on June 19, 2018

You got 20-30 instructions in, drew, and decided it was impossible to do better? That's calling it quits a bit early.

krylon · on June 19, 2018

Like I said, my motivation was not very strong.

During my programming socialization, assembly was always characterized as this terrible thing they did back in the 1960s, so I was reluctant to even try to begin with. Secondly, I had read quite a bit about how nowadays, compilers were so good, it was not worth the effort.

My initial motivation to even give assembly a try was not performance, but accessing CPU-specific features (RDTSC).

tawayway · on June 20, 2018

Another old chestnut - learning some assembly allows you to read it, even if you don't ever need to write any. There is real value in understanding which instructions your compiler is emitting.

Beating the compiler is made easier if you can look at the compiler's answers.

pjmlp · on June 18, 2018

Loved the article.

Then one can follow up with some MS-DOS classics.

https://www.amazon.com/Peter-Nortons-Assembly-Language-Book/...

https://www.amazon.com/Advanced-S-DOS-Programming-Microsoft-...

https://www.amazon.com/Peter-Norton-Programmers-Bible-progra...

https://www.amazon.de/PC-Intern-Programming-Encyclopedia-Dev...

mhd · on June 18, 2018

I see your Norton and raise you an Abrash:

https://www.amazon.com/Zen-Assembly-Language-Knowledge-Progr...

https://www.amazon.com/Zen-Graphics-Programming-2nd-Applicat...

https://www.amazon.com/Zen-Code-Optimization-Ultimate-Softwa...

Zen of Asm is also online, I think a few of the other works, too.

http://www.jagregory.com/abrash-zen-of-asm/

disqard · on June 18, 2018

Ahhh, Zen of Graphic Programming was my one of first graphics books, which led straight to Foley, vanDam, et. al., and I was hooked :)

It's safe to say Abrash's book had a big impact on me.

Norton's books also make me nostalgic, so thank you for that, GP.

ghaff · on June 18, 2018

I'm sure I still have my paper copy someplace. I still remember that you did things differently on an 8086 than an 8088 for example because of the difference in the width of the external data bus. Cycle counting. I'm not sure I learned much that was practical from it but it was the ultimate in optimizing x86 assembler code.

As I recall, he was going to write a sequel but, by then, it would have been fairly pointless.

pjmlp · on June 18, 2018

That is unfair. :)

rzzzt · on June 18, 2018

One more from Mr. Norton: https://www.amazon.com/Peter-Norton-Programmers-Guide-IBM/dp...

It covers a wide range of interrupt routines, both DOS and the BIOS ones. Also, some PCJr sound chip programming got snuck in.

voltagex_ · on June 19, 2018

Peter Norton's Assembly Language book is also online: https://openlibrary.org/works/OL65474W/Assembly_language_boo... but crazily, there's a waitlist to "check out" the eBook.

Sigh.

I did once contact the current copyright owners (DK or Penguin, I think) to see if it could be re-released or released for free, but the paperwork scared me off. Hopefully someone has more luck one day.

zerr · on June 18, 2018

Obligatory mention: John Socha is the author of Peter Norton books as well as Norton Commander.

setquk · on June 18, 2018

I started with 6502, then ARM in 1989 (thanks acorn!), then a little bit of x86. The latter was a culture shock. It’s was like grooming the devil’s genitals in comparison to ARM. It drove me to C and Unix where I’ve been happy ever since (occasionally a bit of PIC assembly as well).

It’s not scary just nasty.

inamberclad · on June 19, 2018

The problem I keep running into is that everybody's asm is different looking, and not even internally consistent. Trying to parse all of this stuff:

    [BITS 16]  ;tell the assembler that its a 16 bit code

Okay, so this is an instruction to the assembler saying that we're only working with 16 bit registers and to emit 16 bit code. Straightforwards enough here.

    [ORG 0x7C00];Origin, tell the assembler that where the code will
    

            ;be in memory after it is been loaded

This means that the first instruction will be at physical address 0x7c00 when the code loads, right? What's the first instruction, the one below?

    mov ah, 0x0A ; Set the call type for the BIOS


    mov al, 66   ; Letter to display


    mov cx, 3    ; Times to print it


    mov bh, 0    ; Page number

Makes enough sense, just mov instructions al refers to the top 8 bits of register ax and al refers to the bottom 8 bits. I see that register bh is zeroed, but what about bl? Why not xor bx, bx or something like that? Don't instructions normally go in a section named .text? How does the BIOS know where to find this code and begin execution in the first place?

...

    TIMES 510 - ($ - $$) db 0    ;fill the rest of sector with 0

Wtf is that? What's a sector? What the syntax here? Are we repeating this 510 times, or 510 minus some value? The db command is for writing bytes, but where are the bytes written to?

    DW 0xAA55          ; add boot signature at the end of bootloader

Where are these bytes written? Why is DW capitalized here when db was left in lowercase before?

wruza · on June 19, 2018

ORG shifts all addresses below it to given offset. That means that external loading code is supposed to put it there, otherwise all non-relative instructions will load/store at wrong locations.

BL is not set because it is irrelevant. How you clear/save a register was up to you. https://en.wikipedia.org/wiki/INT_10H

.text is a section of executable format, but what you got is a boot loader. It doesn’t have sections, it’s just a first 512-byte boot disk sector that is loaded at 7C00h by BIOS and directly jmp’d into.

I forgot what exactly $-$$ is, but one of those is “current” address and another should be ORG address, but I’m not sure, so this expression turns into e.g. TIMES 410 DB 0, if 100 bytes used by so far.

AA55 is a boot sector end marker, something like canary value. Never looked into that in detail.

Assemblers were mostly case-insensitive, like Pascal and BASIC. Case is up to you, if you have any preference.

wruza · on June 19, 2018

As for your confusion, you brought uneasy example here. This is a boot sector, which requires some knowledge on the process, flat binary addressing by convention, and tricks that must fit in 510 bytes, if any. That seems hard (not really) because all you’ve got is a first disk sector and 64kB of BIOS. Once you boot up into OS by reading more sectors and executing these, things slowly begin to ease. Your regular DOS-based program will look like any other example on the internet.

Edit: oh, this code is from the article. Bad habit of reading comments first, sorry!

partycoder · on June 18, 2018

Some free quality resources, ready to use:

- Intel CPU programming guide https://www.intel.com/content/dam/www/public/us/en/documents...

- Calling conventions https://en.wikipedia.org/wiki/X86_calling_conventions https://en.wikibooks.org/wiki/X86_Disassembly/Calling_Conven...

- Web compiler output viewer https://godbolt.org/

- Operation code (opcode) reference http://ref.x86asm.net/

- Disassembler/debugger https://github.com/eteran/edb-debugger, https://x64dbg.com

- Disassembler/Reverse engineering tools http://hte.sourceforge.net/ https://github.com/radareorg/cutter

And this: https://github.com/codilime/veles (binary analysis tool)

zshrdlu · on June 19, 2018

Probably also need some syscall references: https://fresh.flatassembler.net/lscr/

partycoder · on June 20, 2018

Neatly organized. The "man" command also has documentation on syscalls.

A good introduction to syscalls, with some explanation is "The Linux Programming Interface", http://man7.org/tlpi/ (book)

buzzier · on June 19, 2018

+ "Reverse Engineering for Beginners" free book https://beginners.re/

glangdale · on June 18, 2018

Anyone who really wants to dig into this, and is willing to be a bit scared, should review the materials for the 15-410 operating systems course at Carnegie Mellon University. The fall edition of the page isn't up yet but there are probably some older versions kicking around.

If you are at CMU you should definitely take this course, as long as your pain tolerance is high.

This course teaches students to write a preemptive operating systems kernel from scratch. It is quite an experience (I TA'd it after finishing my PhD while trying to figure out what to do next).

The kernels the students wrote used to be able to boot on standard (but old) PC hardware. Sadly the modern USB stack is so complex that a conformant thing to talk to a USB keyboard is pretty much as complex as the student's whole project (but less educational). So non-legacy hardware that lacks a PS/2 keyboard no longer has this nice easy path to read/write stuff to console without a lot of setup.

dcomp · on June 18, 2018

What would be more interesting is the UEFI boot process. I don't think any new computer comes booting into 16bit real mode anymore. I would love to see a "Write your own UEFI bootable kernel" I'm sure going straight to a semi sane 32bit environment is much easier to deal with than starting at 16 and working your way up.

kccqzy · on June 18, 2018

> I don't think any new computer comes booting into 16bit real mode anymore

It most certainly does. See the chapter on 8086 emulation in the Intel Software Developer's Manual, Volume 3.

eadmund · on June 19, 2018

I would love to see a 'write your own UEFI kernel in assembly,' but every resource seems to assume that one is writing in C. I want my system to be in assembly, and to grow it from there.

wruza · on June 19, 2018

>32bit environment is much easier to deal with than starting at 16 and working your way up.

How much bytes do you think one requires for an OS loader? The complex part is loading all the code into memory from a file on disk partition without having a MB-sized fs driver, stdlib and loader in it yet. 16, 32 or 64 it doesn’t really matter.

mmjaa · on June 19, 2018

Sometimes its necessary to return to the past to remember the things we've abandoned in the rush to modernity.

I'm speaking, of course, of the wonderful Prince of Persia and its delights. So many treasures to be discovered for anyone interested in even a little bit of assembly-language programming.. it definitely sharpens my chops, anyway:

http://fabiensanglard.net/prince_of_persia/index.php

PRINCE OF PERSIA CODE REVIEW.

secure · on June 18, 2018

I love the built-in demos!

AnnoyingSwede · on June 19, 2018

The key to x86 assembler/machine code was avoiding the software interupts as they were painfully slow. Sure, when you are dealing with a 512byte boot sector you are better of offloading as much code as you can to software interupts, but for everything else the battle was to come up with something faster than the software interupts.

In many cases some basic routines used in qbasic was actually faster than their counterparty interupts in asm. A grand example of this would be using mode 13h (320x200) and interupt 10 to set all pixels on a screen to a single color, which could take up to 2 seconds using machine code and bios interupts (as it verifies vertical and horizontal refresh prior to setting the pixel). Using interupts however is relativly painfree as the author pointed out.

nsxwolf · on June 18, 2018

It's hard to find a tutorial that really tells you what is going on. You have to start with whatever operating system you're on, then figure out what the heck you're reading. What is x86 and what is some reserved word peculiar to the assembler you're using?

Here's macOS Hello, world code I see floating around a lot, and I can't make heads or tails of it:

  global start

  section .text
  start:
      push    dword msg.len
      push    dword msg
      push    dword 1
      mov     eax, 4
      sub     esp, 4
      int     0x80
      add     esp, 16

      push    dword 0
      mov     eax, 1
      sub     esp, 12
      int     0x80

  section .data

  msg:    db      "Hello, world!", 10
  .len:   equ     $ - msg

pjc50 · on June 18, 2018

Basically everything left-aligned is for the assembler, and the indented material is the mnemonics for the x86 set.

The assembler breaks down as:

      push    dword msg.len
      push    dword msg
      push    dword 1

Place a pointer to the message, its length, and 1 on the stack. Each of these decrements esp by 4.

      mov     eax, 4

We'll be doing syscall 4 in a sec

      sub     esp, 4

Reserve 4 bytes for the return

      int     0x80

Do the syscall! See https://opensource.apple.com/source/xnu/xnu-2782.20.48/bsd/k... ; this is 'write'

      add     esp, 16

Discard the return value and arguments from the stack.

      push    dword 0
      mov     eax, 1
      sub     esp, 12
      int     0x80

Similarly setup and do syscall 1 to exit.

nsxwolf · on June 18, 2018

But then there's that "msg.len", which I assume is some sort of assembler shortcut for getting the length of an array and not part of the x86.

pjc50 · on June 18, 2018

That's referring to the label msg.len, which is a value computed at compile time in the 'data' section at the bottom.

  msg:    db      "Hello, world!", 10
  .len:   equ     $ - msg

'msg' is a label, which will be assigned to an address at compile time. 'db' short for 'declare bytes' puts some bytes at that address. '.len' is then defined as the current output address ($) minus the start of the string.

e12e · on June 18, 2018

In addition to all these excellent comments, it can also be instructive to go "backwards" from C, here on Linux:

cat hello.c

  #include <unistd.h>

  char str[] = "Hello, World!\n";

  int main() {
    write(1, str, sizeof(str)-1);
    return 0;
  }

cat hello.s

  .file	"hello.c"
  	.intel_syntax noprefix
  	.globl	str
  	.data
  	.align 8
  	.type	str, @object
  	.size	str, 15
  str:
  	.string	"Hello, World!\n"
  	.text
  	.globl	main
  	.type	main, @function
  main:
  .LFB0:
  	.cfi_startproc
  	push	rbp
  	.cfi_def_cfa_offset 16
  	.cfi_offset 6, -16
  	mov	rbp, rsp
  	.cfi_def_cfa_register 6
  	mov	edx, 14
  	lea	rsi, str[rip]
  	mov	edi, 1
  	call	write@PLT
  	mov	eax, 0
  	pop	rbp
  	.cfi_def_cfa 7, 8
  	ret
  	.cfi_endproc
  .LFE0:
  	.size	main, .-main
  	.ident	"GCC: (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0"
  	.section	.note.GNU-stack,"",@progbits

Note the line: lea rsi, str[rip] which I belive is something along the lines of load effective address (lea), into rsi register, of "str" symbol offset with relative instruction pointer (rip - to allow for relative addressing).

You can compile and run with:

gcc -std=c11 hello.c && ./a.out

Prouduce hello.s from hello.c with: gcc -std=c11 hello.c -S -masm=intel

cesarb · on June 19, 2018

"lea" is a fun instruction.

Going piece by piece: on x86, each register has several sizes. For instance, rax is a 64-bit register, with its lower 32-bit half being eax, its lower 16-bit half being ax, and its lower 8-bit half being al (for historical reasons). So when a register name starts with "r", it's the 64-bit variant. Therefore, "rip" is the 64-bit address of the next instruction.

The instruction "mov rsi, str[rip]" would get the 64-bit address of the next instruction, add to it a fixed offset (which the assembler and linker compute for you, as the exact offset you need to get to the data at the "str" label), load 64 bits from that address, and put the result in the "rsi" register. And that's not even the most complex addressing mode; you can get a register, add to it another register multiplied by 2, 4, or 8, add to it a constant, and use it as the memory address to load, store, or even modify in-place.

The "lea" instruction (load effective address) is a way to get directly at the power of that complex address calculation logic for your own uses. Instead of using the computed memory address to get at the memory, it's the memory address that's put in the register. Therefore, where "mov rsi, str[rip]" would read from memory, "lea rsi, str[rip]" would put in rsi the memory address the "mov" would have read from.

This also allows for a few tricks. For instance, you can use "lea" to multiply a number by five, without using the multiplier: just use a register and the same register scaled by 4 (and you can also add a constant to the result, still in a single complex instruction).

eterps · on June 18, 2018

This makes it even easier: https://news.ycombinator.com/edit?id=17341952

kccqzy · on June 18, 2018

Some tips for learning:

1. It's much easier to study the disassembly of object files than executables. Or better yet, use -S to ask the compiler to emit assembler.

2. Don't use hello world as your first program. Those involve either library calls or system calls. Strings too, and strings are not so straightforward in any systems programming language. Start with single functions (not called main) that take two arguments and add them/subtract them/multiply them.

3. Try learning x86-64 first. It's easier to learn because you don't have to know about the stack to understand and write simple arithmetic functions, if they take no more than six arguments.

4. It usually helps to pass -Os to the compiler to have it generate more compact code. You won't see lots of superfluous code to load/store things to the stack (for debugability) but you also won't see autovectorized code.

CrankyCanuck · on June 19, 2018

Back in the late 80's, when I was in University, we had second year course where we had to do x86 assembly on PC XT or AT clones. Assignment #1 was messing with keyboard interrupts...easy peasy. Assignment #2? Write a video game...basically we had to do the "snake game" with a never ending line you could steer as it grew. Anyhow, I have to admit pretty much the whole class figured it out...assembly was not nearly as insane as we thought. By the end of that course, I was almost as comfy writing x86 assembly as Pascal. Fun times.

adricnet · on June 18, 2018

Thanks for the share this looks great!

It aligns quite well with the study (in diagrams and commented assembly) of x86 in POC || GTFO starting in pocorgtfo04.pdf chapter 3 provided by Shikhin Sethi.

chaotic_clanger · on June 19, 2018

it seems to me that x86 assembly is the less readable kind of assembly. i always found powerpc set more thought through.

bogomipz · on June 18, 2018

Tangential question - does anyone know what model IBM that is?

wglb · on June 18, 2018

It appears to be the original IBM-PC.

AnIdiotOnTheNet · on June 19, 2018

Yes, the IBM 5150 PC. In fact, the image is the one used on the Wikipedia page for "Desktop Computer".

bogomipz · on June 19, 2018

Ah yes, the 5150. Thank you.

tiborsaas · on June 18, 2018

I wasn't really familiar with webassebly, but when I've seen this source for fibonacci I immediately recognized familiar features which can't be said of the rest of the assembly implementations:

https://github.com/nebulet/nebulet/blob/master/wasm/fibonacc...

https://github.com/Hanks10100/wasm-examples/blob/master/simp...