The story I heard, was that the author is one of these obsessive genius types, who was having a debate about compiler efficiency, so he wrote an entire compiler engine to prove his point.
I can understand that.
This is a great article. Being able to descend into machine code/assembly is a valuable “tool of last resort.”
Personally, I avoid that like the plague, but an OS developer probably still needs to do this (I developed system stuff, back in the Dawn Times).
I started off with Machine Code, so it’s been a long, strange trip, for me...
IMHO you don't have to be obsessive to use godbolt.org in your daily work, neither should it be a "tool of last resort".
It really does help a lot to understand what the compiler will actually turn your high level code into, because quite often the results are very surprising (or if only to be shocked about the amount of assembly code that's generated from a few lines of C++ template code). Comparing the differences in output between different compilers and optimization levels is also very useful.
I consider Godbolt to be an educational and “thinking cap” tool, as opposed to a critical path optimization tool.
It is pretty damn cool, though.
It’s sort of like how an algebraic graphing app is most useful for students, as opposed to working engineers.
Nowadays, with on-chip threading, ASM stack dumps are a lot less useful than they used to be. Really, for me, munged function names are what I use to figure out the general vicinity of a bug, which I then figure out, by examining the source.
I don't understand this tool. My compiler has a flag for generating assembly output. That should be sufficient, no? Also, it more accurately reflects what my compiler actually generates given all the other flags.
Your compiler does, yes. Godbolt gives you everyone else's compilers, across a huge range of versions and architectures, without having to install anything.
It's quite often easier to just slam something into godbolt than to do it locally with a temporary file and gcc -S.
An important feature of godbolt is quickly comparing the output and language feature support of different compilers, or different versions of the same compiler (e.g. what versions of MSVC support a specific C99 feature, or which new C++ feature is safe to use across clang, gcc and MSVC).
And most importantly, you can check what your code would look like when compiled for 6502 or Z80 ;)
I see, but I'm not convinced a web-tool is the best way to go about it if optimizing assembly is part of one's dayjob.
I do understand that it's nice to play with, and that it can have educational value. I would personally love to see WASM support, and a way to see e.g. LLVM intermediate code. But I'd want to see some more documentation on the website e.g. about the ABI so I know how the arguments of a function are passed in etc.
> I'm not convinced a web-tool is the best way to go about it if optimizing assembly is one's dayjob.
It isn't, because that's not what it's for. Besides, few people have optimizing assembly as a dayjob, and in that case you'd be writing it rather than reading it. Trying to juice a compiler into producing specific assembly output is frustrating and brittle. I suppose it's useful for seeing whether your autovectorisation has worked.
But the best use of godbolt - uniquely as a web tool - is to show other people what the output is.
(I wish more people would stop before saying "tool X is useless for workflow Y" and ask themselves what the intended workflow of tool X is. Not everything is for everybody, that would be impossible.)
> more documentation on the website e.g. about the ABI so I know how the arguments of a function are passed in etc
It just calls the underlying compiler executables. Consult their documentation.
There's quite a range between "looking at compiler output is sometimes useful" and "optimizing assembly is one's day job".
We use it all the time, it's certainly not just a toy.
I have the luxury of having a build farm with a large range of compilers and platforms available, so I could use that to rig something like godbolt. But that'd be spending probably days on setting something up, + ongoing maintenance, instead of just using an existing thing that does the job? If I really were only doing this kind of thing, that'd probably be worth it to be able to customize etc. But I'm not. It being a service (and thus basically automatically a web tool) is a a valuable feature.
It's also a great tool to share and communicate results of such investigations. Again, easy for a web tool.
It's eye opening how most trivial functionally equivalent code generates practically identical native binary output regardless of the language, be it C, Java, Javascript or .NET.
It's eye opening, because it shows how small the actual differences in executed code really are.
When I first tried this about 10 years ago, I was very surprised how good code Javascript JITs generated. I expected there to be type checks, lots of unnecessary memory accesses, etc. all over, but to my surprise the inner loop was practically same as gcc -O2 output when compiling an equivalent C program.
Of course the same is not true when it comes to bigger blocks of code, because Javascript does need to do a lot of bookkeeping behind the scenes.
> Note that my tutorial uses the AT&T assembly language syntax instead of Intel syntax.
This makes me curious: how widely used is the AT&T syntax these days? I could easily be wrong, but my impression was that most work is done in the Intel syntax.
This may be ignorance on my part, as the Intel syntax is all I've ever used on Intel processors. I've coded in a dozen other assembly languages for various processors, but Intel chips seem to be the only ones with two different assembly language syntaxes, right down to the source/destination order.
It's still the same division as always. Linux and other UNIXes on x86 and x86_64 generally use AT&T syntax. The rest of the x86 world uses Intel syntax. And of course *nix has always been a minority on x86.
> I've coded in a dozen other assembly languages for various processors, but Intel chips seem to be the only ones with two different assembly language syntaxes,
It's a historical quirk. Generally, today if there is a widely used and accepted assembler syntax targeting the architecture, UNIX porters would use that assembler's syntax. GAS accepts normal ARM style syntax, for example.
But that's today. x86 is ancient. And so is UNIX on the x86. Going all the way back to the late 1970s, the 8086 was one of the first microprocessors that could host a decent UNIX environment. The rush was on as soon as the chip was released. (Most of the porting fervour would shift to the 68000 about 18 months later when that came out.)
There was no established assembler aside from the one from Intel. MASM wouldn't be written for another couple years. UNIX at this point already had a full quasi-portable toolchain, which of course was familiar to the porters. And they planned to live in an all-UNIX world anyway. The only assembly anyone would ever write would, hopefully, be a dozen lines in the kernel. So they just adapted what they had and what worked. The old PDP-11 UNIX assembler, backwards syntax and prefixes and all. And here we are 40 years later.
The M68k has two different syntaxes too. The argument order is still the same (src,dest, the correct order in my opinion), but other things are different, with the main difference being indirect addressing which is (a0) in Motorola syntax, while it's a0@ in AT&T syntax.
So while the changes might not be as huge, it's definitely not just Intel.
I’m going to ignore the characters such as % and $, and also the parameter order (sortove)
There’s also the fact that the Intel and AMD x86 manuals list them as
MNEMONIC dst, src1
GAS does this:
MNEMONIC src1, dst
Which is fine, but what about 3 and 4 operand instructions? I always have to look that up. 8087 FPU instructions also (in GAS), due to a historical quirk, are in “Intel” form:
MNEMONIC dst, src
Then there’s that whole indirect syntax of:
segment:displacement(base,index,scale)
Whereas “Intel” syntax is much clearer:
segment:[base + index*scale + displacement]
This leads to weird things like:
[ebx*2+10] ; I can see what that’s doing
10(,%ebx,2) ; I need to remember what the order is and convert it in my head
I’m not trying to convince anyone here. If you like “AT&T” syntax, that’s fine with me. I just prefer “Intel” syntax because of these reasons.
AT&T syntax is best syntax. Linux, GCC, LLVM, etc. all use it. Scott Wolchok writes it in a style that's more verbose than it needs to be. It's probably a good call since this is an educational blog post. Like there's some nuance to when you need things like q on statements `movq %rdi,8(%rsp)`. For example, this is fine: `mov %rdi,8(%rsp)` but you need it if you move an immediate to memory, e.g. `movq $123,8(%rsp)`. I like to avoid them because mnemonics like `orb` don't appear to the untrained eye as the `or` instruction! The GNU-style suffixes can also appear multiple times within an instruction name in various combinations, e.g. pclmullqlqdq vs. pclmulhqhqdq.
I didn’t write it, I just copied the compiler output from Godbolt so that the article could be read on mobile, where Godbolt’s UI doesn’t work so well and switching tabs back and forth is difficult.
One thing is reading assembly language written by a human, another thing is reading assembly language generated by a compiler. Compilers are insane, especially when you turn on optimizations.
Another thing is that there is more than 1 asm syntax. Popular ones are the Intel syntax and the GNU assembly syntax. The way you read those is different.
If you want to get started in a simple way with little friction I suggest you do this.
1. Go to godbolt.org (Compiler Explorer)
2. Write a simple function (not program, that will complicate things), such as one with no parameters that adds two hardcoded. Like this
int sum() {
int result = 0;
result += 1;
return result;
}
3. Compile with with no optimizations (-O0).
4. Read the output. Note that the colors represent each statement. Pay attention to that. Also, by hovering over an instruction for a few moments you can get a definition of what it does.
Then you can start making adjustments such as using parameters, adding control flow like if statements, invoking other functions, using float instead of int, etc.
Another thing are calling conventions. Functions do not exist in assembly, they're an abstraction. There are many ways to represent a function call and these are some of them:
(Edit: I just noticed (yes, I am slow) that you were probably referring to code written by a human for educational purposes. Sorry for the misunderstanding)
> One thing is reading assembly language written by a human, another thing is reading assembly language generated by a compiler. Compilers are insane, especially when you turn on optimizations.
In my experience, it's the opposite. Compiler-generated code has a very homogeneous global structure that is easy to understand even if the compiler does complex stuff on instruction level. Human-written code does things that force me to be permanently on my guard concerning the control flow and the meaning of data, and I am not even talking about code (like malware) designed intentionally to be confusing. Examples from computer games:
- Calling a function X and directly wanting to jump to another function Y afterwards by first pushing the address of Y on the stack, then call X. Sometimes combined with a jump into the middle of another function.
- Manipulating stack frames, accessing local variables of the caller function, etc.
- Using the same location in memory to store completely different data types (imagine a C program using unions everywhere, even for a simple counter variable)
- Organising data on bit level
- Something extremely popular in older games: Collection types with non-uniform elements. Imagine a kind of array or list where some elements are 4 bytes long and some are 6 bytes.
Something extremely popular in older games: Collection types with non-uniform elements. Imagine a kind of array or list where some elements are 4 bytes long and some are 6 bytes.
There's this really funny reaction from the audience when Mike Acton suggests writing different types of scene graph nodes to a void* to avoid unnecessary padding which would pollute the data cache. One of the Q&A questions at the end utters its distaste for it cause it's using void instead of proper types.
It's the objectively better solution in terms of performance, but it's just too yuck for many people.
-O0 is unnecessarily bloated (can do lots of unnecessary load data to reg, save to stack, load back same reg again) which also makes it hard to read.
I find -Og (on gcc) or -O1 to be the sweet spot.
Compile factorial with avx enabled and the compiler will often go mad and shit out a 150 instruction body when n=low overflows and thus invokes undefined behaviour anyway
The major breakthrough I had in being able to comprehend assembly code was when I learned to look at it in basic blocks, rather than as a straight line. I don't know if that's super obvious (it seems obvious, written out that way), but just taking a listing and adding newlines at block boundaries is massively helpful to me.
That's definitely how a compiler does it, especially at the IR stage. It's much easier to track values within a localized region; this applies to both people and computers.
Sure, and I'm guessing I learned to read assembly this way from IDA Pro; I'd had to deal with assembly long before IDA was a thing, but I was never an especially able reader of other people's assembly.
But even today, like, when the eBPF verifier coughs on a hairball, I end up pulling assembly into Emacs and then doing a quick pass marking up the basic blocks before I try to read it.
Back years ago when I had much much more freetime I binged the below to learn how to read x86 assembly and use things like gdb. It's a lot of information to take in but I was getting into reverse engineering back then and have a bit of an obsessive personality so I thought it was great. Still do, so I refer people to the resources anytime something like this pops up.
The perennial problem for me is the bifurcation from "AT&T" and "Intel" syntax. I don't really care if register names have a $, a %, or no prefix at all. It's the reversal of source and destination operands that is absolutely maddening.
Personally I prefer that destination operands are on the left, because that tracks with assignments in normal programming languages. I don't really understand why you would want the opposite.
And then apply a prefix syntax correction, to end up with
+= foo bar
(Or, equivalently, always mentally add the word "TO" to each operator that modifies an argument, which is most)
Also, most intel instructions use one or two arguments as input, and modify one of them as output. It sort of makes sense to have the output (or input/output) argument always be in the same position, regardless of whether an instruction takes 1 or 2 args. BTW, even in riscv where "add" takes three args, i.e. the form is a = b + c, the output-only argument a comes first.
Yeah, this is how I think about it too. It's natural to have the destination operand always as the first one (i.e. left hand side). I think the GNU tools all use the AT&T syntax (destination is last), and I hate it.
AT&T simply adapted the assembly syntax of earlier machines, ultimately which reflected the architectures of even earlier hardware from a time when there weren't really higher level languages.
Some bizarre high level conventions, like using '=' for assignment, were also adopted from earlier languages but are now commonplace.
A way to squint at assembly language: as backwards functional code where almost all subexpressions have been recursively extracted out into local definitions (except for leaf un- and binops) ... and then the resulting temporaries reused.
Why does assembly use cryptic instruction and variable names? imulq could be integer_multiply_64bit. Everyone has big monitors these days. There isn’t space to save. Is it just historical inertia and/or “it’s not so bad. if you are smart like me, it’s easy”? Or are there good reasons to keep things so terse?
Because they're called mnemonics for a reason ;) They stop being cryptic after a few days of use, but you'd be stuck forever with overly long descriptive instruction names that repeat over and over and over again.
It's not about saving space, but about reducing visual noise and simplifying "visual pattern matching".
PS: Other CPUs do have somewhat friendlier assembly dialects (Motorola 68k comes to mind), but they all have short 3..5 letter mnemonics.
But that's approximately as true for other programming languages. (EDIT: Good point elsewhere in the discussion where it’s pointed out that assembly is significantly more verbose than nearly any other language so it’s desirable to keep it short.)
It'd be interesting to see an editor mode that could jump back and forth between a mnemonic view and a more descriptive one, perhaps even with argument labels to get rid of src vs. dest confusion.
I think much of the "modern confusion" around assembly code comes from being mainly exposed to raw disassembled compiler output instead of "sane" assembly code written by humans.
Back when writing assembly code was more or less mainstream, "high-level" macro assemblers were used to wrap assembly snippets into fairly advanced macros which could lead to assembly code that was nearly on the same abstraction level as C code, you could define structs, named constants, write complex constant expressions and so on..
There were also dedicated assembly IDEs like ASM-ONE on the Amiga or Turbo Assembler on the PC, which made assembly programming quite comfortable (I guess the same can be achieved today with relatively little effort by writing a VSCode plugin).
The good reason for this terseness is that assembly is inherently verbose. A single line of code in a high-level language can easily become a dozen in assembly. Your proposal would be taking an already verbose language and making it even more verbose.
“imulq” is not just less typing than “integer_multiply_64bit”, it is less reading too.
If you are annoyed by something innocent like i-mul-q, then check SSE and AVX.
Btw, intel syntax doesn’t have these [bdq] suffixes (uses operand typing instead) and looks more clean for simple commands like mov, mul, add, and, etc. Personally, I don’t see any reason for them to be too wordy.
Upd: it is also easier to read when operands are aligned on the same column, because values move between registers form line to line.
There is a fixed set of instructions so anyone with a bit of experience will know what they mean. It’s different than being verbose with variable names that can be different in different programs.
There are a few CPUs, for which the official assembly language is based on algebraic expressions: instructions look like "R1 = R2 * 4", or "R4 = R1 AND R3". See for example the SHARC instruction set: https://fayllar.org/sharc-instruction-set.html
Algebraic assembly is obviously a brilliant idea, yet so few assemblers have followed suit. I guess the tradition of cryptic mnemonics is too strong...
I don't see how that would help much, as registers are a terrible way to name variables. I think you just want to avoid assembly as much as possible, in general.
Came here to comment and say the same thing. Assemblers already "know" what the opcode is doing (division, etc.) so it's not like it's a high-level-language intrusion to just write it out in a form which is readable in this manner.
The cryptic instruction names in assembly drive me nuts, too.
`integer_multiply_64bit` is much easier and faster for my eye to parse than `imulq`, which is a blob I have to stop and tease apart. I would also be fine with `imul_64` or `int_mul_64`; over time I could probably get used to `imul_q`.
But `imulq` jammed all together is a nope. I still hate `strrchr` and `atof` and all those silly names which parse poorly from the C standard library, and I've been programming C for many years.
I figure you could create a 1:1 assembly transpiler which does two things:
* Maps a bunch of aliases like `integer_multiply_64bit` to their official names.
* Uses parentheses, argument order, and dereferencing operators according to conventions which are more in line with what people are accustomed to seeing in popular modern programming languages.
Register naming seems a bit tougher, though, and I haven't figured out an approach for deriving intuitive aliases. Suggestions?
> “it’s not so bad. if you are smart like me, it’s easy”?
There are going to be responses that read like this, well-intentioned or no, but we just have to press on regardless.
The question is: why the effort to make assembly even more verbose?
People who write assembly (they still exist, I guess) will prefer "imulq" after a short time because it's much faster to type. Remember that a line like "a=2+b*f(x+1)" corresponds to 5 to 10 machine instructions.
People who have to read assembly don't really care because after a few minutes you know the mnemonics of the 20 most frequently used instructions (which constitute probably 99% of the code) anyway.
Computers that have to write or read a lot of assembly (some compilers do not directly generate binary machine code) are more efficient with a compact representation.
Okay, if we ignore the above cases, there are maybe five or six people in the world who might prefer "integer_signed_multiply_32bit" or "jump_relative_if_unsigned_less_or_equal".
> People who have to read assembly don't really care
I have to read assembly, and I do care. I'd much rather read something like imul.q than imulq. Assemblers could allow something like this by ignoring such periods or underscores, the same way that many modern programming languages allow you to write 1_000_000 as a synonym for 1000000.
Though a lenient assembler frontend wouldn't necessarily help me, since the assembly code I most often read is dumped by a disassembler I have no control over.
Each letter takes space and back in the day, there weren’t free flowing 8TB had drives and 128bit cpus flowing from the distribution centers of Amazon with free 2-day shipping. You had to save space every way you could: mnemonic commands, tabs over spaces, etc...
Note to author: Because this is a personal blog, your RSS feed might as well include the entire post instead of the first few sentences. Just a suggestion.
This is what I get for duplicating the examples across Godbolt and the post itself. It was originally 3D for no particular reason and I cut it to keep things short.
This bothered me as well. The only thing I can think of is that one of the calls to libc functions is consuming rax off the stack. The code as displayed by GodBolt is the code as generated by clang so it's not an error there.
EDIT: I've just noticed this line:
leaq -24(%rbp), %rsp
It does a manual pop of rax by restoring rsp to what it was before the prologue.
The calls to push and pop R15, R14, and RBX in the perilogue are manipulating the save area. The space below the save area is the locals area, and there are local variables in the function.
The inner save and restore of the stack pointer in R15 is adding further, block-scoped, locals to the locals area for the duration of that block, which is why it hasn't been combined with the outer perilogue (one of the things asked about in a footnote). If this function were more complex, the restoration of the original stack pointer from R15 would be well before the function epilogue, demonstrating the disconnect more clearly.
The actual epilogue begins with moving the stack pointer back up to the bottom of the save area, from whereever it happens to have been after the locals area was allocated, so that the save area can then be popped. In the absence of alloca() and variable-length arrays, this could be done by ADDing to the stack pointer register, as the size of the locals area is known by the compiler and fixed at compile time.
alloca() and variable-length arrays make things more complex, though, and this function contains at least one of those. Getting the amount to ADD right would involve retaining the sizes of relevant alloca()s and variable-length arrays. (Also, compilers prefer LEA to ADD for the stack pointer, but for the sake of simplicitly let's just treat this as ADD.)
Fortunately, the address of the bottom of the save area is known, and working out how much to ADD is thus unnecessary and the problem is moot. The bottom of the save area is by its nature at a fixed offset from the frame pointer, with a size known from which registers had to be saved, so the generated code just resets the stack pointer to that offset relative to the frame pointer, effectively popping off the entire locals area that way.
There's no reason to POP what was pushed from RAX back into the register, because that part of the stack is not part of the save area. It's a dummy local variable that ensures that the stack remains 16-byte aligned once all fourof return address, saved frame pointers area, save area, and locals area are in place. As another footnote notes, PUSH RAX is just a quick way of subtracting from the stack pointer without using the ALU, which a SUB instruction would. RAX isn't actually being saved.
Ahh I remember sweating bullets, trying to diffuse assembly bombs for labs in college. I love how looking at GDB starts out like literally staring at heiroglyphics but by the end of the semester, you get such an intuitive sense for things.
Haven't really 'used' that knowledge directly, and I don't think I will. But it taught me a ton about how computers work, which probably helps me somehow!
I've heard of people writing custom assembly to get more performance... what level of assembly expertise is required to achieve that? And, what does that look like?
Do you let the compiler emit assembly code and then start editing it to make it faster? Or, does it start from scratch? Are there known areas that the compiler has troubles with?
It's almost never 'assembly expertise' - it's architecture expertise - knowing a specific instruction you want to use in a specific way to get some kind of fusion or fill some functional unit using some knowledge that the compiler doesn't have.
The actual mechanics of how to write a program in assembly are trivial if you already know a language like C.
The few times I've been stuck writing code for a platform without access to a compiler, I just wrote it in pseudo-C then hand-compiled it. Once I got comfortable I could "compile" on full mental autopilot, checked out and watching TV at the same time. But it's horribly dull work, like doing arithmetic by hand.
I’d be happy to port the article to ARM64 if there was sufficient demand for that, but I think it’s still fairly niche among developers, right? (Apple, recent Android, niche server environments?)
A subset of 32-bit ARM is quite common these days in education, often replacing where MIPS was once used.
ARM64 has really swept in the last couple years. My two year old budget Android smartphone has an ARM64. So does my Raspberry Pi. Lots of hardware hacker types likely to play around with assembly would be using ARM, and often 64 bit these days.
That said, it is relatively niche. A lot of those ARM64 machines are running 32 bit software still. But I do expect it to be the dominant architecture in just a couple years.
The story I heard, was that the author is one of these obsessive genius types, who was having a debate about compiler efficiency, so he wrote an entire compiler engine to prove his point.
I can understand that.
This is a great article. Being able to descend into machine code/assembly is a valuable “tool of last resort.”
Personally, I avoid that like the plague, but an OS developer probably still needs to do this (I developed system stuff, back in the Dawn Times).
I started off with Machine Code, so it’s been a long, strange trip, for me...