Algorithms for Modern Hardware

pickledcods · on March 7, 2022

Say you master this knowledge, how would you go around finding a fitting job?

I'm starting a new chapter in my life and find it impossible to find a position that matches the collection of abilities, and the rare ones available responded with excuses ranging from too old to too mediocre. It is so bad that I'm now looking into switching to education, specifically gifted youngster where my knowledge and insights might rub off, guide, inspire and motivate. But even that is proving to be difficult.

Background: I'm an 70ies wizz-kid, Europe (NL), have an ATW-800, 9 years computer science education including hardware and software, through bad college advice was dis-advised to take the academic path, OS-design (linux, medical, HPC) in 90ies, small single-board computers in 00ies, fractal logic and calculating with structures in 10ies. My handicap: I'm too shy to stand in a spotlight or in front of a camera.

JonChesterfield · on March 7, 2022

Finding someone willing to pay for properly efficient code means finding a place where that code runs a lot. Either because it's running on your data centers (faster=>lower energy cost) or because it's going to run on a lot of computers (browser engine).

I like the libraries that ship with toolchains for using this stuff. Anyone shipping silicon will have toolchain(s), either built in house or via consultancies.

Cutting cycles off integer divide is a waste of time for one application and a serious win if it's the integer divide the compiler injects into all applications.

Further from the machine you get language runtimes (e.g. libc, libm) then application libraries (blas, lapack, mpi) where it starts to merge with ML or HPC tooling.

lazide · on March 7, 2022

It’s more than ‘where it runs a lot’ - it’s also where someone can make money by it being more efficient.

So for instance, JavaScript interpreter in a browser is meh unless the person doing the work is hired by someone who needs their SPA to run faster.

So a mega corp looking for someone with a lot of experience making their software more efficient (think a large operator like MS Azure , Google, AWS), especially if there are supply chain issues making it hard or expensive to scale up their existing hardware. Also high frequency trading where nano seconds matter.

Or older legacy systems (folks still using mainframes?) where the costs of switching to a new system in compliance and interoperability mean they’re trying to wring ever cycle out of the existing system.

jcelerier · on March 7, 2022

I've been doing consulting for that kind of thing for running cost as fast as possible on "embedded" (raspberry pi-level) hw

cma · on March 7, 2022

It also saves battery on mobile.

w1nk · on March 7, 2022

In my experience, the kinds of use cases for this level of optimization tend to be in fields where computers are used as actual computing devices. So think physics, biology, etc. I'd guess most jobs in the HPC space would allow you to get pretty close to this kind of work in practice, although your job title might still just be 'software engineer' or whatever.

Located in .nl, the oil and gas industry surely has some work that benefits from these kinds of optimizations. ASML probably does enough simulation as well. I'd guess there are probably a handful of financial firms operating on these time scales. Outside of those, the various universities will have groups in this space. I doubt that's an inclusive list, just some areas where I'd expect to find heavily optimized code for specific hardware.

ammo1662 · on March 7, 2022

Not all knowledge you mastered could help you finding a job. But some of them will help you to become a better engineer.

I used to learn clarinet for several years when I was yong. And I learned to code a simple CPU core with VHDL and FPGA when I was in college. And now I’m writing C# in healthcare industry. I may never write such kind of experience in my resume. But that helps me when I write projects on waveform editing, or to understand what is “parallel”.

So I separate my knowledge for two part. One is something I need to get a job, the other is something I interest in. Which may helps you in future or not. But you enjoy the time in it.

As for this book, it is useless especially when you are writing some stupid web server in a programming language like Java or C#. But just my interest about optimization.

superdimwit · on March 7, 2022

Amsterdam is good for options market making, most of the techniques in this book come up a lot in trading

hintymad · on March 7, 2022

If you're not fixated on this particular skill set but use it as your edge, then you will have a lot more options. You can find a company or a team that has a demanding problem to solve. Note the keyword is _problem_ instead of technology. You may not get to immediately apply your expertise, but you will spot plenty of opportunities fairly soon. Software engineers are really in an extraordinary time when many problems are screaming for efficient solutions and capable engineers.

dgb23 · on March 7, 2022

Something that surprised me is that there apparently a lot more money in sustainability that I assumed. Large companies signing big checks. The market pressure is coming from long-term investors, government regulation, and a generally increasing awareness around sustainability issues.

It's a _very_ wild guess but there might be opportunities in that general area for a software engineer that has deep practical knowledge around performance.

bekantan · on March 7, 2022

There are plenty of trading companies in NL: Flow, Optiver, IMC - check them out

erwincoumans · on March 7, 2022

Very exciting topics, can't wait digging into the book.

If you like this, you might also enjoy reading about optimisation tips by Mike Acton (data oriented programming,https://dataorientedprogramming.wordpress.com/tag/mike-acton... ) and Pierre Terdiman (on collision detection and physics optimizations, for example http://www.codercorner.com/blog/?p=1978)

We wrote some book (and SIGGRAPH courses) a few years ago, Multithreading for Visual Effects (http://www.multithreadingandvfx.org). The audience for this kind of books turned out to be pretty small.

Narann · on March 7, 2022

I didn't expect to see Multithreading for Visual Effects here. I did a review[1] of the book, long time ago (in french[2]). I was an interesting book, full of real world use case.

[1] https://www.fevrierdorian.com/blog/post/2014/08/24/Multithre...

[2] https://www.fevrierdorian.com/blog/post/2014/08/24/Le-livre-...

erwincoumans · on March 7, 2022

Thank you, that is a nice positive and in-depth review. It is good to discuss technical details of 'Algorithms for Modern Hardware' but it also helps to show how real-world projects struggle or succeed in applying various optimizations: Amdahl's law will move you focus from one serial bottleneck to the other (in large programs). Some optimizations are so intrusive that code becomes unmaintainable or undecipherable.

mrv_asura · on March 7, 2022

The timing of this is great! I just watched Casey Muratori's lecture, "Simple Code, High Performance"[1] and its follow-up videos[2, 3]. I highly recommend watching it. It is about Software Optimization, and gives practical and high-level understanding of topics like CPU Architecture, SIMD etc. [1]https://www.youtube.com/watch?v=Ge3aKEmZcqY [2]https://www.youtube.com/watch?v=8VakkEFOiJc [3]https://www.youtube.com/watch?v=1tEqsQ55-8I

nickelpro · on March 7, 2022

Casey is great but at times a bit... zealous? I distinctly remember him talking about how worthless having a file system is on a server, or for that matter an operating system.

His work stands at one extreme of how to architect a program, and is an extremely valuable reference, but he states his opinions as facts and that can be detrimental for the beginners he markets some of this material to.

sirwhinesalot · on March 7, 2022

He's not wrong on the server end if your goal is maximum performance. That's the whole point of unikernels and they can lead to some impressive speedups. More so when you consider how many servers are running single apps inside docker containers in some Linux VM running on a hypervisor which is usually Xen.

Massive inefficiencies that could all be cut down to a unikernel running on the hypervisor directly. Whether that's a good idea or not in practice is a different story, but the performance improvements and energy savings are real.

That said, his issue is less on the strength of his convictions (which, as far as performance goes, are at least on-point), but rather his delivery. He has very little patience and is not looking to debate even when it might be beneficial to himself or to others.

nickelpro · on March 7, 2022

I don't at all mean to say he is wrong about performance. Casey is brilliant, his approach to writing software is optimal or near optimal when viewed through the lens of performance.

And yet this is not the only lens through which to view software development. Correctness, expressiveness, portability, developer hours. There are many possible things to optimize for, and Casey barely gives these an ounce of weight compared to performance. The reason to target a program towards a platform with an operating system and a file system is because writing these things takes time, using them affords flexibility, and minimizes bugs from you're bespoke code. Casey does not view these as problems.

Casey writes near-C for bare metal, or when not using bare metal directly against the platform API and if you only learned from him that's the only approach you would view as valid. When talking about his approach to writing code he constantly derides, sometimes very harshly, those who would think about program architecture any other way.

I don't think the Rustacean evangelists' obsession with memory safety is the last word is software development, or Linus's assessment of languages other than C, or Jonathon Blow's and Casey's orientation with performance above all else. These are ideas, opinions, and there's some value in representing them as such.

sirwhinesalot · on March 7, 2022

100% agreed. I think when it comes to performance, the biggest takeaway from Casey/Jonathan is to just have the computer do less work. If you can get away with just poking at a big buffer then that's what you should do. It keeps the code both simple and efficient, no need for 30 layers of abstractions.

But beyond that, IMO, it's fine to use higher-level constructs and target higher-level "platforms" (i.e. electron) to improve portability, developer productivity, etc. Even within that setting you can avoid doing things that pessimize your performance, and that's often enough for a massive improvement.

Dogma is the bane of software-development, be it abstraction at all cost, immutability at all cost, or performance at all cost, etc.

webmaven · on March 7, 2022

> Dogma is the bane of software-development, be it abstraction at all cost, immutability at all cost, or performance at all cost, etc.

You seem very sure of that.

sirwhinesalot · on March 8, 2022

I do not claim to be enlightened, I am as flawed as my peers ;)

suyjuris · on March 7, 2022

> I distinctly remember him talking about how worthless having a file system is on a server, or for that matter an operating system.

If I recall correctly, his statement can be more charitably summarised as noting that many of the functions of an OS or FS are not useful in the specific use case he had, which was running a single program serving a website. He would therefore prefer to not have the additional complexity, attack surface and performance overhead. I do not think that this is very controversial? There are of course trade-offs involved, and he did mention that he was planning on running Linux as a way to get device drivers and something booting with reasonable effort, at least initially.

justin66 · on March 7, 2022

Filesystems being a worthless abstraction on, for example, a database server is the rationale for setting up a DBMS to use raw devices. This is not a popular approach and hasn't been for some time (I don't think any of the popular OSS databases have ever offered this approach, and AFAIK even Oracle is moving away from offering it) but it's not as though this is a radical idea.

edit: the Linux direct I/O feature that sort of obsoletes this use of raw devices was apparently developed in part by Oracle, which I guess might be interesting

https://lwn.net/Articles/348719/

abainbridge · on March 7, 2022

I think the profiling section could benefit from a section describing the sort of profiling noise described in this video: https://youtu.be/r-TLSBdHe1A?t=436

The noise source described is that changes to the code typically alter alignments of code and/or data and that is often the cause of the performance difference measured, not the change to the algorithm. The reason this is noise is because unrelated changes to the code might make it change again.

IIRC, the speaker suggests improving compiler/profiler tools to randomise these alignments and benchmark many times.

sereja · on March 7, 2022

I briefly talk about noise reduction in the "Getting Accurate Results" section: https://en.algorithmica.org/hpc/profiling/noise/

I don't follow my own advice in much of the rest of the book, though (because I mostly focus on at least double-digit performance improvements).

abainbridge · on March 7, 2022

Yeah, I should have said I thought the Getting Accurate Results section was excellent and only missing that one thing!

abainbridge · on March 7, 2022

Actually, I thought of another thing: when optimizing some module, I often find myself making a test harness that puts a lot of load on the module in question. Then I tweak it until it is fast. The problem is that I've optimized it in a state where the data is likely to be cached and the branch predictor has learned how the branches go. But then when you put that module back into a full program, the rest of the program might have overwritten our module's state in the cache and branch predictor, so the performance we get is much lower. In which case, Getting Accurate Results requires flushing the cache and branch prediction state in our test harness. I'd be interested to see some ideas on how to do that.

sereja · on March 8, 2022

Flushing the cache is easy: just read 4MB (or whatever your L3 cache size is) of some dummy data and pretty much everything you had cached before would be kicked out.

At the same time, resetting the branch predictor AFAIK can only be done by powering the CPU off and on again.

yencabulator · on March 8, 2022

About two decades ago, I was fighting to optimize a thing where the right answer depended on the icache pressure that the rest of the system was creating. More icache pressure -> inlining bad.

mhh__ · on March 7, 2022

Programmers are notoriously bad at statistics when it comes to performance.

Run it once, spot small speedup, call yourself a genius. Repeat.

JonChesterfield · on March 7, 2022

Remember to write off the small slowdowns as noise during this process

jorangreef · on March 7, 2022

Can't wait to read a copy. The title is great. I knew exactly what this was about before I clicked through.

I've also enjoyed your blog over the past few months, especially your post on the Eytzinger Layout [1], which we've implemented for TigerBeetle's new deterministic LSM-forest [2].

I can also understand your decision to write the examples in C++ given all the legacy code that's out there and given the easy access to std lib examples which are often not optimal. However, if I may make one suggestion it would be to take the extra time now to rewrite the examples in Zig — it's a clearer, simpler, newer language that makes sense for high performance coding for the next 10-20 years. It has a great approach to SIMD intrinsics and shares many of the same performance values as your book. For example, the std lib ships with SoA that can be used to generate SoA layouts at comptime.

As an orthogonal language, it's also brilliant for book examples, notably very readable, and also guaranteed to compile on the first attempt—it will make the examples more engaging and accessible, to bring the cool performance ideas across cleanly to your readers, without abstraction overhead, unnecessary complexity or usability issues.

From a community point of view, I think that this change would also find you a very engaged, supportive and concentrated systems community right from the get go.

[1] https://algorithmica.org/en/eytzinger

[2] https://www.youtube.com/watch?v=LikJDDhwmXA

erwincoumans · on March 7, 2022

C/C++ is a good choice, today. Perhaps have Zig, Rust and Julia (etc) as a companion chapter, on other languages?

adgjlsfhk1 · on March 7, 2022

I think Julia would be a great choice for this. It's features like @code_native to get machine code are great for doing this sort of performance analysis. Also, it has many of the algorithms mentioned already implemented in the language, so it would be relatively easy to translate.

jorangreef · on March 7, 2022

Sure, C is a fine choice today. However, since the subject matter in this book can potentially have a half life of 10-20 years, perhaps a more modern C in the same performance space with easy compilation and quick learning curve might be better?

starmole · on March 7, 2022

First: I love the collection of knowledge. Any project doing that is worth doing.

On the other hand: How do you teach performance engineering? I learned it in the Demoscene, which was simply a competition. "I can do this effect in 2 cycles less". Not everybody will be a good performance engineer. Or should be! I feel like the people who do good in performance engineering will be competitive and find their win in whatever environment. Maybe that is a good context for the book: That's what people did to win!

klysm · on March 7, 2022

I think in order to teach performance engineering, the person being taught has to actually care about performance. I don’t think a lot of software folks really deeply care about performance too much.

grujicd · on March 7, 2022

Significantly increasing performance is one of the most rewarding tasks in software development, comparable only to deleting unused code. At least for me.

However, for most business applications, it doesn’t bring much value. Modern hardware allows a lot of inneficiencies, and having some step few seconds faster or slower won’t matter much. In the last decade I spent maybe a day per year improving performance.

However, when I worked on some mini games recently performance was important since day 1. And I got a lot of joy from that process, programming was fun again!

astrange · on March 7, 2022

Caring about whole-system performance is another step beyond that, because you mainly care about second-order effects and not going full blast using all the resources.

Even the performance people can argue with me on here when I give them correct advice like that you might not want to try and use all CPU and memory just because they're there.

(For instance because your process is lower priority and so is only scheduled on E-cores and so the # CPUs number you see doesn't actually apply to you.)

adgjlsfhk1 · on March 7, 2022

I think a lot of it comes down to language. Summer languages (e.g. python, Matlab, R), performance engineering is a constant fight against an interpreter that is painful. other languages (e.g. C) make it tedious to write fast code because you lack higher level features.

mhh__ · on March 7, 2022

D is the closest thing to a "I write generic code which is then optimal for all the situations I can think of" language that I have found so far.

Writing a type that automatically becomes SOA vs. AOS is relatively easy in D, almost impossible in C i.e. please stop writing "fast" code in C.

MaxBarraclough · on March 8, 2022

> Writing a type that automatically becomes SOA vs. AOS is relatively easy in D

There's a library (which I haven't tried out) for doing this in C++20: https://github.com/celtera/ahsohtoa

(The author is an HN regular, jcelerier.)

cb321 · on March 7, 2022

D is nice. I find Nim even nicer (the static introspection is more complete, and syntactic chatter/noise is lower). https://nim-lang.org/

Several people have done the entity component system (ECS) thing.

webmaven · on March 7, 2022

"Summer languages"?

mhh__ · on March 7, 2022

You need to know so much about everything that I don't think there's any point in trying to teach it in a structured manner.

People who are going to be good at it are the kind of people who work that kind of stuff out themselves.

gleenn · on March 7, 2022

"Hopefully, with the help of a professional editor — I still haven’t figured out how commas work in English." Haha, impressed by the humility and also the ability to discuss some pretty challenging topics given being not a native speaker. Also, very impressive speed ups on such a broad range of algorithms.

josephg · on March 7, 2022

Wonderful! I love performance engineering, and its hard to shake the feeling that I'm alone with that love in the modern software world!

One note: You've got a few sections on B-Trees. BTrees may work well for on-disk data structures (eg databases, filesystems) but in my experience, skip lists work better in-memory; not to mention they're several times simpler to implement.

Eg: I got up to twice the performance moving from a b-tree to a skip list for a Rope structure, at half the code size. (And another 2x from embedding a gap buffer in each leaf):

https://crates.io/crates/jumprope

mwcampbell · on March 11, 2022

Do you also think skip lists would work better than B-trees for in-memory maps and sets, like Rust's BTreeMap and BTreeSet? From what I've read, the performance of those B-trees in the Rust standard library is already great. http://dtrace.org/blogs/bmc/2018/09/28/the-relative-performa...

limoce · on March 7, 2022

I see segment tree in the catalog. This one is really popular in competitive programming. There are various techniques with segment tree in solving data structure problems. However I did never see it in real industry project code. I once saw splay tree in Chromium[1] and FreeBSD but segment tree is rare. I wonder whether segment trees are actually practical?

[1] https://chromium.googlesource.com/v8/v8/+/refs/heads/chromiu...

hansworst · on March 7, 2022

I may be misunderstanding the Wikipedia description on segment trees, but it feels like this is basically what bounding volume hierarchies are used for in computer graphics.

limoce · on March 7, 2022

Got it. I think BVHs are similar to interval trees instead. But after I searched through related data structures, I found that the structure of segment trees is similar to that of octrees. Segment trees look like basically 1-D version of octrees.

mihaic · on March 7, 2022

I once had to reverse engineer a geo IP database file from their source code, and I saw they were using a segment tree to map IP ranges to a location.

Qub3d · on March 7, 2022

Really great read, I found it very engaging! It does a great job balancing introducing concepts and teasing more advanced stuff for anyone who wants to explore further.

I wish I had this text when I took my intro to computer hardware course!

I found one or two copy editing errors (missing words, a broken link) but those are easily fixed and I'll make an issue for them.

Great work overall.

rramadass · on March 8, 2022

This is Great! Thank you very much.

How would you compare this to The Art of Writing Efficient Programs by Fedor Pikus? I am looking for a comprehensive guide.

huqedato · on March 7, 2022

Fantastic book, I consider it as the missing link between CS theory and development/engineering. Thank you!

AtlasBarfed · on March 7, 2022

This knowledge was either the third or fourth class in the CS sequence, Computer Organization. Are there CS programs that do NOT do this? Those would be useless CS departments.

The book I took was Tanenbaum's Structured Computer Organization. It was fantastic, as was Operating Systems. This class was indeed the "light comes on" as you go from gates to logic to CPU to signals to microcode to assembly and so on and so forth.

What Tanenbaums and this lack is a bit of history. We can learn a lot by the examples of the older computers, and what each generational "innovations" brought us.

DMell · on March 7, 2022

Really excited to start digging into this!

adgjlsfhk1 · on March 7, 2022

I'm definitely going to check this out to see if there are any good PR ideas for Julia.

Narann · on March 7, 2022

Any hope this book to have a physical release? I see no mention about it.

sereja · on March 7, 2022

> The e-book/printed editions will most likely be sold on a “pay what you want” basis, and in either case, the web version will always be available online in full.

I will get out of Russia in a few months and then figure out the best way to deliver a hard copy.

Hackbraten · on March 7, 2022

> Pre-ordering / financially supporting the book. Due to my unfortunate citizenship and place of birth, you can’t — that is, until I find a way that at the same time complies with international sanctions, doesn’t sponsor the war, and won’t put me in prison for tax evasion.

> So, don’t bother. If you want to support this book, just share the articles you like on link aggregators and social media and help fix typos — that would be enough.

This breaks my heart. Supporting justified sanctions is one thing. Witnessing how sanctions are affecting a civilian as an individual is another.

I hope those sanctions keep piling up (as they’re among the very few things that are actually capable of ending the war) so people like Sergey can once again live a normal life.

ignoramous · on March 7, 2022

Sanctions (and continuing isolation of Russia by the West) may not end this war, in fact, it might do the opposite.