Hotspot performance engineering fails

dang · on April 28, 2023

Recent and related:

Performance excuses debunked - https://news.ycombinator.com/item?id=35718912 - April 2023 (118 comments)

ummonk · on April 27, 2023

From my experience it's better to just consider performance from the get-go, and carefully consider which tech stack you're using and how the specific logic / system architecture you've chosen will be performant. It's much easier than being stuck with performance problems down the road that will need a painful rewrite.

The whole mantra of avoiding "premature optimizations" was applicable in an era when "optimizations" meant rewriting C code in assembly.

dilap · on April 27, 2023

Yep.

You need to be thinking about performance from the very beginning, if you're ever going to be fast.

Because, like the article said, "overall architecture trumps everything". You (probably) can't go back and fix that without doing a rewrite.

(Though it can be OK to have particular small parts where say "we'll do this in a slow way and it's clear how we'll swap it out into a faster way later if it matters".)

But if your approach is just "don't even worry about performance, that's premature optimization", you'll be in for a world of pain when you want to make it fast.

chii · on April 28, 2023

> you'll be in for a world of pain when you want to make it fast.

but you'll also be in a world of pain if you took so long to architect performance into your app/service that you miss your market window, or fail in some other important metric that causes your entire business to topple over.

It's better to have a product/market fit really early, but poor performance (which can be fixed), than to miss your opportunity to ship in time to get that marketshare and thus fail entirely! Try fixing that!

dilap · on April 28, 2023

Yeah, it could definitely, from a biz perspective, be worth being arbitrarily slow or sloppy. But purely from a tech perspective, expect something like a rewrite if speed becomes a priority.

Sadly, most software today is insanely slow(†), compared to what it could be, which I think reflects the unfortunate business reality that's it better to get anything out to market quickly than actually build fast software.

But you have to be careful or you'll end up in the situation of e.g., Microsoft, which recently had a video bragging that the latest version of Teams did something like take startup time down from 10 seconds to 5 seconds (video seems to be gone now, alas). They're apparently working heroically to try to improve the performance, but it's really really hard since slowness was baked into the architecture.

(† Just typing this comment in Safari is lagging pretty terribly. This on a Macbook M1 pro. Though of course just entering text in a textbox should be snappy even on a potato.)

MaxBarraclough · on April 28, 2023

> You need to be thinking about performance from the very beginning, if you're ever going to be fast.

It's the same thing with security, or for that matter portability. Changes in fundamental goals are hard to 'bolt on' after the system has been built.

bboygravity · on April 28, 2023

Example: if you chose to develop your app in Electron it's never ever going to be close to as fast and stable as it could be?

throwaway173738 · on April 28, 2023

Well it depends. Are you doing anything actually CPU-intensive at all? What makes something a problem is the business context. If you’re building a modeling system for mechanical engineers it might be a bad idea to write your back end in Node. But if you’re building an in-house tool for salespeople or technicians to use to enter data it might be perfectly reasonable to optimize for time to market. Performance is only a problem if it hinders the business or the end-customer in some way.

This also means you have to understand the customer’s problems before trying to optimize or before starting to architect your system.

smolder · on April 27, 2023

The other thing that's changed from the 'every optimization is premature' era is that shrinking CPUs don't result in big gains in frequency anymore -- Moore's law isn't going to make your python run at C speed no matter how long you wait for better hardware.

cma · on April 27, 2023

That and memory latency has improved much slower than everything else, so pointer chasing implicit throughout languages like Python is just horrendously slow. SRAM for bigger cache isn't scaling down anymore either in the last several process nodes.

0x000xca0xfe · on April 27, 2023

And the speed of light ensures that memory latencies won't get much better until CPUs are small cubes made of SRAM.

amluto · on April 27, 2023

Come again?

Modern servers seem to have about 100ns latency to main memory. The speed of light (actually electrical signals) delay is maybe 1-2ns.

0x000xca0xfe · on April 27, 2023

Thanks for the response, this made me dig around some more in this fascinating topic.

I found multiple papers that also simply use 100ns for RAM latency (e.g. this one[1]) but in reality it is a lot more complicated than one number. E.g. the time to first word for DDR3 is just above 6ns[2]. I'm not a CPU engineer but it looks like CPUs don't necessarily wait for the entire cache line and can utilize the first word straight away.

Also, the signal needs to travel both ways and the traces are not straight lines. I'm not an electronics engineer either so the best number for the maximum DRAM trace lengths I could find are 12-30cm, which would be 1-2ns indeed. This means that there is some room for improvement (in theory).

But even with ideal RAM it is physically impossible to beat register access at 0.17ns (i9-13900KS @ 6GHz) or even L1D cache at 0.7ns[3].

What is causing the extra 4-5ns latency of the RAM? Would it be physically possible to remove it?

[1] https://www.usenix.org/system/files/conference/hotcloud18/ho...

[2] https://en.wikipedia.org/wiki/CAS_latency

[3] https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memo...

pclmulqdq · on April 28, 2023

The extra delay comes from resistance and capacitance in the circuit, which makes electrical signals slow down (this is the ELI5 version). DRAM cells and long wires in the DRAM array are big capacitors and they are driven by circuits with fairly weak output (the resistance). Device physics makes these resistances and capacitances extremely hard to reduce. That is why on-device latency for DRAM has held fairly steady at ~5 ns for a very long time.

ummonk · on April 28, 2023

Motherboard manufacturers certainly have to consider DRAM trace lengths when supporting modern DDR5 clock speeds, which are now 6,000 Mhz for top of the line hardware.

The CAS latency for RAM is considerably higher and has been decoupled from this - as clock speeds go up CL multipliers go up and the CAS latency ends up being roughly the same - ~10 ns for the best RAM.

I'm not sure why CAS latency can't be driven lower. Row selection (tRCD) could be limited by the physics of charging up sense amplifiers, but I don't see why the logic for column address shouldn't be able to run faster. It's probably just something that hasn't been an area of optimization for the simple reason that there are many other much larger sources of latency.

And on that note, a CPU won't just jump to accessing some random word from RAM and copying it into a register. It will first look through L1, then L2, and then L3 before seeking the data from RAM. It will then copy an entire cache line into all three caches and load the target word into a register. This entire process results in ~100 ns latency.

actionfromafar · on April 27, 2023

Ehrm. Small spheres of SRAM, if I may.

speed_spread · on April 27, 2023

That's one way of dealing with corner cases.

ummonk · on April 28, 2023

Imagine having machine learning model weights just be encoded in blocks of SRAM right on the CPU die.

astrange · on April 28, 2023

Conversely, you can start putting ALUs and matrix multiplies in the DRAM.

sidpatil · on April 28, 2023

This is what I thought they meant when they mentioned "in-memory compute".

alberth · on April 28, 2023

That already exists.

https://youtu.be/GVsUOuSjvcg

pjmlp · on April 28, 2023

Not at all, but shipping a JIT compiler in the box would make it usable for plenty of workloads, even if C would still be faster.

Psychlist · on April 27, 2023

If you have a fast design/architecture, you may never need to optimise the code at all. But the flip side is that with a bad design or bad architecture optimising the implementation won't save you. With a sufficiently bad architecture starting again is the only reasonable choice.

I've seen code that does "fast" searches of a tree in a dumb way come out O(n^10) or worse (at some point you just stop counting), and the solution was not to search most of the tree at all. Find the relevant node and follow links from that.

Meanwhile in my day job performance really doesn't matter. We need a cloud system for the distributed high bandwidth side, but the smallest instances we can buy with the necessary bandwidth have so much CPU and RAM that even quite bad memory leaks take days to bring an instance down. Admittedly this is C++ with a sensible design (if I do say so myself) so ... good design and architecture means you don't have to optimise.

jamesfinlayson · on April 28, 2023

> If you have a fast design/architecture, you may never need to optimise the code at all. But the flip side is that with a bad design or bad architecture optimising the implementation won't save you. With a sufficiently bad architecture starting again is the only reasonable choice.

Yep, completely agree. I worked at a company with a poorly architected high-throughput system that was written in Perl. It got to a point where no more optimisations could make it scale, so it was rewritten. Of course the rewrite in a "faster language" was touted as the reason for its success but the truth was the new architecture didn't pound the database anywhere near as much.

ww520 · on April 28, 2023

That quote has been misunderstood by many. What Hoare and Knuth indented was for people to think of algorithm and design optimization first instead of doing micro optimization prematurely. Instead people stop all kinds of optimization.

VHRanger · on April 28, 2023

The very next sentence after the famous "premature optimization is the root of all evil" says that "a 12% gain easily achieved should always be done".

That sentence was always taken out of context

bravura · on April 28, 2023

I know this is probably an unpopular position, but my ML/NLP Ph.D. advisor always said not to do any optimization unless it would yield 10x better performance.

I guess that makes sense in the context of large-scale data research, where you should be thinking big if you really want to have outsized impact.

VHRanger · on April 28, 2023

I mean that's obvious nonsense.

Even in ML, which I work in, optimizing a data pipeline or model code to be 3x faster means you can run 3 experiments a day instead of 1.

There's a huge chilling effect from slow code in research iteration. But par for the course as academic code tends to be absolute trash.

hedora · on April 28, 2023

Ideally, a performance-related systems dissertation would achieve at least 10x vs. state of the art, but showing that is going to inevitably involve stacking many 10-20% optimizations just to get to par.

Of course, there is plenty of low quality academic work that only compares to some existing system that has incomprehensibly bad performance, then shows a 10% improvement. That sort of research work should definitely be avoided.

VHRanger · on April 28, 2023

My point isn't when you should be optimizing to the point that the optimization is publishable.

My point is that you should be writing fast optimized code by default in ML research because you tend to throw large volumes of data at that code multiple times a day, and slow code reduces the number of experiments you can run.

This isn't what I see however, lots of research code is horribly slow.

alpaca128 · on April 28, 2023

> I guess that makes sense in the context of large-scale data research

It makes the least sense in that context imho, as any optimization leads to larger differences the more data you process. Any performance difference is multiplied by the amount of times that code runs, that's why people look closer at hot loops than init code that runs once at startup.

usefulcat · on April 28, 2023

Sometimes a position is unpopular for good reason.

astrange · on April 28, 2023

That's about when you should rewrite a project from scratch vs incrementally improving it, but of course it depends on the size of the project.

attractivechaos · on April 27, 2023

A catch in Knuth's famous quote is how to define "premature". I am not old enough to see how programmers in his time thought about "premature", but my impression is quite a few modern programmers think all optimizations are premature.

taeric · on April 28, 2023

You can lookup the paper he says it in. The specific construct under question is the GOTO, and he showcases a topological sort that has an early exit that requires one. General framing is that you should use the easier to reason about constructs as you can, but it would be a shame to remove a tool from our toolbox that can make a significant impact.

One need only look at the programs he typically writes to see that he is far more obsessed with performance than most of us even know how to be.

(I am going off memory here, I should say. Mayhap I have the source wrong. Pretty sure that is the discussion, though.)

alpaca128 · on April 28, 2023

In the end optimization is always important, but different kinds in different stages of the project. Changing the structure is the easiest at the start when there's not much code, while doing crazy performance tweaks of a function that may be thrown away a day later is a waste of time.

lowbloodsugar · on April 27, 2023

To play devils advocate: we must be careful about what we mean by "optimizations". If a problem is not completely understood, then it may be better to stay in a high level language like Java so that one can rapidly iterate, than it would be to drop down to a strictly performance language, like Rust or C. It is far easier to downcode some service in a well architected Java system than it is to iterate a Rust system. It's also way better to have a working system that can be made cheaper by downcoding, than it is to have an optimal system that can't do the job because of its architecture.

Also, within the last few years I have written assembly for an application originally in Java, so that's still on the table. =)

adamnemecek · on April 27, 2023

I agree with the “premature optimization”. It’s one of those phrases like “correlation does not imply causation” that makes my blood boil. Like cool dude, did you just take freshman CS.

larsrc · on April 28, 2023

You'd be amazed at how much experienced programmers fret about local "optimisations" that are actually made irrelevant by the compiler and just serve to make the code harder to understand. And with incomprehensible code comes actual errors, not just slowness.

gashmol · on April 28, 2023

IMHO the crutial word is not premature but rather optimaization. Not everything is an optimazation.

Merging two loops into one is an optimazion that should be postponed until it's necessary and the system have been profiled.

Joining two large lists by a hash join instead of a nested loop is a design decision that should be made at the outset.

EVa5I7bHFq9mnYK · on April 28, 2023

At the time of Knuth, 100 programmers were lining up to wait for their turn to run a short program on a mainframe. Now, a programmer can start 100 machines by changing a number in a script. The ratio of programmer time cost : machine time cost changed dramatically.

patrulek · on April 28, 2023

Just wait for the EU to impose draconian fines if your service/data center produces more than xyz of CO2 or it produces more than product of your kind should produce on average. Then you will be basically forced to lean over performance.

EVa5I7bHFq9mnYK · on April 28, 2023

Are you saying JS will be outlawed? I'm ok with long prison sentences for that atrocity, not only fine.

secondcoming · on April 27, 2023

Well, Lemire is renowned for his SIMD algos.

selcuka · on April 28, 2023

> The whole mantra of avoiding "premature optimizations" was applicable in an era when "optimizations" meant rewriting C code in assembly.

Well, these days the equivalent to that seems to be rewriting {{ $HIGHER_LEVEL_LANGUAGE }} code in Rust.

I would still be conscious of "premature optimizations" and wouldn't rewrite all the things in Rust (or something else) before profiling my existing code.

pdimitar · on April 28, 2023

I mean, who's stopping you? Anyone holding a gun at your head to rewrite stuff in Rust?

manv1 · on April 27, 2023

TL; DR: "It's better to design a fast system from the get-go instead of trying to fix a slow system later."

That's basically true. I worked on a system that was Java/scala/spring/hibernate and it was just slow. It was slow when it was servicing an empty request, and it just went downhill from there. They just built it wrong...and they went ahead and built it wrong again.

Today, I could replace it was a few hundred lines of node in AWS/Lambda and get multiple orders of magnitude of performance.

selcuka · on April 28, 2023

We once rewrote an internal system because it was hard to maintain and the original developer had left the company. It was written in Java/Spring, and Kafka, but Kafka was being used like a database. It was just bad architecture.

We re-implemented it as a small Flask app with a PostgreSQL backend. We got rid of multiple servers, it was at least a magnitude more performant, and much easier to maintain and deploy as there were very little moving parts.

tonyarkles · on April 27, 2023

> Today, I could replace it was a few hundred lines of node in AWS/Lambda and get multiple orders of magnitude of performance.

I had a fun bake-off a few years back. I was in more of a devOPS role (i.e. mostly Ops but writing code here and there when needed) and we needed something akin to an API Gateway but with some very domain-specific routing logic. One of the developers and I talked it through, he wanted to do Node, I suggested it would be a perfect place for Go. We decided to do two parallel (~500 LOC) implementations over a weekend and run them head-to-head on Monday.

The code, logically, ended up coming out quite similar, which made us both pretty happy. Then... we started the benchmarking. They were neck and neck! For a fixed level of throughput, Go was only winning by maybe 5% on latency. That stayed true up until about 10krps, at which point Node flatlined because it was saturating a single CPU and Go just kept going and going and going until it saturated all of the cores on the VM we were testing on.

Could we have scaled out the Node version to multiple nodes in the cluster? Sure. At 10krps though, it was already using 2-3x the RAM that the Go version was using at 80krps, and replicating 8 copies of it vs the 2x we did with the Go version (just for redundancy) starts to have non-trivial resource costs.

And don't get me wrong, we had a bunch of the exact same Java/scala/spring/hibernate type stuff in the system as well, and it was dog-ass slow in comparison while also eating RAM like it was candy.

jerf · on April 28, 2023

What you hit there is that Node's HTTP server is written in C. This is not "cheating", as you saw in that cookoff, the performance is real. But as you go back into the JavaScript engine, you'll return to JavaScript levels of performance, which aren't dozens of times slower than Go, but definitely generally slower in a noticable way even on one core.

This isn't criticism, just something engineers should know. If you've just got a little tiny task, and it fits on one core (and one well-used core does a lot nowadays), a Node solution can be effectively near C. It does have a sharp rise in costs after that, relatively speaking, but that's still a nice little performance curve for a lot of use cases.

tonyarkles · on April 28, 2023

Oh definitely! Yeah, as reasonably performance glue between native modules I'd have no hesitation using Node. I've mentioned elsewhere in this thread about my day job processing 500MB/s worth of live imaging data... I wouldn't be considering Node for that :). In "normal operation" it saturates about 6 of the 12 cores we've got available on the "embedded" system (is it really "embedded" if it has 12 beefy ARM cores and 32GB of RAM?).

manv1 · on April 27, 2023

Yeah, the one time I used go it was pretty good. The big question is always whether your stuff spends more time waiting or more time processing. For the former, it's node. For the latter, it's go.

ilrwbwrkhv · on April 28, 2023

As an aside I am so grateful for people like Casey Muratori (whose article is referenced in that blog).

Their rants about speed and performance is so needed in this world where a chat app like Slack literally gets the fans spinning.

There are few of them remaining but they shine brightly in an otherwise messy and slow land.

Recently I had to touch a nuxt js project and my experience with it was so janky. The hot module replacement sometimes works, sometimes doesn't and basically it is worse than having to refresh manually every single time.

titzer · on April 28, 2023

Just optimizing the "hotspots" is not the only strategy to improving performance. As Daniel alludes to towards the end, it's about maximizing effort versus payoff.

E.g. you might find a hotspot that indicates 20% of the time is spent in a very small function. Yet that small function might be already close to optimal. After intense effort, you improve this function by 25%. Great, you saved 5% overall. But there might be a 5% somewhere else in the program that shouldn't exist at all, and if you remove it, you also get a 5% speedup.

Another example is that maybe you have a loop around 10 function calls that calls each of them 5 times. Each of them might show up as 10% of the execution time. But then you discover that the entire loop only needs to execute once. So you can speed up the program by 5x by only iterating once. Yet the proportion in the profile doesn't change; the program just does 5x less work.

Profiles can be deceiving.

0x000xca0xfe · on April 27, 2023

Optimizing for modern CPUs means optimizing for predictable memory accesses and program flow. Minimizing memory usage helps a lot, too.

Unfortunately this is pretty counterintuitive and most programming languages do not make it easy. And if you optimize for size you almost get laughed at.

usefulcat · on April 28, 2023

> And if you optimize for size you almost get laughed at.

Only by people who don't know much about optimization, which is the real problem. Memory caches are obviously of fixed size and relatively small.

sosodev · on April 27, 2023

This is true if your end goal is to have a super fast program but that is very rarely the case. The GTA online loading times issues went unnoticed for years because Rockstar just didn't care that the loading times were long. Users still played the game and spent a ton of money.

Performance hotspots often are the difference between acceptable and unacceptable performance. I'm sure I'm not the only person who has seen that be the case many times.

hinkley · on April 27, 2023

I don't think people understand the ways that we have adapted to delays. At least once a month I complain about how when we were kids, commercials were when you went for a pee break or to get a snack. There was no pause button. Bing watching on streaming always means you have to interrupt or wait twenty five minutes.

I suspect if you spied on a bunch of GTA players you'd find them launching the game and then going to the fridge, rather than the other way around.

usrusr · on April 28, 2023

And it's not impossible at all (even if perhaps not actually likely) that their entire microtransaction business would run noticeably worse if players were able to jump in and out of a session in an instant. That fridge run during launch? It's a sunken cost that should better be worth it. Now they are committed.

(edit: reading some other parts it turns out the experiment has actually been made, console versions that never had the startup bug are apparently doing fine)

eklitzke · on April 27, 2023

>This is true if your end goal is to have a super fast program but that is very rarely the case.

This is true in some banal sense, but kind of misses the point that there are certain domains where high performance software is a given, and in other domains it may rarely be important. If you're working on games, certain types of financial systems, autonomous vehicles, operating systems, etc. then high performance is critical and something you need to think about quite literally from day one.

tonyarkles · on April 27, 2023

> This is true in some banal sense, but kind of misses the point that there are certain domains where high performance software is a given

I work in a field where we're trying to squeeze the maximum amount of juice out of a fixed amount of compute (the hardware we're using only gets a rev every couple of years). My background (MSc + past work) was in primarily distributed systems performance analysis, and we definitely designed our system from day one to have an architecture that could support high performance.

The GP's comment irks me. There are so many tools I use day-to-day that are ancillary to the work I do where the performance is absolutely miserable. I stare at them in disbelief. I'm processing 500MB/s of high resolution image data on about 30W in my primary system. How the hell does it take 5 seconds for a friggin' email to load in a local application? How does it take 3 seconds for a password search dialog to open when I click on it? How does WhatsApp consume the same amount of memory as QGIS loaded up with hundreds of geoprojected high-resolution images?

I agree that many systems don't require maximum-throughput heavy optimization, but there's a spectrum here and it's infuriating to me how far left on that spectrum a lot of applications are.

vgatherps · on April 27, 2023

I feel the same frustration. I work in a field with stupendously tight latency constraints and am shocked by the disparity vs how much work we fit into tiny deadlines, vs how horrifyingly slow gui software written by well resourced mega corporations is.

It feels to me like user interfaces are somehow not considered high-performance applications because they aren't doing super-high-throughput stuff, they're "just a gui", they're running on a phone, etc. All of that is true but it misses that guis are latency/determinism sensitive applications.

I remember hearing some quote about how Apple was the only software company that systematically measured response time on their GUIs, and I'd believe it because my apple products are by far the snappiest and most responsive computing devices I have (the only thing that even competes is a very beefy desktop).

tonyarkles · on April 27, 2023

Yeah, exactly, like... we're doing microsecond-precise high-bandwidth imaging and processing it real-time (not in the Hard Real-Time sense, but in the "we don't have enough RAM to buffer more than a couple of seconds worth of frames and we don't post-process it after the fact" real-time sense) with a team of... 3-5 or so dedicated to the end-to-end flow from photons to ML engine to disk. The ML models themselves are a different team that we just have to bonk once in a while if they do something that hurts throughput too badly.

I'm sure we'd be bored as hell working on UI performance optimization, but if we could gamify it somehow... :D

xarope · on April 28, 2023

I'm now in a new position which requires me to interface with Microsoft products regularly; outlook, teams, etc, are exactly what you say; why does it takes 5 seconds to search for a locally cached email, meanwhile ripgrep has parsed my entire drive in about the same order of magnitude.

Kranar · on April 28, 2023

GTA5/online is based on one of the most sophisticated and highest performance gaming engines developed at the time (the RAGE engine).

The infamous loading bug [1] is something that only happened on the PC release of GTA5, which among all platforms was the least popular and came out two years after the console release.

Given how many modern games have downright failed due to performance issues/frame rate issues plaguing the game on release, I can assure you that had GTA5 released with significant performance issues it's quite likely that the reviews for the game would have bombed and it would have sold considerably less than it did.

[1] https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

sosodev · on April 28, 2023

I mentioned GTA online because the article mentions it.

rightbyte · on April 28, 2023

Wasn't the loading time getting worse with time as items where added to the master item list too? I.e. initial testing was ok?

kristjansson · on April 28, 2023

Hunting hotspots is a great way to find out where you’ve done dumb things, but it can’t show you where you haven’t done smart things.

PaulHoule · on April 27, 2023

I went through a time when I was pitching a "boxes-and-lines" data processing tool like

https://www.knime.com/

which more-or-less passed JSON documents (instead of SQL rows) over the lines and found that the kind of people who bought and finance database startups wouldn't touch anything that couldn't be implemented with columnar processing.

I thought that this kind of system would advance the "low code" nature of these systems because with relational rows many kinds of data processing require splitting up the data into streams and joining them whereas an object-relational system lets you localize processing in a small area of the graph and also be able to reuse parts of a computation.

Columnar processing is so much faster than row-based processing and most investors and partners thought that customers really needed speed at the expense of being able to write simpler pipelines. Even though I had a nice demo of a hybrid batch/stream processing system (that gave correct answers), none of them cared. Thus, from one viewpoint, architecture is everything.

(Funny though, I later worked for a company that had a system like this that wasn't quite sure what algebra the tool worked on and the tool didn't quite always get the same answer on each run...)

continuational · on April 27, 2023

This is true, when you keep optimizing, you soon face death from a thousand paper cuts. But often, it's enough to find that bottleneck and make it a few times faster.

hinkley · on April 27, 2023

The solution to this is zone defense instead of man-man.

The sad fact is that a manager won't approve you working on something that'll save 1% CPU. But once the tall and medium tent poles have been knocked down, that's all there is left. There are hundreds of them, and they double or triple your response time and/or CPU load.

I've had much, much better outcomes by rejecting trying to achieve an N% speedup across the entire app, and instead picking one subject area of the code and finding 20% there. You deep dive into that section, fulling absorbing how it works and why it works, and you fix every problem you see that registers above the noise floor in your perf tool. Some second and third tier performance problems complement each other, and you can avoid one entirely by altering the other. The risk of the 1% changes can be amortized over both the effort you expended learning this code, and the testing time required to validate 3 large changes scattered across the codebase versus 8 changes in the same workflow. Much simpler to explain, much easier to verify.

Big wins feel good now but the company comes to expect them. In the place where I used this best, I delivered 20% performance improvements per release for something like 8 releases in a row, before I ran out of areas I hadn't touched before. Often I'd find a perf issue in how the current section of code talks to another, and that would inform what section of code I worked on next, while the problem domain was still fresh in my brain.

bastawhiz · on April 27, 2023

> And that explain why companies do full rewrites of their code for performance: the effort needed to squeeze more performance from the existing code becomes too much and a complete rewrite is cheaper.

The article provides reasons why optimization gets harder, but no arguments for why a rewrite is better. It's unclear whether the author is arguing for rewrites or whether they're simply pointing out why companies take them on.

Arguably, though, companies taking on a full rewrite surely must have considered the cost of optimization (versus naively saying "the system is slow, replace it!"—though maybe some did). Rewrites are big, expensive, and time-consuming. It means new bugs and unknown unknowns, and no time to add features or fix bugs because you're busy rewriting functional code. It's a scapegoat for lack of improvement or progress. You shouldn't take one on lightly.

At the same time, this post also neglects that some efficiency wins have little to do with the efficiency of the code, but rather the efficiency of the logic. An N+1 query in your application looks like your database is slow: you're wasting a ton of time sitting and waiting for your DB to return information! But the real problem is that you're repeatedly going back-and-forth to the database to query lots of little pieces of information that could have far more efficiently been queried all at once.

> It is relatively easy to double the performance of an unoptimized piece of code, but much harder to multiply it by 10. You quickly hit walls that can be unsurmountable: the effort needed to double the performance again would just be too much.

That's not really true, though. One bad SQL query can go from many seconds or minutes to milliseconds. One accidentally-quadratic algorithm can take orders of magnitude more time than a linear-time algorithm. One bad regexp can account for the majority of a request. Of course, as you fix the biggest performance problems, the only problems left are ones that are smaller than your biggest ones, so you'll have diminishing returns.

But it also begs the question, what choices has your existing code made that makes it _ten times_ slower than you want it to be? In my experience, you're doing work synchronously that could have been put in a queue and worked on asynchronously. It's more often "you're doing more work than you should" or "you're being inefficient with the resources you have available" than "a specific piece of code is computationally inefficient".

nemothekid · on April 27, 2023

>The article provides reasons why optimization gets harder, but no arguments for why a rewrite is better. It's unclear whether the author is arguing for rewrites or whether they're simply pointing out why companies take them on.

He didn't argue a rewrite is just "better"; his argument was that a rewrite was the only card of the table. The architecture was deficient and to get more performance you have to change the architecture, which means a rewrite.

I tend to agree; I take the view that most engineers are smart, and compilers/interpreters/virtual machines are even smarter so most targeted optimizations aren't going to result in very much gain. A codebase full of N+1 queries or unindexed queries never cared about performance to begin with.

For true gains, you will have to think about data which is the true bottleneck for most applications - getting data from memory, the disk or the network will be much longer that any instruction cycle. The way memory moves through your application is baked into your architecture and changing this will almost always involve a rewrite. To your final point,

>In my experience, you're doing work synchronously that could have been put in a queue and worked on asynchronously.

moving from a synchronous codebase to an async one almost always involves a rewrite.

josephg · on April 27, 2023

> I tend to agree; I take the view that most engineers are smart, and compilers/interpreters/virtual machines are even smarter so most targeted optimizations aren't going to result in very much gain.

This hasn’t been my experience. As an example, I find there’s an awful lot of performance left on the table in most programs because of the sloppy way programmers use memory.

Most programmers don’t think twice about allocating memory and pointer indirection, and most programs are full of it. But if you can refactor this stuff to use bigger objects and fewer allocations, and use object pools, arenas and inline allocation (smallvec, SSO and friends) where it’s appropriate you can usually improve your performance by several times in most “already optimized” programs. The performance comes from fewer malloc calls (malloc is expensive) and fewer cache misses (because locality improves). It’s like you say - Cache misses are super expensive on modern hardware. And you often don’t need a full rewrite to tweak this stuff - just some careful refactoring.

I once saw a 20x performance uplift in a benchmark because we were using some tree structure where each leaf node only stored a single value. We replaced it with something that had arrays at the leaves and larger arrays in the internal nodes and performance skyrocketed.

The compiler isn’t smart enough to suggest any of this stuff. If you use Box<X> in rust instead of X, it’ll happily give you a slow program. Slower than javascript in many cases. And the default Vec and String types in rust’s standard library allocate even if the contents would fit in a pointer.

In javascript, Python and friends you can’t even implement a lot of low-allocation data structures because every list and object is inescapably a pointer to a heap object. This is why JS will never be as fast as C - you can’t write fast, nontrivial data structures. There’s a ceiling on the performance in languages like this - and if you need more performance than JS can give you, then a rewrite might be the right call.

Another example: I had a bug yesterday where writing a 5mb JSON file took about 1 second. Turned out I wasn’t using a buffered writer. Wrapping my File in BufferedWriter::new() made the time taken drop from about 1 second to 0.01 seconds.

The compiler isn’t very smart. By all means, rewrite your software to be fast. But there’s also usually big performance wins to be had in almost any program if you take the time to look.

charcircuit · on April 27, 2023

It's always about architecture. In the micro these are the hotspots you optimize in the macro these are the large rewrites you see.

Performance is not the only thing that you should optimize your architecture for. Factors like adaptability, robustness, ease of understanding, speed of implementation, maintance cost, etc are things that you should consider. The factors that are the best today are not always still the best in the future which is why rewrites are a part of any software's life cycle.

aranchelk · on April 27, 2023

That stuff is boring. Don’t be a killjoy.

mattpallissard · on April 28, 2023

Cherry-picking here.

> But even if you can find the bottlenecks, they become more numerous with each iteration. Eventually, much of your code needs to be considered.

> The effort needed to optimize code grows exponentially.

> And that explain why companies do full rewrites of their code for performance: the effort needed to squeeze more performance from the existing code becomes too much and a complete rewrite is cheaper.

I agree with this somewhat. Yes the effort needed doesn't scale linearly. Yes you hit a point of diminishing returns. And ABSOLUTELY should you write things with _some_ thought for the future the first time.

However, The number of bottlenecks is hardly ever the issue. Most places have a really hard time profiling their applications. They don't know how to effectively. They don't have the instrumentation to do so. They don't think about the time complexity of various operations, standard library methods, storage, and network operations.

Rarely does it get more complicated as you knock out the low-hanging fruit. It just gets more tedious. Asking more questions about metrics. Running down more dead-end rabbit holes. And no, there is never an "end"; you just move from one bottleneck to another until you fix whatever your current issue is. This is true for organizations of all shapes, sizes, and skill levels.

Taking inefficient code that someone else wrote and fixing it in place has been my shtick for well over a decade now. In my experience a complete re-write, while fun, is almost never cheaper than improving what you already have or incrementally breaking apart your service and re-writing portions of it. If it is you have a small application that doesn't have a lot of man-hours invested in it.

MattPalmer1086 · on April 27, 2023

It was interesting to read about why hotspots are not the whole story in performance. They are still important though.

Facebook may have the resources and/or need to do complete rewrites of everything to squeeze out more performance, but most companies don't.

I've personally improved performance of a lot of code significantly by identifying hot spots. So calling hotspot performance engineering a fail seems a bit unnecessarily provocative.

jesse__ · on April 27, 2023

> Facebook may have the resources and/or need to do complete rewrites of everything to squeeze out more performance, but most companies don't.

Actually, if you watch the video Casey put together, he very clearly demonstrates most companies _do_.

MattPalmer1086 · on April 28, 2023

Hardly. He takes evidence from Facebook, Twitter and Uber. These are not small companies and they operate at vast scale, unlike the majority of companies who engage in software development.

kens · on April 28, 2023

Knuth's comments on optimization are very nuanced. I beg people to not use the phrase "premature optimization" without reading what he actually said.

See page 8 of https://pic.plover.com/knuth-GOTO.pdf

seer · on April 28, 2023

One trick I’ve learned is to strive to never be at the limits of the current architecture.

Clear requirements are well understood until they aren’t, most projects I’ve been involved in strongly resembled a factorio game - you build a system that’s well designed for its requirements and then the business comes and demands 10 times the size/performance/speed …

So I try to have a system where changes and expansions “feel” easy and natural, if they aren’t then thats a big flag to go ahead and start refactoring _now_ so when the business inevitably comes with its changes, you are ready.

samsquire · on April 28, 2023

I think if we can make architecture cheap to change, then we can make software that can be drastically optimised from a high level.

I think techniques such as differential dataflow, materialized views and scheduling can optimise many performance problems.

I often start with something fast and then try add functionality to it to make it useful.

Architecture decides your performance characteristics.

NovemberWhiskey · on April 27, 2023

I don't think I entirely agree with the premise here. Yes, it is extremely difficult to engineer performance in after the fact; but assuming you've got an architecture that's basically fit for purpose (from the performance perspective), then improving by targeting hotspots is sound, isn't it? That's literally Amdahl's law.

Sesse__ · on April 27, 2023

Amdahl’s law is specifically about the futility of optimizing by removing hotspots... (Or rather, that it can only take you so far.)

taneq · on April 28, 2023

No, it's about the limitations of optimizing individual parts of a program. A corollary would be the futility of optimizing non-hotspots if you want any significant gain.

If your code spends 80% of its time in a given function you can't speed up the overall task by more than 500% by optimizing that function alone. If it spends 1% of its time in a given function then you can't speed it up by more than 1.1% by optimizing that function.

astrange · on April 28, 2023

Performance isn't all first order wallclock time stuff. If you rarely enter one part of the program but that part accidentally deletes all your caches, everything else is going to get slower afterward.

taneq · on April 28, 2023

Yeah, I'm with you. The 80/20 rule isn't about "write code slow then speed up the inner loops", it's about "don't waste time optimizing the slow bits, which means don't waste time optimizing anything until you know which bits are the slow bits."

It's explicitly about micro-optimizations, not architecture. And it's about helping devs avoid bogging down doing pointless work making micro-optimizations that will never affect the performance metrics that you care about.

anonymoushn · on April 28, 2023

You can't wait until the program is working correctly and you've benchmarked it to start optimizing. By then it is too late, you've already committed to data layouts that cannot possibly be fast.

taneq · on April 28, 2023

Again, micro-optimisations. You still want to think about your architecture, data representations etc. right from the start. Just don’t start hand-coding asm to speed up your config file parser from day 1.

elvis10ten · on April 28, 2023

> Just don’t start hand-coding asm to speed up your config file parser from day 1.

The comment you reply to is talking about thinking about performance even during initial design. Based on your last sentence, you seem to be the only one thinking of micro-optimizations.

helen___keller · on April 27, 2023

This whole thing is basically a straw man. “Performance engineering works but sometimes it’s not enough to overcome a bad architecture”. Alright, was that actually in question in the first place?

vgatherps · on April 27, 2023

You'd be surprised how common the view "Performance doesn't matter now we'll just fix the hotspot later" is

helen___keller · on April 30, 2023

Again, that’s a straw man.

All of engineering is a trade off. You trade off performance to complete the implementation faster, you keep in mind the amount of effort you are trading off in the future in the chance you are wrong and this will need to be optimized.

That’s not the same as “performance doesn’t matter”. People who say “performance doesn’t matter we can optimize later” are using this as a shorthand for, time now is significantly more valuable than time in the future. For example, we must hit a deadline or must get a POC out.

nxpnsv · on April 28, 2023

I certainly have had several great successes with profiling, i wonder what i have been doing wrong….

usefulcat · on April 28, 2023

I don't think he's saying "don't profile", more like "don't assume that all programs spend 80% of the time in 20% of the code". I know I've seen plenty of cases where the function at the top of the callgrind output is ~5% or less of total execution time.

rightbyte · on April 28, 2023

Ye. I mean increasing the performance of semi-perpormant apps is really hard. It's tiny changes everywhere.

At some point you got even the high hanging fruit and you are changing the order of if-clauses to decrease cache footprint to increase speed.

nxpnsv · on April 29, 2023

Valid point

tasubotadas · on April 27, 2023

The guy invents a strawman to justify premature optimization.

turtleyacht · on April 27, 2023

> these lines of code [were] pulling data from memory and software cannot beat Physics. [These] are elementary operations...

> measuring big effects is easy, measuring small ones becomes impossible because the action of measuring interacts with the software

> to multiply the performance by N, you need ... 2^N optimizations

> why companies do full rewrite of their code for performance

saagarjha · on April 27, 2023

Why quote these lines?

turtleyacht · on April 27, 2023

Summarizing the article. Also gives me a way to evaluate performance/optimization. Ideas to hang hooks on.