From my experience it's better to just consider performance from the get-go, and carefully consider which tech stack you're using and how the specific logic / system architecture you've chosen will be performant. It's much easier than being stuck with performance problems down the road that will need a painful rewrite.
The whole mantra of avoiding "premature optimizations" was applicable in an era when "optimizations" meant rewriting C code in assembly.
You need to be thinking about performance from the very beginning, if you're ever going to be fast.
Because, like the article said, "overall architecture trumps everything". You (probably) can't go back and fix that without doing a rewrite.
(Though it can be OK to have particular small parts where say "we'll do this in a slow way and it's clear how we'll swap it out into a faster way later if it matters".)
But if your approach is just "don't even worry about performance, that's premature optimization", you'll be in for a world of pain when you want to make it fast.
> you'll be in for a world of pain when you want to make it fast.
but you'll also be in a world of pain if you took so long to architect performance into your app/service that you miss your market window, or fail in some other important metric that causes your entire business to topple over.
It's better to have a product/market fit really early, but poor performance (which can be fixed), than to miss your opportunity to ship in time to get that marketshare and thus fail entirely! Try fixing that!
Yeah, it could definitely, from a biz perspective, be worth being arbitrarily slow or sloppy. But purely from a tech perspective, expect something like a rewrite if speed becomes a priority.
Sadly, most software today is insanely slow(†), compared to what it could be, which I think reflects the unfortunate business reality that's it better to get anything out to market quickly than actually build fast software.
But you have to be careful or you'll end up in the situation of e.g., Microsoft, which recently had a video bragging that the latest version of Teams did something like take startup time down from 10 seconds to 5 seconds (video seems to be gone now, alas). They're apparently working heroically to try to improve the performance, but it's really really hard since slowness was baked into the architecture.
(† Just typing this comment in Safari is lagging pretty terribly. This on a Macbook M1 pro. Though of course just entering text in a textbox should be snappy even on a potato.)
Well it depends. Are you doing anything actually CPU-intensive at all? What makes something a problem is the business context. If you’re building a modeling system for mechanical engineers it might be a bad idea to write your back end in Node. But if you’re building an in-house tool for salespeople or technicians to use to enter data it might be perfectly reasonable to optimize for time to market. Performance is only a problem if it hinders the business or the end-customer in some way.
This also means you have to understand the customer’s problems before trying to optimize or before starting to architect your system.
The other thing that's changed from the 'every optimization is premature' era is that shrinking CPUs don't result in big gains in frequency anymore -- Moore's law isn't going to make your python run at C speed no matter how long you wait for better hardware.
That and memory latency has improved much slower than everything else, so pointer chasing implicit throughout languages like Python is just horrendously slow. SRAM for bigger cache isn't scaling down anymore either in the last several process nodes.
Thanks for the response, this made me dig around some more in this fascinating topic.
I found multiple papers that also simply use 100ns for RAM latency (e.g. this one[1]) but in reality it is a lot more complicated than one number. E.g. the time to first word for DDR3 is just above 6ns[2]. I'm not a CPU engineer but it looks like CPUs don't necessarily wait for the entire cache line and can utilize the first word straight away.
Also, the signal needs to travel both ways and the traces are not straight lines. I'm not an electronics engineer either so the best number for the maximum DRAM trace lengths I could find are 12-30cm, which would be 1-2ns indeed. This means that there is some room for improvement (in theory).
But even with ideal RAM it is physically impossible to beat register access at 0.17ns (i9-13900KS @ 6GHz) or even L1D cache at 0.7ns[3].
What is causing the extra 4-5ns latency of the RAM? Would it be physically possible to remove it?
The extra delay comes from resistance and capacitance in the circuit, which makes electrical signals slow down (this is the ELI5 version). DRAM cells and long wires in the DRAM array are big capacitors and they are driven by circuits with fairly weak output (the resistance). Device physics makes these resistances and capacitances extremely hard to reduce. That is why on-device latency for DRAM has held fairly steady at ~5 ns for a very long time.
Motherboard manufacturers certainly have to consider DRAM trace lengths when supporting modern DDR5 clock speeds, which are now 6,000 Mhz for top of the line hardware.
The CAS latency for RAM is considerably higher and has been decoupled from this - as clock speeds go up CL multipliers go up and the CAS latency ends up being roughly the same - ~10 ns for the best RAM.
I'm not sure why CAS latency can't be driven lower. Row selection (tRCD) could be limited by the physics of charging up sense amplifiers, but I don't see why the logic for column address shouldn't be able to run faster. It's probably just something that hasn't been an area of optimization for the simple reason that there are many other much larger sources of latency.
And on that note, a CPU won't just jump to accessing some random word from RAM and copying it into a register. It will first look through L1, then L2, and then L3 before seeking the data from RAM. It will then copy an entire cache line into all three caches and load the target word into a register. This entire process results in ~100 ns latency.
If you have a fast design/architecture, you may never need to optimise the code at all. But the flip side is that with a bad design or bad architecture optimising the implementation won't save you. With a sufficiently bad architecture starting again is the only reasonable choice.
I've seen code that does "fast" searches of a tree in a dumb way come out O(n^10) or worse (at some point you just stop counting), and the solution was not to search most of the tree at all. Find the relevant node and follow links from that.
Meanwhile in my day job performance really doesn't matter. We need a cloud system for the distributed high bandwidth side, but the smallest instances we can buy with the necessary bandwidth have so much CPU and RAM that even quite bad memory leaks take days to bring an instance down. Admittedly this is C++ with a sensible design (if I do say so myself) so ... good design and architecture means you don't have to optimise.
> If you have a fast design/architecture, you may never need to optimise the code at all. But the flip side is that with a bad design or bad architecture optimising the implementation won't save you. With a sufficiently bad architecture starting again is the only reasonable choice.
Yep, completely agree. I worked at a company with a poorly architected high-throughput system that was written in Perl. It got to a point where no more optimisations could make it scale, so it was rewritten. Of course the rewrite in a "faster language" was touted as the reason for its success but the truth was the new architecture didn't pound the database anywhere near as much.
That quote has been misunderstood by many. What Hoare and Knuth indented was for people to think of algorithm and design optimization first instead of doing micro optimization prematurely. Instead people stop all kinds of optimization.
I know this is probably an unpopular position, but my ML/NLP Ph.D. advisor always said not to do any optimization unless it would yield 10x better performance.
I guess that makes sense in the context of large-scale data research, where you should be thinking big if you really want to have outsized impact.
Ideally, a performance-related systems dissertation would achieve at least 10x vs. state of the art, but showing that is going to inevitably involve stacking many 10-20% optimizations just to get to par.
Of course, there is plenty of low quality academic work that only compares to some existing system that has incomprehensibly bad performance, then shows a 10% improvement. That sort of research work should definitely be avoided.
My point isn't when you should be optimizing to the point that the optimization is publishable.
My point is that you should be writing fast optimized code by default in ML research because you tend to throw large volumes of data at that code multiple times a day, and slow code reduces the number of experiments you can run.
This isn't what I see however, lots of research code is horribly slow.
> I guess that makes sense in the context of large-scale data research
It makes the least sense in that context imho, as any optimization leads to larger differences the more data you process. Any performance difference is multiplied by the amount of times that code runs, that's why people look closer at hot loops than init code that runs once at startup.
A catch in Knuth's famous quote is how to define "premature". I am not old enough to see how programmers in his time thought about "premature", but my impression is quite a few modern programmers think all optimizations are premature.
You can lookup the paper he says it in. The specific construct under question is the GOTO, and he showcases a topological sort that has an early exit that requires one. General framing is that you should use the easier to reason about constructs as you can, but it would be a shame to remove a tool from our toolbox that can make a significant impact.
One need only look at the programs he typically writes to see that he is far more obsessed with performance than most of us even know how to be.
(I am going off memory here, I should say. Mayhap I have the source wrong. Pretty sure that is the discussion, though.)
In the end optimization is always important, but different kinds in different stages of the project. Changing the structure is the easiest at the start when there's not much code, while doing crazy performance tweaks of a function that may be thrown away a day later is a waste of time.
To play devils advocate: we must be careful about what we mean by "optimizations". If a problem is not completely understood, then it may be better to stay in a high level language like Java so that one can rapidly iterate, than it would be to drop down to a strictly performance language, like Rust or C. It is far easier to downcode some service in a well architected Java system than it is to iterate a Rust system. It's also way better to have a working system that can be made cheaper by downcoding, than it is to have an optimal system that can't do the job because of its architecture.
Also, within the last few years I have written assembly for an application originally in Java, so that's still on the table. =)
I agree with the “premature optimization”. It’s one of those phrases like “correlation does not imply causation” that makes my blood boil. Like cool dude, did you just take freshman CS.
You'd be amazed at how much experienced programmers fret about local "optimisations" that are actually made irrelevant by the compiler and just serve to make the code harder to understand. And with incomprehensible code comes actual errors, not just slowness.
At the time of Knuth, 100 programmers were lining up to wait for their turn to run a short program on a mainframe. Now, a programmer can start 100 machines by changing a number in a script. The ratio of programmer time cost : machine time cost changed dramatically.
Just wait for the EU to impose draconian fines if your service/data center produces more than xyz of CO2 or it produces more than product of your kind should produce on average. Then you will be basically forced to lean over performance.
> The whole mantra of avoiding "premature optimizations" was applicable in an era when "optimizations" meant rewriting C code in assembly.
Well, these days the equivalent to that seems to be rewriting {{ $HIGHER_LEVEL_LANGUAGE }} code in Rust.
I would still be conscious of "premature optimizations" and wouldn't rewrite all the things in Rust (or something else) before profiling my existing code.
The whole mantra of avoiding "premature optimizations" was applicable in an era when "optimizations" meant rewriting C code in assembly.