Torvalds version of rule 5: “Bad programmers worry about the code. Good programmers worry about data structures and their relationships.”
Brooks's version: "Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious."
Not all algorithms are obvious from a picture of the data structure, otherwise we wouldn't need computer science at all.
No, you won't spontaneously invent a minimax search with alpha-beta pruning if you're just show the picture of the game tree with board positions.
Some of the "low-hanging fruit" is obvious, but that's about it.
In the other direction, some algorithms are abstract; the same structure will work for more than one data structure. Sometimes the process is important, not the low level representational pieces. E.g. "map reduce".
Gee, don't show me "map reduce"; I need to see the servers and requests! No wait, I mean linked lists and atoms, ...
Rule 6: Every well-intentioned rule will be bastardized and used to justify horrible code.
Foo: "Hey, this 10000-element collection, which we do repeated lookups on... why are we using a list and not a HashSet?"
Bar: "Because lists are so much simpler than a HashSet. Let's just KISS"
Foo: "But... doing repeated lookups on a 10k sized list is so much slower than just using a HashSet!"
Bar: "Oh really? Have you profiled the entire application, and measured the latency impact caused by this decision? Come back to me once you've done so, and until then, we're not going to tune for speed."
One framework that helps me a lot with these discussions is to treat a lot of rules and advice as directional rather than absolute. For instance, if you have an intern who gets stuck and spends a whole week trying to figure out something that a teammate would be able to work out with him in 30 minutes, you might say, "ask for advice when you get stuck". But if you have a different intern who bothers you all the time instead of trying to figure out anything on their own, you might say, "try and figure things out for yourself". These are contradictory pieces of advice, but they're meant for different people so they're not truly contradictory.
Pike's rules are very good directionally for a certain clever and diligent type of programmer who knows their CS fundamentals but sometimes overthinks what they're doing. If you are Rob Pike and you work places like Bell Labs or Google, you are going to be surrounded by more of these people and your advice is going to be directionally aimed at them. If you work with people who are sloppy or have poor CS fundamentals, you're going to want to give them different advice than Rob Pike gives his coworkers.
(In your example, people don't usually get as militantly stupid as Bar unless Foo is somehow antagonizing them or being too argumentative with his advice, which is a completely different aspect of giving advice. Either that, or Bar is a particular combination of "smart enough to rationalize their original position" and "stubborn enough not to give up their original position".)
Ironically, in that situation, using a list might end up being more efficient due to cache locality. Or not. That's why measuring is so important, since performance can be a very counterintuitive subject. Hard-data should always prevail over theory and guesswork.
Bad practitioners in any field crudely invoke and naively apply a trite toolbox of mantras. Detached from reality and driven by insecure ego, they become the problem by using magical thinking from authoritarian logic and delude themselves of the reality of what they are actually doing.
In any skill, people can fall prey to cults of myth and mysticism as they merit based on adherence to orthodoxy rather than suitability. Programming is no different and sometimes it's hard to hear anybody think over the mooing of all the sacred cows.
For any realistic workload containing a mix of failed-lookups, and lookup-hits midway through the list, we're talking about many thousands of comparisons for a single lookup on average. Regardless, replace list with linked-list in the above example, for illustrative purposes.
I agree that measurements & hard-data are preferable to guesswork, but it takes time and energy to gather these measurements and hard-data as well. For minor decisions where the alternative proposal is very slightly more complex, but there's a very compelling reason to assume order-of-magnitude performance improvements, I would argue that gathering data is a waste of time.
If it's that bad, it should be easy to discover what is fastest.
But the way too common situation is that you look at a program under development and tell people "no way, that list will be too large to keep searching, you should design it around a hash set", and people reply something about premature optimization and keep going, then it gets released and is too slow to use so you get to rewrite all their code under pressure.
Fortunately there are many languages which will provide a generic abstraction for you, so you don't have to deal with any of the complexity of implementing a hash set and you can just treat it like a set.
This is also how HashSet is implemented in Rust (internally its a HashMap<T, ()>, and zero sized structs are optimized out just like Go).
But by providing a set type, Rust is able to provide for users various set operations which Go requires you to write by hand. It becomes hard to argue that a set is simpler than a slice when you have to write your own methods dealing with union, disjunction, intersection, difference, subsets/supersets, etc every time.
One could easily implement a hash set with an array, the reverse is not true at all. From the outside, arrays are extremely flexible, powerful, and complex constructs; a set, even a hash set, is not.
If it is easy to procure a set (as a hash set), then sets should be used when they could be used instead of arrays (unless there are performance concerns to using sets). In languages that do not include batteries (e.g. C), it makes sense to use an array simply because procuring a set isn't easy.
I'm probably missing something, but eg awk implements its 'arrays' at least logically with something like a hash set, and indeed hash sets can be great for (say) sparse arrays.
Lots of low-level data structures can be implemented in terms of one another, and which makes sense depends on circumstance.
Low level doesn't mean simple. In this case, arrays provide a lot of flexibility with manual indexing. They are a low level hammer that can do anything.
Sets are much more restricted, they don't have manual indexing, they don't garaunteed iteration ordering without special semantics, they don't do as much as arrays even if they might be implemented with arrays.
Then why use an array when you only need a set and an array would be overkill?
Your "Get" method hides complexity. It could be that it loops over a list, or that it hashes to find the bin, then loops over the list of items in the bin, or any of other dozens of other possible implementations.
So the next question is "What's the implementation of the Get(key) method?" Or maybe "Why aren't you using your language's library methods to loop over that list?"
I'm considering cases where looping over a list mean using a simpler algorithm than whatever's provided by "Get", applying rules #3+#4, after determining that "Get" is a bottleneck.
It's an ill-considered stream of consciousness where I wasn't clearly separating "simplicity of interface" from "simplicity of implementation".
Part of what I was going for is that a clean interface might hide a complex implementation. If you've followed rule 1 and see that "Get" is a bottleneck, it makes sense to apply rules 3+4, and see if there isn't a simpler implementation than the one provided by "Get".
The interface is the same. For example, in C# you have an interface Collection, which offers a lookup method, and which both HashSet and Lists implement.
The point of the OP is that among different implementations for the same interface, choose the simplest one unless you have empirical evidence that compels you to do otherwise.
Arrays didn't implement interfaces in C# until the Linq release. But the main problem is that arrays have an interface st is much more complex than just ICollection, they have functionality you don't want to be used and that should be properly encapsulated.
The lookup method just checks whether an element exists in a collection or not. List and HashSet implement non-indexable collections, which have no notion of key. The Collection interface is very thin, and it's mostly used when you need to keep track of a set of elements. E.g. storing the nodes already visited in a dfs.
OK I know more about C# now than I ever planned on knowing.
The problem with OP comment is that KISS and premature optimization are not diametrically opposed. Thet are two separate principles that mean different things.
Premature optimization is bad, but not because it necessarily violates KISS. Similarly, many people overcomplicate code for reasons nothing to do with optimization.
His argument reminds me of people who argue against free speech generally because we already ban people shouting fire in a cinema.
Simplicity is great when it's not borne of ignorance and when it doesn't over-simplify. Some things are unavoidably complex. Even the word "simple" is complex, and has many meanings. What if simple is defined as...
- relies on simple "Programming 101" concepts
- doesn't require using or learning a framework or library
- corollary to above, doesn't require ascertaining that said framework or library implements a .Get() method
- doesn't rely on anyone else's code and has no dependencies or references
- doesn't require understanding ancillary concepts that might be present such as hash tables
Then writing the loop is "simpler." To answer your original question.
And just to turn things completely topsy-turvy, how about your .Get() method. If it's implemented the way such things are usually implemented, that means it uses hash tables and is very efficient. But that means it's a premature optimization, right, and therefore the devil? (Well we haven't defined "premature" either.)
The real answer is to think for yourself like a grown-up and don't rely on dogma. Every situation is different; tackle every problem by optimizing for what's important in that specific problem, using whatever resources and minding whatever constraints apply to that specific problem.
Selecting appropriate data structures and algorithms for a problem is not optimization. It's one of the basic skills and responsibilities of a programmer. That's why the bulk of the first volume of Knuth's _The Art of Computer Programming_ is about data structures.
Using the Set interface is probably the right thing. Then you profile, and if the set implementation you chose is too slow, you just swap it out for another implementation that implements the Set interface.
Sorry, but Bar is right. Unless you have profiled, the list is fine.
It might even be faster (cache locality etc as another said -- if the list is contiguous).
But even more so: you don't even need to profile, unless you have an actual problem with execution speed. If it's fast enough that you don't have to care, you don't have to care.
Having spent a great deal of time - most of my career, really, doing performance optimization, some of this rings true, but much of it doesn't.
Rule 2 only applies if you can just decide that your current level of performance is "pretty great" and then parachute away to another project. Otherwise if you find that, instead of 1 hot spot taking 80% of your time, you have 5 warm spots taking 16% of your time, you have 5x as much thinking to do to fix your problems and get that potential 4x speedup (assuming you could get cut that 80% down by a factor of 16).
Rule 4 is an argument from bad implementation. It's harder to get fancy algorithms right, but what does it mean to say an algorithm is buggier than some other algorithm. Are we seriously meant to think that there are some bugs buried in Strassen's matrix multiply?
Rule 3 is dubious. It seems to imply that the performance cost of fancy algorithms on low N matters, which assumes that we are actually executing the algorithm quite a bit. Perhaps we've picked the wrong fancy algorithm? Maybe if you have a godzillion tiny matrix multiplies you shouldn't be using Strassen's but you might still get a performance win from doing something 'fancy' that's tuned for large numbers of small matrix multiplies...
As for Rule 5, this is probably the best of the lot. That's why I like to use languages that allow me to describe my data structures in detail... even going so far as to say that the contents of my user-built containers have types. Ahem.
So much this, there's nothing I dread so much as a "flat profile".
These rules are great if you actually have a smoking gun, however if you're getting killed by random memory access patterns they rarely present that way.
In twenty-five years as a working programmer, I have fallen in love with a piece of technology four times: Emacs (on my third attempt), Tup (the build system by Mike Shal), React... and now TypeScript. I've been using it full-time for a few weeks now, and I can't imagine going back. If you like rule 5... you'll heart TypeScript.
Edit: I also spent ~12 years writing C# which is good. But TypeScript is actually more powerful in a way, because of the impossibility of typing everything. This means it's likely to get things like variadic generics or higher-kinded types long before C#, since (1) in C# everything has to be typed down to the smallest particle (so it's much harder to add features), and (2) changes to the CLR would be needed.
When you say "going back" - going back to what? I mean, is a language where all array keys are strings really the best platform for higher-kinded types?
Going back to untyped JS of course. It's not about having "the best" platform for types—that's just the point. TypeScript's system is incomplete, but it's plenty good enough to dramatically improve the experience. The tooling is now much more powerful. The type contracts also provide a whole new channel of formal communication between team members, not to mention unknown users if you're developing libraries. And it's just way more fun. It's a rewarding challenge to write the most expressive types that you can for the problem you're solving. But you don't have to. If you want to start by writing data types and then fulfilling them, you can. If you want to start by writing code and then typing it, you can. It's brilliantly designed to fit with all kinds of idiomatic JavaScript. The language server provides fast incremental compilation and integrates easily with all major editors. Structural typing means you can define types in-place and conform to interfaces that haven't been written yet. The type system's features are impressive and growing, including generics (with constraints and defaults), tagged unions, mapped types, guard conditions and flow control analysis.
It's not "the best" of both worlds, but it is both worlds. What other language is as "gelatinous" as JavaScript [0] but still gives you state-of-the-art type inference when you want it? Hack? [1] I don't know, but it's the same value proposition.
I can grasp the underlying point here (more code and more complex code both increase the chance you'll have a bug).
Again, this seems a bit like the assumption behind Rule 2 - that the time spent in figuring out the complex algorithm and ensuring it is bug free is not worth spending (just like saying "oh, it's not worth smashing a hot spot if it's only 20%). This will frequently be true, but not always. I suspect that most people would be happier to use OS schedulers, TCP implementations and compilers (to name a few things) written by people who don't get to "parachute out" when the easy solutions are exhausted.
It's just a cycle: procedural programming -> data oriented programming -> object oriented programming -> functional programming
procedural programming -> 'I hate all these imperative sequences ("flowcharts"), it's all about the data anyway !'
data oriented programming -> 'I hate keeping all these data relationships in sync, can't the data do that itself ?'
object oriented programming -> 'all this behavior is too complex, often does things I don't intend, can't it be simpler ?'
functional programming -> "it's simple, but way too dense (like algebra), can't we just write out what happens in sequences of imperative statements ?"
So we arrive at Wapper's "Battlestar" law of programming paradigms:
"All Of This Has Happened Before And All Of It Will Happen Again"
Object oriented programming is where really complex and large programs go. I agree with Pike : my favorite is data-oriented programming. Read the data and relationships, but only document them, don't encode the data schema directly. But here's the kicker:
The issue with these Pike 5 rules is that while some things are simple (and yes, people can screw up and make simple things complex), there are many things that are complex and can't be made simpler without destroying the functionality of the program. There is no way to make a CAD drawing program that is anything remotely close to these rules. Making such a program simple and predictable, while possible, also destroys its function.
> There is no way to make a CAD drawing program that is anything remotely close to these rules.
Eh? A CAD program will have a lot of features, but that's not the same as using complex algorithms. For the bits where you do use complex algorithms, with a good architecture you can confine the complexity to plugins and corners of the program.
All of the other rules would apply to almost anything, surely?
Maybe the best example of irreducible complexity would be a video player; the algorithm is handed to you in a multi-hundred-page spec and there's nothing you can do to avoid that.
>.. there are many things that are complex and can't be made simpler without destroying the functionality of the program.
This is because we haven't figured out how to make them simple yet. I believe all problems can be made simple, once we come up with the proper system. Maybe we don't have the correct abstractions yet or just haven't figured out the best data layout. We've only been collectively programming for a few years now I think we have plenty of room left to advance the field. I'm not ready to just throw in the towel and say impossible.
Have architects figured out how to make designing any sort of building and bridge simple?
Maybe with powerful AI and Holodeck-like programs we can make it simple for humans to tell the computer to do anything they want easily, but short of that, I'm not seeing why every program would be simple to build.
However, the more powerful your tools, the more complex buildings and programs you're able to create. Programming hasn't necessarily gotten easier with more powerful abstractions, rather we create more complex programs now.
Agreed. Our field is still incredibly young, speaking as if we've tried everything that can be tried sounds a lot like when physicists were figuring out relativity. Many believed that the problems were simply unsolvable, that all of physics had been figured out in the late 1800s. Obviously they were wrong, and I think many of today's computer scientists will be wrong as well. Our field has a long way to go before it matures, in the meantime we need to find better abstractions.
Claiming that it's "all a cycle" is incredibly short sighted considering the time-scale.
"Rule 1. You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is."
Everything about reading this quote depends on what you think a "speed hack" is. Without practical agreement on that, you'll get a lot of people arguing past each other.
I'd say "a modification to the code designed to speed up that section of code". Are there other reasonable definitions? I thought it was pretty clear that the rule could be accurately paraphrased as "Don't optimize a section for speed until you're sure that's where it's needed."
Maybe. But are you saying you can't pull a method invocation you know will return the same value every time outside of a loop without profiling? I doubt Rob Pike believes that. Which leaves us with saying this is a rule of thumb, or what a speed hack is requires more commentary.
I'm ok with a rule of thumb. I just want to point out that they leave a lot of room for disagreement and that the most important part may end up being how you apply the rule in particular cases.
I think #5, which is probably going to be thought of as great wisdom for our age, is secretly an empty tautology. That is, my amateurish work in schemas and validating data structures and type systems has led me to think that there is a somewhat hard-to-see but extremely-important bijection between data structures and the control structures that consume them. (In many ways this is theoretically a non-issue as there are Church and Scott encodings of those data structures, but I think it also extends lists to for-loops and not just folds.) Seen that way it's really just saying "the most important part of the algorithm is figuring out the right basic control structures to build the algorithm out of." Well, yeah, that's what programming is.
I think the wisdom is that data structures are declarative. It's also much easier to build an image in your head of a data structure than an algorithm, because an algorithm is actions over time.
This is "obvious" for more experienced programmers, but vastly missed all the time.
And the "experience" will change per-environment! Think in how many apply something like this well enough when inside his programming language, yet fail when using a RDBMS/NOSQL/CACHE/KV store. Or moving between UI/Backend. Or using the network, etc.
ie. Is easy to see the truth at low-scale but do the same across all the stack is another thing.
----
I have some anecdote from my early days, when doing a school management app. After almost finished the porting from FoxPro DOS 2.6 (procedural) to Visual FoxPro 3 (OO, procedural, sql) some task were far slower than in DOS.
Eventually I "discover" that was because some tables were normalized "by rows" instead of "by columns" (ie: The central process was around a table that, when printed in paper, was very much alike a pivot table). Then I change the tables to be as 1-1 to that pivot table. This massively improve the WHOLE program. Not only speed, but the simplification in coding and reduction on lines of the program was very big!
Rule 5 is actually the same as Linus Torvalds said:
"Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
1. It is only ever about small efficiencies.
2. And about small efficiencies in the 97% non-critical parts
So:
1. It is always valid to concern yourself with large efficiencies.
2. In the critical 3%, it is also legitimate to worry about small efficiencies.
For context, later in the same paper (Structured Programming with goto statements):
“The conventional wisdom [..] calls for ignoring efficiency in the small; but I believe this is simply an overreaction [..] In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering."
So his actual message is pretty much the opposite of what is attributed to him.
The issue though is that optimizing for performance is, in most cases, optimizing for the wrong thing if it makes your code harder to understand, maintain and test.
A prox what was said in the book Rework; If we add 1000 more professors and expand to more cities Harvard/MIT/Stanford would be a better school. Everyone laught of this, why do we still think it is a good thing for a company?
Start small, fix while you go and keep a customer focus (not growth focus in both company size and software complexity)
I can't speak for Harvard/MIT/Stanford, but the top-tier administrators at my state's flagship university would probably love to be able to engage in that kind of expansion.
I think the harder part, in my experience, is not just choosing data structures, but choosing ones that will remain relatively flexible through the life of the project, and changing them when they become problematic.
This is also a difficult investment to justify because, as always, it doesn't deliver any "user facing features". That makes me chuckle because most of the time the "feature" is that it continues to be possible to use the application at all.
Programming language 'sages' tend to blurt out these generalizations, although they are based on insights which are only PARTIALLY determinant.
Many advanced ('fancy', if you will) algorithms are based on mathematical proofs and relations, which are seldom obvious. Some of them have taken hundreds of years to reach proof and be capable of being used as a theorem.
If all we needed were 'obvious' algorithms we would have hardly solved any real-world non-rudimentary problems with computation.
> Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
This is especially beautiful insight giving that it has a parallel with molecular biology - proteins and other molecular structures dominate. Enzymes are made of "standard" proteins to transform particular molecular structures.
This, BTW, is also related to usually overlooked the "code is data" principle.
To be a good programmer one has to know molecular biology 101. It seems like good guy (John McCarthy, MIT Wizards, Bell labs and Erlang guys) did.
> If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
OMG, this is so so true! I have seen people writing complicated/fancy algorithms to do simple stuff, and 2 years later they don't even know how did they do it!
The sum of these rules seem to imply that it's difficult to theoretically model programming, so always wait for your program to be fully written so you can perform some brute force empiricism -- and only then think about performance.
It is not true that you can not "theoretically" (read: based on formal reasoning) model program performance. You don't need to run a program to determine that an O(n^2) approach will be slower than an O(log n) algorithm. You do not need to run a program to determine the impact of optimizing a bit of your para code; you can get fairly good sense using Amdahl's law. Etc.
Like his creation, the Go language, Rob Pike's rules are somewhat condescending and patronizing those he deems as lesser programmers.
Rule 3, however, is the exception. That is the take away gem based on experience.
> You don't need to run a program to determine that an O(n^2) approach will be slower than an O(log n) algorithm.
That depends on N and on the constant factors. Which means testing still might surprise you.
More to the point, if the O(n^2) algorithm is simpler and leaves you more time for profiling and fixing actual performance issues after the code works, you may end up with faster code by doing the dumb thing.
He doesn't imply that from what I read. He simply said be empirical and practical. Let data (inputs and what you measure) speaks for itself. This is like the simplified version of scientific method in the context of programming.
Rule 3 is even more important today, with modern CPU cache. Running a simple O(n) scan of a (small-ish) bunch of integers (and often strings) will be significantly faster than storing them in some map or even binary search in some cases.
Honestly the more I learn about Pike the less I have respect for him. He surely has coded with some of the best but I wonder where he has led us. I appreciate the goals of go but see everything I worry about doubling up:
```
- Everything wrong about c, for instance polymorphism in go is...
- Not understanding ownership leads to a lack of design in terms of ownership
- without a better sense of types people's thinking is not explicit and careful enough
```
These are not "Rules of Programming", but are excellent "Guidelines for Performance". Having 30yrs under my belt I have come to all the same conclusions as Rob Pike over my years, so these are very deserving of some serious contemplation for those with < 10yrs experience.
In my current job we have lots of large and often 'sparsely populated' objects which means basically you can never count on any of the properties you were hoping would be present will actually be present. This means the data objects are essentially useless (at least for knowing what data you have to work with as you are writing code), and violate Pike's rules mentioned about data object design being critically important. In the modern world the younger generation of developers thinks "functional programming is great, and OOP (inheritance, etc) is obsolete", and only after 20yrs of coding they look back and eventually realize OOP had it right all along.
In architectures with massive numbers of sparsely populated objects only the guy who originally wrote any given function will be able to understand it, and everyone else who tries to work in the code is pretty much screwed.
".. in the modern world", "... younger generation". Maybe you're just getting old? If "OOP had it right all along" there wouldn't be a revival of interest in FP. OOP encourages mutable state and the bigger your programme the harder it is to keep track of it. Rampant mutable state also makes concurrency difficult and concurrency is now more important than it was due to multi-core hardware.
"Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident."
This should have been #1.
Really. In all of my 20 years of designing and writing software products, I have the learned that pretty much all of the work depends on the data model/the data structures.
(And also: once you have you have understood the data structures of a program, you kinda understand the program too.)
It is by far the number one mistake for juniors to do - focusing on the code rather than on the data structures.
I've found that even as a user it is very useful to have a mental image of what the "data model" looks like, abstractly. This is more important than knowing about all the features of an application.
I'm thinking user documentation should start with a quick explanation of the data model, and from there describe the operations that can be performed on it.
This reminds me of a quote about Blender, which I find has a remarkably solid architecture based on the idea of interactively editing a sort of “scene database”:
“Although some jokingly called Blender a ‘struct visualizer’, the name is actually quite accurate. During the first months of Blender's development, we did little but devise structures and write include files. During the years that followed, mostly tools and visualization methods were worked out.”
And the architecture of a compiler I’ve been working on has gotten considerably simpler and more solid as I’ve figured out the right data structures—with the right data model, especially if it’s enforced with types, the code practically writes itself. Finding that model is the hard part.
Brooks's version: "Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious."