I certainly _do not_ want to denigrate the idea of writing code that's designed to be read and studied - I think that's a great idea, and try to do it. But I've never been really able to 'read' a codebase, and I don't really understand what people who do this are doing.
My technique for getting to know a codebase is to look at one thing in particular, probably with the help of a debugger, tracing the call stack and how the data flows and changes. I get to know a whole project in sections, focusing on individual bits of functionality.
And if I'm honest, I rarely ever 'read' code this way either - I usually do it because I want to add a feature or improve performance, or squash a bug, and I stop when I understand enough to be fairly sure I'm doing things in a responsible manner - keeping with the architecture and not making what I leave when I'm done any more additionally complex than is required to accomplish my task.
I don't understand how people can 'read code' in any sort of a straightforward manner. The phrasing seems to suggest to me a linear 'reading' like a book or a technical manual. But you can't read code like that. If a project is of even moderate size, there are too many interweaved dependencies. Even a very well factored codebase is at best a top-down or bottom-up tree, and you (or, at least, I) often can't understand the trunk without understanding the disparate branches and leaves. I generally can't keep all that in my head without a task to focus on.
I think it's arguably even irresponsible to suggest to newcomers that they should 'read code' because it sets them up for failure when they try to do it and the complexity inevitably overwhelms them.
Do I misunderstand what people mean when they say "read code"?
Yes. (As in both: yes I agree with your way of getting to know a codebase, and yes you misunderstand what at least some people mean by "read code".) See the "Code Is Not Literature" 2014 blog post that was discussed here last week (https://news.ycombinator.com/item?id=37134520), and the 2018 blog post that I linked to as concrete detail (https://news.ycombinator.com/item?id=37138832). Quoting Seibel quoting Knuth:
> > I got a copy of the source-code listing for it. I didn’t have a manual for the machine, so I wasn’t even sure what the machine language was. […] It was just basically the way you solve some kind of an unknown puzzle—make tables and charts and get a little more information here and make a hypothesis. In general when I’m reading a technical paper, it’s the same challenge. I’m trying to get into the author’s mind, trying to figure out what the concept is. The more you learn to read other people’s stuff, the more able you are to invent your own in the future, it seems to me.
> He’s not describing reading literature; he’s describing a scientific investigation.
So when people like Knuth talk about “reading code”, or even reading papers or books!, they're already describing a process like what you described, not reading linearly.
> Start at the beginning, read some of these comments and familiarize yourself with use of the index, and then set yourself some problem or other. […] If you take any one of TeX’s primitives like “def” or something like that, you can look it up in the index under that name and it will say “def primitive” and it will refer you to the place where it was put in the hash table and that will refer you to what command code it has and you might be able to trace through looking at the index [the] whole history of def, how it comes through TeX. If you don’t like that problem, give yourself some other little task, saying “I wonder what this does” and that will just give you a reason for perusing the index and finding your way through the report… the main thing to do is just to get a little familiar with the notation and mess up the page—get the page a little black on the edges.
I think this bolsters my sense that we should not be calling this activity ‘reading’ but rather ‘investigating’ or ‘researching’ or ‘understanding’ or something (none of these are perfect - there must be a better word…)
Calling it ‘reading’ - especially if encouraging newcomers to programming to do it - is, I think, likely to discourage more than help.
I guess “studying” would be a good word: it's close enough to “reading”, while also having an educational connotation and no expectation of being a linear process.
> READING:
Books are not scrolls.
Scrolls must be read like the Torah from one end to the other.
Books are random access -- a great innovation over scrolls.
Make use of this innovation! Do NOT feel obliged to read a book from beginning to end.
I think that some codebases can lend themselves to be read more than others. Consider for example GNU cat[0] vs. Plan9's[1], from which one can infer the overall readability of the two projects.
In particular, codebases who are composed of small, well-isolated components, can be read one chunk at a time, like a book. But I wouldn't be surprised for most "professional grade" codebases to consist of organic, "cluttered" aggregate. Which, as you observe, aren't really suited to be read, even more so linearly.
It also depends on one's intents, which are likely narrower in a professional setting (e.g. fixing a bug, implementing a feature; refactoring being a notable exception), than in a learning setting (e.g. learning how to write idiomatic parsers in Go by studying the Go parser itself). In this last case, curiosity might push you to read the code more deeply, compare different codebases, etc.
Finally, some languages also are more prone to enforce locality than others, impacting readability. See for example Linus arguing about C being more context-free than C++ [2].
Thanks. I’ve definitely ‘read’ things like both those ‘cat’s in the past. Even the GNU one is still small enough to be understood.
I agree with what you said about larger projects and the intent behind ‘reading’.
Linus’s opinion is interesting. I wonder how it’s changed over the years. I share his disdain for operator overloading on a visceral level, but to be fair it’s not much different to method or function naming really - those can end up misleadingly similarly named in large code adds too.
I guess we can put things this way: the closer the codebase is to a collection of uncorrelated chunks, the more one can read it "systematically". OTOH, the more the dependencies between the various chunks, the more useful/efficient a "reverse-engineering reading" becomes.
"Bad" naming, operator overloading, both add such dependencies. As a side note regarding operator (well, symbol) overloading: it's just plain awful in mathematics: e.g. a "+" signs can refer to an actual operator between scalar, between vectors, between functions, between spaces, an essentially syntactical element (portion of a name), etc.
It makes simple things so hard to understand for novices.
The best way to read it is close to how it executes. Following the call graph.
Find the definitions of data types (if they exist properly) or functions which operate on a looser data type (like if the data is a raw integer array but the program utilizes a suite of functions for manipulating and accessing that array in a more structured way). Identify either the entry point ("main") or some point of interest and work down and up the call graph to understand what's happening.
Most code is not written, outside scripts and often not even then, to be executed linearly. No reason to force linearity into reading the code either.
I have trouble making sense of a codebase until I start trying to edit it. That experience teaches me important stuff way faster and is way less watching-paint-dry boring than just “reading” code without some goal other than the reading itself.
I think you're on to something – you read for different reasons; sometimes it's to know where to intervene to fix a bug or add a feature, sometimes it's to know how a really specific algorithm works, sometimes it's to find the event loop or procedure that sequences a high-level process. One of my first "a-ha, so that's how it all works!" moments in coding was when I found the main loop in a decompiled codebase for of Minecraft.
I did this for a nephew. It was a vanilla js implementation of a chess board.
Chess was popular in his friend group so he asked if he could code a chess game for him and his friends to play.
I designed a lesson around making a chess board. Board is an important specification. That is, I wrote it to be a digital chess board rather than a chess game. Coding the logic of checking if a move was legal or identifying a check was quickly getting out of reach of educational material.
So instead I explained we are making a virtual board. Anything you can do on a chess board you can do on ours. Move wherever you want, capture whatever you want, call out your own checks, etc.
It also implemented an implementation of vanilla js p2p peering (without signalling) that updated the board with the other user’s move, and had a chat function.
I feel it turned out great. We’ll see if he pursues his programming interest further.
One thing I did different than the OP was that I included version control as part of the essentials of coding and manufactured my commits so they built up incrementally.
This way he could checkout the last commit and play around with the final product, while also being able to see how it’s built by going through each commit chronologically.
The biggest hurdle was the restrictions on software downloading on his school supplied laptop. Download vim? Blocked. eMacs? Blocked. Git? Blocked.
Had to get a cheapo refurb with linux just to get started.
I really like this idea. Reading other peoples code is one of the best ways to learn to code in my opinion.
Yet at the same time, a part of learning / understanding from reading other peoples code (given it's of good quality) is that you have to manually recreate what they're doing in your head without the comments.
I think for educational purposes the "educational comments" that you add have to hit a balance of explaining what the code does, without just explaining away an entire function. To that end, it might be better to add more comments in regards to specific lines, while letting the learner put the pieces of the different lines together themselves.
An extension that I'd like to see of this is many different types of codebases commented like this (OS, React, DBMS, etc etc) with additional information and diagrams about the overall design of the program and its different parts. Looking at and understanding a single file is all well and good, but without the context of how it fits into the bigger picture a lot of the potential learning is lost.
I am attempting to write a faithful implementation of Turing’s original paper where Turing machines were introduced, such that it could be used as a companion when reading the paper (my first time reading it would have been greatly aided by seeing a rubber-meets-the-road implementation). I am curious how a sort of “companion implementation for C.S. papers” codebase relates to this idea?
It appears to me the author is advocating Literate Programming[0]. If so, then this is a nice way to make "educational codebases" intended for study IMHO.
I adore Crockford's literary style of writing code, with copious comments, etc. I think there's a reason devs write so much more code than they read; code isn't written to be read, so it's a more significant cognitive load.
I wish I had the courage (?) to write verbose comments into my code at work, but I also want to keep it tidy for everyone.
My technique for getting to know a codebase is to look at one thing in particular, probably with the help of a debugger, tracing the call stack and how the data flows and changes. I get to know a whole project in sections, focusing on individual bits of functionality.
And if I'm honest, I rarely ever 'read' code this way either - I usually do it because I want to add a feature or improve performance, or squash a bug, and I stop when I understand enough to be fairly sure I'm doing things in a responsible manner - keeping with the architecture and not making what I leave when I'm done any more additionally complex than is required to accomplish my task.
I don't understand how people can 'read code' in any sort of a straightforward manner. The phrasing seems to suggest to me a linear 'reading' like a book or a technical manual. But you can't read code like that. If a project is of even moderate size, there are too many interweaved dependencies. Even a very well factored codebase is at best a top-down or bottom-up tree, and you (or, at least, I) often can't understand the trunk without understanding the disparate branches and leaves. I generally can't keep all that in my head without a task to focus on.
I think it's arguably even irresponsible to suggest to newcomers that they should 'read code' because it sets them up for failure when they try to do it and the complexity inevitably overwhelms them.
Do I misunderstand what people mean when they say "read code"?