Hacker News new | past | comments | ask | show | jobs | submit login
Try to fix it one level deeper (matklad.github.io)
129 points by Smaug123 3 months ago | hide | past | favorite | 61 comments



The title immediately brings to mind the Osterhout classic, "Always Measure One Level Deeper", [1], and I imagine was probably inspired by it. Also worth revisiting.

[1]: https://cacm.acm.org/research/always-measure-one-level-deepe...


I was not actually aware of the paper, and it is indeed pure gold, thanks for linking it. It is _also_ extremely timely, as with my current project (TigerBeetle), we are exactly at the point of transition from first-principles back-of-the-envelope performance architecture to measurement-driven performance engineering, and all the advice here is directly applicable!


See the Book review: A Philosophy of Software Design (2020) (johz.bearblog.dev)

https://news.ycombinator.com/item?id=27686818

for more in this vein.


I was reading about NASA's software engineering practices.

When they find a bug, they don't just fix the bug, they fix the engineering process that allowed the bug to occur in the first place.


Maintenance is never as rewarded as new features, there's probably some MBA logic behind it to do with avoiding commoditisation.

It's true in software, it's true in physical infrastructure (read about the sorry state of most dams).

Until we root cause that process I don't see much progress coming from this direction, on the plus side CS principles are making their way into compilers. We're a long way from C.


Speaking of digging deeper, can you expand on that theory on why focus/man hours spent on maintenance leads to commoditization and why a company wants to avoid that?


Given enough iteration with the same incentives, two engineering teams might end up with the same sort of product overall. We see this with airframes. We established the prototypical airframe for the commercial airliner in the 1950s and haven't changed it in 70 years since. This is good for the airline but bad for the aircraft manufacturer. The airline can now choose between boeing or airbus or anyone else for their product. If boeing had some novel plane design that wasn't copied the world over, then the airline company would be beholden to them alone.


Sounds like automobiles are -- have been, until now -- an even more long-lasting example. The basic recipe has been the same for over a century: Four wheels, two to five doors, steering wheel and other controls at one of the front seats, internal combustion engine.

Only now is one of those components, the engine, being changed out. And looking at that from this perspective, it's apparently not as humongous a change as many are now claiming: For one thing, it's just one of the four (or more?) parts of the basic recipe; for another, it's not all that new -- electric motors are in wide use elsewhere, and were one of the alternatives in rather wide use in cars, too, before the industry settled on the current recipe ~a hundred years ago.

Anyway, the "beholden to them alone" bit doesn't seem to apply to the automotive industry, since pretty much all manufacturers are in on the switch-to-electrics idea. Which begs the question: Would there really be so much of a customer lock-in in the aeronautic space either? Just like the idea of driving cars by electric motors is unpatentable, what could Boeing (or anyone else) come up with in the basic design of an aeroplane that isn't too general an idea to be protectable, or (most probably) hasn't been tried before?


Top of my head, new things have unbounded potential, existing ones have known potential. We assume the new will be better.

I think it's part of the reason stocks almost always dip after positive earnings reports. No matter how positive it's always less than idealised.

You might think there's a trick where you can sell maintenance as a new thing but you've just invented the unnecessary rewrite.

To answer your question more directly, once something has been achieved it's safe to assume someone else can achieve it also, so the focus turns to the new thing. Why else would we develop hydrogen or neutron bombs when we already had perfectly good fission ones (they got commoditised).


"Maintenance is never as rewarded as new features,"

And security work is rewarded even less!


> And security work is rewarded even less

While I do recognize that this is a pervasive problem, it seems counter-intuitive to me based on the tendency of the human brain to be risk averse.

It raises an interesting question of "why doesn't the risk of security breaches trigger the emotions associated with risk in those making the decision of how much to invest in security?".

Downstream of that is likely "Can we communicate the security risk story in a way that more appropriately triggers the associated risk emotions?"


What is the consequence for security breaches? Usually some negative press everyone forgets in a week. Maybe a lost sale or two, but that's hard to measure. If you're exceedingly unlucky, an inconsequential fine. At worst paying for two years of credit monitoring for your users.

What's the risk? The stock price will be back up by next week.


The people making the decision don't have a direct negative impact. Someone's head might role, but that's usually far up the chain where the comp and connections are high enough to not care. The POs making the day to day decisions are under more pressure for new features than they are for security.


easier to consider people to tend towards conservative rather than risk averse. if we were truly risk averse, society would be very different.


the easy explanation is that the cost of a breach is externalized, so decision makers gain benefit from savings in not investing in security.

Look at the crowdstrike failure as a recent example, but there's plenty more in the past.


This is such a powerful frame of mind. Bugs, software architecture, tooling choices, etc. all happen within organizational, social, political, and market machinery. A bug isn't just a technical failure, but a potential issue with the meta-structures in which the software is embedded.

Code review is one example of addressing the engineering process, but I also find it very helpful to consider business and political processes as well. Granted, NASA's concerns are very different than that of most companies, but as engineers and consultants, we have leeway to choose where and how to address bugs, beyond just the technical and immediate dev habits.

Soft skills matter hard.


It makes you wonder if there's been work designing software that is resilient to bugs. Maybe you can test this by writing a given function in a variety of different ways, simulate some type of bug (fat fingering is probably easiest), and compare outputs. Some of these functions might not work at all. Some might spit out the wrong result. But then there will probably be a few that are written in such a way to get very close to the true result, and maybe that variance is acceptable for your purposes. Given how we currently write code (in english in a way a human can read it) maybe its not so realistic. But if we get to the point with our generative code where you can generate good quality machine code without having it transmuted to human readable code for human verification, then this is how we would be operating: looking at distributions of results from a billion putative functions.


To that example though, is NASA really the pinnacle of achievement in their field? Sure, it's not a very competitive field (e.g. compared to something like the restaurant industry) and most of their existence has been about r&d for tech there wasn't really a market for yet, but still spaceX comes along and in a fraction of the time they're landing and reusing rockets making space launches more attainable and significantly cheaper.

I'm hoping that example holds up, but I'm not well versed in that area so it may be a terrible counter-example but my overarching point is this: overly engineered code often produces less value than quickly executed code. We're not in the business of making computers do things artfully just for the beauty of the rigor and correctness of our systems. We're doing it to make computers do useful thing for humanity.

You may think that spending an extra year perfecting a pace-maker might end up saving lives, but what if more people die in the year before you go to market than would've ended up dying had you launched with something almost perfect, but with potential defects?

Time is expensive in so many more ways than just capital spent.


SpaceX came along decades after NASA’s most famous projects. Would SpaceX have been able to do what they did if NASA hadn’t engineered to their standard earlier on?

My argument (and I’m just thought experimenting here) is that without NASA’s rigor, their programmes would have failed. Public support, and thus the market for soace projects, would have dried up before SpaceX was able to “do it faster”.

(Feel free to shoot this down: I wasn’t there and I havn’t read any deep histories of the conpanies. I’m just brainstorming to explore the problem space)


So I wonder if SpaceX has a lending library of all of the relevant (quality-related) NASA documents, printed out on dead trees. For light lunchtime reading.


The fallacy here is that you're assuming that doing things right takes more time.

Doing things right takes less time in my experience. You spend a little more time up front to figure out the right way to do something, and a lot of the time that investment pays dividends. The alternative is to just choose the quickest fix every time until eventually your code is so riddled with quick fixes that nobody knows how it works and it's impossible to get anything done.


It's tough to sell this to leaders and managers that there could be more benefit to quality and stability at the cost of cutting scope and losing a few oh so indispensable features. But their incentive is to dream up imaginative OKRs and come up with deadlines to show visible progress and justify their roles until the next quarter.


Which blog/post/book was this? Thanks


In the course of interviewing a bunch of developers, and employing a few of them, I've concluded that this ability/inclination/something to do this deeper digging is one of the things I prize most in a developer. They have to know when to go deep and when not to, though, and that's sometimes a hard balancing act.

I've never found a good way of screening for the ability, and more, for when not to go deep, because everyone will come up with some example if you ask, and it's not the sort of thing that I can see highlighting in a coding test (and _certainly_ not in a leet-code test!). If anyone has any suggestions on how to uncover it during the hiring process I'd be ecstatic!


"I've concluded that this ability/inclination/something to do this deeper digging is one of the things I prize most in a developer."

Where have you been all my life? It seems most of teams I've been on value speed over future proofing bugs. The systems thinking approach is rare.

If you want to test for this, you can create a PR for a fake project. Make sure the project runs but has error, code smells, etc. Have a few things like they talk about in the article, like a message of being out of disk space but missing critical message/logging infrastructure to cover other scenarios. The best part is, you can use the same PR for all levels that you're hiring for by expecting senior to get X% of the bugs, mids to get X/2% and noobs to get X/4%.


"It seems most of teams I've been on value speed over future proofing bugs."

So, obviously, if one team is future proofing bugs, and the other team just blasts out localized short-term fixes as quickly as possible, there will come a point where the first team will overtake the second, because the second team's velocity will by necessity has to slow down more than the first as the code base grows.

If the crossover point is ten years hence, then it only makes sense to be the second team.

However, what I find a bit horrifying as a developer is that my estimate of the crossover point keeps coming in. When I'm working by myself on greenfield code, I'd put it at about three weeks; yes, I'll go somewhat faster today if I just blast out code and skip the unit tests, but it's only weeks before I'm getting bitten by that. Bigger teams may have a somewhat farther cross over point, but it's still likely to be small single-digit months.

There is of course overdoing it and being too perfectionist, and that does get some people, but the people, teams, managers, and companies who always vote for the short term code blasting simply have no idea how much performance they are leaving on the table almost immediately.

Established code bases are slower to turn, naturally. But even so, I still think the constant short-term focus is vastly more expensive than those who choose it understand. And I don't even mean obvious stuff like "oh, you'll have more bugs" or "oh, it's so much harder to on board", even if that's true... no, I mean, even by the only metric you seem care about, the team that takes the time to fix fundamental issues and invests in better logging and metrics and all those things you think just slow you down can also smoke you on dev speed after a couple of months... and they'll have the solid code base, too!

"Make sure the project runs but has error, code smells, etc."

It is a hard problem to construct a test for this but it would be interesting to provide the candidate some code that compiles with warnings and just watch them react to the warnings. You may not learn everything you need but it'll certainly teach you something.


Slow is smooth, smooth is fast.


Unfortunately I believe there is no crossing point even in 10 years.

If quick fix works it is most likely a proper fix, if it doesn’t work then you dig deeper. It is also case if feature to be fixed is even worth spending so much time.


A quick fix works now. It makes the next fix or change much harder because it just added a special case, or ignored an edge case that wasn't possible in the configuration at that time.


My main point is That’s false dichotomy.

There is bunch of stuff that could be “fixed better” or “properly” if someone took a better look but also a lot of times it is just good enough and is not somehow magically impeding proper fix.


It is and it isn't a false dichotomy.

It is a false dichotomy in that in the Aristotelian sense of "X -> Y" means that absolutely, positively every X must with 100% probability lead to Y, it is absolutely true that "This is a quick fix -> This not the best fix" is false. Sometimes the quick fix is correct. A quick example: I'm doing some math of some sort and literally typed minus instead of plus. The quick fix to change minus to plus is reasonable.

(If you're wondering about testing, well, let's say I wrote unit tests to assert the wrong code. I've written plenty of unit tests that turn out to be asserting the wrong thing. So the quick fix may involve fixing those too.)

It is true in the sense that if you plot the quickness of the fix versus the correctness of the fix, you're not going to get a perfectly uniformly random two dimensional graph that would indicate they are uncorrelated. You'll get some sort of Pareto-optimal[1] front that will develop, becoming more pronounced as the problem and minimum size fix become larger (and they can get pretty large in programming). It'll be a bit loose, you'll get occasional outliers where you have otherwise fantastic code that just happened to have this tiny screw loose that caused a lot of problems everywhere and one quick fix can fix a lot of issues at once; I think a lot of us will see those once or twice a decade or so, but for the most part, there will develop a definite trend that once you eliminate all the fixes that are neither terribly fast nor terribly good for the long term, there will develop a fairly normal "looks like 1/x" curve of tradeoffs between speed and long-term value.

This is a very common pattern across many combinations of X and Y that don't literally, 100% oppose each other, but in the real world, with many complicated interrelated factors interacting with each other and many different distributions of effort and value interacting, do contradict each other... but only if you are actually on the Pareto frontier! For practical purposes in this case I think we usually are, at least relative to the local developers fixing the bug; nobody deliberately sets out to make a fix that is visibly obviously harder than it needs to be and less long-term valuable than it needs to be.

My favorite "false dichotomy" that arises is the supposed contradiction between security and usability. It's true they oppose each other... but only if your program is already roughly optimally usable and secure on the Pareto frontier and now you really can't improve one without diminishing the other. Most programs aren't actually there, and thus there are both usability and security improvements that can be made without affecting the other.

I'm posting this because this is one of those things that sounds really academic and abstruse and irrelevant, but if you learn to see it, becomes very practical and powerful for your own engineering.

[1]: https://en.wikipedia.org/wiki/Pareto_front


Well I mostly work on systems that people don’t have their lives on line and don’t care that much like HN or Facebook. If I cannot post my comment I can go on with my life.

Sometimes I get “hey you did too quick requests” while posting.

Proper fix would be making better check if I am really a bot or I just casted a vote and wrote quick comment - but no one is going to care enough.

Whatever the long time dead dude was saying.


My impression is that bigger teams have a shorter crossover point.

Weirdly, teams seem to adapt better to bad code. But that adaptation occurs through meetings. And meetings just destroy a team productivity.


>Weirdly, teams seem to adapt better to bad code.

The greenfield team usually adapts well to its own buggy code. They know the system so well inside-out that if a bug pops up they have a general idea why.

This is bad, because with natural fluctuation in team members this institutional knowledge is lost. New members don't have the benefit of knowing about the whole evolution with all its quirks, and don't have the unit tests from the previous team to prevent regressions.

This then slows velocity to near zero as the team gets replaced, and leads to the inevitable rewrite.


My experience is that people adapt differently to different issues, so in a team people can specialize better on the types of bad code they handle best. So the code quality degrades less their productivity while they are working than if each person had to deal with the entire diversity alone.

But that implies on a division of work that is not aligned with any communication-reducing objective.


I have seen enough BSers who claimed that they need “do the proper fix” doing analysis and wasting everyone’s time.

They would be vocal about it and then spend weeks delivering nothing “tweaking db indexes” while I immediately have seen code was crap and needed slight changes but I also don’t have time to fight all the fights in the company.


That's the thing, my comment wasn't about that long analysis or doing the proper fix. It's all about asking if this is the root cause or not, or is there a similar related bug not yet identified. You could find a root cause and bring it back to the team if it's going to take weeks. At that point the team has the say on if that fix is necessary.


That's a really good idea. Thanks


Knowing when to go down the rabbit hole is probably more about experience/age than anything. I work with a very intelligent junior that is constantly going down rabbit holes. His heart is in the right spot but sometimes you just need to make things work/get things done.

I used to do it a lot too and I kind of had a "shit, I'm getting old" moment the other day when I was telling him something along the lines of "yeah, we could probably fix that deeper but it's going to take 6 weeks of meetings and 3 departments to approve this. Is that really what you want to spend your time on?"

Like you said, it's definitely a balancing act and the older I get, the less I care about "doing things the right way" when no one actually cares or will know.

I get paid to knock out tickets, so that's what I'm going to do. I'll let the juniors spin their wheels and burn mental CPU on the deep dives and I'm around to lend a hand when they need it.


However, you have to overdo it a sufficient number of times when you’re still inexperienced, in order to gain the experience of when it’s worth it and when it’s not. You have to make mistakes in order to learn from them.


When it's worth it and when it's not seems to be more of a business question for the product owner. It's all opinion.

I've been on a where I had 2 weeks left and they didn't want me working on anything high priority during that time so it wouldn't be half finished when I left. I had a couple small stories I was assigned. Then I decide to cherrypick the backlog to see how much tech debt I could close for the team before I left. I cleared something like 11 stories out of 100. I was then chewed out by the product owner because she "would have assigned [me] other higher priority stories". But the whole point was that I wasn't suppose dto be on high priority tasks because I'm leaving...


The product owner often isn’t technical enough, or into the technical weeds enough, to be able to asses how long it might take. You need the technical experience to have a feeling of the effort/risk/benefit profile. You also may have to start going down the hole to assess the situation in the first place.

The product owner can decide how much time would be worth it given a probable timeline, risks and benefits, but the experienced developer is needed to provide that input information. The developer has to present the case to the product owner, who can then make the decision about if, when, and how to proceed. Or, if the developer has sufficient slack and leeway, they can make the decision themselves within the latitude they’ve been given.


Yeah. The team agreed I should just do the two stories, which was what was committed to in that sprint. I got that done and then ripped through those other 11 stories in the slack time before I left the team. My TL supported that I didn't do anything wrong in picking up the stories. The PO still didn't like it.


Why product owner? (Perhaps rather not say team lead?)

Are these deeply technical product owners? Which ones would be best to make this decision and which less?


In a non-technical company with IT being a cost center, it seems that the product owner gets the final say. My TL supported me, but the PO was still upset.


Regardless, these deep dives are so valuable in teaching yourself, they can be worth it just for that.


Have you been asked "why do we never have the time to do it right, but always time to do it twice?"


His response is likely something like "I am hourly contractor, I have howevermuch time time they want", or something with the same no long gives a shit energy.

But their manager likely believes that deeper fixes aren't possible or useful for some shortsighted bean-counter reason. Not that bean counting isn't important, but they are often cout ed early and wrong.


Yeah don't get me wrong, I'm not saying "don't care about anything and do a shitty job" but sometimes the extra effort just isn't worth it. I'm a perfectionist at heart but I have to weigh the cost of meeting my manager's goals or getting behind because I want it to be perfect. Then 6 months later my perfect thing gets hacked apart by a new request/change. Knowing when and where to go deeper and when to polish things is a learned skill and has more to do with politics and the internal workings of your company more than some ideal. Everything is in constant flux and having insight into smart deep dives isn't some black and white general issue. It's completely context dependant.


"yeah, we could probably fix that deeper but it's going to take 6 weeks of meetings and 3 departments to approve this. Is that really what you want to spend your time on?"

This is where a developer goes from junior to serior.


Such qualities can sometimes be unearthed when you ask candidates to deal with a problem they can't know the answer to. In the end the ability to go deep has a lot to do with them being confident in their ability to be able to understand things that are new to them.

Most people can go into a deep dive if you force them to do it, but how they conduct themselves while doing it can show you if this is a thing they would do on their own.


Mike Acton said the same thing in an interview. He said curiosity is the best indicator if a candidate will be a good hire.

Casey Muratori was interviewing him at HandmadeHero Con back in 2016. Here is the snippet: https://youtu.be/qWJpI2adCcs?si=ezSKud42PC3Ub-UO&t=3112


Surely there is a way to present some piece of buggy code to a candidate and ask what’s wrong with it (letting them use whatever tools they want, doing this on a whiteboard is senseless), let them determine what the bug is and see how they fix it. Obviously there are constraints that make the code in question difficult to construct (needs to be simple and small enough to fit in an interview without the candidate having ever seen the code before, but not too simple to make it a non-differentiating question; the bug has to have multiple levels to it where there’s an obvious fix and possibly several better but less obvious ways to fix the issue, etc)


I learned a similar mantra that I keep returning to: “there’s never just one problem.”

- How did this bug make it to production? Where’s the missing unit test? Code review?

- Could the error have been handled automatically? Or more gracefully?


This kind of reminds me of https://en.m.wikipedia.org/wiki/Five_whys.


>"There’s a bug! And it is sort-of obvious how to fix it. But if you don’t laser-focus on that, and try to perceive the surrounding context, it turns out that the bug is valuable, and it is pointing in the direction of a bigger related problem."

That is an absolutely stellar quote!

It's also more broadly applicable to life / problem solving / goal setting (if we replace the word 'bug' with 'problem' in the above quote):

"There’s a problem! And it is sort-of obvious how to fix it. But if you don’t laser-focus on that, and try to perceive the surrounding context, it turns out that the problem is valuable, and it is pointing in the direction of a bigger related problem."

In other words, in life / problem solving / goal setting -- smaller problems can be really valuable, because they can be pointers/signs/omens/subcases/indicators of/to larger surrounding problems in larger surrounding contexts...

(Just like bugs can be, in Software Engineering!)

Now if only our political classes (on both sides!) could see the problems that they typically see as problems -- as effects not causes (because that's what they all are, effects), of as-of-yet unseen larger problems, of which those smaller problems are pointers to, "hints at", subcases of, "indicators of" (use whatever terminology you prefer...)

Phrased another way, in life/legislation/problem solving/Software Engineering -- you always have to nail down first causes -- otherwise you're always in "Effectsville"... :-)

You don't want to live in "Effectsville" -- because anything you change will be changed back to what it was previously in the shortest time possible, because everything is an effect in Effectsville! :-)

Legislating something that is seen that is the effect of another, greater, as-of-yet unseen problem -- will not fix the seen problem!

Finally, all problems are always valuable -- but if and only if their surrounding context is properly perceived...

So, an an excellent observation by the author, in the context of Software Engineering!


You need to choose your rabbit holes carefully.

In large and complex codebases, its often more pragmatic to build a guard in your local area against that bug, than following the bug all the way downthe stack.

Its not optimal, and doesn't make the system better as a whole. but its the only way to get things done.

That doesn't mean you should be silent though, you do need to contact the team that looks after that part of the system


IMO it may be worth distinguishing between:

1. Diagnosing the "real causes" one level deeper

2. Implementing a "real fix" fix one level deeper

Sometimes they have huge overlap, but the first is much more consistently-desirable.

For example, it might be the most-practical fix is to add some "if this happens just retry" logic, but it would be beneficial to know--and leave a comment--that it occurs because of a race condition.


This seems like the code implementation way of shifting left. https://news.ycombinator.com/item?id=38187879


In enterprise monorepos I find this hard because "one level deeper" is often code you don't own.

Fun article, good mantra!


Echoes of "Hal fixing a light bulb" from Malcolm in the Middle




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: