Hacker News new | past | comments | ask | show | jobs | submit login
The Value of In-House Expertise (danluu.com)
210 points by ingve on Sept 29, 2021 | hide | past | favorite | 49 comments



Nice article, but I wish these would have a date on them!

Joel addressed the build vs. buy argument here (20 years ago).

If it’s a core business function — do it yourself, no matter what.

https://www.joelonsoftware.com/2001/10/14/in-defense-of-not-...

I guess what's not obvious to many people is that maintaining and optimizing the kernel and JVM is a core business function for Twitter (but probably not writing a kernel!). Likewise CPU design is now a core business function for Apple. Anything "down" your dependency stack can be.

On the other hand, software for employees to file expense reports likely isn't, etc.


I think Dan addresses this in the post. "Core business" just isn't well enough defined to be useful. Why is optimizing the kernel Twitter's core business, but not writing it? Because the ROI on the former is high for Twitter, but the ROI on the latter is low.

If you're going to stretch things to call the kernel maintenance and optimization Twitter's core business, then you have the consequence that you don't know what your core business is until you spend a lot of time exploring which things are going to be effective uses of your money. Imo, that's too much of a stretch.


Yeah now I see it even has the phrase "core business", which he says is vague. I don't think it's that vague if you just add the qualification about "down the stack".

It's hard for me to think anything that should be brought in house that isn't down the stack, i.e. on the critical path to serving Twitter. It's just not obvious to outsiders what is down the stack for a given company (the "I could write Twitter in a weekend" people), but it should be more obvious to people working there.

e.g. the example of expense reports is obviously not down the stack -- if it goes down or is generally terrible, you can still serve your customers. As another example, in the old days, the big tech companies used to actually hire chefs and kitchen staff themselves. These days I believe it's all done through contractors.

Also I'd say the post assumes that saving money is always important... in the earlier stages of a company, they are frequently really wasteful and prefer to grow the business. An example is Dropbox starting on AWS and then building their own data centers to become cost efficient.


Lots of things are "down the stack". The servers, the building they sit in, the electricity that runs the building, and the coal/water/solar/wind that produces that electricity. Should Twitter run a power plant? Maybe they should optimize the design of current power plants? Also, what about the toilets that their programmers use - they could probably be optimized as well!


There may be an ROI for it, but I don't think it's part of their core business. If we extrapolate it out and Twitter becomes the best kernel optimizer on the planet, they still can't really sell kernel optimization as a product.


But one of Dan's point is that it could become a part of their core business - they could launch e.g. a kernel optimization consulting service, much like Apple suddenly expanded into making their own CPUs.


The article list has dates: https://danluu.com/


> a single person found something that persistently reduced TCO by 0.5%, that would pay for the team in perpetuity,

There are several problems with this line of thinking, although as I will mention, it's not actually crazy, just problematic.

Attribution is one, how will you know if the team actually did do something that increased profits? There are many teams are involved in a business. It's not at all simple to say he-did-this-and-she-did-that. In fact much of office politics is exactly that, pie-cutting.

You also don't know whether a non-specialist might have figured out the problem. There's a lot of smart people at Twitter, right? Surely some of them work in adjacent areas and have occasional time to look at other things? If a non-specialist might have solved it, what else might he have solved? Couldn't he also collect 0.5% slices of profit for the company?

How do big businesses ever lose money? They must have a load of specialists, right? And some of those will during any period be doing that thing that makes them pay for themselves in perpetuity? The big danger is you justify every expenditure this way, "they just need to find one thing". Security says they stopped a cyberattack that would have shut down the company for two days, kernel says they reduced runtime by 0.5%, sales claims to have raised prices by 0.5%. In the end there's a fair chance that all the claimed gains don't add up to your bottom line.

I remember as a quant trader we could buy any book we wanted. Programming, Linear Algebra, finance, whatever. "If you find just one good idea, that will pay for all the books." Hard to argue with considering the sums involved, but it's also hard to know exactly what ideas we got out of the books.

Finally, if someone claims to be making you money, they will also claim the money. Especially if it's clearly agreed (Yep AWS cost is lower by 0.5% exclusively because of a kernel team action). So them saving 0.5% won't necessarily mean the company gets that extra profit. They may feel they deserve a raise, or new headcount to spread the work. Or you will decide not to pay them and they will leave.


> I remember as a quant trader we could buy any book we wanted. Programming, Linear Algebra, finance, whatever. "If you find just one good idea, that will pay for all the books." Hard to argue with considering the sums involved, but it's also hard to know exactly what ideas we got out of the books.

I agree with your overall sentiment, but think you might be overapplying it here. According to glassdoor [0], the average salary for a quant trader is $146k, not counting other forms of compensation and employee overhead. Even the most expensive book only needs to save a few hours of effort to pay for itself, let alone provide an idea you wouldn't otherwise have.

Even if it doesn't pay for itself directly, it is cheap enough to justify for retention reasons alone

[0] https://www.glassdoor.com/Salaries/quantitative-trader-salar...


A corollary for this could be:

The value of minimizing external complexities [0]

For instance, if you design an application as HTML+PWA instead of native mobile apps, you just need a web developer who understands responsive CSS techniques and maybe someone with time to test a bunch of different devices all day. With native, you usually need 1 fairly-specialized developer per native target unless you have a lot of time to go to market (or a very simple app).

Another example could be designing your product to run on a single, bare-ass VM so you don't need to hire legions of level 30 kubernetes wizards to sort our your go-to-market strategy or accountants to manage the byzantine nightmare that is AWS/Azure/Et. al. billing.

The fewer things you have to worry about, the less expertise you need to maintain.

  [0] What I mean by "external complexities" - Anything that is external to the problem domain for which the solution is originally being built. If you have a banking product, an internal complexity would be state management around account or customer activities. An external complexity would be a 3rd party vendor, reporting system, database, file, network, hardware, operating system, or any other non-domain types residing within the software product itself.


One fairly specialized developer is a lot harder to replace then a generic level 30 kubernetes wizard. You lose that developer as a company and the ramp up time for a replacement could be >year. In addition, more standardized approaches have better defined practices, tooling, and security.


It's a shame that this gets forgotten so often. Most of the very high value extracting places pay a huge premium to recruit and retain the best people for anything that is remotely considered to be value-add.

Similarly overlooked: vertical integration is rarely (only) about costs but precisely about removing the conflicts and challenges inherent in outsourcing.


> a single person found something that persistently reduced TCO by 0.5%, that would pay for the team in perpetuity,

This means that when you are operating at hyperscale you need a world class team. But the tricky question is calculating when that happens!


> I'm not going to do a team-by-team breakdown of teams that pay for themselves many times over because there are so many of them, even if I limit the scope to "teams that people are surprised that Twitter has".

I assume Dan means that "teams that pay for themselves" are teams where the total cost of employing the team is <= the decrease in company expenditure that can reasonably be attributed to the team.

If that is the case, two things come to mind:

1. What is the likelihood that that company expenditure would have decreased without the team? (over time bugs/improvements are fixed/implemented by other employees or outside parties (open source))

2. If instead of spending money on the team that decreased expenditure, the company had spent money on a team that increased revenue, what would the relative difference be for profits?

This may be complex because the cost to process a byte of information using the same process almost always gets cheaper over time (am I wrong?).


> What is the likelihood that that company expenditure would have decreased without the team? (over time bugs/improvements are fixed/implemented by other employees or outside parties (open source))

Don't forget the time part of this. If the issue is fixed, but not for 2 years, that needs to count for having the team. Nobody is giving real numbers, but reading between the lines you can guess that some of these changes are saving $100,000/month to the company, so 2 years is $2.4 million dollars, and you are assuming you get the change at all.

Most of the above savings are on the electric bill: computers use less power, and therefore need less AC to cool them. Some of it is also that the company buys less computers.


The trick is to realize this is a gray area that can't be measured precisely, and >as a leader, your organization will find evidence to support whichever argument you favor<

It takes a special type of MBA brain to imagine you can projet the future enough to forsee the outcome of outsourcing a "cost center" and putting that budget into a team that will define their metrics in increased revenue.


> 2. If instead of spending money on the team that decreased expenditure, the company had spent money on a team that increased revenue, what would the relative difference be for profits?

Why not both?


> Despite a lot of claims otherwise, Scala uses more memory and is significantly slower than Java

Yup. Sometimes the advanced abstractions are worth it (Spark, maybe Akka), but those are niche cases.


akka is never the answer


> Another reason to have in-house expertise in various areas is that they easily pay for themselves ... If, in the lifetime of the specialist team like the kernel team, a single person found something that persistently reduced TCO by 0.5%, that would pay for the team in perpetuity ... people will also find configuration issues, etc., that have that kind of impact.

KEY observation that we forget when we wear many hats at small companies. This is the satisfying core reason 'why we debug'. Our deep dives matter. (Though not as much as they would at BigCos)


Hmm reducing TCO by 0.5% at Twitter would pay for the team in perpetuity. Reducing TCO by 0.5% at small Co. will pay for a beer once a month.


Why hello my old friend, economy of scale. We meet again.


yes -- at smaller places, the goal is usually fixing something urgently broken and the economics are harder to justify

which is really frustrating and leads old hands at small places to become obsessed with conservative build/buy decisions and low TCO tech

nice to be reminded that it's not like that everywhere, there are places where debugging labor pays consistent dividends


> which is really frustrating and leads old hands at small places to become obsessed with conservative build/buy decisions and low TCO tech

Can you clarify what you mean here? Are you saying that when people cut their teeth in big companies and then move to small startups, they tend to be overly conservative and favor low TCO tech, when instead they should be spending more money for, what, more growth? Should they be building more in-house, or buying more off the shelf? I'm just not clear which direction you're going here.


ah no, I'm saying that small co experience leads to visceral fear of tech that is non-mature / difficult to debug / unpredictable

(whether or not you're ex-big-5)

sorry, by 'low TCO' I wasn't just thinking about $/month -- I was thinking of technology that is non-experimental, easy to hire for and manage, that doesn't take one of your senior people a week per quarter to keep alive. TCO is the wrong word for that.


I'm not surprised that Twitter has those teams, I'm always surprised that more places don't. About 22 years ago I was the only person at the company I worked for at the time who could analyze Solaris core dumps, and understood enough about the JVM to diagnose deep problems. In the 5 years until the rest of the engineering staff caught up to where they stopped needing me for every incident, I probably saved enough money to pay my salary 100x over.

Never saw a dime of that. The one time someone offered me a spot bonus to come solve a problem, they reneged on it: A manager in charge of a project came to me for help. We're crashing all the time, he says. IBM and the team can't figure it out, he says. If you can solve this, I'll give you a $5000 spot bonus, he says.

I would have done it anyway, because it's my, you know, job? But whatever, I won't turn down free money.

So I wander over to the team that's been looking at this and get the lowdown. They keep getting out of memory errors.

Me: So what does the heapanalyzer output look like? Team: Huh?

Me: You...you've been having out of memory errors and haven't looked at the heap? Team: Buh?

So I get the heapdump and look at it. Immediately it's clear that the system is overflowing with http session objects.

Me: Anything in the log files related to sessions? Team: Just these messages about null pointer exceptions during session cleanup...do you think they're related somehow? Me: <Bangs head on desk>

A little more research reveals that there were two issues at play. The first is that we had a custom HttpSessionListener that was doing some cleanup when sessions were unbound. It would sometimes throw an exception. We were using IBM WAS, and it turned out that when a sessionDestroyed method threw an exception, WAS would abort all session cleanup. So we'd wind up in a cycle: the session cleanup thread would start, process a few sessions, hit one that threw an exception on cleanup, and which would abort cleaning up any other sessions.

We did a quick fix of wrapping all the code in the sessionDestroyed method with a blanket try/catch and logging the exception for later fixing, and IBM later released a patch for WAS that fixed the session cleanup code to continue even if sessionDestroyed threw an exception.

So, I very quickly solved this problem and waited for my $5000 spot bonus. And waited. And waited...

I went back to the manager and asked him about it. Over the next few weeks, he proceeded to tell me the following series of stories:

* It was in the works, and I'd have it soon.

* He had to get approval from his superiors.

* Because so many people had worked on the problem, it was decided that it should be split among the group, and that I'd have to share it with the people that couldn't fix it.

* No bonus.

So even though it was his idea to try to bribe me to fix a problem, they still failed to follow through on it.

Another story: We had an issue once where they finally brought me in after a year of problems. One of our Java systems was failing intermittently, and the development team had given up and couldn't figure out what was wrong. The boss told me it was now my problem, that I was to dedicate myself 100% of the time to solving the problem, and I could rewrite as much as much of the system as needed, basically total freedom (and responsibility). About halfway through the spiel where they were talking about the architecture and implementation, someone mentioned that the system was dumping core. I immediately stopped them right there.

Me: You realize that if it's a coredump, it's not our fault, right? Boss: Huh?

Me: If a Java program coredumps, it's either a bug in a 3rd party JNI library, a bug in the JVM, or a bug in the OS. What did the coredump show? Boss: Wha?

Me: You guys have had this problem for a year and haven't looked at the coredumps? Boss: Blurgh?

So I fire up dbx and take a look at the last few coredumps. Pretty much instantly I can see the problem is in a JDBC type 2 (JNI native code) driver for DB2. We contact IBM, and after a bunch of hemming and hawing they admit there's a problem that's fixed in the latest driver patch. We upgrade the driver and poof! the problem is gone.

We had a year of failures, causing problems for customers, as well as all the wasted man hours trying to fix something in our code that simply could not have been fixed that way, all because the main dev team for this product had no idea how to debug it. I had an answer within 30 minutes of being brought in to the problem, and the solution was deployed within days.


Please tell me you found better pastures! A place where you were compensated and appreciated. :)

I've been at the current job for 10 years, receiving the highest level of performance review for all but the last year (the global pandemic was one of four life upheavals... I'm glad I pulled off a 'great' review vs. 'exceptional'), doing similar style work of solving problems people can't seem to comprehend... yet I just can't get seem to break into the next job title tier because "We don't need a principal."

...

I just found out my lead was just promoted to principal. Once the divorce is off my plate, I plan on taking more risk and jumping ship.


I didn't mind so much, as I was well compensated in general. It was only in the last few years that Agile cultists took over and managed to ruin the place completely. After a long, painful decline, my entire group was laid off, from senior managers down to new hires. Fortunately, I immediately found a better job.

Oh, and I had been there for 22 years.


What amazing stories you have shared!. Glad you changed the job. Totally agree on Agile cultists. I am seeing similar decline here at work.

Obsession with hour tracking takes priority over any other thing. Things that one good engineer would do in 4-6 weeks on a single ticket is now broken into 100 little JIRA shitlets for 6 person months divided in to 5 developers. At the end of it no one really has clue what was really been achieved. But since 100 JIRA stories have been completed, it must be a great milestone.


Just chiming in to share my disdain and rancor against JIRA. "JIRA shitlets" is the best term I've heard in weeks.

To paraphrase T.S. Eliot:

"This is the way the world ends, not with a bang, but with a storm of whimpering JIRA shitlets."


The obsession with Agile, Scrum, sprints, and points ground our entire organization to a halt.


I wonder if that's a situation where the bosses above your boss think that there should be a proper "shape" to a team's rankings (e.g., can't havea team with too many principals, or all SWE 3s, etc). Even when one's manager is willing to go to bat for you, sometimes they get told they can only promote one person.


Oh, I'm sure that's the case. That, and they know I love the product and would (usually) be willing to put up with lower pay to keep working on it. But, losing assets in the divorce, I'm going to need to take more risk to get back on my feet. Only have a few decades to retirement, and need to rebuild that nest egg. So my priorities are changing, and I'm less willing to put up with people that don't appreciate me.


> yet I just can't get seem to break into the next job title tier because "We don't need a principal."

> I just found out my lead was just promoted to principal.

Brutal. Good luck.

Got a plan?


Yup! I'd told them last year as the divorce was starting that I'd need to be promoted within 18 months, or I'd start to become bitter and perform less well... and I'd leave before that happened.

Just put in for a GDC talk, hopefully that gets approved. Otherwise I'll stick around to give a talk at the Game Analytics Summit (big networking event for my niche of a field), and spin up some conversations. Also have contacts in most major studios... or I may just strike out and start doing consulting work. It was always a dream of mine to get back into global consulting work once the kids were grown up. They have been, but the ex was wanting to stick around for the grandkids.

Me, I'll be a free woman to make her mark on the world (once I can extract myself from the leech).


I hear you. Tracing and core dumps is just part of being a programmer. At least it should be.

Try to have a conversation about this with anyone where modern architecture means not only there are nothing to analyze, but that every software everywhere should just be restarted on every failure and all state thrown away because "stateless" and "cloud". The best environment is when no one can log in anywhere.

It's not a problem that no one can analyze anything because nothing is ever analyzed anyway. Software simply shouldn't be fixed, it should be built upon.

It's seems to be pretty much everyone's state of mind these days, and I feel completely powerless about it.


Sometimes the Root cause for problems is buried one or two abstraction layers deeper than the "responsible team" is comfortable to work at. This is where the "lower level" expert comes into play.

At my place the abstractions start at high level user facing code, reach down into the hardware interface (driver) code and down to the actual electronic circuit design.

EVERYWHERE something can go wrong. In case the circuit deteriorates too fast you need to get the material analysts to "debug"

Rule of thumb is: there is always one layer of abstraction below you where stuff can break which feels like magic to you but is fixed with a glimpse from the lower level guy


I've just always considered understanding and being able to debug your environment to be a standard part of the job. As a professional software developer working in a Solaris environment, knowing how Solaris works, how to use the shell, tools, and other stuff is just basic.

Back in the 90's I worked at a company doing development on HP Apollo workstations. They were X based and used CDE for the desktop environment. When I started there, I invested some time to learn how it all hung together, how to leverage ToolTalk, and customized my system to do cool stuff.

There was another developer who had started there a month before me, and had the same workstation. They had a problem one day that they wanted my help with, so I went to their desk for the first time. I found their screen had the default X stipple pattern, and a single terminal window 80 characters wide, maybe 2/3rds of the screen tall, and placed somewhat off center in the screen.

I thought it slightly odd, but was distracted by the problem at hand. So at one point when they had some code up in the single window in vi, I asked them to open another terminal so we could do some other stuff. They just sort of sat there, so I asked again. They got agitated and snapped that they didn't know how to do that.

This was a professionally employed C programmer, with several years of experience prior to this, who had been working with this equipment for several months, and _didn't know how to open a second terminal window_. They didn't have CDE running, because when you log in you can select your desktop, and they kept picking just a plain, raw X session with the default xterm. They were completely and utterly uninterested in the capabilities of the hardware, OS, and desktop environment.

The same goes for the issue with the crashing JNI driver in my other comment. Maybe you don't know how to write your own JNI stuff, but I just expected that any developer who'd been using Java for any length of time would at least _know_ about how the JVM works in general, and what a coredump means in the context of a Java program. Specifically, that it's not your software that's causing it, and that trying to fix it by rewriting and debugging your Java code is a waste of time.


Sadly the vast majority of developers might know that there is a tool called debugger and might also know how to set a breakpoint. But that’s where it stops. No drive to see what your debugger offers you. callstack? what’s that? Conditional Breakpoints? Huh?

I hope that instead of aiming further down the abstraction layers, they instead aim up and try to get the hang of macro level stuff, metaprogramming, abstractions which I most often put off as "that’s just a fad"

If something like containerization or container orchestration sticks around for 5 or 10 years I'll maybe take a proper look at it.

Until then it is ‚just a trend‘ and takes mental space I feel spent better digging down. Because that’s where software is interacting with hardware (the real world). That’s the abstraction layer where you can 'actually move the world', where goods are created.

Not everybody is capable or experienced enough to move across different abstraction layers and have a proper mental models for each. SOMEWHERE you will probably have to draw a line.

Some people draw the line at 'writing software'. Execution target (understanding the hardware / runtime) are out of focus for effectively all programmers. I think partly because languages, tutorials, bootcamps don't bother to dig deep there because 'the compiler will take care'.

It just is not teached that way. What do you mean with Solaris? Sun? PDP? It is just a computer. Java is cross platform. How it does that? Thats the JVMs job, not my job. I just write the software.


> I've just always considered understanding and being able to debug your environment to be a standard part of the job. As a professional software developer working in a Solaris environment, knowing how Solaris works, how to use the shell, tools, and other stuff is just basic.

If this was a standard part of the job, it wouldn't be my superpower. Working in a team where everyone (or almost everyone) can debug is amazing!


While it wasn't $5k, I was in a situation where I was told that I'd get something specific if I fixed a particular bug, and then didn't get it.

OTOH, I have gotten bonuses after the fact that weren't talked about at all beforehand. IMO, that works out a lot better for everyone.

I've decided that promised bonuses mean nothing and are a sign of deceit (why bribe me to do my job?) and to absolutely ignore them.

I do very much appreciate the expected bonuses, though. I've had them both in cash and time off and I'm not even sure which I prefer.


Ugh. I meant unexpected bonuses there at the end. That one mistake changes that whole sentence.


At a smaller scale, you don't necessarily need a dedicated team, you can just have a couple people who know to look for core dumps and know how to look at them (although gdb exe exe.dump ... bt gets you 90% of the way there 90% of the time), and whatever their real job is, it can be deprioritized to deal with urgent issues elsewhere without much fuss.

If you get to the point where the fixers are always mostly busy fixing things, you can give their real jobs away.


The section about Apple is quite wrong. Their products were held back by semiconductor technology as far back as I can remember. Some examples were never getting a laptop running with a G5 and for sure with the early phone and tablet prototypes they wanted to focus on effciency in a small package. Buying PA-Semi was integral to their product roadmap.

The stuff about twitter I am not sure. You certainly do not need kernel expertise or to use the JVM at all to build a similar product these days. It seems they could be just held back by legacy technology choices that don’t really benefit the product and which competitors probably won’t need the cost of supporting.

Companies should always be evaluating how critical their in house technology is to thier business and/or future product roadmap.


> Companies should always be evaluating how critical their in house technology is to thier business and/or future product roadmap.

For these sort of deep-expertise support teams, they would have to be considered over very long time horizons, since a kernel or JVM team might be needed to deal with a serious bug at any time, but there may not have been such a bug this year.


For larger companies that have the luxury of even contemplating developing in-house expertise the next big question is: buy or hire external expertise or develop in-house. These aren't easy questions since the investments are large and take a long time to pay off. It's easy to praise the good decisions some companies have made or laugh at the failures but someone has to make these big calls at some point and it's usually not the engineers and developers.


As engineers we should be thinking about this and telling management.

A few years back we decided to change the format all our graphics were stored in - which in turn meant calling new APIs to draw them. After a few rounds of meetings to figure out how many graphics there were and how much this would cost I realized this wasn't something we should do. I continued to estimate, but I sent a strong email to my boss "As a tech lead I forbid all in house engineers from doing this work, there is no long term value in learn how to do it, and we can hire third party contractors who know the new API better than us. Also we need to combine the contract with other division needing to do this same thing as it isn't worth scripting things for just us but a large contract will write a script for some of the work saving time and money". Immediately the whole tone of conversations changed around the company - managers (and I assume their technical people) realized I was right and all got together to get one contract to get the job done.



hmm I mean yes but

author mentions doing science because 'scala is slower than java' -- they're talking about in prod, but build times are also slower. why not just use better tools?

heard a one liner once about react I think which was like 'FB is hiring rocket scientists to get to par with 90s web performance'

twitter is a cool site but it isn't curing cancer, it isn't feeding people, it's solving rails bugs caused by a celeb selfie during the grammies.

is the subtext here 'hiring rocket scientists in organizational sea caves because you can't hire rocket scientists to run the company and impose good practices top down'




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: