Hacker News new | past | comments | ask | show | jobs | submit login
Machine Learning: The High Interest Credit Card of Technical Debt (2014) (ai.google)
576 points by m1245 on June 18, 2018 | hide | past | favorite | 88 comments



I recently came across this artice:

https://en.wikipedia.org/wiki/Overfitting

Although it describes the issue pertaining to statistics + machine learning, this is also exactly what end ups up happening with a large codebase without clear requirements or test cases, and people just making incremental, piecemeal changes over time. You end up with an application that has been trained (overfitted) with historical data and usecases, but breaks easily for slightly new variations that are different from anything that has ever been handled by the system before in some trivial way that better designed, cleaner, more abstract system would be able to deal with.

Given how much poor coding practices resemble machine learning (albeit in slow motion), it's hard to hold too much hope about what happens when you automate the process.


I really like this extension of the concept of overfitting to codebases in general.

I especially noticed this in libraries/packages that were "community owned" in a company--instead of one team owning the package and being the authority on deciding the long term roadmap and communicating with other teams about feature requests, deprecations, documentation, bug fixes, etc, the community at large, where "community" was very broadly defined as a team that for whatever reason had an interest in using/maintaining/adding onto the package, would collectively own the package.

Naturally, the result was exactly the scenario you described. Each team hacked on their own bit of functionality for their specific purpose, while doing their best to not affect or break the increasingly precarious tightrope of backwards compatibility. There was no long term architectural vision, so there was a definite need for refactoring--and yet no team had the incentive to invest the amount of time needed to do that.The documentation was woefully incomplete as well, and few people understood how the entire thing worked since each team would only interact with their small fraction of the code.


Two principles I live by (much to the annoyance of my bosses)

1. Don't fear the refactor. 2. If you don't want to rebuild your entire application from scratch, don't worry, a competitor will do it for you.

There's nothing wrong with creating something in increments. It's the fear of revisiting something that destroy's a code base.


Your bosses might be right.

Technical debt, much like regular debt, can also be used as leverage to quickly gain a competitive advantage. While your competitors are busy refactoring/rebuilding perfect applications without hardly creating any more customer value, the scrappy startup that writes piles of spaghetti code might be building exactly what customers want.

Code quality != business value.


While this is clearly true (and is exactly what was being described when "technical debt" was coined), the unfortunate reality is that we often take on huge amounts of technical debt in order to fund the equivalent of pizza parties. Having eaten all the pizzas, we then have to pay back the debt and frequently the company can't afford it.

This is one of the reasons why you must not fear the refactor. Sometimes you need to get that code out the door because the business requires it. Then you need to pay back the debt -- by refactoring that mess every time you touch it in the future.

There is no such thing as "technical inflation" to magically wipe away our debt. It's important to have good lines of communication so that the business doesn't get used to squeezing development in order to eat pizza (because, why not? It's free!)


Piles of spaghetti code will give your customers what they want today, but rob them of the features they want tomorrow and make every future feature orders of magnitude more expensive to develop than they should be. That's the interest you pay on Tech Debt.

Much like regular debt, if you don't repay it, you go out of business and end up penniless.


> Code quality != business value

I don't think that's a given: in some circumstances code quality absolutely is business value. It might be better to say code quality can be, but isn't always, business value. As ever, context is the deciding factor.


Well, I would say technical debt is similar to the classic kind of debt: It may give you short-term advantage (liquidity), but on the long term, there's interest on it. If not paid off, it grows exponentially.

So yeah, technical debt can be used as a tool, but it doesn't come for free.


I really don't think so, apart from exceptional case (if you're selling your code to another dev maybe), code quality is never value to the user. That's not to say that good code quality is useless of course, but the usefulness of code quality is not in the business value.


Technical debt is already has similar business concept in "expensive" money - Funds that you raise from VCs while in distress on bad terms because there's no other way to do what needs to be done fast enough. Programmers paid with expensive money trying to argue that they need more time to write high quality code because 'future' will seldom win that argument.


I think this is actually a better analogy than debt (notice that equity and debt are on the same side of the ledger); just as future valuation of the company is uncertain, so may also be the business value of the quick hack. I.e. in the same way that a certain VC investment may or may not be wise, the quick hack has the same uncertainty attached.

Add to this that the business people may have a bad grasp of the true cost of the hack, and the developers little insight into the business value of it, you get the current situation.


And when you clearly see that the whole product was a dead end, you can default on your technical debt. Saving you untold man hours.


> Technical debt, much like regular debt, can also be used as leverage to quickly gain a competitive advantage.

Unlike regular debt, technical debt is extremely hard to quantify.

You can't balance a business strategy if you can't estimate how much you're going to pay.


What tends to happen in reality, after the code gets 7-8 years or more long and it's always been piecemeal and spaghetti code then each change is exponentially more difficult to make.

There is the story by Robert C Martin about the company that made a really good C debugger back in the day. Then C++ came out and the company promised to make a version for it. Well months came and went and eventually they went out of business. Because the first version of the debugger they wrote was awful code it made changes really hard to mark and so they couldn't adapt to the changing market.


Business is mostly a math's problem and most programmers don't really understand why they go to work.


Amen. Good enough working code gets your foot in the door. You pay later, but at least there is a later.


This view has a danger that some understand this as "you never have to pay your debt back". But if project lives long enough to be successful you end up painting yourself into the corner where you cannot change a single thing without breaking something.


> 1. Don't fear the refactor.

Like most things in life, there is a balance. I have argued against large refactors many times. Often wanting to do a refactor is just a thinly disguised excuse to use some new technology (I'm as guilty of this as anyone else). Anytime a refactor comes up my goal is to figure out why:

1) What will the refactor fix?

2) What will the refactor potentially break? Are there tests around critical functionality?

3) Does the group proposing the refactor really understand the ins and outs of the application? When new people come into a system they often want to change it to fit their mental model of the problem, and miss subtleties of why the system is a certain way.

That being said, I evaluate small refactors anytime I have to touch a piece of code.


I am more inclined to your sentiment. Now there is no excuse for badly formatted code and being a lazy slob, and I never use the word refactor in the sense it is used here.

I often _redesign_ old code to meet new requirements and to support new features, but I would not call it refactoring.

I always strive to leave the code better than when I found it. But I would not name it refactoring.



"It's the fear of revisiting something that destroy's a code base." So true.


Yep, when fear creeps in around modifying a part of an application it is time to have a very serious conversation about fixing that. It is one of the few cases where I find the refactor vs. creating customer value argument is more clear cut -- letting that fear linger is likely to spread to other parts of the code & turns into a human problem pretty fast.

Fixing might be a presentation, tests, documentation, refactoring, rewrite, deprecation, whatever. Just don't let it languish and the fear grow.


I wholeheartedly agree! Companies that delay tackling technical debt still ship features, they get slower and more error prone development as time passes. As they still keep shipping they can fail to see how much faster they'd be shipping 6 months down the line if they tackle debt which adds weeks to each feature being developed.

I've expanded on these thoughts before on my blog about technical debt inflation if anyone is interested https://scalabilitysolved.com/technical-debt-inflation/


Indeed. I used to be scared of database changes in case something went wrong. Now I realise the worst thing to do is to hack code on top of a poor database design to make up for it. That usually ends up far worse.


And one can extend that to businesses as well. How many established companies have been laid low by someone with a new process built in a more modern foundation.


It would be interesting if someone actually had data on this?

I suspect this is something "software engineering" researchers might study.


Whatsapp is a prime example in the tech vs tech space - ride-shares vs taxi services, automated freight loading, fedex vs ups in terms of automating their package sites. An old factory with 1000 workers not being able to compete on a cost basis with a new automated one is the story of the last 50 years I feel.


Agree with the other commenters, very interesting insight.

It maybe doesn't fit the metaphor quite as well, but as an operations person, I've frequently run into the "underfitting" problem. For example, we run Chef to manage our physical and virtual infrastructure. There are a ton of community-authored Chef cookbooks available. Which at first blush, sounds great. But often, they have grown over time to become these awful hydras that try to be all things to all people. PR after PR has added support for the specific use case of every organization that wants to run the cookbook in their own special way. The "Getting Started" section of the README eventually becomes a dumping ground of 900 attributes you need to set correctly, and yet somehow it still doesn't quite perform how you'd like.

In many cases, we've tried to use community cookbooks and even merge our own customizations back upstream. Only to eventually give up and write our own version that's 50 lines of Chef DSL/Ruby instead of 5,000 but does exactly what we need, the way we need, and no more. It's very possible to make a system too generic and configurable, to the point where it loses all meaning.


Found the exact same thing regarding the community cookbooks. We do use some though, it depends on the complexity and how well they work. I've either written some from scratch, taking pointers from the community ones or forked them to make them simpler and better suit our needs. Pull requests have been made where it makes sense.

Glad to hear we're not the only ones who found the community ones not perfect for every need.


> There are a ton of community-authored Chef cookbooks available. Which at first blush, sounds great.

Welcome to software development! Not as easy as it looks is it :)

EDIT you may find these articles helpful (or at the very least food for thought):

- https://blog.codinghorror.com/dependency-avoidance/

- https://www.joelonsoftware.com/2001/10/14/in-defense-of-not-...


The problem with the analogy is that for a learning algorithm, there are clear definitions of the model complexity as it relates directly to the outcome being optimized. YAGNI applied to a model is a penalty term for parameters or various methods of regularization.

But when the “goal” of the system is just “arbitrary short term desires of management” you can easily point out the problems, but there is no agreement on what constraints you can use to trade-off against it.

Especially for extensibility, where you can get carried away easily with making a system extensible for future changes, many of which turn out to be wasted effort because you did not end up needing that flexibility anyway, and everything changed after Q2 earnings were announced, etc.

In those cases, it can actually be more effective engineering to “overfit” to just what the management wants right now, and just accept that you have to pay the pain of hacking extensibility in on a case by case basis. This definitely reduces wasted effort from a YAGNI point of view.

The closest thing I could think of to the same idea of “regularizing” software complexity would be Netflix’s ChaosMonkey [0], which is basically like Dropout [1] but for deployed service networks instead of neural networks.

Extending this idea to actual software would be quite cool. Something like the QuickCheck library for Haskell, but which somehow randomly samples extensibility needs and penalizes some notion of how hard the code would be to extend to that case. Not even sure how it would work...

[0]: < https://github.com/Netflix/chaosmonkey >

[1]: < https://en.m.wikipedia.org/wiki/Dropout_(neural_networks) >


Overfitting is a quantifiable problem. If you're not doing robust data segregation and CV you're not even engaging in elementary ML practices.


Only if the training data you got is representative of all future use cases. Good luck with that.


You can segment the validation to be data after a certain date, and train on data before that date. You get an accurate sense of how well the model will perform in the real world, as long as you make sure the data never borrows from the future.


That only ensures your model is accurate assuming real world parameters remain the same, which again, is prone to overfitting.

To use a real world example, financial models on mortgage backed securities were the root cause of the financial crisis, because they were based on decades of mortgages that were fundamentally different than the ones they were actually trying to model. Even if someone was constructing a model by training on data from say, 1957-1996, and validating using 1997-2006, they would have failed to accurately predict the collapse because the underlying factors that caused the recession (the housing bubble, prevalence of adjustable rate mortgages, lack of verification in applications) were essentially unseen in the decades of data prior to that.

Validation protects against overfitting only to a certain degree, and only to the extent that the underlying data generating phenomena don't ever change, which, in the real world, is generally a terrible assumption.


I'd probably put fraud ahead of models as the root cause. The entire purpose of those securities was to obscure the weakness of their fundamentals.


That's not hard and fast, though. While no model is perfect, robust models can "handle" outliers. Worst case, you know when it happens and train with more a priori.


Worse case? More like best case.

It's not about outliers. Let's say you're at a startup and you fit some model to your first 30 customers. It works great for your next 10 customers, but fails dramatically for your first enterprise client. Why? Because the enterprise client was fundamentally different from your previous 40 customers. If you fit your model on a population in which the relationship looks one way, then try to apply your model to a population with a different relationship, it will fail.

Machine learning and statistics are both application of the same principles of probability and information theory. They work (for the most part) by modeling the world capturing the relationships between random variables. A random variable can be any natural process that we can't express in precise terms, so we express it in probabilistic terms.

This is the same principle underlying the premise that "past results do not guarantee future success." The relationships between random variables in the world that affect success in anything -- stock market performance, legal outcomes, etc. -- might not be the same tomorrow as they are today.

And that's not even a matter of overfitting. That's just your ever-present real-world threat of having all your modeling work invalidated by forces outside your control. Overfitting happens when you, the data scientist, fit your model to random noise in the training data. An overfitted model will have bad generalization performance on held-out samples, even from the same population. It's not always easy or possible to detect overfitting, especially with small training sets.


What's the problem with that, though? Startups are usually advised to service one market, not several. If your first 40 customers were prosumers but then you have a prospective enterprise client, the logical response is say no to the enterprise client and go after another 60 (or 60,000) prosumers.

Or at least understand that you're entering a new market and budget appropriately for development. Usually, if you're switching from between prosumer -> enterprise, you are very, very lucky if the sum total of changes you need to make is training a new machine learning model. To start out with, you usually need to get used to sales cycles that take 6-18 months, hiring a dedicated sales guy to manage the relationship, and handling custom development requests.


There's no problem with it, but some very intelligent people don't seem to realize that you can't just "use machine learning" and predict whatever you want. It's gotten better over the last few years, now that it's less new and magical than it used to be, but I still see it happen now and then.


Hopefully your analysts (which in this case includes your lawyers, accountants and statisticians) will tell you that the new client is different to the others and your models may not hold up and may need revision.

Hopefully you also listen to them.


Close. Extrapolation is possible using structural theories rather than only reduced form models.


Only if your structural theory is not-wrong enough.

Even if you KNOW that your model is not-wrong in the right direction and within acceptable orders of magnitude, how do you fit the parameters for that structural model? You need some kind of data, even if you're just using anecdata to pick magic constants.


All models are wrong, some are useful.

Fortunately models like these are often testable across many contexts, amenable to metastudies, available for calibration, etc.


That's my whole point. You just asserted that you can extrapolate outside a training set with a structural model. I am asserting that those "many contexts" and "metastudies" amount to a bigger, more representative training set.


What do you mean by CV? I'm not familiar with those terms. Thank you.


As sibling points out, cross validation, which is the front-line approach to avoiding overfitting for supervised classification problems.


It means cross validation. It essentially means is a way of simulating how well your model will do when it encounters real world data.

When building a model, you divide your data into two parts, the training set and the testing set. The training set is usually larger (~80% of your original data set, although this can vary), and is used to fit your model. Then, you use the remaining data you set aside for the testing set by using your model to generate predictions for that data, and comparing it to the actual values for that data.

You can then compare the accuracy of the model for the training and testing sets to get an idea if your model generalizes well to the real world. If, for example, you find that your model has an accuracy of 95% on the training data, but 60% on your testing data, that means your model is overly tuned into features of the data used to build the model that may not actually be helpful for prediction in the real world.


Never seen the acronym (not really in the space) but I assume cross validation.


Camouflaged Vacuity


I assumed Code Versioning so that if you have robust data segmentation you have less uncertainty about the impact of change. However, I'm a tourist here and hope OP comes back to share.


Cross-validation: testing model fit on non-training data


I assumed Computer Vision.


Fantastic insight, really top-notch.

Just some random thoughts in no particular order - curious what you make of them:

- On the subject of incremental piecemeal changes over time with no requirements: don't you all find that in your workflows (when you're doing something for yourself), it is hard to step back and "architect" something? It is easier to just let it evolve.

- Likewise it takes real work and thought to organize something as simple as a spice rack. (I just keep opened packages of spices in the cupboard.) The knowledge that company is coming is one of the few pushes. But it kind of feels like it's being done for show.

- It's hard to add architecture when you know there's no team that is coding against it as an API. It's just you. It feels like that extra power is, kind of wasteful.

- The other thing is that it may be the case that you know there is some deeper level of architecture. In the case of my spices, for example, most of the opened spice packets I mentioned are actually mixes. (Such as grilled chicken spice mix.)

- If I had to architect my own spice rack, I should start by learning which spices I'm actually using more of. And since what I'm doing works, I don't actually care. Plus, it would be a step down: the first time I mixed my own spices, I would probably end up with a worse dish than pouring some out of a premixed packet.

- The first time you architect a "proper" framework rather than let your machine learning algorithm "overfit", the result is probably demonstrably worse.

- That is a lot of pressure on not architecturing, and just continuing to (over)-fit.


This is where good logging helps.

The lifehack is to throw all your spices in a box and only pull hthem out when you need them and then leave them on the rack. Then throw away any spice you haven't used in n months and add it to a blacklist. The ones you use frequently should be prominently displayed and texted with extra care and possibly set up for autorenewal from the grocery.

Only introduce new spices when there's a recipe, and buy just the amount you need.

So too with code. Log your code paths, prune little used features, optimize the hell out of the most frequently used ones, introduce features sparingly and with purpose...

I like this spice metaphor, thabks for it.


well-constructed != over-architected


Epicycles within epicycles eventually get replaced with a clean redesign (https://wikipedia.org/wiki/Paradigm_shift)

The tricky bit is mostly that you need a new theory of the data to have a better abstraction. That's the tricky bit.

Models generated by DL lack even a paradigm or theory or abstraction.


This is a brilliant insight.

One of the problems I’ve seen in research into technical debt is the lack of a good definition. This insight could form the basis of one.


"Given how much poor coding practices resemble machine learning (albeit in slow motion), it's hard to hold too much hope about what happens when you automate the process."

Your whole argument seems to be based on your personal experiences. Perhaps it is also thus vulnerable to some sort of overfitting :)


Hopefully code reviewers with institutional knowledge can advise on where to apply pruning and prevent code overfitting.

Pruning is also the common ML practice to prevent statistical overfitting.



which means it must be good, right?

I feel like there are a few of these frequent flyers... is there anyway to figure out what they are?


"Show HN: Hacker News Classics" might fit the bill.

Discussion: https://news.ycombinator.com/item?id=16442888

App: http://jsomers.net/hn/

Source: https://github.com/jsomers/hacker-classics


Software Engineering daily has a podcast, with D. Sculley, the author of this paper. It is quite interesting to listen

https://www.softwaredaily.com/post/5913c0e74ee01db33cacd027/...


If you're interested in this, be sure to read "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction" [1] and the Rules of Machine Learning [2].

[1] https://ai.google/research/pubs/pub46555

[2] https://developers.google.com/machine-learning/rules-of-ml/


I love this paper. The further I get in my ML/stats career, the more relevant these lessons are. I would recommend anyone interested in building long lived ML products to read this.


This paper will eventually be seen as a landmark in the field of machine learning systems. Read it, learn, and write up what you discover along the way. This literature needs to grow.


I enjoy the paper as a starting point for discussions. However, given the varying definitions of technical debt, I think it is important to see additional elaborations and examples of real-world trade offs in production systems.

Perhaps the best overall wisdom this paper tries to impart is this: build awareness, culture, and tooling around your ML systems, both upstream and downstream. Never stop exploring and improving. Relentlessly try to slim down your models, simplify your pipelines, and bring people together to talk about all kinds of dependencies.


2014 feels like decades ago in AI. Looks like that paper is already a classic.


This is a great paper. It helped us think more rationally and helped us realize that we were not alone having difficulties running ML in production.


Is this still relevant?


I work at Google on a product driven by ML doing ranking and regression tasks. Can confirm, very relevant. That said, ML is usually superior to the rules and heuristics systems we've been able to come up with, so we take on the debt once we stop being able to improve our heuristics, but only once we've tried really hard at the heuristics such that we have a baseline quality bar to beat. That justifies the effort, but it's still a lot of work to be vigilant and keep an eye on shifts in signals, unintended dependencies, good metrics that mean something, etc..


What really bugged me for a while is how unbelievably easy it was to beat a very large amount of hand tuned code using ML. Going from 92.x% accuracy to 97% accuracy even without any tweaking at all feels like cheating.


Aha, and what you're actually getting is like 99% accuracy with incredibly costly false positives; as opposed to hand-tuned rules which made sure that most of the mistakes made are cheap ones.

If you are smart, then what you're doing is probably easily transformable to the set of rules you had before. At that point you can compare why exactly it's so good at the metric you're measuring so there's no "cheating".

Sadly most of the ML consultants just take an exemplary code from one of the tutorials and then show you the metric it generates after having run.


Do you have more information on jaquesm's ML model than what he mentioned in the comment or was it a sweeping claim about any high accuracy ML system.

There is no reason why the consequences of false positives and false negatives cannot be incorporated in the model itself. In fact for certain kinds of systems such as 'alarms', or 'imbalanced classes' this is pretty standard.


He doesn't and the incredibly sure way in which he speaks of a system without any knowledge of the context or the application is an interesting study in how online conversations derail.

Anyway, the misclassifications are much the same as with the original system, in fact on the same parts only with far lower incidence so to me it looks as if the ML system simply managed to extract a lot more features (and automatically) than I would have time for to do by hand, on top of that it adapts easier to new, previously unseen content because I don't need to come up with a bunch of (reliable!) rules to tell those parts apart from the previous ones (this does require a complete retraining of the net).

For some subset of the problems available ML works very well indeed, for others it may be a marginal improvement and in many cases ML is just dragged in to a project even though it has no place there. If you're in the first category: consider yourself very lucky and reap the benefits.


> For some subset of the problems available ML works very well indeed, for others it may be a marginal improvement and in many cases ML is just dragged in to a project even though it has no place there

How to recognize which problems are well-suited for ML? Are there any rules of thumb for (relative) laymen already?


Are you at liberty to share, accuracy of what?


Can't speak for OP, but such accuracy numbers often hide a 20/20 hindsight bias.

After having built and run a rule-based system for a while, you always get tremendous subject matter expertise, a feel for what works.

Any rewrite of the system at that point will lead to much improved accuracy. The clarity is reflected in a better choice of the input signals, features, data preprocessing, metrics, workflows…

A "magic ML" (without domain understanding) beating well-tuned SME rules is a dangerous fantasy, in any non-trivial endeavour. In other words, without that clarity, you're better off gaining it first through simple iterations of rules, figuring out what matters.


Lego parts recognition.


nah, thankfully technical debt has been solved.


Gnome solved it. Just burn the universe every five years!


That should honestly be a more frequently employed tactic in codebases.


I'm definitely a believer in this. I read Carmack refer to it as "grooming" at one point. Codebases where components and subsystems are periodically rewritten naturally end up being easier to change, which is an undervalued metric.


Definitely agree. However it mostly depends on your product being well isolated little chunks of functionality which is rarely the case. So what happens is you end up with a perpetually half baked Cthulhu.


2.0 came out around Sep 2002 (a bit less than 16 years ago). 3.0 came out in Apr 2011 (7 years ago). Most of the components were the same between the last 2.32 release and 3.0, except for the very visible gnome-panel to gnome-shell switch. Gnome isn't a good example.


Sorry I forgot the /s on my original post.


It's more relevant every year.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: