Hacker News new | past | comments | ask | show | jobs | submit login
Machine Learning: The High-Interest Credit Card of Technical Debt [pdf] (googleusercontent.com)
135 points by sonabinu on Aug 4, 2015 | hide | past | favorite | 26 comments



I've experienced issues like those mentioned. Imagine a recommendation service for online shopping, using a simple co-occurrence model, aka "People who viewed this also viewed". Hopefully, your model will eventually hit a steady state, with generally helpful recommendations. Now, imagine someone comes up with an almost-perfect replacement algorithm, that provides a high lift in clickthrough rate, that's being tested against the original system.

The original co-occurrence system has access to all the data, including that from sessions exposed to the new algorithm. If the A/B test runs for long enough, the original system will learn the new system's behaviour and emulate it, because for a given seed item, a lot of the co-occurring clicks will be on items recommended by the new system. Although initially the new system will show a lift, eventually the two systems will tend towards showing the improved recommendations, and the lift will tend towards zero.


Interesting!

Is this an example of where the old algorithm is capable of exploiting the information in its training database, but is not capable / not configured to ever explore? So by feeding it additional (context, recommendation, result) samples from the new algorithm, it is rapidly able to exploit the information to offer improved recommendations, even though it would never have proposed those recommendations?

More generally, it sounds like the old algorithm (and perhaps the new one too) are rigged to myopically try to make the best decision right now - to conservatively maximise the value of this recommendation - without considering that there will be value in future of carrying out some ongoing experimental work to try new things and grow a diverse training dataset, which could pay off in subsequent rounds of recommendation.

A simple to describe but sub-optimal strategy to improve this would be to use an epsilon-greedy recommendation system: e.g. set epsilon=1%, so 99% of the time it makes a recommendation using the original algorithm, and 1% of the time makes a recommendation at random (to gain novel information).

I read a little about this kind of thing a few years ago: explore/exploit tradeoffs, online learning, regret minimisation, bandit algorithms, contextual bandits, upper confidence bounds, ...


It sounds like you're talking about the idea of introducing noise in order to prevent stagnation and make sure learning continues.

One of the trivial ways to do this with a recommender system is to change the priority of some search results so that, say, a page 5 result shows up on page 1.

You also do something similar with introducing noise in nns for image processing.


A co-occurrence model isn't really meant to be used for exploration. I'm eliding a bunch of details, as it's just one of many different recommendation algorithms, and there's an exploration layer on top of the whole ensemble, which includes an epsilon-greedy component.


I may have missed the subtext/point of your earlier comment:

that the additional samples generated using the new candidate algorithm were visible to the existing algorithm, making comparison of the two algorithms difficult, and that this visibility (or its consequences) was not initially anticipated.


It sounds like biologic growth, processes bootstrapped to other processes.


>eventually the two systems will tend towards showing the improved recommendations, and the lift will tend towards zero

While this is probably true in many realistic cases, I'm skeptical on theoretical grounds.

Suppose the replacement algorithm happens to be run on a quantum computer. When you search for a book such as "I wish I knew a prime factor of 132,200,813,987,918,309", it near-instantly recommends you might be interested in the book "Interesting facts about 373,587,911".

If P!=NP and the original system is running on conventional hardware, there's no way it can match the replacement.


>> The original co-occurrence system has access to all the data, including that from sessions exposed to the new algorithm.

This is obviously a bad A/B Test design. The original algorithm shouldn´t have access to the data generated by the new algorithm. Designing an adequate test for ML systems is often as hard as designing the ML systems itself. And this is my main concern with machine learning.


> The original co-occurrence system has access to all the data, including that from sessions exposed to the new algorithm

Wouldn't the solution be to exclude that data from the original system?


The paper lists some of the downsides of machine learning but skips the upsides. If you think of something like "How Google Translate squeezes deep learning onto a phone" featured here recently (https://news.ycombinator.com/item?id=9969000 ) it may stop working well because the world changes in that new fonts get trendy but it is probably easier just to retrain on new fonts than to adjust a non machine learning solution - if there is one for that problem. High-Interest credit card seems unfair. Occasional service charge might be closer.


> The paper lists some of the downsides of machine learning but skips the upsides.

This is entirely correct but I do not think it is a valid criticism of the paper -- the paper does not set out to answer the question of "is machine learning worth using operationally?".


On the contrary, the first line of the abstract reads "Machine learning offers a fantastically powerful toolkit for building complex systems quickly."



A big downside too is incorrect usage of ML, where a more mechanistic model would be more useful. Many recent systems biology research is full of papers that look as ridiculous as someone trying to discover Newton Laws of Motion with an SVM.

Hopefully things like probabilistic programming bridge the gap between ML and classical models.


Personally I find the discussion of Undeclared Consumers much more worrying, I'd never thought of it in those terms before.

Even down to all those roll-your-own report guis plugged directly into unabstracted data sources.


I wonder how many of these issues are applicable to human processes ("headcount ML").

Glue code and correction cascades pretty much describe the average corporate reporting/BI department, with layers of complex Excel-based manual processes nobody really understands which are there both due to piles of "corrections" ("we need it in 2 hours, just get it done dirty") and due to Excel limitations (like effectively reimplementing sharding and map reduce in Excel with humans, because you've hit the 1m row limit). It is a wonder some of these businesses can reconcile their P&L with cash flow.

Entanglement and feedback loops definitely apply to a lot of things like A/B testing or online marketing budgeting, and many so-called "data driven" processes designed by managers miss out on a lot of important features/variables (and especially, interaction terms), assuming they even worry about significance in the first place.


Can anyone suggest more resources on designing ML systems? Books, papers, talks... This was fascinating.


John Langford's hunch.net[0] has a lot of interesting stuff. One example: "Interactive machine learning is about doing machine learning in an interactive environment. It includes aspects of Reinforcement Learning and Active Learning, amongst others." [1] Tutorial slides [2].

A fair bit of this is theoretical, but the theory is extended to cover scenarios that more closely resemble operational use, rather than the textbook supervised learning world view, which assumes that (i) your training data appears magically, and you never get any more of it, and (ii) the predictions of the model are never actually used for anything.

[0] http://hunch.net/

[1] http://hunch.net/~jl/projects/interactive/index.html

[2] http://hunch.net/~jl/interact.pdf


Re: "And the requirement to maintain multiple versions of the same signal over time is a contributor to technical debt in its own right."

The authors don't elaborate on their reasoning here. Would someone care to comment on what they are getting at?

Now, how does one make the call if this would be technical debt? In my view, this question will answer the question: are multiple versions of a signal required while different parts of the system settle over different time periods? If so, then reality is intrinsically complex in that way, so I would not expect a system to gloss over that complexity. A system that simply reflects the inherent domain complexity is not adding technical debt, the way I see it. Am I missing something?


This paper was presented at a NIPS workshop last year.


Is there a video online?


No, as far as I know that workshop was not recorded.

https://nips.cc/Conferences/2014/Schedule?type=Workshop


The ultimate litmus test for any ML system is whether it impacts the bottom line of a company and the best and most direct place to prove this is in the financial markets with a hedge fund for example.


The proper test for a ML learning system is one that you would expect the average human with 140-160 IQ and 5 years of experience to fail at performing manually? Because most active portfolio managers do not beat the market.


Isn't that the ultimate test of any system developed for a company? How does that have anything to do with the article?


Hedge funds are extraordinarily unrepresentative companies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: