Hacker News new | past | comments | ask | show | jobs | submit login
Why Multi-armed Bandit algorithms are superior to A/B testing (with Math) (chrisstucchio.com)
209 points by yummyfajitas on June 3, 2012 | hide | past | favorite | 50 comments



For what it's worth, I've been following this cross-Internet debate with more than a little professional interest. Cards on the table: I have coded A/B testing software, I frequently code and/or administer it for clients (often in ways which are provably suboptimal), and I am a dirty loyalty-free scientist-cum-capitalist-pig who would stab A/B testing in the back in a second if I thought there were an easier way to extract more money for the same amount of work.

I strongly, strongly suggest that anyone attempting to look at this problem from the perspective of a site owner rather than a mathematical abstraction read and digest btilly's comment from earlier this week:

http://news.ycombinator.com/item?id=4040616

The issues he lays out are very real in the course practical use of site testing to actually make money. In particular, his #2 would scare the heck out of me, in a much deeper way than "A/B testing provably doesn't minimize regret" worries me in the other direction. (Or e.g. other flaws with particular A/B testing implementations. For example, repeatedly checking the results of your A/B test and deciding to end it when you see significance has been explained quite a few times as a bad idea with stats to match. However, even if you check like a hyperactive squirrel, you're still winning, you're just winning less often than you think you are. Take your B- in stats class but proceed to make motivational amounts of money for the business.)

The worst possible takeaway you, personally, the typical HN reader, could possibly have from this debate is "Oh, I guess I shouldn't A/B test then." Pick A/B testing, bandit testing, whatever -- any option in the set, even with poor algorithms and/or the easiest errors I can think of, strictly dominate not testing at all. (Actually testing today also is better than "testing... someday", which from my own experience and that of clients I know is something which is very easy to slip into even if you theoretically know you should be doing it.)


Pick A/B testing, bandit testing, whatever -- any option in the set, even with poor algorithms and/or the easiest errors I can think of, strictly dominate not testing at all.

So I submitted this post and went off to boxing. On the train ride back, I thought "I hope I don't make people think they shouldn't A/B test. And at the same time you were writing this post, I added a conclusion to my blog post saying the same thing.

Bad A/B testing is an 80% solution. Good A/B testing is a 90% solution. Good bandit is a 95% solution. 80% >> 0.

A/B testing has another benefit to software engineers that bandit doesn't - it lets you delete code.


Unless I misunderstand bandit algorithms, there's a trivial modification that makes the actual, practical administration of them essentially identical to A/B testing with regards to when you can rip out code.

If A smashes B, then bandit will converge in a very obvious manner on A, and you pick it and delete the B code branch, accepting future regret from the possibility that B was in fact better as a cost of doing business. If A doesn't smash B, then at an arbitrary point in the future you realize it has not converged on either A or B, pick one branch using a traditional method like "I kind of like A myself", delete the B code, and go on to something that will actually matter for the business rather than trying to minimize your regret function where both possible solutions are (provably) likely non-motivational amounts of money between each other.

Please feel free to correct me if I'm wrong on this -- I have a bad flu today and take only marginal responsibility for my actions.


You are correct for practical purposes.

However, you are mathematically wrong. Doing things the way you describe is another form of A/B testing, just one designed to reduce the regret incurred during the test. You still get the linear growth in regret I described (i.e., there is some % chance you picked the wrong branch, and you are leaving money on the table forever after).

Of course, your way is also economically correct, and the theoretical CS guys are wrong. They shouldn't be trying to minimize regret, they should be minimizing time discounted regret.

(Yes, I'm being extremely pedantic. I'm a mathematician, it comes with the territory.)


Well fundamentally website optimization is not a multi-armed bandit problem. You are not simply choosing between machines. Instead each website change is another node on a tree and you want to find the highest performing path in constantly changing world, with additional constraints that you can't keep incompatible machines for very long.

A guess at good strategy in those situations would be something that enables you to choose a good path from a small number of options and keep doing so - on the basis that the cumulative advantage will be far more important than any particular choice in itself. Sounds like A/B testing would win if you could actually map the complexity of the problem correctly.


I think that assuming stationary behavior (bad assumption) then given enough time either A will smash B or B will smash A. As in, it's almost impossible that they are exactly equal in performance and given enough impressions even the smallest change will be optimized for.

I think that's a bad thing, though. A single conversion metric optimization is only good for some (large or medium) effect size. After a point, criterions like "I kind of like A" are much more important.

In A/B testing you see this when your ambitious testing campaign returns "insignificant". In MAB you see it when two choices run at roughly 50/50 enrollment for a long period of time.

So on these two extremes, I think practical use of A/B and MAB should be roughly identical. In the middle ground where A is usefully but not incredibly better than B, I feel they must differ.


This is correct. One can also modify most bandit algorithms so they stop exploring at some point.


This, I think.

Once you get your expected regret below the cost of maintaining the old code, rip out the old code!


Wouldn't bad A/B testing be a 50/50 solution. Is making decisions on data without statistical significance any better than throwing darts blind?


Consider the bad A/B testing algorithm implemented by Dr. Awesome. Instead of reporting statistical significance, if it would be statistically significant below 10% chance of coincidence, he reliably reports "It was Awesome!" Dr. Awesome then promptly burns his notes to warm his awesome heart.

Even though many people would take issue with using 10% (a lot of practitioners like 5%) and Dr. Awesome has some serious issues with data retention policies, if you always follow Dr. Awesome's advice, you'll win 9 times for every time you lose. I'll take those odds.

Say Dr. Awesome also has one additional problem: one time of twenty, regardless of the results of the A/B test, he just can't help himself and says "It was Awesome!" anyhow. If you follow his advice, you'll now win approximately 5 times for every time you lose. I'll take those odds, too.


There is a large gap between true statistical significance and a fair coin toss. Bad testing (of whatever flavor A/B, MAB, etc.) is likely to land somewhere in that gap. Most likely worse off than proper testing but also quite likely better than tossing the coin or throwing darts.


A prerequisite for making any of this work is having a statistically significant number of visits to your site on a daily basis, right?

I think many people first have to figure out how to cross that bridge before they start to worry about optimizing what's on the other side.


Thank you for saying that. For people who want to dive in deeper, the discussion below http://news.ycombinator.com/item?id=4053739 is highly worthwhile as well.

And you are absolutely right. Even with bad assumptions and techniques, actually testing beats not testing by such a ridiculous margin that you need to.


In practice, if we were going to talk about that -- variations which perform statistically significantly better one day may perform worse the next. For example, because happy people like to click on one variation versus another which is preferred by stressed people.

Keep A-B tests running even after you've decided the best variation to make sure that the uplift you've observed is real! See http://john.freml.in/ab-testing-significance


I think concern #2 was addressed in the original article: add a "fade out" threshold so that results are fixed to some time span.


Let's settle this with science rather than rhetoric. I'd like to do some proper comparisons between bandit algorithms and A/B testing. Unfortunately we haven't been saving time series data at Myna, so we don't have any test data. If anyone has time series data from an A/B test, and is happy to donate it to the cause, please get in touch (email in profile).

Updated for clarity.


Write out what .CSV columns you need and what formats they need to be in, and I will happily get you this for a handful of A/B tests. (Though probably not faster than late July. As much as I love A/B testing there is the small matter of a wedding and honeymoon to throw a wee bit of a wrench into my near term schedule...)


Gosh, congratulations Patrick. You have helped so many of us. Good luck ;)


Thanks. I'll be in touch.


Rather than try to mine historical data, run an experiment to pit UCB against Neyman-Pearson inference. For some A/B tests, split the users into two groups. Treatment A is A/B testing, treatment B is UCB.

In A/B testing, follow appropriate A/B testing procedures: Pick a sample size prior to the experiment that gives you appropriate power, or use Armitage's rule for optimal test termination. (Email me if you're interested, I'm happy to send over papers/scan relevant pages from his book). However , it's probably best to use a fixed sample size, as that is what most real life A/B test practitioners use. Picking the sample size can be a bit tricky, but as a rule of thumb, pick something that is large in enough to dectect differences in treatments as small as 1%age point.

In the treatment group B, use the UCB1 procedure. Subject the users to whichever design UCB1 picks, and continue with the learning.

Do not share any information between treatment groups A and B.

Run these tests for a sufficient amount of time over a largish number of clients, and then use permutation tests to determine which treatment, UCB1 vs Neyman-Pearson, performs better.

In all the simulations I've seen, UCB performs simple A/B testing, but it would be great to see some empirical evidence as well.


Most websites lack sufficient traffic to reach statistical significance in a short time frame. Sure, Google and Facebook can run a test and get real results in a day (or even hour(s)), but the rest of us need weeks or months to do things properly


That really depends on the level of significance and the size of the effect you wish to detect.


Unfortunately most of the A/B tests that I am involved with at the moment simply do not look anything like what you'd want for a traditional bandit approach. For a start the data is fundamentally 2-dimensional. I am testing by user, and users may take days to hit a conversion metric. Furthermore a large chunk of the population has to go in the test before I get any data at all about how any of them will perform.

This is not an uncommon scenario when testing an email program. And points to operational reasons why A/B testing may be preferred even if you believe that it is statistically worse.


Actually you are correct that in situations with latency (i.e., you have the opportunity to play a machine at t=0, another opportunity to play at t=1, but you don't receive results of the plays until t=10), UCBx doesn't work.

It's not even a question of statistical power, it's just a matter of typing. Typical bandit algorithms use the function play_machine: Machine -> Float. In the situation you describe, the type signature is Machine -> Future[Float].

I'm working on a modified version which handles those situations, however. It's along the lines of UCB2, tweaking the formula so that the numerator is proportional to the number of plays you've submitted while the denominator is the number of plays you've received. Once I finish the math I'll write it up.


The problems with UCBx in the real world are much bigger than that.

The biggest theoretical problem is how it performs when conversion rates are changing out from under it. Which can happen either because your business has natural fluctuations, or because you run overlapping tests - and adoption of one good test can mess up the analysis of unrelated tests that are running at the same time.

You can simply solve this problem by only using data collected in your exploration phase, and throwing away all data collected during exploitation for statistical purposes. However I've yet to see anyone promoting MAB approaches - including you - express any awareness of this rather significant problem.


Could you explain the problem a little more? From what you wrote here, it sounds like the problem is that if the world changes, it can mess up your answers, and bandit will take exp(N) time to catch up.

This is a problem for A/B testing also - if the world changes after your test is finished, you just got the wrong answer. Or am I missing something?

This is why you run tests which you expect to be statistically independent of each other and why you stick to results you expect to be durable.

I'd love it if you email me about it (or post publicly on the topic). I'm thinking of writing a "real world bandit benchmarking suite" which will take into account as many such situations as possible.


The problem is that the world changes - constantly. What changes could mess you up? When we apply math to the real world we should assume as little as possible.

A/B testing is robust in the face of absolute changes in conversion rates so long as preferences remain consistent. Of course preferences do not always remain consistent, but that is substantially more likely to happen than that conversion rates do not budge.

Traditional MAB approaches are not robust in the face of absolute changes in conversion rates, even if preferences remain consistent. The problematic change is what happens if conversion rates improve while the worse version is ahead. Then you can come to very solidly to the conclusion that the worse version is better, and be stuck on that for a depressingly long time. The smarter the MAB algorithm, the more solidly you can make that mistake.

Is this likely? Well most businesses have regular fluctuations in conversion rates. Furthermore websites under continuous improvements are constantly introducing changes. It is therefore not uncommon to, while testing one thing, make independent changes that are likely to improve conversion rates.

But as I said, the simple change to throw away data collected during exploitation makes a MAB approach also robust in the face of absolute changes in conversion rates, as long as preferences remain consistent. Doing so increases average regret by a constant factor in the case where conversion rates never change.

(There are complications in the more than 2 arm case. In particular the fact that you're uncertain about A vs C doesn't mean that you should continue exploring version B. So some observations should count as exploration between a strict subset of the versions. But that's a technical detail.)


First off, I'm all for settling this with data and math.

However, I'm afraid that some practical matters were lost in the previous discussions. As I see it, things happened somewhat like this: 1) A/B testing starts to become a Big Deal; 2) There's a flurry of articles about A/B testing of various quality; 3) Lots of people implement A/B testing, mostly in a poor way; 4) The 20 lines of code article, which would probably really help most people who haven't done A/B testing correctly; 5) Intense discussion about correctness and efficacy that won't impact those people.

I think getting to the bottom of this is important, but I think the multi-armed bandit article that started this would do far more good than harm for the droves of people trying and largely failing to get good data from their poorly done A/B testing.


I'm all for multi-armed bandits. So much so that I've founded a startup to bring them to the masses (http://mynaweb.com/ Sign up now!) I'm absolutely certain that an appropriate MAB algorithm will outperform A/B in the overwhelming majority of cases and be simpler and flexible to use in practice.

However some reasonable objections have been raised in the previous discussions. (In case anyone is keeping record, here are the ones I have been involved in:

- http://news.ycombinator.com/item?id=4052997 - http://news.ycombinator.com/item?id=4040022 - http://news.ycombinator.com/item?id=3928929 - http://news.ycombinator.com/item?id=3867380 - http://news.ycombinator.com/item?id=2831455 )

The typical way for a company to handle these objections would be to ignore them or publish a few pretty graphs and drown the objections in FUD. I think the community deserves better, and would like to do a proper comparison. It will not only be informative but also help us design a better algorithm. Everyone wins.


> I've founded a startup to bring them to the masses ... It will not only be informative but also help us design a better algorithm. Everyone wins.

Yes, this is how everyone wins. Those with the ability and drive can end up making a difference, largely through providing a framework or library for doing this right. This is why I think a continuing exploration of this is worthwhile. Until that time, I think the masses should be encouraged to do what works best (or at least better) for them in the present.

For now, I think I'll let this rest. I'm on the verge of ranting against a general trend (checking off boxes by following steps rather than understanding what you're doing), but this is the wrong place for that and it deserves more thought and time than I'd give it here.


What’s being glossed over here, and explains a lot of the confusion around which is the best method of “solving” the multi-arm-bandit problem, is the classical bias-variance tradeoff. All of the methods presume some model of the problem, and some of those models are more flexible than others. When a model is more flexible, it allows solutions to take on more shapes and must burn more of its training data choosing among those shapes. Models that are more biased toward a particular shape, on the other hand, can use more of their data for convergence and so converge more rapidly.

Which method is “best” depends on what you know about the problem. Does its optimal solution look a certain way? change over time? and so on. If you’re willing to bet on your answers to those questions, you can choose a method that’s biased toward your answers, and you’ll converge more rapidly on a solution. The risk, however, is that you’ll bet wrong and converge on a poor solution (because your biases rule out better solutions).

If you’re not willing to bet on your answers, you can choose a method that will place bets for you based on what it sees in the data. But now you’re burning some of your data on betting. So that’s the tradeoff: you can use more of your knowledge to place bets (and risk placing the wrong bets), or more of the data’s knowledge to place bets (and burn some of your data on betting). Where you adjust the slider between those two is up to you.

Which brings us back to our original question. Which method of solving the multi-arm-bandit problem is best? It depends a lot on where you want to adjust the slider. Which depends on your knowledge, aversion to risk, and expected payoffs.

In life, sometimes one size does not fit all. If you’re going to test one shoe size against another, make sure you know which foot will end up wearing the winner. Likewise, if you’re going to compare algorithms for solving the multi-arm-bandit problem, make sure know the particulars of the problem you need to solve.


What is is the half-time rate of this debate? 300 days or less? http://news.ycombinator.com/item?id=2831455

Sadly, the option to drop this problem over Germany doesn't exist any more.[1]

[1] http://en.wikipedia.org/wiki/Multi-armed_bandit#Empirical_mo...


As someone who works for a major e-commerce site, I am often the one who has the most influence when it comes time to decide which testing method to adopt. Multi-armed Bandit testing can be good, just as standard A/B testing can be good. But the factor which trumps all of these are the total costs of testing (and the return on investment to the business). One must consider the following before undertaking any of these testing methods:

1. Implementation Costs - How much time will it take to implement the testing code? Some tests are easier to implement than others. 2. Maintenance Costs - How much time will it cost to maintain the test for the duration of the testing period? We've ignored this in the past only to realize on occasion that implementation introduces bugs which incur cost and can be disruptive. 3. Opportunity Costs - What is the cost of doing the test versus not doing the test? Consider setup time, analysis, and final implementation.

After going through a few tests now, we have a pretty good sense for what the total cost to the business is. We don't really look at it as adopting one test method over the other, but instead rely upon the projected ROI to test this versus that, versus doing nothing.


If you've conducted multiple tests and "time to implement the testing code" is a major consideration, then you're doing it wrong. If ROI is also a major consideration, then again you're doing it wrong.

Seriously to add an email test right now at the company I'm contracting for takes 2 lines of code. One appears in the program that sends email and looks like:

    $email_state_contact->ab_test_version("test_1234", {A => 1, B => 1});
where test is the name of a test, and 1234 is a ticket number to avoid accidental conflicts of test names. The other appears in a template and looks something like this:

    .../[% ab_test.test_12345 == 'A' ? 'button1' : 'button2' %].png...
That's it. The test automatically shows up in a daily reports. When it wins, you get rid of that code and put the right thing in the template.

Done.


I can imagine a number of situations where the implementation is significantly more complex. While ideally A/B tests should be looking at relatively small changes, where each change is independent, many times people are making profoundly larger changes.

If you are testing the conversion rate in shopping carts, and the changes involves drastic redesigns of the flow through the shopping cart process, that could be a serious technological difference and requires substantial time to implement.

Not every test is as easy as changing the copy on an email.


Even if you're making larger and more complex changes, the overhead of your testing methodology remains the same. That is how you measure things should be a fixed (small) effort, The cost of building the test is whatever the test is.

In other words multi-armed bandit versus A/B test is something that you shouldn't be deciding based on the effort of the testing methodology.


I don't think he was referring to the technology behind the A/B test itself, but rather the technology behind the change that was being made.

That's how I interpreted his statement. I agree with you that the actual A/B testing overhead should be minimal and fairly trivial to put into place.


Disclaimer: I also have software for running AB/MVT as well as adaptive control problems (so bandits as well as extended sequential decisions) at www.conductrics.com.

I wouldn't sweat too much UCB methods vs e-greedy or other heuristics for balancing explore/exploit. E-greedy (and e-greedy decreasing) is nice because it is simple. Softmax/Boltzman, is interesting since it is satisfying in that it selects arms weighted by the estimated means, and UCB-Tuned and UCB-Normal are nice because, like AB testing, they take variance measures directly into account when selecting an arm. Take a look at this paper from Doina Precup (who is super nice BTW) and Volodymyr Kuleshov from 2000 http://www.cs.mcgill.ca/~vkules/bandits.pdf they have comparisons between various methods. Guess what - the simple methods work just fine. Of course there are various Bayesion versions - esp of UCB. Nando de Freitas over at UBC has a recent ICML paper on using the Gaussian Process for Bandits (based on a form of UCB). See http://www.cs.ubc.ca/~nando/papers/BayesBandits.pdf I have not given it a tight read, but not sure what the practical return would be. Plus you have to fiddle with picking a Kernel function, and I imagine length scales and the rest of hyper parameters associated with GPs. I did read a working paper from Nando a few years back that used a random forest as a prior - I can't seem to find it now. BTW - John Langford is program chair of this year’s ICML over in Edinburgh. If you are in the UK might be worth it to pop up and attend. Plus Chris Williams is there at Edinburgh, so maybe you can corner him about GPs. Although he has moved on from GPs - he still wrote (well, co-wrote) the book and is one of the smartest people I have ever met.


Doesn't this method also (over) simplify? It goes into pseudo explorer mode anytime there is a >0 probability that the presumed worst cast is actually better than the presumed best/better case. Shouldn't there be a threshold to that process, so that the presumed worst case must have at least a (for example) .05 probability of being better before the algorithm gives it a shot?


So am I getting this right?

(Edit) So the two steps to run are:

1. Run a traditional A/B test until 95% confidence is reached. This is full exploration.

2. Then, switch to the MAB after that, showing the better performing variant most of the time. As time increases, the display of the worse performing variants decreases.


Option 1 will NOT give you the correct answer. You CANNOT use confidence intervals as a stopping criteria. If you do this, you end up running many tests, and then you need to apply a multiple test correction to account for this. Otherwise you run a VERY HIGH risk of picking the wrong result.

I emphasize, because this is a common problem made by A/B test practitioners. For a fuller discussion of the problems, check out the papers by Armitage (frequentist) and Anscombe (Bayesian) on the topic. Or see my summary of the issue here:

http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-te...


Sorry I wasn't clear. I meant run #1 first, then run #2. I didn't mean them as different options.


This article suggests another option: Run UCB1 (a specific MAB variation) from day one and you'll get benefits of MAB without limitations of A/B testing or mathematical fallacies of the prior 20 lines of code MAB for n-cell tests.


I thought the article was suggesting that step #1 is part of UCB1?


What, now it's indirect argument over HN?

Dear UX specialists with no knowledge of statistics: you can use the MAB algo with 2 choices, no problem. And it is a better way of getting 'the right choice'.

Dear statisticians: there's more to life (and to UX) than A/Bing (or MABing) everything


> What, now it's indirect argument over HN?

This has obviously been standard procedure for a while now. I see this on almost a daily basis. Afraid your comment on a HN thread won't get enough traction? Make a blog post instead, meant expressly for submission to HN (or reddit, or ...)


Dear random internet poster. This argument has happened several times already on this site. I recommend reading previous discussion at http://news.ycombinator.com/item?id=4052997 and http://news.ycombinator.com/item?id=4040022 before being sure that this blog is obviously the final say.


I don't think most statisticians need to be taught the latter. This particular narrow slice of statistical methodology is dominant in certain areas of advertising, web design, and UX, but it's not as if it completely dominates the field of statistics, or the toolbox of most statisticians.


My guess is 98% of developers use neither.


You are, sadly, overshooting the worldwide population of A/B testing developers by at least an order of magnitude. Great news for my consulting business, bad news for everyone else.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: