Multi-armed bandit isn't an algorithm, it's a model of how to view the problem. Like it or not, the problem web designers face fits the multi-armed bandit model pretty well. The algorithm called "MAB" in the article is one of many that have been developed for multi-armed bandit problems. Traditionally, the "MAB" of this article is known as "epsilon-greedy".
The point of multi-armed bandit situations is that there is a trade-off to be made between gaining new knowledge and exploiting existing knowledge. This comes up in your charts - the "MAB"s always have better conversion rates, because they balance between the two modes. The "A/B testing" always gain more information quickly because they ignore exploitation and only focus on exploration.
I should say also that multi-armed bandit algorithms also aren't supposed to be run as a temporary "campaign" - they are "set it and forget it". In epsilon-greedy, you never stop exploring, even after the campaign is over. In this way, you don't need to achieve "statistical significance" because you're never taking the risk of choosing one path for all time. In traditional A/B testing, there's always the risk of picking the wrong choice.
You aren't comparing A/B testing to a multi-armed bandit algorithm because both are multi-armed bandit algorithms. You're in a bandit situation either way. The strategy you were already using for your A/B tests is a different common bandit strategy called "epsilon-first" by wikipedia, and there is a bit of literature on how it compares to epsilon-greedy.
This comment just sold me on MAB. You can just keep on throwing variations on a design at the system without having to make tenuous decisions. I hope all the A/B tools implement this soon.
It seems like most people who use these content optimization tools don't really understand the statistics involved. What are your thought on this? How do you educate your users on the merit of your approach vs a/b testing when the topic is so complex?
Also, despite this being a slightly pro a/b testing post, I have to say it's actually made me more interested in trying out Myna's approach MAB algorithm.
Same way every product from GWO and T&T on down: show a pretty graph that ignores the underlying assumption that it's even possible to use statistics to conjure certainty from uncertainty, and trust that users will never know or care about the difference.
/former AB test software dev who fought my users to try to stop them from misinterpretation results, and failed.
If it gives you comfort, if there is a significant underlying difference and the calculations are done right, with high probability they will get the right answer even though they are misunderstanding the statistics.
Acceptance of this fact has avoided a lot of potential ulcers for me.
Just to be clear, we don't have anti-MAB stance or pro-A/B testing. The point was that MAB is not "better" as an earlier article titled (20 lines of code that beat A/B testing) had claimed. These methodologies clearly serve two different needs.
I should note here that if you use Myna, you will be using a much better multi-armed bandit approach than the epsilon-greedy which lost in this blog post.
See my longer top-level comment for some of the trade-offs.
I'd love to see someone who knows more explore this, but: Statistical tests and processes can not be blindly applied. They all have prerequisites before they mean anything. A common one is the assumption that the underlying distribution is Gaussian. In that case, average + standard deviation tells you something very useful. On the other hand, nothing stops you from taking the average and standard deviation of a Poisson process, but the numbers will be much less meaningful, and often quite deceptive.
I'm concerned that the assumptions that are necessary for those statistical tests are not being met. Generally the significance tests are based on an assumption that all samples are independent as far as I know, but MAB grossly violates that assumption, and I do mean grossly. MAB doesn't pass those tests because it is, itself, a statistical significance test in some sense, and it is "deliberately" not feeding the independence-base significance algorithms the data it thinks it should be getting. This is not a bug, it's a feature.
(Pardon the anthropomorphizing there. It's still sometimes the best way to say something quickly in English.)
In fact, the fact that you agree that MAB has a higher conversion rate, which in this context basically means nothing more and nothing less than works better, but that there's this measure Q on which it does worse, is probably better interpreted as evidence that Q is not a useful measurement, rather than that the thing that works better shouldn't be used due to its lack of Q-ness.
In this case, because you are measuring conversion vs. no-conversion, the underlying distribution would actually be binomial. With a large enough sample size, it looks similar to a Gaussian distribution, but still slightly different.
Also, I don't see the problem of lack of independence. People are still randomly assigned to one of the conditions in the MAB scheme, people in one sample don't affect people in the other sample, and each person submits one independent datapoint. The problem is just one of unequal sample sizes in your two conditions, so when you make the comparison (in a one factor ANOVA, say), you lose lots of statistical power because your effective sample size is essentially the harmonic mean of the two sample sizes (which is heavily biased towards the lower number of the two).
"People are still randomly assigned to one of the conditions in the MAB scheme"
No, they're only partially randomly assigned to one of the conditions. MAB has a memory based on what the results of the previous runs are, which is basically another way of phrasing that there's a dependence.
I'm not so sure. By that logic, no experiment has random assignment. I know when I place subject X in condition Y that I do so because I need Z people in condition Y. Thus, I have some information about where I've placed other people. This is no different?
The point is that with bandit models, you're placing people in a cohort as a function of the previous conversion rate of your cohorts. That's the dependence.
By contrast, randomly allocating X% of your population to cohort C_i is not a dependence on anything but the free parameter (X).
"The results will still be conditionally independent given the assignments"
If the assignments are the source of the problem, you can't just take them as "given" and assume that things are OK.
A simpler example: if I assign people to cohorts based on their age, then the results of my experiment may be "conditionally independent" with respect to (say) eye color, but it probably won't be conditionally independent with respect to income or reading level or pant size. In other words, the assumption of independence is violated with respect to all variables correlated with age.
With bandit optimization, our bucket allocation scheme is correlated with time, which makes it nearly impossible to apply conventional statistical tests on any metric that may also be correlated with time. And in web testing, that's nearly everything.
It's not flawed. Wouldn't you assume that whatever bias (due to "peeking" at results) was there for one algorithm would also be there for another? Also, to offset this effect somewhat, I waited till I saw statistical significance at least 10 times. (I know it's a heuristic) Moreover, I'm curious if there's a theoretical result that compares statistical power of randomization and MAB.
Though like a commenter said, comparing these two algorithms is actually like comparing apples to oranges, and that was precisely the conclusion.
Yes, I agree it is not a rigorous statistical analysis but I tried doing a fair comparison (based on assumptions/heuristics clearly described in the post). The blog post simply claims what is described there.
One glaring flaw in this article is that they're comparing a multi-arm bandit with a 10% exploration rate to a simple random split test with an implicit 50% exploration rate. Of course the split test will converge faster.
In their results when they compare a MAB with a 50% exploration rate to their split test you start to see a comparable amount of time to converge. Also they only show the results of one simulation with a lot of random in it. Given we're all stats nerds it would've been handy to see a box plot for each of the styles of simulation across multiple runs.
Having said all of that, my biggest concern around MAB as it's being sold is the lack of thought around the experiments. In the end it's just data and it doesn't mean anything without human intuition and preconceptions guiding it. Example: day of week, time of year/day, new page people are getting used to still, old one that garners lots of repeat traffic, etc. When running tests there's a lot more to consider besides just "whatever the data says".
We ran each simulation 25 times (with 10000 iterations each). And can share exact statistics in case you are interested. And yes, MAB with 50% exploration rate did perform similarly, but it didn't generalize when there were more than two versions.
Experiment and Optimization are different topics and scoring one by the criteria of the other will lead to mistaken conclusions.
Your simulation is a fiction. It only informs you of the behavior of the code you wrote. To connect statistical significance to reality requires honoring criteria you have not met.
Applying frequentest evaluation concepts to Bayesian and Markov learning methods will lead to mistaken conclusions because they follow from a different set of axioms. They both lead to numbers, but these numbers are not interchangeable and you must remember their qualifications when applying them to predicting reality.
In more frequentest terms, multi-arm bandit is searching for minimum total regret via a random walk, not rapid convergence to a significance predicate. It can do this in the face of changing conditions and weaker assumptions concerning control than frequentest significance requires, which is why such methods are now the norm in machine learning.
There are huge opportunities that you will not see with your current perspective. If you do not learn them, your competitors will.
I do not know of a particular reference for multi-arm bandit algorithms in specific, but they are a specialized case of Markov model learning described with vocabulary that predates a more modern broad and uniform understanding of these topics.
David Barber's recent book is very good.
If you want a broad understanding of common mistakes due to unstated metaphysical assumptions concerning statistics, experiment and action, read Judea Pearl's 2nd edition.
Convergence to statistical signifcance is a bad metric (just like how early stopping is such an issue) but you can map it to the rigorous parameter of statistical power, perhaps. From this, it's well proven that t-tests are statistically most powerful--there exists nothing better--given that you're simulating disjoint normals.
In reality, however, tests like the Wilcoxin test are 99% as powerful and more robust to misspecification in models.
I bring this up because while statistical power and signficance is a very important metric in the theory of picking good tests, it's actually a pretty terrible one in practice. Comparing MAB, which optimizes an entirely different loss parameter, to t/z-tests on power is sort of meaningless.
MAB can produce a cleaner workflow for many kinds of websites. Underperforming classes will be underrepresented and eventually pruned. The increased power of a batch test isn't necessarily so important in this context. I'm not even actually advocating MAB over other tests, just that you shouldn't spend too much time worry about power comparisons unless you're genuinely comparing apples to apples.
Yes, comparing MAB with A/B test is not accurate and that is what the conclusion (in summary) is. These algorithms optimize for different things, and our customers should know that.
This blog post misses to me what is the most important difference: If you are not keeping consistent assignment ratios of users into experiment buckets, then you are implicitly making the assumption that no other factors other than your A/B test can affect conversion.
A very common example would be you might have a form that converts higher on the weekends than the weekdays just because of some external factor. You fire up a test with this "multi-arm bandit method" on Friday morning with two variations that are exactly equal. By random chance variation A starts winning by the end of Friday, and then it gets shifted to 90% of impressions. Due to this external factor, that it is now the weekend and that naturally increases conversions, variation A's conversion rate is going to excel past that of variation B, maybe even significantly. However, it in fact isn't better at all.
Now this may be uncommon, but think about this: Your website is changing all the time. If you happen to make a change on your website that affects conversion rates and you do not keep consistent bucket assignment ratios, then any ongoing experiment is going to be tainted.
I appreciate that the "multi-armed bandit" method increases the conversion rates on your site during the test, but to do be honest when I am running a test I'm not worried about increasing the conversion rate during the test. I'm worried about increasing the conversion rates in the much longer future, and the best way to accurately do that is to get accurate and significant results from my test as quickly as possible.
This also bunks one of the advantages in the blog post that says you can add variations at any time. You cannot add variations at anytime, because if you do you are comparing apples to oranges. The conversion rates from time period A-->C cannot be assumed to be the same as B-->C, because there likely have been changes in between A and B that could affect the "base" conversion rates (like website changes, external factors like weekend vs weekday, etc). When you add a new variation, you are creating a new A/B test that must be analyzed independently.
However I have been thinking about it since, and it is possible to design a multi-armed bandit approach with logarithmic regret (though higher by a constant factor than a traditional approach), that can handle the varying performance ratio. It also would allow you to add variations at any time.
There remain operational differences, but this problem is fixable.
The human factor in split testing is rarely to almost never discussed. I've been building and running landing pages for around 7 years now, I can't count how many millions of people have flowed through them.
Here are some things I've found:
1) Absolute conversion rate. After a certain point, whatever you add will just detract from the performance of something else. That detraction could either be from the landing page itself (if your lucky) or some longer term variable (hope those "conversions" don't cost you too much money.) I have had both occur.
2) "Statistically significant" can just be noise when variables are fairly close to each other. After getting rid of the obvious losers, I've watched the "winner" of elements like button color change back and forth daily for weeks, with no clear winner, even with 30,000+ conversions a day flowing through. This is the kind of thing visualwebsiteoptimizer would write a case study on 1 hours worth of traffic and declare a winner.
3) You brand can be shit on when dealing with returning users. They are used to seeing one thing and now they see something else. Imagine if the colors, theme, and button locations changed every day (or to be fair, once a week) when you visited hn. Often "better converting" designs actually convert worse when introduced on existing customers.
4) Failure to account for external variables, especially when dealing with small sample sizes. Testing is often done most vigorously with paid traffic sources as the monetary incentive is direct. The traffic source itself is often a bigger determining factor behind the conversion rate than the design. Small budget/sample size, and you could end up with some pretty poor test results that the math will tell you are correct.
I am not saying don't test. I am saying a/b testing, split testing, multivariate testing, etc is abused
Helpfully, the final paragraph of the article seems to contain its takeaway message:
"So, comparing A/B testing and multi-armed bandit algorithms head to head is wrong because they are clearly meant for different purposes. A/B testing is meant for strict experiments where focus is on statistical significance, whereas multi-armed bandit algorithms are meant for continuous optimization where focus is on maintaining higher average conversion rate."
Like any business owner, at the end of the day I care more about conversion rates than about statistical significance.
I'm not a scientist trying to advance the sum of humanity's knowledge. I'm a business owner trying to find the shortest path between what my customers need and what I can profitably offer them.
In a way, statistical significance strikes me as a bit of a fool's errand, because significant results in one context may not be generalizable to another, which means even if we know for a near certainty what worked best, it's hard to apply that knowledge reliably in the future.
Of course, with MAB we could still wait for statistical significance if we want it, before turning off variations that are performing worse. And we can certainly still try to draw conceptually useful conclusions by designing our test variations in ways that facilitate easy comparison.
But with MAB and Myna it sounds like we can pretty well count on higher conversion rates at the end of the day, as well, and that counts for a lot in a business context.
I'm grateful to the VWO team for writing up their analysis and findings, and being so frank about the relative advantages of A/B and MAB. Their summary above tells me what I need to know.
Without concrete details about the experiment setup and dataset on which it was run (and ideally the source code,so we can be sure there aren't any bugs) this is essentially meaningless.
Anyone can run "simulations" to prove anything. Providing just a summary table is of little use. I am not saying that the Wingify folks are trying to mislead people - just that this article doesn't have sufficient rigor to justify its conclusions.
OTOH, many CS papers, even published ones, don't provide source code or datasets so people can replicate the results, so perhaps this is the 'new normal' ;) .
I am not saying that your code has bugs (or that it hasn't!).
However running experiments to prove that one 'process' is superior to another in a real world situation often involves more than running a chi square test on randomly generated data ;) (which is what I understood your code to be doing on a very brief glance at it- sorry if I got it wrong).
There is nothing wrong with starting your exploration with something like this, but it really isn't sufficient to make the claims in the blog post about specific ways in MAB is better/worse than A/B testing.
This seems to be methodologically dubious. I am not a stats expert, though I use it in my work, and I could be wrong but testing convergence to statistical significance doesn't seem to mean anything (mathematically/statistically)!.
I could be wrong - there are people with Stats PhD's here and I'll leave it to them to tell me if I am. But I've never heard of an experiment (in the formal sense) which ran till a significance checking algorithm crossed some tripwire level.
Is the sampling random? (especially for an MAB) What are the underlying distributions (and assumptions about them)?
In your article you (imo) either need to say something like "this test isn't really valid and so the conclusions shouldn't be taken very seriously" (which would be weird thing for a company with your product to blog!) or you should do some rigorous experiment design and be able to defend your results.
Right now it reads like some MBA type plugged in two black box formulae in an Excel sheet and drew conclusions from the results. (please note: I am not saying you did it that way -just saying that the article reads like not enough thought went into setting up a proper experiment)
(Statistical) Experiment Design is a lot of work and surprisingly hard to carry off well, and involves much subtlety - and theory! there are whole books written about it. Maybe time to buy a few? :)
Yes, I agree we should put the code on the post probably so people know what exactly was done. Yes, it is a quick and dirty simulation, and not a rigorous analysis. I hope someone more knowledgeable does shed light if the arguments have any fundamental oversights. The post is mainly meant for our customers who asked if MAB is "better" than A/B testing. It clearly isn't as both are meant for answering different types of questions.
I personally do not understand the need to compare bandit with A/B.
The goal of A/B is to decide as quickly as possible for the best one.
The goal of bandit is to optimize given content.
I wouldn't use bandit to "decide" the button color but as a simple recommendation system. This seems by far more natural to me as it reacts better with changing optima.
Example: Let's say I run a larger "fun content" media webpage. I have "awesome videos", "funny images" and "goofy articles". On the bottom and right side of each content page i show follow up content. I would use bandit here to optimize the mix of images,videos,text i recommend.
To spice it up: I would create cohorts for my typical behaviour of users (e.g. registered male user) and only consider interactions of the last two weeks into my bandit calculations.
"A/B testing is meant for strict experiments where focus is on statistical significance, whereas multi-armed bandit algorithms are meant for continuous optimization where focus is on maintaining higher average conversion rate."
Who cares about conversion rates, right? I'd much rather have statistical significance than more clicks on my buy button / signups for my site / etc.
The point is that with less significance you are less confidential what your conversion rate "would have been" under other courses of action. You can't say "this or that test has higher conversion" because of some simulation that has totally different structure than your real website.
Statistical about managing unknowns, not creating knowns, and everyone ignores this!
Multi-armed bandit is a general problem. There are many strategies for tackling it. An epsilon-greedy strategy is one strategy. It is not the only strategy, and it is not even a particularly good one. There are many others. In fact A/B testing is itself a solution to the multi-armed bandit approach.
In fact the standard way to classify strategies is their asymptotic average regret - how much time you spend pulling the bad lever. An epsilon-greedy strategy has linear regret - even after a version has won you keep pulling it a certain fraction of the time.
Let's consider an alternate approach. Run an A/B test, then forever after exploit it. This is also a linear regret algorithm. Given two levers and the way you're going to run the test, there is a probability that you will come to the wrong conclusion and forever after do the wrong thing. Most of the time your regret is fixed, but sometimes it is not.
So they are equal? Not exactly. A/B testing converges quickly, and has a very small amount of linear regret forever after. An epsilon-greedy approach converges more slowly. Furthermore to lower the long-term regret you have to drop epsilon, which slows convergence. By the time you lower it to a level that is comparable to an A/B testing approach, convergence is so slow as to be operationally infeasible.
Thus the conclusion of this article is that A/B testing is a better operational strategy than epsilon-regret. And it is. But there are known better strategies still. The algorithms that Myna uses, for instance, have logarithmic regret. In a multi-armed bandit situation, they will soundly beat A/B testing just as A/B testing soundly beats epsilon-regret.
At that point you're left with two questions. The first is whether multi-armed bandit is a good model for website optimization. The second is whether there are any operational advantages to one approach over the other.
The answer to the first, for traditional multi-armed bandit algorithms, is a sound no. While relative conversion rates remain comparable, absolute conversion rates vary for all sorts of reasons. (Time of day, day of week, longer cycles, external website improvements, etc.) That was my big criticism the last time this discussion popped up, and there was no good answer to it.
But behind the scenes Noel (the CEO of Myna) and I have been in a discussion. I believe that I can produce a multi-armed bandit algorithm that has logarithmic regret and will work in the face of ever varying conversion rates, as long as there is a consistent relative difference between the versions.
At that point you are left with the operational differences. If you want to test at a granularity of "user", a website often has to put most of its active population into the test, before collecting enough data to know which version was good. With A/B testing you will eventually put everyone into the better version. With a multi-armed bandit, half your current users are stuck in the wrong version.
In effect you have a limited population of levers available to pull, and have to pull a good chunk of them before you start getting feedback on the rewards.
You can avoid this problem by testing at the granularity of an impression. Whether or not this is OK to do is a business question, and the answer will not always be yes. If it is not, then a multi-armed bandit approach is inappropriate to use.
> I believe that I can produce a multi-armed bandit algorithm that has logarithmic regret and will work in the face of ever varying conversion rates, as long as there is a consistent relative difference between the versions.
That sounds interesting. As you undoubtedly know, there is a lot of literature on multi armed bandit problems and also on the non-stochastic multi armed bandit problem. The latter has the advantage of not assuming anything about the way in which the sequence of rewards is generated.
There is a a line of work in the non stochastic setting which sounds related to what you are describing. It's called the problem of tracking the best expert: http://www.cse.ucsc.edu/~mark/papers/track-long.ps
The idea in this problem is to do well as compared not to the best fixed action but rather the best piecewise constant sequence of actions. There are many variations of this problem, but the basic idea is that you can have low regret as compared to a changing sequence of actions so long as that sequence doesn't change too "quickly" for some definition of quickly.
I know there is a lot of literature, which I'm unfortunately less familiar with than I'd like to be.
Here is the specific problem that I am interested in.
Suppose we have a multi-armed bandit where the probability of reward are different on every pull. The distribution of rewards are not known. It is known that there is one arm which, on every pull, has a minimum percentage chance by which it is more likely to produce a reward if pulled than any other arm. Which arm this is is unknown, as is the minimum percentage.
In other words payoff percentages change quickly, but the winner is always a winner, and your job is to find it. I have an algorithm that succeeds with only logarithmic regret. I have no idea whether anyone else has come up with the same idea, and there is so much literature (much of which I presume is locked up behind paywalls) that I cannot easily verify whether anyone else has looked at this variant.
I think what you're describing is basically the non-stochastic setting. The idea there is to assume absolutely nothing about the rewards (they can be generated by quickly changing distributions or even an adversary) but still try to do as well as the best single arm. The standard algorithm for this problem is the multiplicative weights update method, also called the exponentially weighted average method or the hedge algorithm. It's a really simple method where you keep a probability distribution over the arms and then reweight the probabilities at each time step using something proportional to the exponential of the rewards of the arms. If you run it for T time steps on a problem involving N arms, it gives you regret O(sqrt(T log N)). This is actually the best possible bound without additional assumptions. This is a good paper on it: http://cns.bu.edu/~gsc/CN710/FreundSc95.pdf
There are many, many different variations of the method and result. For example, you can show improved results assuming different things about the rewards. For exapmle, you can show O(log(T)) regret with certain additional assumptions. This is a really good book on different non-stochastic online learning problems:
http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Predi...
The article is interesting, but assumed you get to pull all of the arms at once, and see all of the results. That's different from the website optimization problem.
The book looks like it would take a long time to work through, and they don't get to the multi-armed bandit problem until chapter 6. It goes into my todo list, but is likely to take a while to get off of it...
Yes, I realized after posting that this is probably a better paper to link to: http://cseweb.ucsd.edu/~yfreund/papers/bandits.pdf
The algorithms for the partial information setting are sometimes surprisingly similar to the algorithms where you see all the results. The algorithm in the paper linked above is essentially the same algorithm but with a small exploration probability. The regret bound gets worse by a factor of sqrt(N), however.
I don't think the non-stochastic setting is entirely appropriate. What we've been discussing is a situation where the differences in conversion rates are constant in expectation, though absolute conversion rates vary. So it's basically a stochastic setting, which should allow more leverage. Also, as I recall for the non-stochastic setting the definition of regret changes so the regret bounds are not directly comparable.
My take on btilly's problem was that the difference between the convergence rates is not constant but rather known to be bounded away from zero by some unknown constant. That is, there is some value epsilon such that the difference in conversion rate is on every round at least epsilon. However, at some rounds it could be more than epsilon, and it doesn't necessarily vary according to any fixed distribution. I think this is therefore roughly equivalent to the non stochastic setting additionally assuming the best arm has large (relative) reward.
I agree though that if the difference in conversion rate were stochastic that could potentially make the problem much easier.
This is a good coversation.
So the Disclaimer, my company Conductrics.com allows you to use algos similar to bandits as well as AB/MVT.
AB can be thought of as more of a form of epoch- greedy - you play uniform random then play greedy. One advantage of e-greedy is that if your environment is not stationary - you are still sampling from the arms - its sort of an insurance premium.
To address the differences in reward signals based on the environment, there is the option to model the environment with features - since it maybe that Fridays require a different selection than Sundays - not sure why you would a priori assume that the relative values, or even just the arm rankings are independent of environmental variables.
One other point- if you just want a super quick hack to pick winners (or at least pick from te set of higher performing arms) you can just optimistically seed your estimates for each arm - then just pick greedy. Not claiming it is optimal or anything but it requires almost no coding overhead. Of course you need the problem to be stationary.
Regardless of which algo approach you use, I do think it is useful to at least think in terms of bandits or reimforcemt learning when tackling these problems.
There is also another area where improvement can be had. As far as I know most A/B testing just looks at the decisions that led to a conversion. But what about those that didn't lead to a conversion? There is information in that too, and it depends on time.
Suppose you pulled decision X for some user two minutes ago, and the user hasn't converted yet. Suppose that you pulled decision Y for some user two weeks ago, and the user hasn't converted yet. Do these two give you the same information? No: the user that got Y is less likely to convert than the user that got X, simply because of the time difference.
What you could do to incorporate this extra information is model the conversion process more explicitly, for example as each lever resulting in a particular exponentially distributed time-to-conversion (and then do Bayesian inference over those parameters).
From a statistics perspective, a few things are missing. Most importantly, a discussion of statistical power and a discussion of why exactly it is that they test until statistical difference is found. Every scientist knows that if you test a big enough sample, you are more than likely going to find a statistical difference, regardless of whether it actually exists. Hence, using only that as your heuristic for what makes one algorithm better than another is not very useful.
Agreed that statistical power wasn't used to calculate how long test should have been run. I know statistical significance can be found earlier if you are "looking" for it, that is why I ran the simulations until it was found at least 10 times. (I know it is not the most scientific way, but I used it as a heuristic; I don't know how to use statistical power in case of MAB. Probably 'statistical power' concept is not valid at all for MAB. Need to study more.)
This blog post is describing epsilon-greedy exploration/exploitation as applied to A/B testing. It's not really multi-armed at all, since the multi would imply > 2. The real beauty of exploration/exploitation is the ability to go beyond two simultaneous tests. Although they do show a test with three options, this would be far better if it emphasized the multi more.
I would like to have seen more tests vs the MAB-90 algorithm as this appeared the most interesting one to me. Also the article appears to imply that A/B testing is useful in the situation where you want a cleaner statistically significant result which I believe is actually a false argument.
Both methods can supply statistically significant results, there is no reason that A/B testing is fundamentally superior in getting a statistically significant result, although arguably A/B testing is more straight forward.
True, both methods can provide statistical significance. My only point is that in general A/B testing may require less number of visitors as compared to A/B for same amount of confidence.
Talk about being too smart to solve the problem... The article is comparing apples to concrete blocks. When you're hungry, you want apples. When you're building a retaining wall, you want concrete blocks. One can't substitute for the other.
Statisticians, academics, math enthusiasts, and People Who Spend Too Much Time Building Tables and Graphs want concrete blocks. They want to build a case or a report or something that suggests a point using a mathematical model. They deal with "statistical significance" because that's how the tools they understand work. Thus A/B testing.
Web workers, businesses, advertisers, people looking for better conversion rates, etc. want to eat. They want apples. Talking about statistical significance, trial sizes, and frequency measures don't help if they don't get more apples. The article showed that MAB gets more apples than A/B and by a wide margin.
Comparing the two isn't just missing the point, it's drawing an equivalency between "statistical significance" (whatever the tools say it means) and actual realworld significance, a.k.a. better results. If A/B testing fails to even simulate better results, the proper word for that is worse.
(And don't even get me started about trying to run a statistical experiment until significant results are found... someday that practice will be labeled as fraud.)
If you have a question where the answer can be used to inform other decisions (e.g. Which is more important for our customers, price or support level?) then an A/B test will give you a clear answer that you can apply elsewhere.
If you don't really care what the answer is and just want the best option (e.g. should the hero shot be looking up or down) then a bandit is the way to go.
Reading the conclusions it seems that each approach - a/b testing vs multi-armed bandit - has its merits. Maybe Visual Website Optimizer should be method agnostic and offer customers the choice of method?
Yes, we have been discussing this internally. Providing both options is great, but that assumes customers have a complete understanding of when to use which method.
2. You do not indicate whether you’ve run enough simulations to attain a high statistical significance (over the meta- result that each campaign statistical significance has been attained)
3. Correct me if I’m wrong, but I don’t think A/B testing should be run “until statistical significance has been attained” because it foils the purpose of a statistical test, that is, you should rather define before hand a number of iterations, wait until they have been completed, and then check whether it gives you statistical significance
4. The purpose of the epsilon-greedy algorithm is that it can be run for very long duration without knowing which statistical significance is best because it will adapt the most commonly seen version of your site to this particular version (ie it is made to be run almost indefinitely, and will -- in effect -- automatically remove the non-working versions after a while!
A/B testing is linear regression against a single binary variable. There are many better techniques. Multi-armed bandit is just one of them. A more direct comparison would be simply using a continuous independent variable, or at least one with more than two values. You can have more variables pretty easily, too.
Of course you get statistical significance with fewer observations if you have fewer variables to model. Although you'd get even significance with even fewer observations if you had a continuous independent variable. But more variables produces a higher r-squared. As in, you can explain more of the variation in your conversion rate.
The prevalence of A/B testing demonstrates a sad lack of statistics training. It's a problem in many science fields as well, especially biology.
Was this analysis based on testing only two variations, A vs. B, using A/B testing then MAB?
If so then that ignores one of the best advantages of MAB is you can compare variations A, B, C, D, E, F, ... all simultaneously, where if using A/B testing you'd have to coordinate a tournament between all of those variations to determine the best of the bunch ... so seems it would take just as long with A/B testing (as thus a large sample population) to arrive at knowing the best variation from many choices.
I'm confused: surely the right way to compare would be to say the A/B tests pick the best option 100% of the time after 'reaching statistical significance'[1] - this is at least apples to apples.
[1] Like others here, I believe testing until you see significance is Doing It Wrong.
I found it very interesting that MAB-90 was outperforming RAND on both convergence speed and average conversion rate in the first experiment. This seems a very interesting result, albeit not the one you might be looking for in the article. I find it rather disturbing that this dataset was removed in later experiments. It actually seems to provide a very valuable insight namely: Multi-Armed Bandit strategies might be better than A/B testing but the exploration factor should be far higher because of the need for rapid convergence to significance.
It depends on how much traffic you have. If you have enough for 10% to quickly provide statistically significant amounts of data, then there is no need to provide more traffic to that variation to determine whether it is worthwhile or not - you can split 90% to the already well performing version.
My imaginary ideal would be a genetic algorithm that mutates the HTML and CSS in random ways. You'll probably have users with completely unusable pages but that's the price you pay for finding the optimal solution as it converges over time, probably a really, really long time.
I cannot help but think that under a sufficiently precise set of assumptions (prior knowledge about the various one-armed-bandits, cost of losses, gain of winnings, expectations of being still in business in the future, etc), there is a provably optimal strategy.
From a EE's point of view, what the OP said is that there's too much noise involved in multi-armed bandit algorithm. If that's the case, a low pass loop filter is needed. It's kind of like locking a PLL to a reference clock right?
Maybe a better way to do the exploitation phase would be to sample from the different choices with probabilities proportional to their expected payoff instead of just choosing the one with maximum expected payoff.
(From reading the post quickly, I'm not left with the impression that the author understands the research they are 'refuting'. (Which I also only read quickly and cannot help understand.) This is just my impression.)
Of course, I don't claim to understand intricacies of MAB completely. This is on my understanding of this algorithm in last couple of days, but I'd like to be illuminated by someone more knowledgable if there's anything wrong with my understanding. In fact, that would be quite helpful for our customers as well.
The point of multi-armed bandit situations is that there is a trade-off to be made between gaining new knowledge and exploiting existing knowledge. This comes up in your charts - the "MAB"s always have better conversion rates, because they balance between the two modes. The "A/B testing" always gain more information quickly because they ignore exploitation and only focus on exploration.
I should say also that multi-armed bandit algorithms also aren't supposed to be run as a temporary "campaign" - they are "set it and forget it". In epsilon-greedy, you never stop exploring, even after the campaign is over. In this way, you don't need to achieve "statistical significance" because you're never taking the risk of choosing one path for all time. In traditional A/B testing, there's always the risk of picking the wrong choice.
You aren't comparing A/B testing to a multi-armed bandit algorithm because both are multi-armed bandit algorithms. You're in a bandit situation either way. The strategy you were already using for your A/B tests is a different common bandit strategy called "epsilon-first" by wikipedia, and there is a bit of literature on how it compares to epsilon-greedy.
http://en.wikipedia.org/wiki/Multi-armed_bandit#Common_bandi...