To preface this comment, I generally really like Evan Miller's writing on statistics, but this statement:
> anyone who lacks a firm understanding of statistical power should not be designing or interpreting A/B tests.
Completely ignores that you can approach A/B testing from an entirely Bayesian perspective (which Miller has written about fairly well in the past).
I might be a bit biased in favor of Bayesian methods, but I'm honestly a bit surprised that many people still run A/B tests using a frequentist frameworks such as this.
The biggest reason I would argue to dismiss the NHST (Null-hypothesis significance test) approaching is because A/B Tests are not really controlled experiments, at least not in the same way clinical trials are. User behavior is always observational even if you have a control group. Whether you go NHST or Bayesian is a coin toss in a controlled environment, but when looking at user behavior on a web site it's much better to think in Bayesian terms.
Other reasons why a Bayesian frame work should be used:
- Marketers care about "Probability A is better than B", not "failure to reject null hypothesis".
- If you've been running A/B tests for years, you most certainly have a very good prior over the distribution of conversion rates. It's literally wasting your time and money not to make use of these.
- If you're measuring human behavior on a website, there are absolutely confounding variables at play and you want to model these explicitly. It's much easier to do this when thinking in a Bayesian framework.
I've been designing and running A/B test of all sorts of complexity at all sorts of companies for a very long time, and my experience has been that approaching the problem from a Bayesian perspective is by far the superior way to run your tests.
I generally agree about the pros you list of Bayesian, but I think in companies that are doing a lot of experiments and are not just optimizing a checkout page, they don't hold as well. For example, you often are testing out features that really are different from any feature before, and so a prior is harder to get alignment on. There are also frequentist methods like CUPED to mitigate the effect of confounding variables, and then usually you're analyzing the results across user segments so that you can try to look for heterogeneous treatment effects. (FWIW, having analyzed a lot of experiments, I've been surprised at how often there is NOT a heterogeneous treatment effect across user segments; the baselines are often very different, but the direction and general magnitude of the lifts are typically similar. And, the confounder that is most often relevant is simply some broad "engagement" dimension: users who use the product a lot might behave differently from users who do not.)
But what is really interesting to me about your comment is:
> A/B Tests are not really controlled experiments, at least not in the same way clinical trials are. User behavior is always observational even if you have a control group
Could you expand on that? Isn't clinical data also observational: you observe the patient after some time, and see what their symptoms are or what various endpoint measurements are?
> Could you expand on that? Isn't clinical data also observational
Generally in statistics there is a divide between controlled experiments and observational studies. The former is how most medical trials are run and the latter is how most anthropologists work.
In medical trials you can make sure all the demographics, lifestyle differences etc are controlled for before you even start the trial. In the more extreme case of experiments in the physical sciences you can often tightly control for everything involved where the only difference between test and control is precisely the variable of interest.
In anthropology you can't go back in time and say "what if this society had a higher ratio of male/female and didn't go to war?", you can only model what happened and attempt to bake some causal assumptions into your model to see what happened. This is why detailed regression analysis is very important in these fields.
Having a test and control group in your A/B test is not really enough to establish the equivalent of a controlled study of rats in a laboratory, or bacteria in petri dishes. In my experience it really helps to include causal models regarding what you know of your A/B testing population to ensure that what you're observing is really the case. I've had concert examples where it looks like one variant is winning and then checking for the causal assumptions find that we were accidentally measuring something else.
In practice you can reframe all of this into being some sort of ANOVA test and fit it just fine into a classical framework, but I find starting with a Bayesian methodology make the process easier to reason about.
Many A/B tests as RCTs do suffer from the fact that they're randomizations of convenience sample of users coming to a web site, but it's always seemed to me that randomization addresses most confounders due to population heterogeneity. Is there some specific confounding variables you worry about that are not addressed by randomization in this way?
One challenge for me is how to explain to marketers what the "Probability A is better than B" means from a Bayesian perspective. How do you convey to them that such a probability is not a frequency of an event? I haven't met anyone who is not a statistician that won't default to "you know, how often it happens" when pressed to define probability. My attempts to engage marketers on this fall like water off a duck's back.
In general I prefer NHST methods because I often require reasonably tight error control. Do you do any kind of calibration to ensure error rate control when doing your procedures? Or are your clients comfortable with less tight inferential error controls?
This comment reminded me of yet another reason to prefer the Bayesian set up for A/B tests: you work with the data you have not some bizarre idea of "sample size".
Especially with proper priors (which again, everyone running an A/B test should have at this point) you don't have to "plan a sample size". You work with the data you have, and assess the risks in the decision give what you know. There is likewise no risk of "early stopping".
If you're doing Bayesian analysis and only have say 100 visitors, and are using priors, the only situation in which you'll have a strong posterior is when the winning variant is notably superior to the current one. All of the ad hoc rules Frequentists put in place are not necessary when you do proper analysis of your posterior probability that A > B.
Again we see where Bayesian vs Frequentist doesn't really matter for a clinical trials but does for running an A/B test. In a clinical trial you are going to have to plan how many people are involved anyway, so doing a test power calculation makes sense since you'll need some number and that's a good way to get one.
For running an A/B test the bigger problem is usually time not numbers (an exception being email campaigns in which both tend to matter). In the Bayesian setup you if your boss says "I need an answer tomorrow morning" you can take the data you've got, show the probabilities of improvement as well as the distribution of the risks if you're wrong. This allows you to make riskier moves when it makes sense, and be more conservative when it does not. This is something that the out-of-the-box NHST does not allow you.
I am also team bayes for all the reasons you stated, but do want to argue a couple counterpoints:
* While you don't have to have a fixed sample size up front, you can still "cheat" in a bayesian analysis if you peek constantly and end early on promising results that you want to win, and let them run longer otherwise. So you want to do something to account for this (put some structure in place, approach with skepticism, laugh and put on sunglasses, whatever).
* It's very often useful in practice to have some idea of what kind of answer you're going to see in how long for planning reasons -- for example, rather than your boss saying "I need an answer tomorrow" they say "I need an answer as quick as you can". Bayesian methods give you the flexibility to be risky when you need to and accurately count for uncertainty, but sometimes you still need to predict and strategize around ideas like "We'll be about this certain in 2 days, and about this certain in 1 week, and about this certain in 4 weeks and it seems like planning on next Tuesday is the right call"
I've found understanding these frequentist methods to help inform my guesstimates of how experiments will play out with regards to sample size and impact as well as honestly evaluate the trade-offs in evaluating the tests where I wasn't running it -- AB testing is really widespread so I feel like it's important to understand frequentist tests well even if you intend to never use them if you can help it.
Well, at least in my day job, a very common question is "how much more data do we need to collect?" or "how long will the experiment take?". A response of "The analysis will be Bayesian, so those questions don't apply" is not helpful. Planning matters!
Even a Bayesian with a proper prior can make good guesses about sample sizes, for example, by saying that they have a goal to reduce the length of the 95% credible interval for the most relevant parameter by 80%.
BTW, asking the question "what is the difference between power and significance/p-value" is a good litmus test for detecting insufficient competence in statistics. As the author puts it, unless you know this stuff well you shouldn't really be juding the "scientific" numbers.
In fact you're probably (probably!) better off building a good intuition within the domain and then improvising the tests, in the sense that this way you're less likely to arrive at bogus conclusions, as opposed to someone who thinks they'll just apply some formula like this was a high school maths problem - here we plug in some magic that speaks back the truth. I've seen it time and again that (especially senior) people can rapidly and accurately judge the results of things based on very "un-scientific" improvised methods and numbers. YMMV.
Broadly, there are two kinds of error: false positives and false negatives. A false positive arises because reality conspired to produce what you consider to be an anomalous result. False negatives arise because your detector wasn't sensitive enough to notice the anomalous result.
The standard statistical practice is to fix your risk of false positives through the preemptive choice of a p-value threshold and then to attempt to minimize your risk of false negatives through increases in the "power" of your experiment. But, honestly, these two are in tension with one another. There are lots of possible choices.
It's important to know the difference, though. It changes how you interpret results. A positive result might be doubted because you, as a reader, would have preferred more protection against false positives, and, thusly, a smaller p-value threshold. A negative result might be doubted because you believe that the experimenter under-powered their experiment. You might also, in a repeating experimental process, wish to obtain greater power by sacrificing false-positive protection. This is as easy as changing the p-value threshold.
A last perspective is that in the typical scientific practice, you basically control false-positives through pushing the p-value down and making the experiment more challenging (reducing power). Then, you re-establish power through more expensive experiments (use better tools, run the experiment for longer, improve controls, improve data efficiency). Often the "typical" p-value is set by the community in relation to what level of false-positive the overall community is willing to tolerate in publication. That's why psychologists are happy with a 5% threshold whereas physicists might seek a threshold of 10^-3 or smaller. The physicists are upholding themselves to a much more expensive experimental standard and would vastly prefer false negatives.
All you have to do is ask an author of a paper in medicine or psychology that quotes a p-value to define p-value for you. That will open your eyes if you’re not already clued in.
While this is all true, I think the reason why people so often run underpowered tests is not because they "lacks a firm understanding of statistical power" but because we have not made a compelling case for an alternative. UXR surveys? In my experience that is even more underpowered than AB tests. User testing? Often just as subjective as gut calls.
Gut calls (ie product sense) seem like the only real choice, but I think there really needs to be an argument made for WHY that is the way, rather than just against other options.
Yes, in some cases, teams want an experiment to back them up.
If a decision maker cannot dedicate the necessary resources to properly power a test, it arguably would be more rational to use another process to make the decision. Perhaps a thoughtfully designed team discussion or even a 'Person P will make the decision because the uncertainty is relatively high, consensus is relatively low, and additional discussion is unlikely to clarify -- meanwhile we could go ahead and try the thing and see what happens!"
> If a decision maker cannot dedicate the necessary resources to properly power a test
Apologies if this is what you meant, but in a lot of cases it's not a matter of not dedicating enough resources so much as there aren't enough resources (ie visitors, users, posts, etc) to detect reasonable effect size. And in those cases, it seems like the author's point is that running a test is actively worse than a gut call or focus group or whatever else. But to your point, product teams often don't like gut calls because they feel subjective and unilateral, so having a test allows some degree of shared accountability for decisions.
I tend to notice when people write 'the only choice' or 'the only real choice' when they probably mean something more like
(1) an option that, when examined, seems best; or (2) an unexamined (sometimes dogmatic) choice.
In some cases, a person may have a third intent: by linguistically limiting a choice set, they can frame or limit a discussion (for good or ill).
> Significance is the probability of seeing an effect where no effect exists.
In my experience, the term significance level is more common and clearer.
From Wikipedia:
> More precisely, a study's defined significance level, denoted by α (alpha), is the probability of the study rejecting the null hypothesis, given that the null hypothesis is true; and the p-value of a result, p, is the probability of obtaining a result at least as extreme, given that the null hypothesis is true. The result is statistically significant, by the standards of the study, when p ≤ α.
Also, I prefer to use the term statistical significance over just significance because it is clearer to a broad audience across fields. From Wikipedia:
> The term significance does not imply importance here, and the term statistical significance is not the same as research significance, theoretical significance, or practical significance. For example, the term clinical significance refers to the practical importance of a treatment effect.
Here is a one page visual diagram (GPL licensed) I created a while back to show a large number of statistical classification metrics in a way that reveals an underlying symmetry. It also shows synonyms for metric across fields.
What I don't understand is why power would be so relevant. I want to know if going from A to B would increase my revenue. I run an A/B test and see statistical significance, even if a minor one. I now know that B is better than A.
I suppose the need for a power calculation comes in when considering effort. If I need 10 engineers for a month to build out a feature that won't get the power it needs for a year, it may not be worth it.
Power is important if you want to distinguish between noise and true signals. In other words, if you care about P(real effect | significant).
From good ol' Bayes we get:
P(real|sig) = P(sig|real) x P(real) / P(sig)
P(sig|real) is the power; so if you have more power, all other things being equal (a bit of a weaselly caveat), the likelihood that your stat sig result is real is higher.
> What I don't understand is why power would be so relevant.
Doing the A/B test itself has a cost greater than just building the feature and releasing it (supporting two variants in production), and beyond that you also need to take into account the cost of acting on the results (i.e. if control wins what do you do? if it's a tie what do you do? best to budget for maximum possible effort, or the expected effort -- but expecting for the variant to win handily is budgeting for the minimum possible effort).
I've seen multiple businesses that always schedule around shipping an A/B test and context switching to the next project while the results stream in. Any result that isn't shipping the variant after x weeks is a huge inconvenience that throws off multiple teams, which means all those cognitive biases start to creep in and make it comfortable to declare loser variants as wins or ties.
While it's easy to write this behavior off as yet another way that groups make irrational decisions, I think the bit of truth in there is that sometimes, the cost of running the strictest, science-iest A/B test is simply too high. Power is a key part of how you reason that out up front, so you can make a rational decision not to test, or to modify your test to make the payoff worth it. For example:
* Let's set the goal metric for something higher up the funnel which is further from our true goal (more $) but happens much more often, so we can see the effect in 1 week instead of 2 months
* We really need to do this for [variety of business strategical decisions], so let's structure our experiment to make sure it won't cost us more than $X in a worst case scenario and find out in a few days rather than wait 2 months
> I now know that B is better than A.
You know that B outperformed A in the experiment. Checking statistical significance is like asking a trustworthy person "Are you sure?" and them saying "Yeah, I'm pretty sure". It's a percentage because it's sometimes wrong, and this doesn't account for the massive amount of real world factors that can still mean an experiment conducted with bulletproof math behind the analysis is still taking people down the wrong path.
Not just effort: there usually are many costs associated with change. Perhaps user disorientation, downtime, as well as a lack of long term understanding of how the change plays out in combination with other factors.
A small improvement may not be worth it right now.
Perhaps it can be deferred. Perhaps it makes sense to bundle it with other changes later. *
* Some changes might reflect better on a product when they are rolled out together; e.g. to signal a major release.
>I run an A/B test and see statistical significance
That’s exactly what power gives you: a fighting chance to detect anything. Not sure where the misunderstanding comes from – it’s about the worthwhileness of tests themselves, not “features”.
You may think about power visually by relating them to confidence intervals – higher power → more precise (expected) estimate. Low power → bands so wide that you can just as well use an RNG.
I don't know if it's my civic-minded love for humanity or my laziness, but whenever someone describes a common failure mode and then expects me to do work to catch it every time it inevitably happens, my brain screams "no!". The tooling, processes and community should make this problem impossible by design and/or easily caught in a semi-automated manner.
Good science just doesn't reduce to a script. You need to actually understand what you are doing to interpret the results correctly.
It's why people spend a decade getting a PhD, and many years more still to become a professor. You can't do good science by putting numbers in a spreadsheet and have the computer do all the thinking for you.
I missed the connection between your comment and its parent comment. Would you please elaborate?
Perhaps you are thinking about a situation where the result of an AB test is accepted and incorporated into a product without reflection on the associated factors and context?
I took the GP to mean they wanted tooling to do design and interpret A/B tests for them rather than having to consider cases like false negatives and what have you, to reduce them down to something that's easier to understand.
My point is that the process of designing experiments and interpreting their outcome is nothing other than science.
Science is hard and goes much deeper than the trivial stuff like significance and power. Even scientists, who know all these formulae inside and out struggle with constructing experiments and interpreting the outcome.
To properly think about experimental outcomes, you need to actually have a firm grasp of statistics. The moment you try to convert it into a message that doesn't require understanding statistics is the moment it stops being informative.
In my work experience so far, when it comes to A/B testing what I've observed is that:
* The better the tooling in general increases the proportion of people doing it _wrong_, because when it was harder, this selected better for people who wanted to do it right. Making it easier means more people do it right, but more people who would otherwise not do it at all can now do it, and do it wrong.
* Making aspects impossible by design will impress and amaze you with how humans and groups figure out how to not only prevail over what you tried to make impossible, but do it even harder than if you did nothing at all.
The best I think we can hope for is making things by default easier to catch, or easier to find later, or less deceptive.
Hard things are hard and that's ok. I think we should spend more time channeling our empathy into aspiring for ourselves and others to be better and do hard things, and making the ability to learn and do hard things accessible to everyone who wants it, as opposed to trying to pretend hard things are easy.
I'm not confident that it makes a big difference, because in the face of anything "idiot-proof", nature will provide a better idiot.
It is really easy to A/B test having a small static button vs a large, flashing, jumping button with sound, and measure engagement as "ever clicked the button". It passes statistical power and significance since most of your users now click the button, even if none of them did before. The failure here is that the button is simply annoying, and clicking the button is not a legitimate engagement with the feature, hence the statistical power to answer the question "does a flashing button promote engagement with the feature" is actually zero.
I hear you, and our perceptual lenses are part of our tooling. Our mental tooling.
A change in mindset can be brought can be encouraged and developed by a combination of process and community. This includes educational processes supported by public policy, market mechanisms, and communities.
I believe we need a revolution on how we educate ourselves. How we get there is non-obvious.
If sound statistical thinking was taught, nurtured, and synthesized across many domains, I think it is likely that humanity would be better off.
Right you are, but in defense of the author, understanding this concept while looking through readouts done by past colleagues has also made me feel depressed at times.
Can these problems be circumvented by thinking of non parametric bootstrap methods to directly infer significance from an empirical distribution of not-significant results?
> anyone who lacks a firm understanding of statistical power should not be designing or interpreting A/B tests.
Completely ignores that you can approach A/B testing from an entirely Bayesian perspective (which Miller has written about fairly well in the past).
I might be a bit biased in favor of Bayesian methods, but I'm honestly a bit surprised that many people still run A/B tests using a frequentist frameworks such as this.
The biggest reason I would argue to dismiss the NHST (Null-hypothesis significance test) approaching is because A/B Tests are not really controlled experiments, at least not in the same way clinical trials are. User behavior is always observational even if you have a control group. Whether you go NHST or Bayesian is a coin toss in a controlled environment, but when looking at user behavior on a web site it's much better to think in Bayesian terms.
Other reasons why a Bayesian frame work should be used:
- Marketers care about "Probability A is better than B", not "failure to reject null hypothesis".
- If you've been running A/B tests for years, you most certainly have a very good prior over the distribution of conversion rates. It's literally wasting your time and money not to make use of these.
- If you're measuring human behavior on a website, there are absolutely confounding variables at play and you want to model these explicitly. It's much easier to do this when thinking in a Bayesian framework.
I've been designing and running A/B test of all sorts of complexity at all sorts of companies for a very long time, and my experience has been that approaching the problem from a Bayesian perspective is by far the superior way to run your tests.