I am an engineering manager in a large ecommerce company, overseeing machine learning and our in-house A/B testing framework.
One of our big tenets in my organization is that null-hypothesis significance testing and any associated methodologies (random effects, frequentist experiment design, and various enhancements to fgls regression) are simply not applicable and not useful to answering questions of policy (any kind of policy, which subsumes pretty much all use cases of running tests).
We take a Bayesian approach from the ground up, put lots of research into weakly informative priors for all tests, develop meaningful posterior predictive checks and we state policy inference goals up front to understand what we are looking for (predictive accuracy? measurement of causal effect sizes? understanding risk in choosing between different options that have differently shaped posteriors?).
which discusses “type m” (magnitude) and “type s” (sign) error probabilities in Bayesian analyses, and how that can provide some benefits and flexibility that NHST methods a cookie cutter power designs cannot.
Your mileage may vary, but my org has found this to be night and day better than frequentist approaches and we have no interest in going back to frequentist testing really for any use case.
For feature analysis (should we enable this new feature) I think Bayesian approaches are far better.
I work in a slightly different space, which is qualification of builds (this assumes you have feature flags and code, and the code shouldn't enable features, but it may include refactoring, as well as the code that will later be enabled by a flag etc.)
For this, I originally wanted to push my org to do things in a more Bayesian, but it didn't work. What did work was forcing org leaders to sit down and actually decide how costly true positives were, and therefore how much developer time they were willing to sink into chasing ghosts.
If you can say "we expect to spend 4 hours of developer time investigating an alert here", and 1/10 alerts will be real at our current sensitivity, then you can decide if these things are alright.
Ultimately we can't control all the variables (our sensitivity requirements are informed by things like SLOs that we don't directly control), but it does help make informed decisions about prioritization.
The type m/s stuff is really neat, although in my space we don't care too much about those, although that may be because we have conventionally gargantuan sample sizes.
Something that helped me understand where frequentist logic sat was Computer Age Statistical Inference. A lot of the tools of frequentism were developed for a time that is quite different to our own.
Just personally, I don't think frequentism is incompatible...it seems to have just been built up to frame everything in terms of a problem that is tractable (i.e. t-tests), and that is effective but comes with pitfalls. In economics, as an example, it seems that this toolset has gone pretty far.
What I like about Bayesianism, to my ill-informed mind, is that Bayes Theorem feels parallel to other parts of statistics that are effective: Kalman filters, Metropolis-Hastings, MDPs, even ELO is Bayesian (Glicko makes this explicit). And, this is in no way empirical but, when the results are directly comparable then I have seen better results (even when the Bayesian model is at an information deficit). No idea if that generalises beyond my activities (largely, sports modelling), there are still many pitfalls, and implementation can be tricky (I still don't understand Bayesian Data Analysis)...but it is pretty useful.
I am philosophically Bayesian, but given that I work with large datasets, I am practically frequentist.
I actually think that many of the people who promote Bayes everywhere have never tried to run a simple Stan regression on 100k+ data-points (pro-tip: sample, then sample some more, give up as it's taking too long).
That being said, from a philosophical point of view, Bayes is definitely the way I think.
In my org we frequently run Bayesian regression fitting on datasets of millions of samples.
The key is: don’t use pymc or stan for large data, just actually write your own MCMC code and write log likelihoods for your own models. It’s very easy and very fast, even in Python.
We do still use pymc and stan for other, smaller modeling tasks.
Yeah, but it's not worth it for my purposes. Given the kinds of data and the wide variety of problems we deal with, it would be an investment of too much resources relative to the rewards.
Great comment. I can see how the Bayesian setting uniquely equips you for dealing with non-point testing settings in which reasoning about alternatives to do power, type m, and type s design would be hard.
I'd be very curious about two follow-up questions here:
1 - A frequentist approach to the above could still be formulated in a minimax sort of way, but then you have to deal with alternatives which are close to your null. It's not like this problem goes away for Bayesians, it still seems like the final sample sizes you calculate could end up being very sensitive to the prior. Does this happen in practice?
2 - What kind of optimization goals do your users prefer when trading off power, type m, and type s? My guess based on this formulation it's something of the flavor "max power st P(type m or type s or type I) <= alpha", but wanted to check.
For 1 - keep in mind that frequentist inference cannot be used to support a statement like “the probability that variant A is better than variant B is X” or “the distribution of improvement from variant B over variant A is Y” - the only question a frequentist analysis can answer is, “Assuming the null hypothesis distribution is true, the unlikeliness on the observed data is Z.” From that point of view, we find frequentist analysis simply is epistemologically unsuited for comparing multiple (or a continuum of) policy options, period. Given this, then we start to ask totally new types of experiment design questions. We no longer ask an unphysical question such as, “assuming effect size X and sample size N, what is the probability of falsely rejecting the null” - the idea of “rejecting the null” doesn’t map to any notion of optimal policy selection, so we just don’t care about such a question when designing an experiment. Instead we ask ourselves, “if the true effect size is X, what is the probability I will make a mistake in estimating that effect size, and how large a mistake?” or “if the true relationship between the target and the covariate is positive, what is the probability I’ll mistakenly think it’s negative?” - basing experiment design on how my physical beliefs can be wrong helps me make decisions. Basing it on tail properties of a null distribution does not.
2. It really depends on the experiment. For causal inference, both type s and type m need to be very low, but type I can be high (I don’t care about rejecting a null). For inference where I only care about final predictive accuracy, this may not matter. For example in an extreme case I could have two perfectly collinear predictors, which means their coefficients can be arbitrary as long as the sum yields the true coefficient on the underlying linear component. If my goal is causal inference of the effect size, this would ruin it, since type m error can be unboundedly bad. But if the goal is overall predictive accuracy, it doesn’t matter at all - I just ignore the coefficients.
> “ it still seems like the final sample sizes you calculate could end up being very sensitive to the prior.”
I prefer to flip this around. A frequentist model has a prior too whether anyone wants to admit it or not - usually it is some unrealistic flat / uninformative prior or improper prior. The results of a frequentist method are equally sensitive to this implicit (huge) assumption. A Bayesian approach at least makes it explicit, admits the sensitivity, puts the range of prior choices out in the open for skeptical review, lets you carry out sensitivity analysis, and very often relies of real data and domain expertise to posit a much more physically plausible prior.
One of our big tenets in my organization is that null-hypothesis significance testing and any associated methodologies (random effects, frequentist experiment design, and various enhancements to fgls regression) are simply not applicable and not useful to answering questions of policy (any kind of policy, which subsumes pretty much all use cases of running tests).
We take a Bayesian approach from the ground up, put lots of research into weakly informative priors for all tests, develop meaningful posterior predictive checks and we state policy inference goals up front to understand what we are looking for (predictive accuracy? measurement of causal effect sizes? understanding risk in choosing between different options that have differently shaped posteriors?).
One solid paper is this:
http://www.stat.columbia.edu/~gelman/research/published/retr...
which discusses “type m” (magnitude) and “type s” (sign) error probabilities in Bayesian analyses, and how that can provide some benefits and flexibility that NHST methods a cookie cutter power designs cannot.
Your mileage may vary, but my org has found this to be night and day better than frequentist approaches and we have no interest in going back to frequentist testing really for any use case.