Hacker News new | past | comments | ask | show | jobs | submit login

I generally agree about the pros you list of Bayesian, but I think in companies that are doing a lot of experiments and are not just optimizing a checkout page, they don't hold as well. For example, you often are testing out features that really are different from any feature before, and so a prior is harder to get alignment on. There are also frequentist methods like CUPED to mitigate the effect of confounding variables, and then usually you're analyzing the results across user segments so that you can try to look for heterogeneous treatment effects. (FWIW, having analyzed a lot of experiments, I've been surprised at how often there is NOT a heterogeneous treatment effect across user segments; the baselines are often very different, but the direction and general magnitude of the lifts are typically similar. And, the confounder that is most often relevant is simply some broad "engagement" dimension: users who use the product a lot might behave differently from users who do not.)

In my experience, the biggest blocker to using Bayesian approaches has been about data and computation requirements; there are some closed-form solutions (e.g. https://www.evanmiller.org/bayesian-ab-testing.html , also https://www.chrisstucchio.com/blog/2014/bayesian_asymptotics... ) but even those are computationally difficult at scale even if you don't have to do a lot of MCMC sampling.

But what is really interesting to me about your comment is:

> A/B Tests are not really controlled experiments, at least not in the same way clinical trials are. User behavior is always observational even if you have a control group

Could you expand on that? Isn't clinical data also observational: you observe the patient after some time, and see what their symptoms are or what various endpoint measurements are?




> Could you expand on that? Isn't clinical data also observational

Generally in statistics there is a divide between controlled experiments and observational studies. The former is how most medical trials are run and the latter is how most anthropologists work.

In medical trials you can make sure all the demographics, lifestyle differences etc are controlled for before you even start the trial. In the more extreme case of experiments in the physical sciences you can often tightly control for everything involved where the only difference between test and control is precisely the variable of interest.

In anthropology you can't go back in time and say "what if this society had a higher ratio of male/female and didn't go to war?", you can only model what happened and attempt to bake some causal assumptions into your model to see what happened. This is why detailed regression analysis is very important in these fields.

Having a test and control group in your A/B test is not really enough to establish the equivalent of a controlled study of rats in a laboratory, or bacteria in petri dishes. In my experience it really helps to include causal models regarding what you know of your A/B testing population to ensure that what you're observing is really the case. I've had concert examples where it looks like one variant is winning and then checking for the causal assumptions find that we were accidentally measuring something else.

In practice you can reframe all of this into being some sort of ANOVA test and fit it just fine into a classical framework, but I find starting with a Bayesian methodology make the process easier to reason about.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: