Hacker News new | past | comments | ask | show | jobs | submit login
Annoying A/B testing mistakes (posthog.com)
292 points by Twixes on June 16, 2023 | hide | past | favorite | 149 comments



On point 7 ((Testing an unclear hypothesis), while agreeing with the overall point, I strongly disagree with the examples.

> Bad Hypothesis: Changing the color of the "Proceed to checkout" button will increase purchases.

This is succinct, clear, and is very clear what the variable/measure will be.

> Good hypothesis: User research showed that users are unsure of how to proceed to the checkout page. Changing the button's color will lead to more users noticing it and thus more people will proceed to the checkout page. This will then lead to more purchases.

> User research showed that users are unsure of how to proceed to the checkout page.

Not a hypothesis, but a problem statement. Cut the fluff.

> Changing the button's color will lead to more users noticing it and thus more people will proceed to the checkout page.

This is now two hypotheses.

> This will then lead to more purchases.

Sorry I meant three hypotheses.


* Turns out, folks are seeing the "buy" button just fine. They just aren't smitten with the product. Making "buy" more attention-grabbing gets them to the decision point sooner, so they close the window.

* Turns out, folks see the "buy". Many don't understand why they would want it. Some of those are converted after noticing and reading an explanatory blurb in the lower right. A more prominent "buy" button distracts from that, leading to more "no".

* For some reason, a flashing puke-green "buy" button is less noticable, as evidenced by users closing the window at a much higher rate.

Including untestable reasoning in a chain of hypothesises leads to false confirmation of your clever hunches.


The biggest issue with those three hypotheses is one of them, the noticing the button, almost certainly isn't being tested. But, how the test goes will inform how people think about that hypothesis.


Rate of traffic on the checkout page, divided by overall traffic.

We see a lot of ghosts in A/B testing because we are loosey goosey about our denominators. Mathematicians apparently hate it when we do that.


That doesn't test noticing the button, that tests clicking the button. If the color changes it is possible that fewer people notice it but are more likely to click in a way that increases total traffic. Or more people notice it but are less likely to click in a way that reduces traffic.


Good observation that the noticing doesn’t get tested.

Would there be any benefit from knowing the notice rate though? After all, the intended outcome is increased sales by clicking.


This is what I was driving at in my original comment - the intermediary steps are not of interest (from the POV of the hypothesis/overall experiment), so why mention them at all.


Probably not, but then that hypothesis should not be part of the experiment.


It is surely helpful to have a "mechanism of action" so that you're not just blindly AB testing and falling victim to coincidences like in https://xkcd.com/882/ .

Not sure if people do this, but with a mechanism of action in place you can state a prior belief and turn your AB testing results into actual posteriors instead of frequentist metrics like p-values which are kind of useless.


That xkcd comic highlights the problem with observational (as opposed to controlled) studies. TFA is about A/B testing, i.e. controlled studies. It’s the fact that you (the investigator) is controlling the treatment assignment that allows you to draw causal conclusions. What you happen to believe about the mechanism of action doesn’t matter, at least as far as the outcome of this particular experiment is concerned. Of course, your conjectured mechanism of action is likely to matter for what you decide to investigate next.

Also, frequentism / Bayesianism is orthogonal to causal / correlational interpretations.


I think what kevinwang is getting at, is that if you A/B test with a static version A and enough versions of B, at some point you will get statistically significant results if you repeat it often enough.

Having a control doesn't mean you can't fall victim to this.


You control statistical power and the error rate, and choose to accept a % of false results.


AB tests are still vulnerable to p-hacking-esque things (though usually unintentional). Run enough of them and your p value is gonna come up by chance sometimes.

Observational ones are particularly prone because you can slice and dice the world into near-infinite observation combinations, but people often do that with AB tests too. Shotgun approach, test a bunch of approaches until something works, but if you'd run each of those tests for different significance levels, or for twice as long, or half as long, you could very well see the "working" one fail and a "failing" one work.


The xkcd comic seems more about the multiple comparisons problem (https://en.wikipedia.org/wiki/Multiple_comparisons_problem), which could arise in both an observational or controlled setting.


I don't think these examples are bad. From a clarity standpoint, where you have multiple people looking at your experiments, the first one is quite bad and the second one is much more informative.

Requiring a user problem, proposed solution, and expected outcome for any test is also good discipline.

Maybe it's just getting into pedants with the word "hypothesis" and you would expect the other information elsewhere in the test plan?


the problem is the hand wavy "user research"

if you have done that properly, why ab testing? if you did that improperly, why bother?

ab testing moves from an hypotesis, because ab testing is done to inform a bayesian analysis to identify causes.

if one knows already that the reason is 'button not visible enough' ab testing is almost pointless.

not entirely pointless, because you can still do ab testing to validate that the change is in the right direction, but investing developer time for production quality code and risking business to just validate something one already knows seems crazy compared to just ask a focus group.

when you are unsure about the answer, that's when investing in ab testing to discovery makes the most sense.


> ab testing is almost pointless

Except you can never be certain that the changes made were impactful in the direction you're hoping unless you measure it. Otherwise it's just wishful thinking.


I didn't say anything to the contrary, the quotation is losing all the context.

but if you want to verify hipotesis and control for confounding factor, the ab test needs to be part of a baesyan analysis, if you're doing that, why also pay for the priori research?

by going down the path of user research > production quality release > validation of the hypotesis you are basically paying research twice and paying development once regardless of wether the testing is succesful or not.

it's more efficient to either use bayesian hypotesis + ab testing for research (so pay development once per hypotesis, collect evidence and steer into the direction the evidence points to) or use user research over a set of POCs (pay research once per hypotesis, develop in the direction that research points to)

if your research need validation, you paid for a research you might not need. if you start research knowing the priory (the user doens't see the button) you're not actually doing research, you're just gold plating a hunch, then why pay for research, just skip to the testing phase. if you want to research from the users, you do ab testing, but again, not against a hunch, but against a set of hypotesis, so you can eliminate confounding factors and narrow down the confidence interval.


Having a clearly stated hypothesis and supplying appropriate context separately isn't pedantry. It is semantics, but words result in actions that matter.


As kevinwang has pointed out in slightly different terms: the hypothesis that seems wooly to you seems sharply pointed to others (and vice versa) because explanationless hypotheses ("changing the colour of the button will help") are easily variable (as are the colour of the xkcd jelly beans), while hypotheses that are tied strongly to an explanation are not. You can test an explanationless hypothesis, but that doesn't get you very far, at least in understanding.

As usual here I'm channeling David Deutsch's language and ideas on this, I think mostly from The Beginning of Infinity, which he delightfully and memorably explains using a different context here: https://vid.puffyan.us/watch?v=folTvNDL08A (the yt link if you're impatient: https://youtu.be/watch?v=folTvNDL08A - the part I'm talking about starts at about 9:36, but it's a very tight talk and you should start from the beginning).

Incidentally, one of these TED talks of Deutsch - not sure if this or the earlier one - TED-head Chris Anderson said was his all-time favourite.

plagiarist:

> That doesn't test noticing the button, that tests clicking the button. If the color changes it is possible that fewer people notice it but are more likely to click in a way that increases total traffic.

"Critical rationalists" would first of all say: it does test noticing the button, but tests are a shot at refuting the theory, here by showing no effect. But also, and less commonly understood: even if there is no change in your A/B - an apparently successful refutation of the "people will click more because they'll notice the colour" theory - experimental tests are also fallible, just as everything else.


Will watch the TED talk, thanks for sharing. I come at this from a medical/epidemiological background prior to building software, and no doubt this shapes my view on the language we use around experimentation, so it is interesting to hear different reasoning.


Good to see an open mind! I think most critical rationalists would say that epidemiology is a den of weakly explanatory theories.

Even though I agree, I'm not sure that's 100% epidemiology's fault by any means: it's just a very difficult subject, at least without measurement technology, computational power, and probably (machine or human) learning and theory-building that even now we don't have. But, there must be opportunities here for people making better theories.


The biggest mistake engineers make is determining sample sizes. It is not trivial to determine the sample size for a trial without prior knowledge of effect sizes. Instead of waiting for a fixed sample size, I would recommend using a sequential testing framework: set a stopping condition and perform a test for each new batch of sample units.

This is called optional stopping and it is not possible using a classic t-test, since Type I and II errors are only valid at a determined sample size. However, other tests make it possible: see safe anytime-valid statistics [1, 2] or, simply, bayesian testing [3, 4].

[1] https://arxiv.org/abs/2210.01948

[2] https://arxiv.org/abs/2011.03567

[3] https://pubmed.ncbi.nlm.nih.gov/24659049/

[4] http://doingbayesiandataanalysis.blogspot.com/2013/11/option...


People often don’t determine sample sizes at all! And doing power calculations without an idea of effect size isn’t just hard but impossible. It’s one of the inputs to the formula. But at least it’s fast so you can sort of guess and check.

Anytime valid inference helps with this situation, but it doesn’t solve it. If you’re trying to detect a small effect, it would be nicer to figure out you need a million samples up front versus learning that because your test with 1,000 samples a day took three years.

Still, anytime is way better than fixed IMO. Fixed almost never really exists. Every A/B testing platform I’ve seen allows peeking.

I work with the author of the second paper you listed. The math looks advanced, but it’s very easy to implement.


The biggest mistake is engineers owning experimentation. They should be owned by data scientists.

Realize though that is a luxury, but I also see this trend in blue chip companies


Did a data scientist write this? You don't need to be a member of a priesthood to run experiments. You just need to know what you're doing.


I agree with both sides here. :) DS should own experimentation, AND engineers should be able to run a majority of experiments independently.

As a data scientist at a "blue chip company", my team owns experimentation, but that doesn't mean we run all the experiments. Our role is to create guidelines, processes, and tooling so that engineers can run their own experiments independently most of the time. Part of that is also helping engineers recognize when they're dealing with a difficult/complex/unusual case where they should bring DS in for more bespoke hands-on support. We probably only look at <10% of experiments (either in the setup or results phase or both), because engineers/PMs are able to set up, run, and draw conclusions from most of the experiments without needing us.


... and by some definition you'd be a data scientist yourself. (Regardless of your job title)


Surprised no one said this yet, so I'll bite the bullet.

I don't think A/B testing is a good idea at all for the long term.

Seems like a recipe for having your software slowly evolved into a giant heap of dark patterns. When a metric becomes a target, it ceases to be a good metric.


More or less, it tells you the "cost" of removing an accidental dark pattern. For example we had three plans and a free plan. The button for the free plan was under the plans, front-and-center ... unless you had a screen/resolution that most of our non-devs/designers had.

So, the button, (for user's most common resolution) had the button just below the fold.

This was an accident though some of our users called us out for it -- suggesting we'd removed the free plan altogether.

So, we a/b tested moving the button to the top.

It would REALLY hurt the bottom line and explained some growth we'd experienced. To remove the "dark pattern" would mean laying off some people.

I think you can guess which one was chosen and still implemented.


When an organization has many people, I think that many of these are a continuum from accidental to intentional.


When I left that company it had grown to massive and the product was full of dark patterns… I mean bugs, seriously, they were tracked as bugs that no one could fix without severe consequences. No one put them there on purpose. When you have hundreds of devs working on the same dozen files (onboarding/payments/etc) there are bound to be bad merges (when a git merge results in valid but incorrect code), misunderstanding of requirements, etc.


Good multivariate testing and (statistically significant) data doesn't do that. It shows lots of ways to improve your UX, and if your guesses at improving UX actually work. Example from TFA:

> more people signed up using Google and Github, overall sign-ups didn't increase, and nor did activation

Less friction on login for the user, 0 gains in conversions, they shipped it anyway. That's not a dark pattern.

If you're intentionally trying to make dark patterns it will help with that too I guess; the same way a hammer can build a house, or tear it down, depending on use.


I often see this argument, and although I can happily accept the examples given in defence as making sense, I never see an argument that this multivariate approach solves the problem in general and doesn't merely ameliorate some of the worst cases(I suppose I'm open to the idea that it could at least get it from "worse than the disease" to "actually useful in moderation").

Fundamentally, if you pick some number of metrics, you're always leaving some number of possible metrics "dark", right? Is there some objective method of deciding which metrics should be chosen, and which shouldn't?


"user trust" is a good one, abeit hard to measure

Rolled out some tests to streamline cancelling subscriptions in response to user feedback, with Marketing's begrudging approval.

Short term, predictably, we saw an increase in cancellations, then a decrease and eventual levelling out. Long term we continued to see an increase in subscriptions after rollout, and focused on more important questions like "how do we provide a good product that a user doesn't want to cancel?"


So, it's just a process of trial and error, in terms of what metrics to choose and how to weight them?


> Seems like a recipe for having your software slowly evolved into a giant heap of dark patterns.

Just don't test for dark patterns?


Well, how does one "just not do" that though, specifically?


First determine if what you want to test for is a dark pattern?


And how do you determine that? I'm not trying to be coy here, I genuinely don't understand.

Because you're not testing for patterns, what you test is some measurable metric(s) you want to maximise(or minimise), right? So how can you determine which metrics lead to dark patterns, without just using them and seeing if dark pattern emerge? And how do you spot these dark patterns if by their very nature they're undetectable by the metrics you chose to test first?


The "patterns" in dark patterns doesn't mean they're an emergent property of the system. You test whether a change improves a metric in A/B tests. You avoid accidental dark patterns in the change like you avoid bugs that cause accidental data loss in the change: you think carefully about what you're doing, maybe a reviewer looks it over, and so on. This isn't perfect, but nothing is.


[flagged]


Well this discussion isn't helpful at all.

Why reply at all if you're just gonna waste my time?


Let's ship the project of those that bang the table, and confirm our biases instead.


Please try to be serious and don't put words in my mouth. I'm actually trying to learn and have a serious discussion here.

Thanks.


What they're describing is a serious problem in modern product innovation, so maybe it is you who should take it seriously, aye?

Let's rephrase: if we are not to test for changes in user behaviour that give positive signal to progressive innovation, then what should we do? And how should we avoid the loudest voices in a room full of whiteboards creating a product that biases towards the needs of tech company and startup employees?


I don't think it should even be legal. Why do these corporations think they can perform human experimentation on unwitting subjects for profit?


What if it's at an airport queue, where they are testing how to improve queue times and whether having the queue be in a straight line or in zig-zag makes it faster for passing security checks?

Should the passengers sign an agreement before being "experimented on", and having them be split in two groups, where one stays in a straight line and one in zig-zag?


I built an internal a/b testing platform with a team of 3-5 over the years. It needed to handle extreme load (hundreds of millions of participants in some cases). Our team also had a sister team responsible for teaching/educating teams about how to do proper a/b testing -- they also reviewed implementations/results on-demand.

Most of the a/b tests they reviewed (note the survivorship bias here, they were reviewed because they were surprising results) were incorrectly implemented and had to be redone. Most companies I worked at before or since did NOT have a team like this, and blindly trusted the results without hunting for biases, incorrect implementations, bugs, or other issues.


> It needed to handle extreme load (hundreds of millions of participants in some cases).

I can see extreme loads being valuable for an A/B test of a pipeline change or something that needs that load... but for the kinds of A/B testing UX and marketing does, leveraging statistical significance seems to be a smart move. There is a point where a large sample is trivially more accurate than a small sample.

https://en.wikipedia.org/wiki/Sample_size_determination


Even if you're testing 1% of 5 million visitors, you still need to handle the load for 5 million visitors. Most of the heavy experiments came from AI-driven assignments (vs. behavioral). In this case the AI would generate very fine-grained buckets and assign users into them as needed.


Do you know if there were common mistakes for the incorrect implementations? Were they simple mistakes or more because someone misunderstood a nuance of stats?


I don't remember much specifics, but IIRC, most of the implementation related ones were due to an anti-pattern from the older a/b testing framework. Basically, the client would try and determine if the user was eligible to be in the A/B test (instead of relying on the framework), then in an API handler, get the user's assignment. This would mean the UI would think the user wasn't in the A/B test at all, while the API would see the user as in the A/B test. In this case, the user would be experiencing the 'control' while the framework thought they were experiencing something else.

That was a big one for awhile, and it would skew results.

Hmmm, another common one was doing geographic experiments when part of the experiment couldn't be geofenced for technological reasons. Or forgetting that a user could leave a geofence and removing access the feature after they'd already been given access to it.

Almost all cases boiled down to showing the user one thing while thinking we were showing them something else.


I wonder if that falls under mistake #4 from the article, or if there's another category of mistake: "Actually test what you think you're testing." Seems simple but with a big project I could see that being the hardest part.


I actually just read it (the best I could, the page is really janky on my device) I didn’t see this mistake on there and it was the most common one we saw by a wide margin in the beginning.

Number 2 (1 in the article) was solved by the platform. We had two activation points for UI experiments. The first was getting the users assignment (which could be cached for offline usage). At that point they became part of the test, but there was a secondary one that happened when the component under test became visible (whether it was a page view or a button). If you turned on this feature for the test, you could analyze it using the first or secondary points.

One issue we saw with that (which is potentially specific to this implementation), was people forgetting to fire the secondary for the control. That was pretty common but you usually figured that out within a few hours when you got an alert that your distribution looked biased (if you specify a 10:20 split, you should get a 10:20 ratio of activity).


Same experience here for the most part. We're working on migrating away from an internal tool which has a lot of problems: flags can change in the middle of user sessions, limited targeting criteria, changes to flags require changes to code, no distinction between feature flags and experiments, experiments often target populations that vary greatly, experiments are "running" for months and in some cases years...

Our approach to fixing these problems starts with having a golden path for running an experiment which essentially fits the OP. It's still going to take some work to educate everyone but the whole "golden path" culture makes it easier.


When we started working on the internal platform, this was exactly the problems we had. When we were finally deleting the old code, we found a couple of experiments that had been running for nearly half a decade.

For giggles, we ran an analysis on those experiments: no difference between a & b.

That's usually the best result you can get, honestly. It means you get to make a decision of whether to go with a or b. You can pick the one you like better.


That's a great outcome for a do-no-harm test. Horrible outcome when you're expecting a positive effect.


It’s an experiment, you shouldn’t be “expecting” anything. You hypothesize an effect, but that doesn’t mean it will be there and if you prove it wrong, you continue to iterate.


> you shouldn’t be “expecting” anything

This is the biggest lie in experimentation. Of course you expect something. Why are you running this test over all other tests?

What I'm challenging is that if a team has spent three months building a feature, you a/b test it and find no effect, that is not a good outcome. Having a tie where you get to choose anything is worse than having a winner that forces your hand. At least you have the option to improve your product.


> What I'm challenging is that if a team has spent three months building a feature, you a/b test it and find no effect, that is not a good outcome.

That's a great outcome. At one company we spent a few months building a feature only for it to fail the test, now that was a bad outcome. The feature's code was so good, we ended up refactoring it to look like the old feature and switching to that. So there was a silver lining, I guess.

The key takeaway was to never a/b test a feature that big again. Instead we would spend a few weeks to build something that didn't need to scale nor feature complete. (IOW, an MVP/POC shitty code).

If it had come out that there was no difference, we would have gone with the new version code because it was so well built -- alternatively, if the code was shit, we probably would have thrown it out. That's why its the best result. You can write shitty POC code and toss it out -- or keep it if you really want.


> Of course you expect something. Why are you running this test over all other tests?

Because it has the best chance to prove/disprove your hypothesis. That's it. Even if it doesn't, all that means is that the metrics you're measuring are not connected to what you're doing. There is more to learn and explore.

So, you can hope that it will prove or disprove your hypothesis, but there is no rational reason to expect it to go either way.


But why this hypothesis? Sometimes people do tests just to learn as much as they can, but 95%+ of the time they’re trying to improve their product.

> there is no rational reason to expect it to go either way.

Flipping a coin has the same probability of heads during a new moon as during a full moon. I’m going to jump ahead and expect that you agree with that statement.

If I phrase that as a hypothesis and do an experiment, suddenly there’s no rational reason to expect it to go either way? Of course there is. The universe didn’t come into being when I started my experiment.

Null hypothesis testing is a mental hack. A very effective one, but a hack. There is no null. Even assuming zero knowledge isn’t the most rational thing. But, the hack is that history has shown that when people try to act like they know nothing, they end up with better results. People are so overconfident that pretending they knew nothing improved things! This doesn’t mean it’s the truth, or even the best option.


I’d suggest reading up on the experimental method. There’s also a really good book: Trustworthy Online Controlled Experiments.

You are trying to apply science to commercial applications. It works, but you cannot twist it to your will or it stops working and serves no purpose other than a voodoo dance.

> Flipping a coin has the same probability of heads during a new moon as during a full moon. I’m going to jump ahead and expect that you agree with that statement.

As absurd as it sounds, it’s a valid experiment and I actually couldn’t guess if the extra light from a full moon would have a measurable affect on a coin flip. Theoretically it would, as light does impart a force… but whether or not we could realistically measure it would be interesting.

Yes, I’m playing devils advocate, but “if the button is blue, more people will convert” is just as absurd a hypothesis, yet it produced results.


Late response: I’ve read that book. I also work as a software engineer on the Experimentation Platform - Analysis team at Netflix. I’m not saying that makes me right, but I think it supports that my opinion isn’t from a lack of exposure.

> You are trying to apply science to commercial applications. It works, but you cannot twist it to your will or it stops working and serves no purpose other than a voodoo dance.

With this paragraph, you’ve actually built most of the bridge between my viewpoint and yours. I think the common scientific method works in software sometimes. When it does, there are simple changes to make it so that it will give better results. But most of the time, people are in the will-twisting voodoo dance.

People also bend their problems so hard to fit science that it’s just shocking to me. In no other context do I experience a rational, analytical adult arguing that they’re unsure if a full moon will measurably affect a count flip. If someone in a crystal shop said such a thing, they’d call it woo.


The one mistake I assume happens too much is trying to measure "engagement".

Imagine a website is testing a redesign, and they want to decide if people like it by measuring how long they spend on the site to see if it's more "engaging". But the new site makes information harder to find, so they spend more time on the site browsing and trying to find what they're looking for.

Management goes, "Oh, users are delighted with the new site! Look how much time they spend on it!" not realizing how frustrated the users are.


Engagement is my favorite form of metrics pseudoscience. A classic example is when engagement actually goes up, not because the design change is better, but because it frustrates and confuses the user, causing them to click around more and remain on the site longer. Without a focus group, there's really no way to determine whether the users are actually "delighted".

EDIT: For some reason it didn't compute with me that you already referred to the same example. I've seen that exact scenario play out in real life, though.


I bet the reddit redesign used a similar faulty measurement of engagement.

"People spent more time scrolling the feed, people must enjoy it!"

No, the feed takes up more space, so now I can only fit 1 or 2 items on my screen at once, rather than 10, so I have to scroll more to see more content.


That would not surprise me in the least! In fact, that's exactly what happened at a company I used to work for (that shall remain nameless). At the behest of the design team, we implemented a complete redesign of our site which included changing the home page so that at most only two media items could be on-screen at a time, and the ads which used to be simple banners now were woven between the feed of items. I remember sitting in a meeting where we had A/B tested this new homepage, and witnessing some data analyst guy giving a presentation which included how "engagement in the B-group was increased by N-percent!!!" The directors of web content were awestruck by this despite no context or explanation as to why supposed "engagement" was higher with the new design. The test wasn't even carried out for a long duration of time. For all anyone knew, users were confused and spent more time clicking around because they were looking for something they were accustomed to in the original design. And no, it did not matter that I brought up my reasons for skepticism; anything that made a number increase made it into the final design. Then, we actually had focus groups, long after the point at which we should have been consulting them, and the feedback we received was overwhelmingly lukewarm or negative. Much of it vindicated my concerns the entire time; users didn't actually like scrolling. Then again, I guess if they're viewing more ads, then who cares what the user thinks?? Never have I felt more like I was living in a Dilbert comic than that time.


If that also resulted in little or no change in how often you (and everyone) opened reddit each day, then it is a "success" for them. They have your eyeballs for longer, so you likely see more ads.

If only they were trying to maximise enjoyment and not addictiveness. They don't care at all about enjoyment, just like Facebook doesn't care about genuine connection to family and friends, or twitter to useful and constructive discussion that leads to positive social change.


LinkedIn is a good example, I think. One day I got a “you have a new message” email. I clicked it, thinking, well, someone has messaged me, right? It turned out to be just bullshit, someone in my network had just posted something.

I’m sure the first few of those got a lot of clicks, but it prompted me to ignore absolutely everything that comes from LinkedIn except for actual connection requests from people I know. Lots of clicks but also lots of pissed off people. I guess the latter is harder ti measure.


This isn't really the case because engagement isn't the only metric people look at. If they notice the new design increases engagement, but hurts retention of new users. If your site makes money as a function of engagement and to other metrics are being hurt is it really a problem if people spend more time on your site due to a "poor" design.


That's not Simpson's Paradox. Simpson's Paradox is when the aggregate winner is different from the winner in each element of a partition, not just some of them


Also, doesn't their suggested approach amount to multiple testing? In other words, a kind of p-hacking: https://en.wikipedia.org/wiki/Multiple_comparisons_problem

Edit - and this: http://www.stat.columbia.edu/~gelman/research/unpublished/p_...


Yeah, a good AB testing framework would either refuse to let you break things down too much or have a large warning about the results not being significant, but that doesn't always stop the business-types from trying to wiggle in some way for them to show a win.


Yes, I don’t think it’s possible to observe a simpson’s paradox in a simple conversion test, either.

Simpson’s paradox is about spurious correlations between variables - conversion analysis is pure Bayesian probability.

It shouldn’t be possible to have a group as a whole increase its probability to convert, while having every subgroup decrease its probability to convert - the aggregate has to be an average of the subgroup changes.


Are you sure?

Consider the case where iOS users are more likely to convert than Android users, but you currently have very few iOS users. You then A/B test a new design that imitates iOS, but has awful copy. Both iOS and Android users are less likely to convert, but it attracts more iOS users.

The group as a whole has higher conversion because of the demographic shift, but every subgroup has less.


I don't follow. If one bucket has many more iOS users, it seems like you have done a bad job randomizing your treatment?


It could be self-selection happening after you randomized the groups. For example a desktop landing page advertising an app, which might be installed on either mobile operating system.


Simpson's paradox is sometimes about spurious correlations, but the original paradox Simpson wrote about was simply a binary question with 84 subgroups, where 3 or 4 subgroups with the outlying answer just had a significant enough amount of all samples, and a significant enough effect, to mutate the whole.


Exactly.

On that topic – what do you do when you observe that in your test results? What's the right way to interpret the data?


Let's consider an example that would be a case of Simpson's Paradox. Suppose you are A/B testing two different landing pages, and you want to know which will make more people become habitual users. You partition on whether the user adds at least one friend in their first 5 minutes on the platform. It might be that landing page A makes people who add a friend in the first 5 minutes more likely to become habitual users, and it also makes people who don't add a friend in the first 5 minutes more likely to become habitual users. But page A makes people less likely to add a friend in the first 5 minutes, and people who add a friend in the first 5 minutes are overwhelmingly more likely to become habitual users than people who don't. So, in this case at least, it seems like the aggregate statistics are most relevant, but the fact that page A is bad mainly because it makes people less likely to add a friend in the first 5 minutes is also very interesting; maybe there is some way of combining A and B to get the good qualities of each and avoid the bad qualities of both


With random bucketing happening at the global level for any test, the proper thing to do is to take any segments that show interesting (and hopefully statistically significant) results that differ from the global results and test those segments individually so the random bucketing happens at that segment level.

There are two issues at play here -- one is that the sample sizes for the segments may not be high enough, the other is that the more segments you look at , the greater the probability for finding a false positive.


It can only happen with unequal populations. If you decide to include people in the control or test group randomly you're fine (you can use statistical tests to rule out sample biad).


I'm the author of this blog. Thank you for calling this out! I'll update the example to fix this :)


The example is fine, you are calling out confounding variables. Just call it confounders, instead of Simpson paradox.


FWIW, the arithmetic in that example also has a glitch. For the "Mobile" "Control" case, 100/3000 is about 3% rather than 10%.


What it is is confounding


I want an A/B test framework that automatically optimizes the size of the groups to maximize revenue.

At first, it would pick say a 50/50 split. Then as data rolls in that shows group A is more likely to convert, shift more users over to group A. Keep a few users on B to keep gathering data. Eventually, when enough data has come in, it might turn out that flow A doesn't work at all for users in France - so the ideal would be for most users in France to end up in group B, whereas the rest of the world is in group A.

I want the framework to do all this behind the scenes - and preferably with statistical rigorousness. And then to tell me which groups have diminished to near zero (allowing me to remove the associated code).


Sadly, there is an issue with the Novelty Effect[1]. If you push traffic to the current winner, it probably won't validate that its the actual winner. So you may trade more conversions now, for a higher churn than you can tolerate later.

For example, you run two campaigns:

1. Get my widgets, one year only 19.99!

2. Get my widgets, first year only 19.99!

The first one may win, but they all cancel at the second year because they thought it was only for one year. They all leave reviews complaining that you scammed them.

So, I would venture that this idea is a bad one, but sounds good on paper.

[1]: https://medium.com/geekculture/the-novelty-effect-an-importa...

PS. A/B tests don't just provide you with evidence that one solution might be better than the other, they also provide some protection in that a number of participants will get the status-quo.


> So, I would venture that this idea is a bad one, but sounds good on paper.

It's a great idea, it's just vulnerable to non-stationary effects (novelty effect, seasonality, etc). But it's actually no worse than fixed time horizon testing for your example if you run the test less than a year. You A/B test that copy for a month, push everyone to A, and you're still not going to realize it's actually worse.


Yeah. If churn is part of the experiment, then even after you stop the a/b test for treatment, you may have to wait at least a year before you have the final results.


As others have mentioned, you're referring to Thompson sampling and plenty of testing providers offer this (and if you have any DS on staff, they'll be more than happy to implement it).

My experience is that there's a good reason why this hasn't taken off: the returns for this degree of optimization are far lower than you think.

I once worked with a very eager, but junior DS who thought that we should build out a massive internal framework for doing this. He didn't quite understand the math behind it, so I build him a demo to understand the basics. What we realized in running the demo under various conditions is that the total return on adding this complexity to optimization was negligible at the scale we were operating at and required much more complexity than our current set up.

This pattern repeats in a lot of DS related optimization in my experience. The difference between a close guess and perfectly optimal is often surprisingly little. Many DS teams perform optimizations on business processes that yield a lower improvement in revenue than the salary of the DS that built it.


Small nit: it’s a bad idea if NPV of future returns is less than the cost. If someone making $100k/yr can produce one $50k/yr optimization that NPV’s out to $260k, it’s worth it. I suspect you meant that, just a battle I have at work a lot with people who only look at single-year returns.


Besides complexity, a price you pay with multi-armed bandits is that you learn less about the non-optimal options (because as your confidence grows that an option is not the best, you run fewer samples through it). It turns out the people running these experiments are often not satisfied to learn "A is better than B." They want to know "A is 7% better than B," but a MAB system will only run enough B samples to make the first statement.


Get yourself a multi arm bandit and some Thompson sampling https://engineering.ezcater.com/multi-armed-bandit-experimen...


Curious about your use case.

Is the idea that you wa t to optimize the conversion and then you would remove the experiment code with the winning variant ?.

Or would you prefer to keep the code in and have it continuously optimize variants ?.


I'd expect to be running tens of experiments at any one time. Some of those experiments might be variations in wording or colorschemes - others might be entirely different signup flows.

I'd let the experiment framework decide (ie. optimize) who gets shown what.

Over time, the maintenance burden of tens of experiments (and every possible user being in any combination of experiments) would exceed the benefits, so then I'd want to end some experiments, keeping just whatever variant performs best. And I'd be making new experiments with new ideas.


There might be a particular situation where B might be more effective than A, and therefore should be kept, if only for that specific situation There might be a cutoff point, where maintaining B would cost more than it's worth, but that's a parameter you will have to determine for each test


That sounds like you don't want A/B testing at all.


Indeed - I really want A/B testing combined with conversion optimization.


Why can't users just tell me what works!


Exactly. That just sounds like a Bayesian update



Posthog is on developerdans "Ads & Tracking" blocklist[1], if you're wondering why this doesn't load.

[1]: https://github.com/lightswitch05/hosts/blob/master/docs/list...


Just noticed that myself, It's also in the Adguard DNS list.


Another challenge, related more to implementation than theory, is having too many experiments running in parallel.

As a company grows there will be multiple experiments running in parallel executed by different teams. The underlying assumption is that they are independent, but it is not necessarily true or at least not entirely correct. For example a graphics change on the main page together with a change in the login logic.

Obviously this can be solved by communication, for example documenting running experiments, but like many other aspects in AB testing there is a lot of guesswork and gut feeling involved.


A better solve is E2E or unit tests to make sure A/B segments aren't conflicting. At the enterprise level there's simply too many teams testing too much to keep track of it in, say, a spreadsheet.


The biggest mistake engineers make about A/B testing is not recognizing local maxima. Your test may be super successful, but there may be an even better solution that's significantly different than what you've arrived at.

It's important to not only A/B test minor changes, but occasionally throw in some major changes to see if it moves the same metric, possibly even more than your existing success.


If I read the first mistake correctly, then getFeatureFlag() has the side-effect to count how often it was called and uses this to calculate the outcome of the experiment? Wow. I don't know what to say....


Yeah I felt that way too. Initially I thought I wasn't sure what I was missing, since the only difference is that the order of the checks is switched, and the function will still return the same true/false in both cases. Then I thought about side effects and it felt icky.


That's how every one of these tools works, that’s the whole point of using them: you only call them when you’re going to actually show the variation to the user. If you’re running a test that modifies the homepage only, you shouldn’t be calling that decision method in, say, your global navigation code that you show everyone. Or, for instance, if your test only affects how the header looks for subscribers, you have to put an outer if statement “if subscriber“ before the “if test variation.“ How else would it correctly know exactly who saw the test?


This is indeed the case. Have run into a few surprising things like this when implementing posthog experiments recently


When you call the feature flag, it’s going to put the user into one of the groups. The article is saying you don’t want to add irrelevant users (in the example, ones that had already done the action they were testing) because it’s going to skew your results.


The point is from an api design perspective something like

"posthog.getFeatureFlag('experiment-key')"

doesn't look like it's actually performing a mutation.


Writing an article about developer mistakes is easier than redesigning your rubbish API though.


Yep gross...


Another one: don’t program your own AB testing framework! Every time I’ve seen engineers try to build this on their own, it fails an AA test (where both versions are the same so there should be no difference). Common reasons are overly complicated randomization schemes (keep it simple!) and differences in load times between test and control.


I don't keep that up with it but it seems like the ecosystem has kind of collapsed the last few years though? Like you have optimizely and its competitors that are fully focused on huge enterprise with "call us" pricing right out the gate. VWO has a clunky & aged tech stack that was already causing problems when I used it a couple years ago and seems unchanged since then.

If you're a medium-small business I see why you'd be tempted to roll your own. Trustworthy options under $15k/year are not apparent.


Shouldn't AA tests fail a certain percentage of the time? Typically, 5% of the time?


Enough traffic.

Isn’t the biggest problem with A/B testing that very few web sites even have enough traffic to properly measure statistical differences.

Essentially making A/B testing for 99.9% of websites useless.


I have worked for some pretty arrogant, business types who fancy themselves “data driven“ but actually they knew nothing about statistics. What that actually meant was they forced us to run AB tests for every change, and when the tests nearly always showed no particular statistical significance, they would accept the insignificant results if it supported their agenda, or if the insignificant results were against their desired outcome, they would run the test longer until it happend to flop the other way. The whole thing was such a joke. You definitely need some very smart math people to do this in a way that isn't pointless.


This. Ron Kohavi 1) has some excellent resources on this 2). There is a lot of noise in data, that is very often misattributed to 'findings' in the context of A/B testing. Replication of A/B tests should be much more common in the CRO industry, it can lead to surprising yet sobering insights into real effects.

1) https://experimentguide.com/ 2) https://bit.ly/ABTestingIntuitionBusters


A/B testing works fine even at a hundred users per day. More visitors means you can run more tests and notice smaller differences, but that’s also a lot of work which smaller sites don’t really justify.


Ad 7)

> Good hypothesis: User research showed that users are unsure of how to proceed to the checkout page. Changing the button's color will lead to more users noticing it (…)

Mind that you have to prove first that this preposition is actually true. Your user research is probably exploratory, qualitative data based on a small sample. At this point, it's rather an assumption. You have to transform and test this (by quantitative means) for validity and significance. Only then you can proceed to the button-hypothesis. Otherwise, you are still testing multiple things at once, based on an unclear hypothesis, while merely assuming that part of this hypothesis is actually valid.


In practice you often cannot test that in a quantitative way. Especially since it’s about a state of mind.

However, you should not dismiss qualitative results out of hand.

If you do usability testing of the checkout flow with five participants and three actually verbalize the hypothesis during checkout (“hm, I’m not sure how to get to the next step here”, “I don’t see the button to continue”, after searching for 30s: “ah, there it is!” – after all of which a good moderater would also ask follow up questions to better understand why they think it was hard for them to find the way to the next step and what their expectations were) then that‘s plenty of evidence for the first part of the hypothesis, allowing you to move on to testing the second part. It would be madness to quantitatively verify the first part. A total waste of resources.

To be honest: with (hypothetical) evidence as clear as that from user research I would probably skip the A/B testing and go straight to implementing a solution if the problem is obvious enough and there are best practice examples. Only if designers are unsure about whether their proposed solution to the problem actually works would I consider testing that.

Also: quantitative studies are not the savior you want them to be. Especially if it’s about details in the perception of users … and that’s coming from me, a user researcher who loves to do quantitative product evaluation and isn’t even all that firm in all qualitative methods.


You really have to be able to build samples based on the first part of the hypothesis: you should test 4 groups for a crosstab. (Also, homogeneity may be an issue.) Transitioning from qualitative to quantitative methods is really the tricky part in social research.

Mind that 3/5 doesn't meet the criteria of a binary test. In statistical terms, you do know nothing, this is still random. Moreover, even if metrics are suggesting that some users are spending considerable time, you still don't know why: it's still an assumption based on a negligible sample. So, the first question should be really, how do I operationalize the variable "user is disoriented", and, what does this exactly mean. (Otherwise, you're in for spurious correlation of all sorts. I.e. you still don't know why some users display disorientations and others not. Instead of addressing the underlying issue, you rather fix this by an obtrusive button design, which may have negative impact on the other group.)


I think you are really missing the forest for the trees.

Everything you say is completely correct. But useful? Or worthwhile? Or even efficient?

The goal is not to find out why some users are disoriented and some are not. Well, I guess indirectly it is. But getting there with rigor is a nightmare and to my mind not worthwhile in most cases. The hypothesis developed from the usability test would be “some users are disoriented during checkout”. That to me would be enough evidence to actually tackle that problem of disorientation, especially since to me 3/5 would indicate a relatively strong signal (not in terms of telling me the percentage of users affected by this problem, just that it’s likely the problem affects more than just a couple people).

The more mysterious question to me would actually be whether that disorientation also leads to people not ordering. Which is a plausible assumption – but not trivially answerable for sure. (Usability testing can provide some hints toward answering that question – but task based usability testing is always a bit artificial in its setup.)

Operationalizing “user is disoriented” is a nightmare and not something I would recommend at all (at least not as the first step) if you are reasonably sure that disorientation is a problem (because some users mention it during usability testing) and you can offer plausible solutions (a new design based on best practices and what users told you they think makes them feel disoriented).

Operationalizing something like disorientation is much more fraught with danger (and just operationalizing it in the completely wrong way without even knowing) than identifying a problem and based on reasonableness arguments implementing a potential solution and seeing whether the desired metric improves.

I agree that it would be an awesome research project to actually operationalize disorientation. But worthwhile when supporting actual product teams? Doubtful …


> The more mysterious question to me would actually be whether that disorientation also leads to people not ordering.

This is actually the crucial question. The disorientation is an indication for a fundamental mismatch between the internal model established by the user and the presentation. It may be an issue of the flow of information (users are hesitant, because they realize at this point that this is not about what they thought it may be) or on the usability/design side of things (user has established an operational model, but this is not how it operates). Either way, there's a considerable dissonance, which will probably hurt the product: your presentation does not work for the user, you're not communicating on the same level, and this will be probably perceived as an issue of quality or even fitness for the purpose, maybe even intrigue. (Shouting at the user may provide a superficial fix, but will not address the potential damage.) – Which leads to the question: what is the actual problem and what caused it in the first place? (I'd argue, any serious attempt to operationalize this variable will inevitable lead you towards this much more serious issue. Operationalization is difficult for a reason. If you want tho have a controlled experiment, you must control all your variables – and the attempt to do so may hint you at deeper issues.)

BTW, there's also a potential danger in just taking the articulations of user dislike at face value: a classic trope in TV media research was audiences critizising the outfit of the presenter, while the real issue was a dissonance/mismatch in audio and visual presentation. Not that users could pinpoint this, hence, they would rather blame how the anchor was dressed…


What you say is all true and I actually completely agree with you (and like how you articulate those points – great to read it distilled that way) but at the same time probably not a good idea at all to do in most circumstances.

It is alright to decide that in certain cases you can act with imperfect information.

But to be clear, I actually think there may be situations where pouring a lot of effort into really understanding confusion is confusion. It‘s just very context dependent. (And I think you consistently underrate that progress you can make in understanding confusion or any other thing impacting conversion and use by using qualitative methods.)


Regarding underestimating qualitative methods: I'm actually all for them. It may turn out, it's all you need. (Maybe, a quantitative test will be required to prove your point, but it will probably not contribute much to a solution.) It's really that I think that A/B testing is somewhat overrated. (Especially, since you will probably not really know what you're actually measuring without appropriate preparation, which will provide the heavy lifting already. A/B testing should really be just about whether you can generalize on a solution and the assumptions behind this or not. Using this as a tool for optimization, on the other hand, may be rather dangerous, as it doesn't suggest any relations between your various variables, or the various layers of fixes you apply.)


This is adding considerable effort and weight to the process.

The alternative is just running the experiment, which would take 10 minutes to set up, and see the results. The A/B test will help measure the qualitative finding in quantitative terms. It's not perfect but it is practical.


> The solution is to use an A/B test running time calculator to determine if you have the required statistical power to run your experiment and for how long you should run your experiment.

Wouldn't it be better to have an A/B testing system that just counts how many users have been in each assignment group and end when you have the required statistical power?

Time just seems like a stand in for "that should be enough", when in reality you might have a change in how many users get exposed that differs from your expectations.


Running the experiment until you have a specific pre-determined number of observations is okay.

However, the deceptively similar scheme of running it until the results are statistical significant is not okay!


If you want statistical significance of 1/20 and you check 20 times... you are likely to find it.


Point one seems to be an API naming issue. I would not anticipate getFeatureFlag to increment a hit counter. Seems like it should be called something like participateInFlagTest or whatever. Or maybe it should take a (key, arbitraryId) instead of just (key), use the hash of the id to determine if the flag is set, and idempotently register a hit for the id.


Thanks for posting this. It’s to the point and easy to understand. And much needed- most companies seem to do testing without teaching the intricacies involved.


> Relying too much on A/B tests for decision-making

Need I say more? Or just keep tweaking your website until it becomes a mindless, grey, sludge.


plus, mind the Honeymoon Effect

something new performs better cause its new

if you have a platform with lots pf returning users this one will hit you again and again.

so even if you have a winner after the test and make the change permanent, revisit it 2 months later and see if you are now really better of.

all changes of a/b tests in sum has a high chance to just get an average platform in the sum of all changes.


If anyone from posthog is reading this, please fix your RSS feed. The link actually points back to the blog homepage.


Will take a look, thanks for the heads up!


Every engineer? Electrical engineers? Kernel developers? Defense workers?

I hesitate to write this (because I don't want to be negative) but I get a sense that most software "engineers" have a very narrow view of the industry at large. Or this forum leans a particular way.

I haven't A/B tested in my last three roles. Two of them were defense jobs, my current job deals with the Linux kernel.


> Two of them were defense jobs, my current job deals with the Linux kernel.

I don't work on the kernel, but one of the most professionally useful talks about the Linux kernel was an engineer talking about how to use statistical tests on perf related changes with small effects[1]. It's not an _online_ A/B technique but sometimes you pay attention to how other fields approach things in order to learn how to improve your own field.

[1]: https://lca2021.linux.org.au/schedule/presentation/31/


I used to get knots in my hair about these distinctions, but in retrospect, I was just being pedantic. It's a headline— not a synopsis or formal tagging system. Context makes it perfectly clear to most in a web-focused software industry crowd which "engineers" might be doing a/b testing. Also, my last three jobs haven't included a lot of stuff I read about here; why should that affect the headline?


Was going to say the same thing. Lots of articles have clickbait titles, but this one is especially bad. Even among software engineers, only a small percentage will ever do any A/B testing, not to mention that often "scientists" or other roles are in charge of designing, running and analyzing A/B test experiments.


A very large percentage of product engineers at some of the biggest tech companies, eg Meta, regularly run their own experiments.


> 6. Not accounting for seasonality

Doesn't the online nature of an A/B automatically account for this?


In the second table, shouldn't the mobile control conversion rate be 3.33%?


Annoying illegal cookie consent banner?


Probably off-topic, but how do opt out from most of A/B testings?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: