> you shouldn’t be “expecting” anything This is the biggest lie in experimentati...

withinboredom · on June 17, 2023

> What I'm challenging is that if a team has spent three months building a feature, you a/b test it and find no effect, that is not a good outcome.

That's a great outcome. At one company we spent a few months building a feature only for it to fail the test, now that was a bad outcome. The feature's code was so good, we ended up refactoring it to look like the old feature and switching to that. So there was a silver lining, I guess.

The key takeaway was to never a/b test a feature that big again. Instead we would spend a few weeks to build something that didn't need to scale nor feature complete. (IOW, an MVP/POC shitty code).

If it had come out that there was no difference, we would have gone with the new version code because it was so well built -- alternatively, if the code was shit, we probably would have thrown it out. That's why its the best result. You can write shitty POC code and toss it out -- or keep it if you really want.

withinboredom · on June 17, 2023

> Of course you expect something. Why are you running this test over all other tests?

Because it has the best chance to prove/disprove your hypothesis. That's it. Even if it doesn't, all that means is that the metrics you're measuring are not connected to what you're doing. There is more to learn and explore.

So, you can hope that it will prove or disprove your hypothesis, but there is no rational reason to expect it to go either way.

travisjungroth · on June 17, 2023

But why this hypothesis? Sometimes people do tests just to learn as much as they can, but 95%+ of the time they’re trying to improve their product.

> there is no rational reason to expect it to go either way.

Flipping a coin has the same probability of heads during a new moon as during a full moon. I’m going to jump ahead and expect that you agree with that statement.

If I phrase that as a hypothesis and do an experiment, suddenly there’s no rational reason to expect it to go either way? Of course there is. The universe didn’t come into being when I started my experiment.

Null hypothesis testing is a mental hack. A very effective one, but a hack. There is no null. Even assuming zero knowledge isn’t the most rational thing. But, the hack is that history has shown that when people try to act like they know nothing, they end up with better results. People are so overconfident that pretending they knew nothing improved things! This doesn’t mean it’s the truth, or even the best option.

withinboredom · on June 17, 2023

I’d suggest reading up on the experimental method. There’s also a really good book: Trustworthy Online Controlled Experiments.

You are trying to apply science to commercial applications. It works, but you cannot twist it to your will or it stops working and serves no purpose other than a voodoo dance.

> Flipping a coin has the same probability of heads during a new moon as during a full moon. I’m going to jump ahead and expect that you agree with that statement.

As absurd as it sounds, it’s a valid experiment and I actually couldn’t guess if the extra light from a full moon would have a measurable affect on a coin flip. Theoretically it would, as light does impart a force… but whether or not we could realistically measure it would be interesting.

Yes, I’m playing devils advocate, but “if the button is blue, more people will convert” is just as absurd a hypothesis, yet it produced results.

travisjungroth · on June 28, 2023

Late response: I’ve read that book. I also work as a software engineer on the Experimentation Platform - Analysis team at Netflix. I’m not saying that makes me right, but I think it supports that my opinion isn’t from a lack of exposure.

> You are trying to apply science to commercial applications. It works, but you cannot twist it to your will or it stops working and serves no purpose other than a voodoo dance.

With this paragraph, you’ve actually built most of the bridge between my viewpoint and yours. I think the common scientific method works in software sometimes. When it does, there are simple changes to make it so that it will give better results. But most of the time, people are in the will-twisting voodoo dance.

People also bend their problems so hard to fit science that it’s just shocking to me. In no other context do I experience a rational, analytical adult arguing that they’re unsure if a full moon will measurably affect a count flip. If someone in a crystal shop said such a thing, they’d call it woo.