> Good hypothesis: User research showed that users are unsure of how to proceed to the checkout page. Changing the button's color will lead to more users noticing it (…)
Mind that you have to prove first that this preposition is actually true. Your user research is probably exploratory, qualitative data based on a small sample. At this point, it's rather an assumption. You have to transform and test this (by quantitative means) for validity and significance. Only then you can proceed to the button-hypothesis. Otherwise, you are still testing multiple things at once, based on an unclear hypothesis, while merely assuming that part of this hypothesis is actually valid.
In practice you often cannot test that in a quantitative way. Especially since it’s about a state of mind.
However, you should not dismiss qualitative results out of hand.
If you do usability testing of the checkout flow with five participants and three actually verbalize the hypothesis during checkout (“hm, I’m not sure how to get to the next step here”, “I don’t see the button to continue”, after searching for 30s: “ah, there it is!” – after all of which a good moderater would also ask follow up questions to better understand why they think it was hard for them to find the way to the next step and what their expectations were) then that‘s plenty of evidence for the first part of the hypothesis, allowing you to move on to testing the second part. It would be madness to quantitatively verify the first part. A total waste of resources.
To be honest: with (hypothetical) evidence as clear as that from user research I would probably skip the A/B testing and go straight to implementing a solution if the problem is obvious enough and there are best practice examples. Only if designers are unsure about whether their proposed solution to the problem actually works would I consider testing that.
Also: quantitative studies are not the savior you want them to be. Especially if it’s about details in the perception of users … and that’s coming from me, a user researcher who loves to do quantitative product evaluation and isn’t even all that firm in all qualitative methods.
You really have to be able to build samples based on the first part of the hypothesis: you should test 4 groups for a crosstab. (Also, homogeneity may be an issue.) Transitioning from qualitative to quantitative methods is really the tricky part in social research.
Mind that 3/5 doesn't meet the criteria of a binary test. In statistical terms, you do know nothing, this is still random. Moreover, even if metrics are suggesting that some users are spending considerable time, you still don't know why: it's still an assumption based on a negligible sample. So, the first question should be really, how do I operationalize the variable "user is disoriented", and, what does this exactly mean. (Otherwise, you're in for spurious correlation of all sorts. I.e. you still don't know why some users display disorientations and others not. Instead of addressing the underlying issue, you rather fix this by an obtrusive button design, which may have negative impact on the other group.)
I think you are really missing the forest for the trees.
Everything you say is completely correct. But useful? Or worthwhile? Or even efficient?
The goal is not to find out why some users are disoriented and some are not. Well, I guess indirectly it is. But getting there with rigor is a nightmare and to my mind not worthwhile in most cases. The hypothesis developed from the usability test would be “some users are disoriented during checkout”. That to me would be enough evidence to actually tackle that problem of disorientation, especially since to me 3/5 would indicate a relatively strong signal (not in terms of telling me the percentage of users affected by this problem, just that it’s likely the problem affects more than just a couple people).
The more mysterious question to me would actually be whether that disorientation also leads to people not ordering. Which is a plausible assumption – but not trivially answerable for sure. (Usability testing can provide some hints toward answering that question – but task based usability testing is always a bit artificial in its setup.)
Operationalizing “user is disoriented” is a nightmare and not something I would recommend at all (at least not as the first step) if you are reasonably sure that disorientation is a problem (because some users mention it during usability testing) and you can offer plausible solutions (a new design based on best practices and what users told you they think makes them feel disoriented).
Operationalizing something like disorientation is much more fraught with danger (and just operationalizing it in the completely wrong way without even knowing) than identifying a problem and based on reasonableness arguments implementing a potential solution and seeing whether the desired metric improves.
I agree that it would be an awesome research project to actually operationalize disorientation. But worthwhile when supporting actual product teams? Doubtful …
> The more mysterious question to me would actually be whether that disorientation also leads to people not ordering.
This is actually the crucial question. The disorientation is an indication for a fundamental mismatch between the internal model established by the user and the presentation. It may be an issue of the flow of information (users are hesitant, because they realize at this point that this is not about what they thought it may be) or on the usability/design side of things (user has established an operational model, but this is not how it operates). Either way, there's a considerable dissonance, which will probably hurt the product: your presentation does not work for the user, you're not communicating on the same level, and this will be probably perceived as an issue of quality or even fitness for the purpose, maybe even intrigue. (Shouting at the user may provide a superficial fix, but will not address the potential damage.) – Which leads to the question: what is the actual problem and what caused it in the first place? (I'd argue, any serious attempt to operationalize this variable will inevitable lead you towards this much more serious issue. Operationalization is difficult for a reason. If you want tho have a controlled experiment, you must control all your variables – and the attempt to do so may hint you at deeper issues.)
BTW, there's also a potential danger in just taking the articulations of user dislike at face value: a classic trope in TV media research was audiences critizising the outfit of the presenter, while the real issue was a dissonance/mismatch in audio and visual presentation. Not that users could pinpoint this, hence, they would rather blame how the anchor was dressed…
What you say is all true and I actually completely agree with you (and like how you articulate those points – great to read it distilled that way) but at the same time probably not a good idea at all to do in most circumstances.
It is alright to decide that in certain cases you can act with imperfect information.
But to be clear, I actually think there may be situations where pouring a lot of effort into really understanding confusion is confusion. It‘s just very context dependent. (And I think you consistently underrate that progress you can make in understanding confusion or any other thing impacting conversion and use by using qualitative methods.)
Regarding underestimating qualitative methods: I'm actually all for them. It may turn out, it's all you need. (Maybe, a quantitative test will be required to prove your point, but it will probably not contribute much to a solution.) It's really that I think that A/B testing is somewhat overrated. (Especially, since you will probably not really know what you're actually measuring without appropriate preparation, which will provide the heavy lifting already. A/B testing should really be just about whether you can generalize on a solution and the assumptions behind this or not. Using this as a tool for optimization, on the other hand, may be rather dangerous, as it doesn't suggest any relations between your various variables, or the various layers of fixes you apply.)
This is adding considerable effort and weight to the process.
The alternative is just running the experiment, which would take 10 minutes to set up, and see the results. The A/B test will help measure the qualitative finding in quantitative terms. It's not perfect but it is practical.
> Good hypothesis: User research showed that users are unsure of how to proceed to the checkout page. Changing the button's color will lead to more users noticing it (…)
Mind that you have to prove first that this preposition is actually true. Your user research is probably exploratory, qualitative data based on a small sample. At this point, it's rather an assumption. You have to transform and test this (by quantitative means) for validity and significance. Only then you can proceed to the button-hypothesis. Otherwise, you are still testing multiple things at once, based on an unclear hypothesis, while merely assuming that part of this hypothesis is actually valid.