Hacker News Clone

A/B test improved your website's conversion rate? Not so fast

by riebnik on 1/7/2022, 11:40:05 AM with 66 comments

by jefftk on 1/7/2022, 12:00:52 PM
Summary: if you run an experiment where you try to rush users to convert, and you only run the experiment for a short time, it will look great even though it might be lossy overall, because you're capturing a larger proportion of conversions in the experiment group.
You can also run into this sort of problem with user learning effects, where initially a large change in the UI can give a large change in behavior due to novelty, but then it wears off over time. Running experiments longer helps a lot in both cases.
by snarfy on 1/7/2022, 1:39:46 PM
I've dealt with this enough that at this point I'm convinced all companies that do this fail to see the users through the metrics. A/B testing is overvalued.
by ssharp on 1/7/2022, 4:57:19 PM
This whole example seems like it boils down to a poor test/analysis plan than anything that truly speaks to the value of Bayesian approaches:
1) It's almost always a bad idea to decide a test based on one-week's worth of data, regardless of what statistical approach you take
2) There's not really any info on why Fisher's exact test is used. It seems like most A/B software has adopted Bayesian but the ones who haven't, I believe, choose the Student's test and require prior sample sizing
3) The conversion delay issue was not addressed in the measurement plan. There are clear ways to address this issue, both tactically as well as mathematically. From a tactical standpoint, most testing platforms, you'd be able to change the test allocation to 0%, which would allowed previously bucketed users to continue to be measured with subsequent visits while not allowing any new users in. You could also just run the test long enough to where the conversion lag no longer has a major impact on results (this may or may not be possible, depending on how long and fat the lag tail is).
by rwilson4 on 1/7/2022, 2:40:44 PM
The author discusses "pull forward", where the real impact of a change is to make people purchase earlier, but we don't necessarily observe incremental purchases. This isn't necessarily bad; I'd rather have a dollar today than a dollar next week.
This can be quantified by plotting the incremental conversions observed by day x. We migh see a big initial lift that degrades over time. If it eventually degrades to zero, there are no truly incremental conversions, just pull-forward. But if we end up pulling forward a meaningful number of purchases by a month or more, that can be valuable to the business!
I wouldn't immediately jump to a complicated mathematical model to handle this situation, I would consider the business implications first and foremost.
I also urge anyone considering Bayesian methods for A/B testing to read up on the likelihood principle vs the strong repeated sampling principle (I documented my thoughts here [0]). Bayesian methods always satisfy the likelihood principle; frequentist methods always satisfy repeated sampling. In many situations both methods satisfy both principles, and then the two approaches will give similar answers. But based on many years doing A/B testing, I wouldn't give up repeated sampling lightly. Bayesian and frequentist methods are not blindly interchangeable.
On the other hand, if repeated sampling is not important in your use case, then by all means prefer the Bayesian approach! I just want people to consider the trade offs.
[0]: https://adventuresinwhy.com/post/bayesian_ab_testing/
by DebtDeflation on 1/7/2022, 2:24:21 PM
>the new website version implemented urgency features that gave the users the impression that the product they were considering for purchase would soon be unavailable or would drastically increase in price. This lead to the fact that some users were annoyed by this alarmist messaging and design, and now didn't convert anymore even though they might have under the old version
This basic principle has far broader implications than website design and A/B testing. Managers at large corporations have learned to pull all sorts of levers to optimize the short term value of some metric (typically the one upon which their compensation depends) often in direct opposition to the long term interests of the corporation (and even the value of that metric beyond the next few quarters).
by vlozko on 1/7/2022, 4:33:37 PM
This article reminded me of an experience I had as a developer for an online retailer. The product browse team, the one responsible for showing lists of products from searches, categories, brands, etc., had a slew of AB tests to see product detail page viewing conversion rate. One of these tests included the bright idea of removing the product name from the individual product cards. Trouble is, the names often contained differentiating descriptors. While dogfooding the app, I was thoroughly frustrated with the need to constantly go back and forth between browse and detail pages. When I spoke to the browse team about it, they were patting themselves on the back on how amazing their detail page viewing conversation rates were. Made me a skeptic of AB tests since.
by Darge on 1/7/2022, 1:01:28 PM
Question to HN folks: What are, in your opinion, the best resources for a Computer Science graduate to learn how to apply statistics like this?
by knuthsat on 1/7/2022, 2:33:02 PM
A/B tests work fine if the signal you are measuring is strong. This is not the case here.
Is it even fine to use the distribution assumptions in the later analysis?
Looks like these assumptions combined with a higher conversion rate on day 2 for control is the main reason for the surprising result (control is obviously spread out).
by robomartin on 1/7/2022, 6:41:01 PM
Well, this is where it comes down to understanding statistics. Yes, that subject we all hated in college and could not wait to pass and forget about. I think I can say that most A/B testing is, to be kind, statistically flawed. At the same time, it is possible to think that only some of the largest websites might have enough traffic to do it right (whatever that means).
And then there's the big question: How much of business did you lose in the process of arriving at what seems like an optimal solution (which might just be a local peak, rather than a global optimum point)?
That said, what's the alternative? To optimize, or not, that is the question.
by antux on 1/7/2022, 2:35:38 PM
A/B testing can be useful at times but it's largely overrated because you're only discovering the best design out of the ones you test. That means there could be a far better design that you failed include in the experiment.
Just because one design converts more than the other doesn't mean it's the design with optimal UX. I've seen many tests where the designs included already had faulty UX. This is why it's better to have a trained UX designer on your team who can fix basic flaws and present the best version of various designs for testing.
by jklinger410 on 1/7/2022, 3:50:14 PM
Every marketing team should have a data scientist and A/B tests are almost always done incorrectly.
by Snoozus on 1/7/2022, 4:11:19 PM
If a carpenter told me he preferred screws over nails or vice versa I would think he was probably just a bad carpenter.
If he went on to post about his preference on the Internet with some made up examples I would be sure that he couldn't be trusted.
by ffhhj on 1/7/2022, 7:42:49 PM
Correlation does not imply causation. Why do people believe AB test based decisions actually improve conversion rates in the long run? These tests could be eroding fundations like usability and slowly push your followers to other sites.
by marban on 1/7/2022, 12:59:45 PM
It sucks to suck at maths.
by clircle on 1/7/2022, 7:05:08 PM
It seems statistically dubious to me that the point k = infinity is in the sample space and that the model is well defined.
Does this model require a proper prior on p?
Cool blog post.
by throw1234651234 on 1/7/2022, 1:37:44 PM
Google Optimize flickers the screen with React. And makes non-mobile layouts appear mobile (messes with screen width presets). That is all.