Trustworthy Online Controlled Experiments
Diane Tang, Ron Kohavi, and Ya Xu
“Good data scientists are skeptics: they look at anomolies, they question results, and they invoke Twyman’s Law when the results look too good
What I like
Very thorough book on experimental design, ethics, and methodologies
Definitely makes a case for A/B (or A/A) testing and how to run them
Nice real-world case studies and examples to illustrate complex topics (e.g. how Bing runs A/B tests)
What’s missing
Very technical at times diving into the weeds of backend programming and statistical models- some ideas could have been communicated simpler
Ends pretty abruptly with a fairly dry chapter. Would have liked some ramping down after all of the content
Key Topics
Experiments, Methods, Comparing Products, Data, Metrics, Research, Triangulation
Review
I originally picked up this book when an insurance client I was working with had asked us to devise a means of comparing a hypothetical credit card product with that of a competitors. This book is considered the de facto book on A/B testing. They wanted to understand the appetite within their customer base for a custom insurance credit card provided by their company. For a variety of reasons we chose not to run the study after evaluating the concept with a critical lens, and moved on from the idea, which left this book sitting on my bookshelf for quite some time until I recently decided to pick it up.
Statistics and research methods have always been a bit challenging to me. In school math was never my strong suit and I found it difficult to wrap my head around complex algorithms. However I’ve come to the conclusion that it is something that every product designer needs to understand to a certain degree in order to make data-driven decisions rather than just listening to the Highest Paid Person’s Opinion (Hippo- hence the cover). Sometimes it’s for our own understanding, but more often then not it’s for the businesses’ understanding. We require a way to justify decisions through the means of impacting methods that matter most to the business. That’s where A/B testing comes in.
This book is a great (albeit technical) primer on how to run A/B tests within a large organization, or how to set up an infrastructure for A/B testing within any organization. It goes so far as to suggest trade-offs between processing control and variant conditions server-side or client-side and discusses the trade-offs with each. It dives into the details of how to set up, ramp up, ramp down, and analyze the results of a study and how to avoid bias when setting up and analyzing an experiment. The content in this book is very good, and I enjoyed it, though admittedly had a hard time grasping many more technical topics. The big picture has stuck with me though and I think I now know enough about how to run an A/B test and how to avoid issues like sample-ratio mismatch.
Coming out on the other side of reading it, I certainly feel more confident aligning experimental design with business goals and drivers. There’s more I need to learn about statistics, and hopefully I can pick some of these skills up in an MBA program that I’ve just recently applied to.
Learnings
An online controlled experiment , or A/B test, is a data-driven means of comparing one or more things to drive strategic decision making.
There are four questions that we need to answer when designing an experiment: randomization unit, who is our target, how large should it be, how long?
Twyman’s Law states that if information from data seems too good to be true, we should be skeptical of it and look for bias.
When running controlled experiments within an organization it is important to set up the right infrastructure and tooling.
Latency matters on the web, so much so that a 10ms decrease can result in millions of dollars of revenue increase for companies like Bing.
There are three main metrics we should measure: Goal, Driver, Guardrails
An Overall Evaluation Criterion (OEC) is typically a metric that is composed of multiple goal metrics to help understand the performance of something.
A meta-analysis within an organization involves studying past experiments for patterns on what worked and didn’t work, and guidance on how to run new ones.
Ethics are a set of rules that govern what we should or should not do.
There are a lot of ways to source user or customer data, which can be triangulated to help validate or form hypotheses.
Observational studies are simplified versions of A/B tests without a control condition and without the guardrails to make them as trustworthy.
When setting up a digital experiment it is important to distinguish between making changes client or server side.
When selecting a randomization unit for a study one of the best types you can select is user-level randomization, particularly by using their user ID.
When running an online controlled experiment, it is important to ramp up a variant’s exposure as opposed to releasing it all at once.
When analyzing experiments there are two major errors: Type 1 error is when we say there is a difference between a variant and control and there isn’t; and Type 2 is when we say there isn’t a difference between a variant and control and there is.
Variance is one of the most important metrics in science and is the degree to which your data differentiates from the mean within a set, and if your variance is too high it can make it difficult to achieve a statistically significant result.
Triggering is the process where a user gets initiated into a treatment group. For example, let’s say that you want to test a promotional deal for users who have $30 in their cart.
Leakage is the process where an experimental and treatment unit are not completely isolated from each other and impact one another’s experience.
Long term effects are what we study to understand the effects of a treatment in the long term and not just the short term.