Trustworthy Online Controlled Experiments

Diane Tang, Ron Kohavi, and Ya Xu

Good data scientists are skeptics: they look at anomolies, they question results, and they invoke Twyman’s Law when the results look too good


What I like

  • Very thorough book on experimental design, ethics, and methodologies

  • Definitely makes a case for A/B (or A/A) testing and how to run them

  • Nice real-world case studies and examples to illustrate complex topics (e.g. how Bing runs A/B tests)

What’s missing

  • Very technical at times diving into the weeds of backend programming and statistical models- some ideas could have been communicated simpler

  • Ends pretty abruptly with a fairly dry chapter. Would have liked some ramping down after all of the content


Key Topics

Experiments, Methods, Comparing Products, Data, Metrics, Research, Triangulation


Review

I originally picked up this book when an insurance client I was working with had asked us to devise a means of comparing a hypothetical credit card product with that of a competitors. This book is considered the de facto book on A/B testing. They wanted to understand the appetite within their customer base for a custom insurance credit card provided by their company. For a variety of reasons we chose not to run the study after evaluating the concept with a critical lens, and moved on from the idea, which left this book sitting on my bookshelf for quite some time until I recently decided to pick it up.

Statistics and research methods have always been a bit challenging to me. In school math was never my strong suit and I found it difficult to wrap my head around complex algorithms. However I’ve come to the conclusion that it is something that every product designer needs to understand to a certain degree in order to make data-driven decisions rather than just listening to the Highest Paid Person’s Opinion (Hippo- hence the cover). Sometimes it’s for our own understanding, but more often then not it’s for the businesses’ understanding. We require a way to justify decisions through the means of impacting methods that matter most to the business. That’s where A/B testing comes in.

This book is a great (albeit technical) primer on how to run A/B tests within a large organization, or how to set up an infrastructure for A/B testing within any organization. It goes so far as to suggest trade-offs between processing control and variant conditions server-side or client-side and discusses the trade-offs with each. It dives into the details of how to set up, ramp up, ramp down, and analyze the results of a study and how to avoid bias when setting up and analyzing an experiment. The content in this book is very good, and I enjoyed it, though admittedly had a hard time grasping many more technical topics. The big picture has stuck with me though and I think I now know enough about how to run an A/B test and how to avoid issues like sample-ratio mismatch.

Coming out on the other side of reading it, I certainly feel more confident aligning experimental design with business goals and drivers. There’s more I need to learn about statistics, and hopefully I can pick some of these skills up in an MBA program that I’ve just recently applied to.


Learnings

  • An online controlled experiment , or A/B test, is a data-driven means of comparing one or more things to drive strategic decision making.

  • There are four questions that we need to answer when designing an experiment: randomization unit, who is our target, how large should it be, how long?

  • Twyman’s Law states that if information from data seems too good to be true, we should be skeptical of it and look for bias.

  • When running controlled experiments within an organization it is important to set up the right infrastructure and tooling. 

  • Latency matters on the web, so much so that a 10ms decrease can result in millions of dollars of revenue increase for companies like Bing.

  • There are three main metrics we should measure: Goal, Driver, Guardrails

  • An Overall Evaluation Criterion (OEC) is typically a metric that is composed of multiple goal metrics to help understand the performance of something.

  • A meta-analysis within an organization involves studying past experiments for patterns on what worked and didn’t work, and guidance on how to run new ones.

  • Ethics are a set of rules that govern what we should or should not do.

  • There are a lot of ways to source user or customer data, which can be triangulated to help validate or form hypotheses.

  • Observational studies are simplified versions of A/B tests without a control condition and without the guardrails to make them as trustworthy.

  • When setting up a digital experiment it is important to distinguish between making changes client or server side.

  • When selecting a randomization unit for a study one of the best types you can select is user-level randomization, particularly by using their user ID.

  • When running an online controlled experiment, it is important to ramp up a variant’s exposure as opposed to releasing it all at once.

  • When analyzing experiments there are two major errors: Type 1 error is when we say there is a difference between a variant and control and there isn’t; and Type 2 is when we say there isn’t a difference between a variant and control and there is. 

  • Variance is one of the most important metrics in science and is the degree to which your data differentiates from the mean within a set, and if your variance is too high it can make it difficult to achieve a statistically significant result.

  • Triggering is the process where a user gets initiated into a treatment group. For example, let’s say that you want to test a promotional deal for users who have $30 in their cart.

  • Leakage is the process where an experimental and treatment unit are not completely isolated from each other and impact one another’s experience.

  • Long term effects are what we study to understand the effects of a treatment in the long term and not just the short term.

Previous
Previous

Field Experiments

Next
Next

Actionable Gamification