## The What, Why, and How of A/B Testing in Machine Learning

### A/B tests are a key tool of business decision-making

September 16, 2022

A/B tests are a key tool of business decision-making. In this article, we’ll discuss the what and why of A/B testing, how an A/B test is designed, and how to easily set them up in production.We’ll also briefly touch on other kinds of experiments that can be run in production, and how to set them up with the Wallaroo experimentation framework. You can try theshadow deployment examplealong with otherWallaroo tutorialsthrough theFree Wallaroo Community Edition.

An A/B test, also called a controlled experiment or a randomized control trial, is a statistical

method of determining which of a set of variants is the best. A/B tests allow organizations

and policy-makers to make smarter, data-driven decisions that are less dependent on

guesswork.

In the simplest version of an A/B test, subjects are randomly assigned to either the control

group (group A) or the treatment group (group B). Subjects in the treatment group receive

the treatment (such as a new medicine, a special offer, or a new web page design) while the

control group proceeds as normal without the treatment. Data is then collected on the

outcomes and used to study the effects of the treatment.

This idea has been around for a long time. Historically, farmers have divided their fields into sections to test whether various treatments can improve their crop yield. Something like an A/B nutrition test even appears in the Old Testament!

“Please test your servants for ten days. Let us be given vegetables to eat and water to drink. You can then compare our appearance with the appearance of the young men who eat the royal rations…” (Daniel 1:12-13)

In 1747, Dr. James Lind conducted one of the earliest clinical trials, testing the efficacy of citrus fruit for curing scurvy.

Today, A/B tests are an important business tool, used to make decisions in areas like product pricing, website design, marketing campaign design, and brand messaging. A/B testing lets organizations quickly experiment and iterate in order to continually improve their business.

In data science, A/B tests can also be used to choose between two models in production, by measuring which model performs better in the real world. In this formulation, the control is often an existing model that is currently in production, sometimes called the ** champion**. The treatment is a new model being considered to replace the old one. This new model is sometimes called the

**. In our discussion, we’ll use the terms**

*challenger**champion*and

*challenger*, rather than

*control*and

*treatment*.

Keep in mind that in machine learning, the terms

*experiments*and

*trials*also often refer to the process of finding a training configuration that works best for the problem at hand (this is sometimes called hyperparameter optimization). In this article, we will use the term

*experiment*to refer to the use of A/B tests to compare the performance of different models in production.

**Designing a Machine Learning A/B Test**

A/B tests are a useful way to rely less on opinions and intuition and to be more data-driven in decision-making, but there are a few principles to keep in mind. The experimenter has to decide on a number of things.

**First, decide what you are trying to measure. **We’ll call this the *Overall Evaluation Criterion *or *OEC*. This may be different and more business-focused than the loss function used while training the models, but it must be something you can measure. Common examples are revenue, click-thru rate, conversion rate, or process completion rate. **Second, decide how much better is “better”. **You might want to just say “Success is when the challenger is better than the champion,” but that’s actually not a testable question, at least not in the statistical sense. You have to decide *how much *better the challenger has to be. Let’s define two quantities:

●

y_{0}:The champion’s assumed OEC. Since the champion has been running for a while, we should have a good idea of this value. For example, if we are measuring conversion rate, then we might already know that the champion typically achieves a conversion rate ofy2%._{0 }=●

δ: theminimum delta effect sizewe want to reliably detect. This is how much better the challenger needs to be for us to declare it “the winner.” For example, we may decide to switch to our challenger model if it improves the conversion rate by at least 1% – that is, we want the challenger to have a conversion rate of at least 0.02 * (1.01) = 0.0202. This meansδ= 0.002.

Note that some sample size calculators (we’ll get to that below) specify minimum effect size as a relative delta; in our example, the relative delta is 0.01 (1%), and the absolute delta is 0.002.

**Third, decide how much error you want to tolerate**. Again, you probably want to say “none,” but that isn’t practical. The less error you can tolerate, the more data you need, and in an online setting, the longer you have to run the test. In the classical statistics formulation, an A/B test has the following parameters to describe the error:

●

α:thesignificance, orfalse positive ratethat we are willing to tolerate. Ideally, we wantαas small as possible; in practice,αis usually set to 0.05. You can think of this as meaning that if we run an A/B test over and over again, we will incorrectly pick an inferior challenger 5% of the time.●

β:thepower, ortrue positive ratewe want to achieve. Ideally, we would likeβnear 1; in practice,βis usually set to 0.8. You can think of this as meaning that if we run an A/B test over and over again we will correctly pick a superior challenger 80% of the time.Note that

αandβare talking about incompatible circumstances (that’s why they don’t add up to 1). The first case assumes the challenger is worse, the other case assumes it’s better; finding out which situation we are in is the whole point of an A/B test.

There’s one last parameter in an A/B test:

●

the minimum number of examples (per model) we have to examine to make sure our false positive raten:αand true positive rateβthresholds are met. Or as it’s commonly said: “to make sure we achieve statistical significance.”Note that n is per model: so if you are routing your customers between A and B with a 50-50 split, you need a total experiment size of 2*n customers. If you are routing 90% of your traffic to A and 10% to B, then B has to see at least n customers (and A will then see around 9*n). So a 50-50 split is the most efficient, although you may prefer an unbalanced split for other reasons, like safety or stability.

To run an A/B test, the experimenter picks **α**, **β**, and the minimum effect size **δ**, and then determines *n*. We won’t go into the formula for calculating *n *here; so-called power calculators or sample-size calculators exist to do that for you. Here’s one for rates, from Statsig; it defaults to **α** = 0.05, **β** = 0.8, and split ratio of 50-50. Feel free to play around to get a sense of how big sample sizes have to be in different situations.

Once you’ve run the A/B test long enough to achieve the necessary *n*, measure the OEC for each model. If OEC_{challenger }– OEC_{champion }> **δ**, then the challenger wins! Otherwise, you may wish to stick with the champion model.

**Some Practical Considerations**

**Splitting your subjects: **When splitting your subjects up randomly between models, make sure the process is truly random, and think through any interference between the two groups. Do they communicate or influence each other in some way? Does the randomization method cause an unintended bias? Any bias in group assignments can invalidate the results. Also, make sure the assignment is consistent so that each subject always gets the same treatment. For example, a specific customer should not get different prices every time they reload the pricing page.

**A/A Tests: **It can be a good idea to run an A/A test, where both groups are control or treatment groups. This can help surface unintentional biases or errors in the processing and can give a better feeling for how random variations can affect intermediate results.

**Don’t Peek!: **Due to human nature, it’s difficult not to peek at the results early and draw conclusions or stop the experiment before the minimum sample size is reached. Resist the temptation. Sometimes the “wrong” model can get lucky for a while. You want to run a test long enough to be confident that the behavior you see is really representative and not just a weird fluke. **The more sensitive a test is, the longer it will take: **The resolution of an A/B test (how small a delta effect size you can detect) increases as the square of the samples. In other words, if you want to halve the delta effect size you can detect, you have to *quadruple *your sample size.

**Extensions to A/B Testing**

##### Bayesian A/B Tests

The classical (AKA frequentist) statistical approach to A/B testing that we described above can be a bit unintuitive for some people. In particular, note that the definitions of **α** and **β** posit that we run the A/B test over and over; in actuality, we generally run it only once (for a specific A and B). The Bayesian approach takes the data from a single run as a given, and asks, “What OEC values are consistent with what I’ve observed?”

The general steps for a Bayesian analysis are roughly:

1) Specify prior beliefs about possible values of the OEC for the experiment groups. An example prior might be that conversion rates for both groups are different and both between 0 and 10%.

2) Define a statistical model using a Bayesian analysis tool (ie. using distributional techniques) and flat, uninformative, or equal priors for each group.

3) Collect data and update the beliefs on possible values for the OEC parameters as you go. The distributions of possible OEC parameters start out encompassing a wide range of possible values, and as the experiment continues the distributions tend to narrow and separate (if there is a difference).

4) Continue the experiment as long as it seems valuable to refine the estimates of the OEC. From the posterior distributions of the effect sizes, it is possible to estimate the delta effect size.

Posterior distributions of a Bayesian treatment/control test. Source: Win Vector Blog

Note that a Bayesian approach to A/B testing does not necessarily make the test any

shorter; it simply makes quantifying the uncertainties in the experiment more

straightforward, and arguably more intuitive. For a worked example of frequentist and

Bayesian approaches to treatment/control experiments (in the context of clinical trials) see

this blog post from Win Vector LLC.

##### Multi-Armed Bandits

If you want to minimize the waiting until the end of an experiment before taking action,

consider Multi-Armed Bandit approaches. Multi-armed bandits dynamically adjust the

percentage of new requests that go to each option, based on that option’s past

performance. Essentially, the better performing a model is, the more traffic it gets—but some

small amount of traffic still goes to poorly performing models, so the experiment can still

collect information about them. This balances the trade-off between exploitation (extracting

maximal value by using models that appear to be the best) and exploration (collecting

information about other models, in case they turn out to be better than they currently

appear). If a multi-armed bandit experiment is run long enough, it will eventually converge

to the best model, if one exists.

Multi-armed bandit tests can be useful if you can’t run a test long enough to achieve

statistical significance; ironically, this situation often occurs when the delta effect size is

small, so even if you pick the wrong model, you don’t lose much. In fact, the

exploitation-exploration tradeoff means that you potentially gain more value during the

experiment than you would have running a standard A/B test.

## A/B Testing in Production

Once you’ve designed your A/B test, the Wallaroo ML deployment platform can help you

get it up and running quickly and easily. The platform provides specialized pipeline

configurations for setting up production experiments, including A/B tests. All the models in

an experimentation pipeline receive data via the same endpoint; the pipeline takes care of

allocating the requests to each of the models as desired.

Requests can be allocated in a number of ways. For an A/B test, you would use random

split. In this allocation scheme, requests are distributed randomly in the proportions you

specify: 50-50, 80-20, or whatever is appropriate. If session information is provided, the

pipeline ensures that it is respected: for example, customer ID information can be used to

make sure a specific customer always sees the output from the same model.

The Wallaroo pipeline keeps track of which requests have been routed to each model and

the resulting inferences. This information can then be used to calculate OECs to determine

each model’s performance.

#### Other Types of Experiments

Wallaroo experimentation pipelines allow other kinds of experiments to be run in

production.

With key split, requests are distributed according to the value of a key, or query attribute. For

example, in a credit card company scenario, gold card customers might be routed to model

A, platinum cardholders to model B, and all other cardholders to model C. This is not a good

way to split for A/B tests but can be useful for other situations, for example, a slow rollout of

a new model.

With shadow deployments, all the models in the experiment pipeline get all the data, and all

inferences are logged. However, the pipeline only outputs the inferences from one

model–the default, or champion model.

Shadow deployments are useful for “sanity checking” a model before it goes truly live. For

example, you might have built a smaller, leaner version of an existing model using

knowledge distillation or other model optimization techniques, as discussed here. A shadow

deployment of the new model alongside the original model can help ensure that the new

model meets desired accuracy and performance requirements before it’s put into

production.

*A/B tests and other types of experimentation are part of the ML lifecycle. The ability toquickly experiment and test new models in the real world helps data scientists tocontinually learn, innovate, and improve AI-driven decision processes. The Wallarooplatform helps to optimize the last mile of the ML model journey. *

*This post is in collaboration with our friends at Wallaroo. If you are interested in learning how Wallaroo can help you with ML deployment, reach out to them at [email protected].You can try a shadow deployment example along with other Wallaroo tutorials through theFree Wallaroo Community Edition.*

Want to be notified when content like this drops on our blog? Check out our MLOps Community newsletter.