Understanding Multi-Armed Bandits Experiments
05/30/2022

The concept of A/B testing is a widely used methodology to compare the performance on one or more variables of two (or more) variants. The comparison represents a universal practice used by advertisers for data-based evaluation of advertising effectiveness. This is correct, since subjective perception is irrelevant for economically effective decisions. Now the question must be asked whether there are any alternatives to A/B testing that might produce (even) more optimal solutions. This is what we want to deal with today.
Introductory understanding: The mathematics of A/B testing
The task is to compare two variants (classic A/B testing) with regard to the performance of a selected variable. Possible variables are the number of conversions (form submissions, mail sends, downloads or calls) or metrics of web analytics (bounce rate, exit rate or dwell time). The two test variants should only differ in one particular aspect (different visuals, title or headline wording, coloring, text length) so that a difference in impact can be justified by the variable. “Random” test subjects are shown one of the two variants at a time, i.e., at no moment is the user aware of being part of a test.
The absolute “dominance” of the original or the test variant with regard to a target variable (wording: “more conversions”, “lower bounce rate”) does not represent the decision criterion for a “better” variant in A/B testing. We would like to be able to formulate statistically significant statements that the test course is not a “coincidence”. A Chi square independence test is used. This tests the stochastic independence of the correlation of variables. For the calculation of the statistical value, also for a basic understanding, the development of the cross table is suitable in the first step. We use fictitious data for this:
Original | Variant | ∑ | |
Visitors without conversions | 500 | 450 | 950 |
Visitors with conversions | 20 | 5 | 25 |
∑ | 520 | 455 | 975 |
The Chi Square value is calculated by forming the following sum:
∑ (observed frequency – expected frequency)^2 / (expected frequency).
The observed values can be taken from the crosstab. The calculation of the expected frequency requires the total sum of the variants as well as the sample size.
Original | Variant | ∑ | |
Visitors without conversions | 950*520/975 | 950*455/975 | 950 |
Visitors with conversions | 25*520/975 | 25*455/975 | 25 |
∑ | 520 | 455 | 975 |
The expected frequencies are:
Original | Variant | ∑ | |
Visitors without conversions | 506,67 | 443,33 | 950 |
Visitors with conversions | 13,33 | 11,67 | 25 |
∑ | 520 | 455 | 975 |
! A significance can only exist if a sample with more than 50 “random” persons is examined.
For the calculation of the chi-square value all values are available:
(500-506,67)^2/506,67 + (450-443,33)^2/443,33 + (20-13,33)^2/13,33 + (5-11,67)^2/11,67 = 7,33791
The level of a chi-square value does not yet allow a final interpretation. What is needed is the critical value from the Chi Square distribution table. The choice of the critical value depends on the significance level & the number of degrees of freedom.
The degrees of freedom can be calculated by the formula: (number of columns – 1) * (number of rows – 1) (excluding sum row & sum column). In our 2×2 crosstab, this results in a value of (2-1)*(2-1) = 1.
The higher the significance level is set, the more sensitive/strict the result is seen.

If we choose a significance level of 1%, the critical value of the Chi Square distribution is 6.63. The correct wording is now: the hypothesis that conversions are independent of different variants can be rejected at a 1% significance level. The number of conversions is highly significantly different.
Thus, a statistical statement can be formulated about the data set at hand. Despite all the joy about the result, only one question arises: Does the world always consist of only two options? If the (invisible) third version would double the conversions, only the second best option is chosen without knowing the best one. A suboptimal version does not make economic sense. At this point, it should be noted that A/B tests can also be conducted with multiple variants. An exemplary test model can start with the parallel display of several variants and compare the best variants again halfway through the test period (see the following simulation).
Multi-Armed Bandits – The new standard
The explanatory example of an alternative approach is, first, the idea of parallel pursuit of several strategies and intensification of those that are strongest with respect to the objective. This concept is referred to as exploitation. We reward the best variant in the steady progression of our test. We explore each variant sufficiently to identify those with the highest forced performance. A selection of mutil-armed bandit algorithms will be briefly described below:
- Epsilon (ε) Greedy: Initially an intial number of purely exploratory trials, then randomly select variants in ε*100 percent of the time. In 1-ε time, the best variant is chosen. Simple example: If ε is set to 0.2, the algorithm explores random variants 20% of the time, and 80% of that time for the variant with the highest success. Optionally decreasing ε with time results in the most successful variant being explored more and more intensively (The term of 1-ε becomes larger as ε becomes smaller). This is referred to as Decayed Epsilon Greedy.
- Upper Confidence Bounds (UCB): If the algorithm is “unsure” about certain variants, it intensifies the exploration process of the variant. If the degree of success of a variant has become clearer to the algorithm, the intensity of the exploration process is reduced (identical concept to Decayed Epsilon Greedy).
- Thompson Sampling: Represents a Bayesian alternative. Determination of a probability distribution based on previous results. For a new trial, the algorithm selects the variant based on the a posteriori probability of greatest success (A-posteriori = obtained from experience).
