Applications of Bayesian Optimization in Digital Context
In the previous article about multi-armed bandit algorithms, different (not all) variants of multi-armed bandit algorithms were shown. In order to understand the quite high-performance variant of Thompson Sampling, the term and the concept of Bayesian statistics must be understood.
Can we estimate the range of values of conversion rates?
Let’s imagine the scenario that in the past an average conversion rate of 2.5% was measured for our online store. This rate was created on the basis of 1000 store visitors, i.e. 25 store visitors actually made a purchase. The marketing department now estimates that a fourfold increase in the number of visitors based on the evaluated search volume data is possible through extensive search engine optimization. How certain can we be that the conversion rate will increase? Can we express that as a percentage, how likely a certain range of values is?
In probability theory, it is common for (infinitely) repeated passes to force the approximation of a true value or distribution. Bayesian statistics, on the other hand, works by gradually combining existing knowledge (e.g., about conversion rates) with new knowledge. This gradually produces a new a posteriori probability distribution, which in turn can be used for new data.
Let’s assume that Θ (“theta”) represents the true value of the conversion rate. Θ is often used for an unknown parameter of a probability distribution. This ranges between 0 and 1 → [0,1]. By estimating Θ, we find answers to the following questions:
- With what probability does Θ range between 0.5% and 1%?
- How many conversions are possible in a range of 0.5% and 1%?
Choice of an initial distribution (A priori)
Does it seem plausible that all conversion rates are equally likely? No. A conversion rate above 10%, 25% or even 50% is not realistic. Thus, an inital model that sees all rates as equally likely is not suitable. A suitable distribution is e.g. the beta distribution. However, it is also possible to choose another distribution that represents one’s own presumed distribution of values.
For the development of a beta distribution two parameters are necessary, which gives the curve its form. To quickly arrive at a suitable distribution, it can help to align the shaping parameters a and b according to the expected value. The expected value here can represent a realistic conversion rate, e.g. 2% (= 0.02).
The expected value of a beta distribution basically corresponds to the quotient a / (a+b). If, for example, b = 30, then we resolve to a:
a / a + 30 = 0.02 (2%)
a = 0.02a + 0.6
0.98a = 0.6 | / 0.98
a = 0.61224489…
Graphically it looks like this:
For a better understanding, here are more examples:
a = 2, b = 10, expected value therefore approx. 16.67%.
a = 2, b = 20, expected value therefore approx. 9,09%.
Now we make use of the set of rules of Bayesian statistics: To develop our a posteriori probability distribution, we now need an initial data set, which was given as an example at the beginning of this article (1000 visitors, 2.5% conversion rate).
The exact algebra to derive the following trick requires advanced knowledge of Bayesian statistics. The main thing to understand at this point is that our initial distribution can be improved by a simple parameter expansion, as follows:
Let us assume that we have chosen a beta distribution with parameters a = 2, b = 20 at the beginning. Our function is therefore f(a,b) = f(2,20). If we basically assume a beta distribution, then our a posteriori probability distribution is given by the following function:
Note: The binomial distribution was used as likelihood in the derivation.
What does this mean for our case?
The parameter a is extended by the absolute number of conversions and b by the number of cases where no conversion was achieved.
New parameters a = 27, b = 995.
Now we come to the final. If we form the integral of e.g. 0.02 to 1 of this function, we can give a probability of how likely a conversion rate of 1% / 2% / 3% / 4% / 5% and higher could be. By hand this makes little sense, ideally a corresponding programming code does this (example in Python):
import scipy as sp from scipy import stats betacdf = sp.stats.beta(27,995).cdf print((betacdf(1)-betacdf(0.01))*100) → 99,999%, print((betacdf(1)-betacdf(0.02))*100) → 90,897% print((betacdf(1)-betacdf(0.03))*100) → 22,796% print((betacdf(1)-betacdf(0.04))*100) → 0,785% print((betacdf(1)-betacdf(0.05))*100) → 0.006%
Interpretation: Assuming a beta distribution of conversion rates, we can assume that, according to the available data set, the conversion rate lies between 1% and 100% with a probability of 99.9999%,
of 90,897% between 2% and 100%,
from 22,796% between 3% and 100%,
from 0,785% between 4% and 100%,
0.006% between 5% and 100%.
Special case: All conversion rates are equally likely
If we assume that all conversion rates are equally likely (a = b = 1!) at the beginning, our function with appropriate initial parameters is:
f(1,1) → with our new findings we get f(1+25,1+1000-25) → f(26,976).
Compared to our previous model f(27,995), this gives almost the same distribution.
From this we learn that the initial distribution is not really crucial in such a conversion rate measurement. What matters most is the choice of the optimal distribution.
Develop optimization algorithms based on Bayesian statistics
The previous explanations are merely a theoretical introduction to viewing and calculating conversion rates from a statistical perspective. Now it is time to show concrete use cases.
In digital marketing, permanent optimization processes are necessary to achieve visibility effects. The aim is always to approach the optimium (“the best possible marketing”). Two questions need to be answered here:
- What is the realistic optimium?
- Which measures lead us step by step to the realistic optimium?
Now we cross again the field of Multi-Armed Bandit Algorithms. We will now approach a Monte Carlo Simulation with a test aspect as an example. Specifically, it will be about the formulation of meta-descriptions. If you are not familiar with search engine optimization, these are the longer descriptions below the title of an organic search result:
In the world of search engine optimization, the definition & formulation of the meta description is an effective tool to move more users to the website via search engines. The limited number of words requires a considered formulation – an optimum is difficult to determine. It seems sensible to compare several alternatives and decide on the best one – in the spirit of multi-armed bandit algorithms.
Thompson Sampling & Monte Carlo Simulation for Meta Descriptions
We formulate two different meta descriptions X and Y and formulate an identical title for both variants. During the clearly defined test periods with identical runtimes, the result is 150 impressions & 40 clicks for variant X and 200 impressions & 55 clicks for variant Y. Impressions here are users who typed in a specific keyword and were shown one of the two variants accordingly. You cannot have different titles & meta descriptions defined at the same time with the same document.
If we were to do another data collection and it turned out that variant X gets 45 clicks on 180 impressions & variant Y gets 40 clicks on 180 impressions, then the curves would change as follows:
Variant Y would suddenly have a higher expected value and would be preferable. What do we do now? We leave the algorithmic decision to a multi-armed bandit algorithm, e.g. Thompson Sampling.
With the help of a Monte Carlo simulation, the application of Thompson Sampling can be well illustrated. For this, Chris Stucchio has designed a corresponding test script. We set two fictitious click-through rates for fictitious meta-descriptions. Why? To check whether Thomspon sampling actually leads to the choice of the better variant. Variant A has a 15% click-through rate and variant B has a much better rate of 45%. Now we simulate 1000 random visitors and get the following evaluation after executing the script:
The findings from the experiment:
- The algorithm correctly prefers the second meta-description by the higher CTR value (and this already after 100 test subjects!).
- The algorithm correctly converges to the highest possible click-through rate of 45%.
Further applications of Thompson Sampling / Bayesian inference in practice:
- Amazon (http://www.kdd.org/kdd2017/papers/view/an-efficient-bandit-algorithm-for-realtime-multivariate-optimization)
- Facebook (https://www.youtube.com/watch?v=A-JJvYaBPUU&t=1474s)
- Netflix (https://info.dataengconf.com/hubfs/SF%2018%20-%20Slides/DS/A%20Multi-Armed%20Bandit%20Framework%20for%20Recommendations%20at%20Netflix.pdf)
- Apple (https://machinelearning.apple.com/research/interpretable-adaptive-optimization)