Association analysis in web analytics
06/04/2022

The a-priori algorithm is part of the family of association analyses. From a set of transaction data of arbitrary states / items, e.g. products in a shopping cart, visited websites or even visited destinations of different bloggers, it tries to identify useful associations determined by the number of common occurrences.
The terms premise and conclusion are important at this point. The following relationship must be found:
A → B, where A is called the premise and B is called the conclusion. Now we have a choice of two more important variables that are applied in the Apriori algorithm: Support & Confidence. The so-called minimum support & minimum confidence are assumed at the beginning of the algorithm. Both variables are probabilistic measurements. The minimum support represents the probability that the item or the item set occurs in a transaction. Confidence represents the probability of the conclusion given the premise. In other words: How probable is the combination of items and combination of items if a certain item is present? This will now become clearer in a concrete example.
Example of the a-priori algorithm in web analytics
We imagine ourselves to be a supplier of a nasal spray for people suffering from flu. Our website is defined quite simply in the following structure:
Page 1 is the main page
Page 2 is a topic page about cold and flu
Page 3 would be a conversion-oriented landing page for nasal spray
Page 4 is a topic page on treatment methods for acute colds
Page 5 would be a company presentation page (“About Us”).
The company asks the question, “We spent a lot of time working on page 4. If someone wants to buy our nasal spray, that is, expresses an intention to buy by visiting page 3, did they look at page 4 beforehand to do so? So is page 4 a crucial source of information in the buying process?
Five users have visited our website. The user navigated through various subpages. In our sample, the following user behavior emerged:
Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | |
User 1 | x | x | x | x | |
User 2 | x | x | x | ||
User 3 | x | x | |||
User 4 | x | x | x | ||
User 5 | x | x |
We set the minimum support to 40% in the context of this example. The minimum confidence for us is 60%. It would also be possible to set the confidence to 100%. We are expressing how sure we want to be about the discovered association rule.
The support is easily calculated in the first step. How often was the main page (page 1) accessed across users? Exactly twice. Since we have a total of five users, page 1 has been accessed 40% (2 out of 5 users) of the time. Support is calculated for all pages. Page 5 (“About us”) was visited by only one user. Too few for our minimum support! Only 20% of users (1 out of 5 users) visited the page. The minimum support is 40%. We do not consider page 5 at all in the analysis.
Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | |
User 1 | x | x | x | x | |
User 2 | x | x | x | ||
User 3 | x | x | |||
User 4 | x | x | x | ||
User 5 | x | x | |||
# | 2 | 3 | 4 | 4 | 1 |
Support | 2/5 = 40% | 3/5 = 60% | 4/5 = 80% | 4/5 = 80% | 1/5 = 20% |
>= 40% (min support) | OK! | OK! | OK! | OK! | – |
Page 1-4 have the minimum support. From all pages that have reached the minimum support 2 pairs are formed.
For example:
{Page 1, Page 2}: Main page and the topic page appear together.
{Page 1, Page 3}: Main page and landing page appear together.
All new candidates: {Page 1, Page 2}; {Page 1; Page 3}; {Page 1, Page 4}; {Page 2, Page 3}; {Page 2, Page 4}; {Page 3, Page4}. Again, we create a table. We now want to check whether the pairs occur together more frequently.
Page 1, Page 2 | Page 1, Page 3 | Page 1, Page 4 | Page 2, Page 3 | Page 2, Page 4 | Page 3, Page 4 | |
User 1 | x | x | x | |||
User 2 | x | x | x | x | ||
User 3 | ||||||
User 4 | x | x | x | |||
User 5 | x | |||||
# | 1 | 1 | 1 | 2 | 3 | 3 |
Support[%] | 20% | 20% | 20% | 40% | 60% | 60% |
>= 40% (min support) | – | – | – | OK! | OK! | OK! |
The pairs Page 2 + 3, Page 2 + 4 and Page 3 + 4 create the minimum support. Now we can already find the first rules. The pairs that we now check are now the following. Each pair contains two association rules to check. It’s important to check both directions:
{Page 2, Page 3}: Page 2 → Page 3? Page 3 → q2? (If page 2 occurs, does page 3 occur? And vice versa).
{Page 2, Page 4}: Page 2 → Page 4? Page 4 → q2? (If page 2 occurs, does page 4 occur? And vice versa).
{Page 3, Page 4}: Page 3 → Page 4? Page 4 → Page 3? (If page 3 occurs, does page 4 occur? And vice versa)
Why are both directions checked? The direction is important because the formula of the confidence requires this.
How often do premise and conclusion occur together? All values can be taken from figure 2. {Page 2, Page 3} exists in two out of five cases. {Page 2, Page 4} in three out of five cases. {Page 3, Page 4} in three out of five cases. These are our numerator values in the fraction we form. So for Page 2 > Page 3 and Page 3 > Page 2, the numerator is two. In the denominator is the number of occurrences of the premise! Premise is always the item before the “>”. The values here can be taken from Figure 1. These are our denominator values.
If the fraction is above the minimum confidence of 60% in percent, we have found rules. And indeed, associations can be found. In 100% of the cases even, a user of the website calls page 4, if he called page 2. Now here can be interpreted in many ways. Is the content too weak from page 2, so that additional information from page 4 is needed?
2 > 3 | 3 > 2 | 2 > 4 | 4 > 2 | 3 > 4 | 4 > 3 | |
Confidence | 2/3 = 66,67% | 2/4 = 50% | 3/3 = 100% | 3/4 = 75% | 3/4 = 75% | 3/4 = 75% |
According to the algorithm, only the item sets from the previous iteration must be considered for which the confidence is different in the respective directions. However, this has no meaning for the next step, since we only have one iteration left.
Now we check combinations of 3. Only one set remains to be checked, since page 1 and page 5 were already removed in the first and second iteration, respectively.
Page 2 + Page 3 + Page 4 | |
User 1 | |
User 2 | x |
User 3 | |
User 4 | x |
User 5 | |
# | 2 |
Support | 40 |
>= 40% min support | OK! |
We check the minimum confidence for six rules:
Page 2 > Page 3, Page 4
Page 4, Page 3 > Page 2
Page 3 > Page 2, Page 4
Page 2, Page 4 > Page 3
Page 4 > Page 2, Page 3
Page 2, Page 3 > Page 4
Here one must look carefully. Page 3, Page 4 -> Page 2 e.g. Page 3, Page 4 is the premise, i.e. here the value for the pair must be taken! So in the denominator there is three. For all pairs of 3 the numerator is the same: two. The combination of three appears three times. Two rules do not manage the confidence value.
2 > 3,4 | 3,4 > 2 | 3 > 2,4 | 2,4 > 3 | 4 > 2,3 | 2,3 > 4 | |
Confidence | 2/3 = 66,67% | 2/3 = 66,67% | 2/4 = 50% | 3/4 = 75% | 2/4 = 50% | 2/2 = 100% |
All association rules in the overview
Page 2 > Page 3: If the topic subpage on colds and flu is called up, then the landing page is also called up with a confidence of approx. 66.67%.
Page 2 > Page 4: If the topic subpage on colds and flu is called up, then the topic page on treatment methods is also called up (100% confidence!).
Page 4 > Page 2: With a confidence of 75%, a user passes the topic page on treatment methods and then the topic subpage on colds and flu.
Page 3 > Page 4: With a confidence of 75%, a user visits the topic page on treatment methods if the landing page was visited beforehand.
Page 4 > Page 3: The landing page is visited with a confidence of 75% if the topic page on treatment methods was previously visited.
Page 2 > Page 3, Page 4: With a confidence of 66.67%, the landing page and the topic page on treatment methods are visited if the topic page on cold and flu was previously visited.
Page 3, Page 4 > Page 2: With a confidence of 66.67%, the topic page on cold and flu is viewed each time the landing page and the topic page on treatment methods are viewed.
Page 2, Page 4 > Page 3: With a confidence of 66.67%, the landing page is accessed each time the topic page on cold and flu and the topic page on treatment methods are accessed.
Page 2, Page 3 > Page 4: 100% of the time: If the landing page and the topic page on the common cold are called up, the topic page on treatment methods is also called up.
Technical example with VisitorProfile interface from Matomo (formerly: Piwik)
It is possible through a few lines in Python to iterate over the API output of the Matomo API and run the analysis on the users’ click behavior for any period of time. Feel free to just test the script. But works only with a self-hosted version of Matomo, Matomo for WordPress does not work! And please install efficient-apriori (pip install efficient-apriori) and other required libraries (urllib, BeautifulSoup, …)
import xml.etree.ElementTree as ET
import urllib.request
from efficient_apriori import apriori
token = 'INSERT_MATOMO_TOKEN'
host = 'http://example.com'
request_url = host + '/analytics/?module=API&method=Live.getLastVisitsDetails&idSite=1&period=day&date=last10&format=xml&token_auth=' + token
opener = urllib.request.build_opener()
tree = ET.parse(opener.open(request_url))
root = tree.getroot()
click_paths = []
for child in root:
clickHistory = child.find('actionDetails')
paths = []
for row in clickHistory:
paths.append(row.find('url').text)
click_paths.append(paths)
candidates = []
for l in click_paths:
l = list(dict.fromkeys(l))
candidates.append(l)
itemsets, rules = apriori(candidates, min_support=0.5, min_confidence=1)
for x in rules:
print(x)
