Search Keywords clustering with the Levenshtein algorithm & Affinity Propagation
01/02/2023

Part of the data mining board used internally is keyword clustering, which is based on the Levenshtein algorithm. What can be titled as a one-time diligence work in the course of an SEO audit becomes a time-consuming challenge when analyzing weekly and differently timed reporting. The Levenshtein algorithm can generate a rough but automated keyword clustering in advance, so that with three to five digit keyword lists at least an acceptable keyword sorting has taken place. By hand, this would be unthinkable. In practice, the code snippets below generate 80-90 clusters for about 800 keywords. The precision of this algorithm is of course also significantly dependent on the shape of the keywords at hand.
First of all, we need two queries of the Google Search Analytics API for two lists (lost and gained keywords). For this we need two additional time bounds. For two given times, a second query is prepared that queries search queries in the period two weeks before. This way we can gain lost and gained keywords.
start = datetime.strptime(flags.start_date, '%Y-%m-%d') - timedelta(days=14)
end = datetime.strptime(flags.end_date, '%Y-%m-%d') - timedelta(days=14
On both lists, the difference of the quantity in one and the other direction is formed:
newKeywords = list(set(new_keywords) - set(old_keywords)) lostKeywords = list(set(old_keywords) - set(new_keywords))
The queries differ only in the request parameters startDate and endDate:
QUERY_REQUEST = { 'startDate': flags.start_date, 'endDate': flags.end_date, 'dimensions': ['query'], 'searchType': 'web', 'dimensionFilterGroups': [{ 'filters': [{ 'dimension': 'country', 'expression': 'DEU' }] }], 'rowLimit': 5000 } QUERY_PREWEEK_REQUEST = { 'startDate': start.strftime("%Y-%m-%d"), 'endDate': end.strftime("%Y-%m-%d"), 'dimensions': ['query'], 'searchType': 'web', 'dimensionFilterGroups': [{ 'filters': [{ 'dimension': 'country', 'expression': 'DEU' }] }], 'rowLimit': 5000 } QUERY_PREWEEK_REQUEST = { 'startDate': start.strftime("%Y-%m-%d"), 'endDate': end.strftime("%Y-%m-%d"), 'dimensions': ['query'], 'searchType': 'web', 'dimensionFilterGroups': [{ 'filters': [{ 'dimension': 'country', 'expression': 'DEU' }] }], 'rowLimit': 5000 }
The function levensthein is then called on the two lists. The values are stored accordingly in separate csv files. The parameter y is optional and has per se a reason for existence, which is to be understood only in the overall context of the entire SEO reporting script. If only the cluster is needed, the function can also be abstracted and y and the if-conditions removed.
# CLUSTER-ALGORTIHMUS def levenshtein(x,y='won'): words = np.asarray(sorted(x)) lev_similarity = -5*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words]) affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5) affprop.fit(lev_similarity) for cluster_id in np.unique(affprop.labels_): exemplar = words[affprop.cluster_centers_indices_[cluster_id]] cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)]) cluster_str = ", ".join(cluster) neues_cluster = "%s: %s" % (exemplar, cluster_str) if (y == 'won'): with open('new-keywords-' + start.strftime('%Y-%M-%d') + '-' + end.strftime('%Y-%M-%d') + '.csv', 'a', newline='') as csvfile: der_writer = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL) der_writer.writerow([neues_cluster,","]) if (y == 'lost'): with open('lost-keywords-' + start.strftime('%Y-%M-%d') + '-' + end.strftime('%Y-%M-%d') + '.csv', 'a', newline='') as csvfile: der_writer = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL) der_writer.writerow([neues_cluster,","])
You can be experimental here. The weight (here 5) produces plausible results. In addition to keyword clustering, applications to clean up misspellings, brand filters, and similar methods are conceivable.
