malerAI: Scraping Wikipedia To Build Highly-Dynamic Niche Website (Co-Product, Part 1)
03/01/2023

To develop the test environment called “malerAI”, a web server application, multiple MySQL databases and a Python library called Wikipedia-API were first installed on the local machine to create a text corpus. The text corpus initially contains the entire raw text of each article. The first prototype should have a certain heterogeneity, so we decided on the category of artists who were represented at the documenta. This listing can be taken from the German-language version of Wikipedia.

Extracting Information
! Bug Warning: The regex expressions and thus the code are not yet error-free, so that contents of the last Wikipedia section were always not transferred. This is because we extract according to a regular expression that always interprets the beginning of an h2 tag as the end of the section.
The following libraries are required:
import wikipediaapi
from bs4 import BeautifulSoup
import re
from collections import Counter
import xlsxwriter
You can save the data into a database or into an excel workbook. Let’s use a simple workbook:
# create workbook
workbook = xlsxwriter.Workbook('extractor_results.xlsx')
# increasing the size of the cell
my_format = workbook.add_format({'text_wrap': True})
We need the following functions:
def remove_html_tags(text):
clean = re.compile('<.*?>')
return re.sub(clean, '', text)
It removes the HTML tags from the paragraphs (you can keep them, if you need them).
def regexCleaner(x,y):
if "(" in x:
x = x.replace("(", "[(]")
x = x.replace(")", "[)]")
if "</i>" in x:
x = x.replace("</i>", "<\/i>")
if "</strong>" in x:
x = x.replace("</strong>", "<\/strong>")
# ....
return x
It’s important to have this function. There will be some unusual formatted headlines that has to be converted before used in a regular expression for example. “# …” means that there will be more cases or even cases that have to be hard-coded.
The final code:
wiki_html = wikipediaapi.Wikipedia(language='INSERT_LANGUAGE',extract_format=wikipediaapi.ExtractFormat.HTML)
cat = wiki_html.page('INSERT_CATEGORY')
print("Number Of Entries: %s" % len(cat.categorymembers.keys()))
headlines = []
m = 1
for key in cat.categorymembers.keys():
temp_headlines = []
temp_content = []
wiki_name = key
page_py = wiki_html.page(wiki_name)
abstract = remove_html_tags(re.findall('<p>(.*)', str(page_py.text))[0])
# Collect paragraphs
Soup = BeautifulSoup(page_py.text, 'lxml')
heading_tags = ["h2"]
number_of_sections = len(Soup.find_all(heading_tags))
s = 1
for tags in Soup.find_all(heading_tags):
# BUG!: Regex-Expression doesn't work for the last element
if (s == number_of_sections):
break
else:
headlines.append(tags.text.strip())
temp_headlines.append(tags.text.strip())
print("- " + tags.text.strip())
# Get Paragraph Content
try:
raw_content = re.search(regexCleaner(tags.text.strip(), tags) + '<\/h2>((.|\n)*?(?=<h2>))', str(page_py.text))[1]
temp_content.append(remove_html_tags(raw_content))
except:
temp_content.append("regex error")
s += 1
# Excel-Export
worksheet = workbook.add_worksheet(str(m))
worksheet.write('A1', 'Title')
worksheet.write('B1', 'Abstract')
createExcelColumnNames(temp_headlines)
worksheet.write(1, 0, wiki_name, my_format)
worksheet.write(1, 1, abstract, my_format)
i = 2
for c in temp_content:
worksheet.write(1, i, c, my_format)
i += 1
print("(" + str(m) + "/" + str(len(cat.categorymembers.keys())) + ") done")
print("-----------------------------------------------------")
m +=1
workbook.close()
print("------------------- SUCCESS -------------------------")
print("Final Evaluation")
print(Counter(headlines))
print("-----------------------------------------------------")
Wiki Data Analysis and Database Consistency
The problem with parsing the subpages is the lack of consistency in the heading structure, so exporting the correct text content was not possible without an additional classification algorithm. For example, biographical information may be found under the headings “Life”, “Life and Work” and “Biography”, where we consider all three as a general text block of biographical information.
The headings have first been calculated in their actual distribution. For this, the BeautifulSoup Python library was used to extract the contents of the headings per entry. The distribution of the top 20 headings translated from German into English can be seen in absolute numbers in Table 1.
Headline Text | Absolute Frequency | Relative Frequency |
“Web links” | 2374 | 0,9137 |
“References” | 2105 | 0,8102 |
“Literature” | 1340 | 0,5157 |
“Life and Work” | 1120 | 0,4311 |
“Life” | 968 | 0,3725 |
“Exhibitions (Selection)” | 528 | 0,2032 |
“Exhibitions” | 446 | 0,1716 |
“Awards” | 425 | 0,1635 |
“Work” | 335 | 0,1289 |
“Literature & Sources” | 220 | 0,0846 |
“Works” | 172 | 0,0662 |
“Works (Selection)” | 147 | 0,0565 |
“Rewards (Selection)” | 125 | 0,0481 |
“Solo exhibitions (Selection)” | 75 | 0,0288 |
“Filmography” | 72 | 0,0277 |
“Reception” | 70 | 0,0269 |
“Honours” | 67 | 0,0257 |
“Publications” | 65 | 0,0250 |
“Movie” | 60 | 0,0230 |
“Writings” | 59 | 0,0227 |
The heterogeneous use of the headings prompted us in the development approach to carry out a manual classification first, since a precise classification seemed possible through quite similar use of terms. The chosen categories and their coverage seemed to be a suitable means to evaluate the quality of the chosen categories. It should be mentioned here that selected categories and similar ones such as “Weblinks”, “Literature”, “See also” or “References” were removed, as they had no initial purpose for the development of the test environment; the contents were always exported for later, possible processing. We measured the proportion of headings that could not be added to any of our categories by computing the precision of retrieved paragraphs:
We chose the first 100 artist pages to build a naïve substring-matching classification algorithm and a blacklist to sort out content that is not relevant for us, e.g. web links, sources or publications. We are primarily interested in bibliography information and work-related information. These are the results for using a simple case-sensitive string-in-string-approach that classifies all scraped headlines into four possible segments: “Biography”, “Exhibitions”, “Awards” and “Works”.
The results of the manual classification are betrayed in Table 2.
n | Bibliography | Exhibtion(s) | Awards | Works | classified | non-classified | Precision |
50 | 45 | 26 | 20 | 19 | 110 | 22 | 83,333% |
100 | 88 | 49 | 36 | 41 | 214 | 45 | 82,625% |
500 | 451 | 241 | 150 | 200 | 1042 | 243 | 81,089% |
1000 | 914 | 504 | 306 | 414 | 2138 | 477 | 81,759% |
2598 | 2388 | 1333 | 862 | 1073 | 5656 | 1383 | 80,352% |
The manual classifier works with a solid precision for this dataset and thus for our demanding, but the usage of affinity propagation could be useful to cluster efficiently. Nevertheless, we are able to reach 80,35% of precision with the simple string-in-string-approach (see Table 2).
Extra: Affinity Propagation with Normed Levenshtein Distance
Alternatively, you can try to cluster all the headlines and its content by using Affinity Propagation with a Normed Levenshtein distance for example. Here’s a quick code that can be modified regarding dumping value and weight. The example input looks like this:
{'Weblinks': 1818, 'Literatur': 1343, 'Leben und Werk': 1115, 'Leben': 973, 'Ausstellungen (Auswahl)': 536, 'Ausstellungen': 446, 'Auszeichnungen': 429, 'Werk': 351, 'Literatur und Quellen': 218, 'Werke': 170, 'Einzelnachweise': 152, 'Werke (Auswahl)': 147, 'Auszeichnungen (Auswahl)': 128, 'Einzelausstellungen (Auswahl)': 76, 'Rezeption': 72, 'Filmografie': 71, 'Ehrungen': 67, ...., 'Literaturauswahl': 1}
And have a try with this code:
from ast import literal_eval
from sklearn.cluster import AffinityPropagation
import distance
import numpy as np
# load categories (dict)
with open('dict-categories-demo.txt',"r") as r:
dict_categories_as_string = r.read()
dict_categories = literal_eval(dict_categories_as_string)
categories = np.asarray(sorted(dict_categories.keys()))
def normedLevensthein(x,y):
levD = distance.levenshtein(x,y)
return 1 - (levD / max((len(x), len(y))))
lev_similarity = np.array([[normedLevensthein(w1,w2) for w1 in categories] for w2 in categories])
affprop = AffinityPropagation(affinity="precomputed", damping=0.9)
affprop.fit(lev_similarity)
cluster_centers_indices = affprop.cluster_centers_indices_
labels = affprop.labels_
n_clusters_ = len(cluster_centers_indices)
for cluster_id in np.unique(affprop.labels_):
exemplar = categories[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(categories[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print("%s: %s" % (exemplar, cluster_str))
