Topic modeling#

Topic modeling is an unsupervised approach to identify topics from a corpus. Identifying topics among the comments helps to answer the question of which topics are dominant in the comments section and whether they are dominated by right-wing comments. To identify topics the Latent Dirichlet Allocation (LDA) will be used. The LDA is the most widely used model for topic modeling and learns the topic-word mappings from the corpus over several iterations [Atteveldt, 2022].

First, the libraries for the preprocessing of the dataset are imported. To remove stopwords the stopword list from the Natural Language Toolkit (NLTK) is used. In addition, the spacy library is used for lemmatization. Lemmatization is an important and often used pre-processing step for topic modelling, because it has been shown that lemmatization can lead to better results [May et al., 2019].

from IPython.core.display_functions import display
import pandas as pd
from cleantext import clean
import pickle
from pathlib import Path
import string
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore') # Disable warnings to improve output formation

import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

import spacy
# The model has to be installed via the following command: $(env) python -m spacy download de_core_news_md
# The docker image already contains the model
spacy_model_german = spacy.load("de_core_news_md", disable=["parser", "ner"])

For preprocessing the text, we first create a stopword list and a punctuation list. Then we define methods for lemmatizing the text and removing punctuation marks. Both methods expect the text as input, perform the corresponding processing and return the processed text. The method tokenize_and_lemmatize_text performs the complete preprocessing. First the linebreaks and the emojis are removed. To remove the emojis the cleantext library is used. Then the two methods to remove punctuation and to perform lemmatization are applied. Since spacy returns a double bar -- as a token for punctuation marks or unknown characters, these tokens are removed to ensure that only meaningful tokens are included. Then the stopwords are removed. At the end empty tokens are removed because these tokens can occur, for example, if the comment consists only of emojis.

stop_words = set(stopwords.words('german'))
regular_punctuation = list(string.punctuation)

def get_lemmatized_text(text: str):
    lemmatized_text = []
    document = spacy_model_german(text)
    for word in document:
        lemmatized_text.append(word.lemma_)
    return lemmatized_text

def remove_punctuation(text):
    for punc in regular_punctuation:
        if punc in text:
            text = text.replace(punc, ' ')
    return text.strip()

def tokenize_and_lemmatize_text(text: str):
    text = clean(text, no_emoji=True, lang="de", no_urls=True)
    text = remove_punctuation(text)
    tokens = get_lemmatized_text(text)
    tokens = [token for token in tokens if token != "--"]
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [token for token in tokens if not token.isspace()] # remove spaces
    tokens = [token for token in tokens if len(token) > 1] # remove single characters
    return tokens

To preprocess all comments, they are first loaded from the csv file and then it is ensured that all comments are in string format. This step is necessary because otherwise the error occurs later that the comment would not be a string. This problem is probably caused by the fact that special characters are interpreted incorrectly when the comments are saved or read in. At the end we get a list that contains for each comment a list of tokens, as we can see in the output.

comments = pd.read_csv("data/youtube_comments_500.csv")
comments["Comments"] = comments["Comments"].astype(str)
preprocessed_comments = []
for index, row in tqdm(comments.iterrows()):
    preprocessed_text: list = tokenize_and_lemmatize_text(row["Comments"])
    if len(preprocessed_text) > 1:
        preprocessed_comments.append(preprocessed_text)

406242it [24:50, 272.51it/s]

preprocessed_comments[0:2]

[['Tag',
  'groß',
  'Bericht',
  'immer',
  'Panzer',
  'liefern',
  'ganz',
  'schön',
  'sinnlos',
  'Vermittlung',
  'Neuigkeit'],
 ['scholz',
  'gut',
  'Weiss',
  'wieso',
  'brauchen',
  'Verteidigungsminister',
  'Stelle',
  'sparen']]

The preprocessing is quite time-consuming and takes a few minutes, which is why the processed comments are saved as a pickle file. Saving the list as a pickle file makes it possible to save the list directly and load it again. Saving the comments here as a csv makes little sense, as the list of tokens for each comment has a different length.

file = Path("data/youtube_comments_500_preprocessed.pkl")
if not file.exists():
    with open("data/youtube_comments_500_preprocessed.pkl", "wb") as fw:
        pickle.dump(preprocessed_comments, fw)

with open("data/youtube_comments_500_preprocessed.pkl", "rb") as fr:
    preprocessed_comments = pickle.load(fr)

To implement the LDA model, the gensim library is used. For the visualisation of the topics and results at the end, the pyLDAvis library is used. This library allows interactive exploration of the results, as we will see later.

For the LDA model, a dictionary must first be created which is a mapping between words and IDs for the words. This allows us to subsequently represent the text corpus, i.e. the comments, as a bag-of-words format. When creating the model, very frequent and infrequent words are ignored to improve the model. It is relatively obvious that very common words have less meaning and are therefore less likely to be associated with specific topics. Infrequent words, on the other hand, could belong to a topic, but it is unlikely that this topic will be identified because the LDA model learns only a few topics and it is therefore likely that this topic will not be found.

The hyperparameters alpha and beta (in gensim also called eta) for the LDA model have to chosen carefully, as they strongly influence the model performance. Fortunately, the gensim’s LDA model provides the feature to automatically find the best choice for both hyperparameters. Another important hyperparameter is the number of topics, which has to be specified upfront. The choice of this hyperparameter is often based on domain knowledge and there is no good theoretical solution for this problem [Atteveldt, 2022]. One possible approach to finding a good choice may be to systematically increase or decrease the number of topics. However, on the one hand, this would require a lot of computing resources and, and on the other hand, it is difficult to decide whether one distribution of topics is better than the other. It was therefore decided to try out the following three values for the number of topics: 6, 10 and 20.

To compare which option performed better, two metrics are used.The perplexity measures how well the model can fit the actual word representation. The coherence measures how semantically coherent two topics are. However, the best model achieved by these metrics is not always the most interpretable model from a human perspective [Atteveldt, 2022].

import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel

dictionary = corpora.Dictionary(preprocessed_comments)
dictionary.filter_extremes(no_below=20, no_above=0.8)
corpus = [dictionary.doc2bow(text) for text in preprocessed_comments]

lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics = 20, id2word=dictionary, passes=15, alpha="auto", eta="auto")
lda_model.save('lda_models/lda_model_20.gensim')

The score of both evaluation metrics will decrease when adding more topics [Atteveldt, 2022]. This behavior can also be seen in the models trained here. In theory, one looks for the inflection point at which the values of the two metrics fall at a much slower rate [Atteveldt, 2022]. However, since only a few models were trained here, such an approach is difficult to carry out, especially since both metrics continue to decrease (at a similar rate).

lda_model_6 = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_6.gensim")
lda_model_10 = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_10.gensim")
lda_model_20 = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_20.gensim")

# Compute Perplexity
print('LDA_6 Perplexity: ', lda_model_6.log_perplexity(corpus))
print('LDA_10 Perplexity: ', lda_model_10.log_perplexity(corpus))
print('LDA_20 Perplexity: ', lda_model_20.log_perplexity(corpus))

# Compute Coherence Score
coherence_model_lda_6 = CoherenceModel(model=lda_model_6, texts=preprocessed_comments, dictionary=dictionary, coherence='c_v')
coherence_lda_6 = coherence_model_lda_6.get_coherence()
print('LDA_6 Coherence Score: ', coherence_lda_6)

coherence_model_lda_10 = CoherenceModel(model=lda_model_10, texts=preprocessed_comments, dictionary=dictionary, coherence='c_v')
coherence_lda_10 = coherence_model_lda_10.get_coherence()
print('LDA_10 Coherence Score: ', coherence_lda_10)

coherence_model_lda_20 = CoherenceModel(model=lda_model_20, texts=preprocessed_comments, dictionary=dictionary, coherence='c_v')
coherence_lda_20 = coherence_model_lda_20.get_coherence()
print('LDA_20 Coherence Score: ', coherence_lda_20)

LDA_6 Perplexity:  -8.334388322182123
LDA_10 Perplexity:  -9.166142281809835
LDA_20 Perplexity:  -13.330936599649911
LDA_6 Coherence Score:  0.6025754127903082
LDA_10 Coherence Score:  0.5074582503859234
LDA_20 Coherence Score:  0.4254128399081809

If we look at the topics that the different models have found, we can identify some clear topics, but there are also some other topics that make little sense. Generally, the topics in the 6-topic model are too general. It is difficult to find clear topics there. In the models with 10 and 20 topics, on the other hand, it is easy to find some clear topics. However, it is difficult to say which of the two models is better.

lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_6.gensim")
display(pd.DataFrame({f"Topic {n}":[word for (word,word_weight) in words] for (n, words) in lda_model.show_topics(formatted=False, num_words=15)}))

	Topic 0	Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
0	Jahr	ja	Regierung	mehr	innen	Nachricht
1	Deutschland	gut	Frau	sollen	Sektendepp	The
2	Russland	schon	grün	geben	Coronaleugner	Reichsbürger
3	Krieg	mal	endlich	Mensch	hetzen	Reinhard
4	USA	wer	Politiker	Deutschland	Behauptung	Youtube
5	Ukraine	immer	Medium	Land	Lüge	Antwort
6	seit	gehen	wählen	warum	Putinanhimmler	Weihnachten
7	Putin	kommen	Volk	müssen	Lara	oh
8	Waffe	deutsch	Demokratie	tun	Beweis	Mrscrewy
9	russisch	ganz	eur	vieler	Croft	de
10	EU	sagen	Herr	Leute	Name	jährig
11	10	sehen	Partei	Kind	Michael	Müller
12	Europa	wissen	wann	Problem	beweisen	dürsch
13	ukrain	einfach	Berlin	dürfen	Jinping	Henning
14	Panzer	heute	EU	dafür	drohen	Kreml

lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_10.gensim")
display(pd.DataFrame({f"Topic {n}":[word for (word,word_weight) in words] for (n, words) in lda_model.show_topics(formatted=False, num_topics=10, num_words=15)}))

	Topic 0	Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9
0	ja	grün	Russland	Polizei	10	schön	dumm	innen	12	sollen
1	mehr	öffentlich	Krieg	Euro	Million	afd	erster	Sektendepp	14	wer
2	gut	ard	USA	Absonderung	000	Kommentar	Coronaleugner	hetzen	Mädchen	sagen
3	Deutschland	Wahrheit	Putin	Mathias	rd	klein	schreiben	Lüge	11	sehen
4	schon	links	Ukraine	Korruption	2022	halt	na	fordern	xi	einfach
5	mal	Aussage	EU	Milliarde	20	wählen	Behauptung	selber	Hans	warum
6	geben	The	Waffe	verkaufen	sterben	danken	nix	bleiben	jährig	wissen
7	Jahr	Ausländer	endlich	europäisch	Corona	echt	Name	Putinanhimmler	Müller	tun
8	deutsch	Angst	russisch	Wiese	100	lange	Nachricht	Beweis	00	Leute
9	immer	etc	Europa	Pergon	ca	Partei	beweisen	Jinping	40	Kind
10	gehen	Rex	ukrain	Hetzer	Swongeböte	Demokratie	darauf	gesinnungsbraun	24	finden
11	kommen	rot	China	kriminell	tot	Berlin	vielleicht	Klimawandel	Blödsinn	Frau
12	Mensch	traurig	Panzer	öl	kämpfen	stimmen	Propaganda	behaupten	geh	denken
13	ganz	Wilhelm	Volk	Inflation	pro	lesen	offensichtlich	immer	35	Problem
14	Land	Imperator	Scholz	steigen	Freiheit	grüne	drohen	Papst	Fckafd	lassen

lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_20.gensim")
display(pd.DataFrame({f"Topic {n}":[word for (word,word_weight) in words] for (n, words) in lda_model.show_topics(formatted=False, num_topics=20, num_words=15)}))

	Topic 0	Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8	Topic 9	Topic 10	Topic 11	Topic 12	Topic 13	Topic 14	Topic 15	Topic 16	Topic 17	Topic 18	Topic 19
0	grün	Jahr	Tagesschau	schreiben	Polizei	ja	sehen	EU	deutsch	stimmen	neu	The	Deutschland	jemand	Russland	Wort	innen	Mensch	Frau	nie
1	na	seit	erster	Name	öffentlich	mehr	echt	Politik	berichten	zurück	schön	darauf	Regierung	verlieren	Krieg	co2	Coronaleugner	Land	Kind	Thema
2	grüne	letzter	eur	nennen	verbieten	gut	falsch	Politiker	Volk	teuer	endlich	Antwort	afd	ne	USA	Frankreich	hetzen	Welt	bitte	voll
3	ard	10	Berlin	Michael	Klimawandel	schon	beide	danken	Recht	12	dumm	tatsächlich	Staat	trotzdem	Putin	Flüchtling	fordern	Geld	Kommentar	absolut
4	linker	nächster	Reinhard	dank	Klima	mal	lieb	Russe	darüber	Preis	Scholz	oft	weit	typisch	Ukraine	weltweit	Behauptung	brauchen	nehmen	Wahrheit
5	SPD	Million	sowas	kannst	schützen	sollen	Weg	vergessen	Schuld	14	wann	etwa	zeigen	Mathias	Waffe	bekannt	Lüge	stehen	alt	and
6	links	zwei	kaufen	Gott	Fakt	geben	suchen	krank	Bild	30	ach	Nazi	Demokratie	Rente	russisch	wahrscheinlich	selber	ab	Mann	handeln
7	sofort	000	spielen	mögen	fahren	immer	Merkel	raus	lächerlich	11	wünschen	drohen	Medium	kriegen	Europa	Youtube	Beweis	eigen	Frage	Bundestag
8	offensichtlich	Milliarde	wahr	einsam	Polizist	wer	Hand	Ampel	freuen	vorbei	sitzen	toll	wählen	xi	ukrain	europäisch	Nachricht	wegen	hören	25
9	Angst	2022	oh	Iran	Gewalt	gehen	Seite	wieso	willst	treffen	Propaganda	Thomas	Meinung	Wiese	China	Gedanke	beweisen	leben	Herr	super
10	Korruption	20	Fußball	treiben	mussn	kommen	Lösung	dr	hoffentlich	schaden	Glück	warten	Bürger	Präsident	Panzer	Generation	bleiben	gehören	lesen	zerstören
11	Wahl	Monat	Freiheit	Inflation	fliegen	ganz	Familie	Geschichte	leisten	extrem	Bundeswehr	zweiter	schnell	schuldig	nein	Rakete	Aussage	davon	Gesellschaft	verfolgen
12	erkennen	Euro	Mal	Covid	bedeuten	sagen	bestimmt	demokratisch	erwarten	00	fehlen	Gegenteil	egal	Beitrag	ukraine	Gericht	behaupten	leider	jung	angst
13	abschaffen	Habeck	etc	Regime	Böller	einfach	verkaufen	Hilfe	entscheiden	Luft	Müller	Bericht	Partei	Lügner	liefern	Alexander	wm	Leben	lernen	Argument
14	blöd	50	rechter	Merz	rechtlich	warum	Schnettka	Hitler	danach	Sanktion	beenden	40	politisch	drauf	Frieden	Nacht	Beleidigung	schaffen	Schule	Afghanistan

So far, only the top words for each topic have been considered, without taking into account the weighting of the words. With a wordcloud it is possible to include this weighting and thus get a better understanding of the topic. In order to keep it clear, it was decided to visualise the model with 10 topics using wordclouds. This is also the model that performs best in the evaluation with the pyLDAvis library in the next section. If we look at the wordclouds shown below, the following topics could be identified:

Topic 0: This topic could be about Germany and domestic issues.
Topic 2: This topic is obviously about the ukraine war.
Topic 5: This topic seems to be related to the “AFD”. The terms suggest that it could be partly about calls to vote for the “AFD” or a statement that the “AFD” has been voted for in the past.
Topic 6: This topic is about covid and covid deniers. However, it cannot be clearly stated whether these are more comments by covid deniers or rather comments by people who are upset about the covid deniers and criticise them.
Topic 7: This topic could be about comments where others complain about the lateral thinking movement.

from matplotlib import pyplot as plt
from wordcloud import WordCloud
import matplotlib.colors as mcolors

cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]

cloud = WordCloud(background_color='white', width=2500, height=1800, max_words=10, colormap='tab10', color_func=lambda *args, **kwargs: cols[i], prefer_horizontal=1.0)

lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_10.gensim")
topics = lda_model.show_topics(formatted=False)

fig, axes = plt.subplots(5, 2, figsize=(10,10), sharex=True, sharey=True)

for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    topic_words = dict(topics[i][1])
    cloud.generate_from_frequencies(topic_words, max_font_size=300)
    plt.gca().imshow(cloud)
    plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
    plt.gca().axis('off')


plt.subplots_adjust(wspace=0, hspace=0)
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()

Analysis of German YouTube comments: How people from the lateral thinking movement dominate the comments section under videos of the "tagesschau" channel

Topic modeling

Topic modeling#