Topic modeling#

Topic modeling is an unsupervised approach to identify topics from a corpus. Identifying topics among the comments helps to answer the question of which topics are dominant in the comments section and whether they are dominated by right-wing comments. To identify topics the Latent Dirichlet Allocation (LDA) will be used. The LDA is the most widely used model for topic modeling and learns the topic-word mappings from the corpus over several iterations [Atteveldt, 2022].

First, the libraries for the preprocessing of the dataset are imported. To remove stopwords the stopword list from the Natural Language Toolkit (NLTK) is used. In addition, the spacy library is used for lemmatization. Lemmatization is an important and often used pre-processing step for topic modelling, because it has been shown that lemmatization can lead to better results [May et al., 2019].

from IPython.core.display_functions import display
import pandas as pd
from cleantext import clean
import pickle
from pathlib import Path
import string
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore') # Disable warnings to improve output formation

import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

import spacy
# The model has to be installed via the following command: $(env) python -m spacy download de_core_news_md
# The docker image already contains the model
spacy_model_german = spacy.load("de_core_news_md", disable=["parser", "ner"])

For preprocessing the text, we first create a stopword list and a punctuation list. Then we define methods for lemmatizing the text and removing punctuation marks. Both methods expect the text as input, perform the corresponding processing and return the processed text. The method tokenize_and_lemmatize_text performs the complete preprocessing. First the linebreaks and the emojis are removed. To remove the emojis the cleantext library is used. Then the two methods to remove punctuation and to perform lemmatization are applied. Since spacy returns a double bar -- as a token for punctuation marks or unknown characters, these tokens are removed to ensure that only meaningful tokens are included. Then the stopwords are removed. At the end empty tokens are removed because these tokens can occur, for example, if the comment consists only of emojis.

stop_words = set(stopwords.words('german'))
regular_punctuation = list(string.punctuation)

def get_lemmatized_text(text: str):
    lemmatized_text = []
    document = spacy_model_german(text)
    for word in document:
        lemmatized_text.append(word.lemma_)
    return lemmatized_text

def remove_punctuation(text):
    for punc in regular_punctuation:
        if punc in text:
            text = text.replace(punc, ' ')
    return text.strip()

def tokenize_and_lemmatize_text(text: str):
    text = clean(text, no_emoji=True, lang="de", no_urls=True)
    text = remove_punctuation(text)
    tokens = get_lemmatized_text(text)
    tokens = [token for token in tokens if token != "--"]
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [token for token in tokens if not token.isspace()] # remove spaces
    tokens = [token for token in tokens if len(token) > 1] # remove single characters
    return tokens

To preprocess all comments, they are first loaded from the csv file and then it is ensured that all comments are in string format. This step is necessary because otherwise the error occurs later that the comment would not be a string. This problem is probably caused by the fact that special characters are interpreted incorrectly when the comments are saved or read in. At the end we get a list that contains for each comment a list of tokens, as we can see in the output.

comments = pd.read_csv("data/youtube_comments_500.csv")
comments["Comments"] = comments["Comments"].astype(str)
preprocessed_comments = []
for index, row in tqdm(comments.iterrows()):
    preprocessed_text: list = tokenize_and_lemmatize_text(row["Comments"])
    if len(preprocessed_text) > 1:
        preprocessed_comments.append(preprocessed_text)
406242it [24:50, 272.51it/s]
preprocessed_comments[0:2]
[['Tag',
  'groß',
  'Bericht',
  'immer',
  'Panzer',
  'liefern',
  'ganz',
  'schön',
  'sinnlos',
  'Vermittlung',
  'Neuigkeit'],
 ['scholz',
  'gut',
  'Weiss',
  'wieso',
  'brauchen',
  'Verteidigungsminister',
  'Stelle',
  'sparen']]

The preprocessing is quite time-consuming and takes a few minutes, which is why the processed comments are saved as a pickle file. Saving the list as a pickle file makes it possible to save the list directly and load it again. Saving the comments here as a csv makes little sense, as the list of tokens for each comment has a different length.

file = Path("data/youtube_comments_500_preprocessed.pkl")
if not file.exists():
    with open("data/youtube_comments_500_preprocessed.pkl", "wb") as fw:
        pickle.dump(preprocessed_comments, fw)
with open("data/youtube_comments_500_preprocessed.pkl", "rb") as fr:
    preprocessed_comments = pickle.load(fr)

To implement the LDA model, the gensim library is used. For the visualisation of the topics and results at the end, the pyLDAvis library is used. This library allows interactive exploration of the results, as we will see later.

For the LDA model, a dictionary must first be created which is a mapping between words and IDs for the words. This allows us to subsequently represent the text corpus, i.e. the comments, as a bag-of-words format. When creating the model, very frequent and infrequent words are ignored to improve the model. It is relatively obvious that very common words have less meaning and are therefore less likely to be associated with specific topics. Infrequent words, on the other hand, could belong to a topic, but it is unlikely that this topic will be identified because the LDA model learns only a few topics and it is therefore likely that this topic will not be found.

The hyperparameters alpha and beta (in gensim also called eta) for the LDA model have to chosen carefully, as they strongly influence the model performance. Fortunately, the gensim’s LDA model provides the feature to automatically find the best choice for both hyperparameters. Another important hyperparameter is the number of topics, which has to be specified upfront. The choice of this hyperparameter is often based on domain knowledge and there is no good theoretical solution for this problem [Atteveldt, 2022]. One possible approach to finding a good choice may be to systematically increase or decrease the number of topics. However, on the one hand, this would require a lot of computing resources and, and on the other hand, it is difficult to decide whether one distribution of topics is better than the other. It was therefore decided to try out the following three values for the number of topics: 6, 10 and 20.

To compare which option performed better, two metrics are used.The perplexity measures how well the model can fit the actual word representation. The coherence measures how semantically coherent two topics are. However, the best model achieved by these metrics is not always the most interpretable model from a human perspective [Atteveldt, 2022].

import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
dictionary = corpora.Dictionary(preprocessed_comments)
dictionary.filter_extremes(no_below=20, no_above=0.8)
corpus = [dictionary.doc2bow(text) for text in preprocessed_comments]
lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics = 20, id2word=dictionary, passes=15, alpha="auto", eta="auto")
lda_model.save('lda_models/lda_model_20.gensim')

The score of both evaluation metrics will decrease when adding more topics [Atteveldt, 2022]. This behavior can also be seen in the models trained here. In theory, one looks for the inflection point at which the values of the two metrics fall at a much slower rate [Atteveldt, 2022]. However, since only a few models were trained here, such an approach is difficult to carry out, especially since both metrics continue to decrease (at a similar rate).

lda_model_6 = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_6.gensim")
lda_model_10 = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_10.gensim")
lda_model_20 = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_20.gensim")

# Compute Perplexity
print('LDA_6 Perplexity: ', lda_model_6.log_perplexity(corpus))
print('LDA_10 Perplexity: ', lda_model_10.log_perplexity(corpus))
print('LDA_20 Perplexity: ', lda_model_20.log_perplexity(corpus))

# Compute Coherence Score
coherence_model_lda_6 = CoherenceModel(model=lda_model_6, texts=preprocessed_comments, dictionary=dictionary, coherence='c_v')
coherence_lda_6 = coherence_model_lda_6.get_coherence()
print('LDA_6 Coherence Score: ', coherence_lda_6)

coherence_model_lda_10 = CoherenceModel(model=lda_model_10, texts=preprocessed_comments, dictionary=dictionary, coherence='c_v')
coherence_lda_10 = coherence_model_lda_10.get_coherence()
print('LDA_10 Coherence Score: ', coherence_lda_10)

coherence_model_lda_20 = CoherenceModel(model=lda_model_20, texts=preprocessed_comments, dictionary=dictionary, coherence='c_v')
coherence_lda_20 = coherence_model_lda_20.get_coherence()
print('LDA_20 Coherence Score: ', coherence_lda_20)
LDA_6 Perplexity:  -8.334388322182123
LDA_10 Perplexity:  -9.166142281809835
LDA_20 Perplexity:  -13.330936599649911
LDA_6 Coherence Score:  0.6025754127903082
LDA_10 Coherence Score:  0.5074582503859234
LDA_20 Coherence Score:  0.4254128399081809

If we look at the topics that the different models have found, we can identify some clear topics, but there are also some other topics that make little sense. Generally, the topics in the 6-topic model are too general. It is difficult to find clear topics there. In the models with 10 and 20 topics, on the other hand, it is easy to find some clear topics. However, it is difficult to say which of the two models is better.

lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_6.gensim")
display(pd.DataFrame({f"Topic {n}":[word for (word,word_weight) in words] for (n, words) in lda_model.show_topics(formatted=False, num_words=15)}))
Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
0 Jahr ja Regierung mehr innen Nachricht
1 Deutschland gut Frau sollen Sektendepp The
2 Russland schon grün geben Coronaleugner Reichsbürger
3 Krieg mal endlich Mensch hetzen Reinhard
4 USA wer Politiker Deutschland Behauptung Youtube
5 Ukraine immer Medium Land Lüge Antwort
6 seit gehen wählen warum Putinanhimmler Weihnachten
7 Putin kommen Volk müssen Lara oh
8 Waffe deutsch Demokratie tun Beweis Mrscrewy
9 russisch ganz eur vieler Croft de
10 EU sagen Herr Leute Name jährig
11 10 sehen Partei Kind Michael Müller
12 Europa wissen wann Problem beweisen dürsch
13 ukrain einfach Berlin dürfen Jinping Henning
14 Panzer heute EU dafür drohen Kreml
lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_10.gensim")
display(pd.DataFrame({f"Topic {n}":[word for (word,word_weight) in words] for (n, words) in lda_model.show_topics(formatted=False, num_topics=10, num_words=15)}))
Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9
0 ja grün Russland Polizei 10 schön dumm innen 12 sollen
1 mehr öffentlich Krieg Euro Million afd erster Sektendepp 14 wer
2 gut ard USA Absonderung 000 Kommentar Coronaleugner hetzen Mädchen sagen
3 Deutschland Wahrheit Putin Mathias rd klein schreiben Lüge 11 sehen
4 schon links Ukraine Korruption 2022 halt na fordern xi einfach
5 mal Aussage EU Milliarde 20 wählen Behauptung selber Hans warum
6 geben The Waffe verkaufen sterben danken nix bleiben jährig wissen
7 Jahr Ausländer endlich europäisch Corona echt Name Putinanhimmler Müller tun
8 deutsch Angst russisch Wiese 100 lange Nachricht Beweis 00 Leute
9 immer etc Europa Pergon ca Partei beweisen Jinping 40 Kind
10 gehen Rex ukrain Hetzer Swongeböte Demokratie darauf gesinnungsbraun 24 finden
11 kommen rot China kriminell tot Berlin vielleicht Klimawandel Blödsinn Frau
12 Mensch traurig Panzer öl kämpfen stimmen Propaganda behaupten geh denken
13 ganz Wilhelm Volk Inflation pro lesen offensichtlich immer 35 Problem
14 Land Imperator Scholz steigen Freiheit grüne drohen Papst Fckafd lassen
lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_20.gensim")
display(pd.DataFrame({f"Topic {n}":[word for (word,word_weight) in words] for (n, words) in lda_model.show_topics(formatted=False, num_topics=20, num_words=15)}))
Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17 Topic 18 Topic 19
0 grün Jahr Tagesschau schreiben Polizei ja sehen EU deutsch stimmen neu The Deutschland jemand Russland Wort innen Mensch Frau nie
1 na seit erster Name öffentlich mehr echt Politik berichten zurück schön darauf Regierung verlieren Krieg co2 Coronaleugner Land Kind Thema
2 grüne letzter eur nennen verbieten gut falsch Politiker Volk teuer endlich Antwort afd ne USA Frankreich hetzen Welt bitte voll
3 ard 10 Berlin Michael Klimawandel schon beide danken Recht 12 dumm tatsächlich Staat trotzdem Putin Flüchtling fordern Geld Kommentar absolut
4 linker nächster Reinhard dank Klima mal lieb Russe darüber Preis Scholz oft weit typisch Ukraine weltweit Behauptung brauchen nehmen Wahrheit
5 SPD Million sowas kannst schützen sollen Weg vergessen Schuld 14 wann etwa zeigen Mathias Waffe bekannt Lüge stehen alt and
6 links zwei kaufen Gott Fakt geben suchen krank Bild 30 ach Nazi Demokratie Rente russisch wahrscheinlich selber ab Mann handeln
7 sofort 000 spielen mögen fahren immer Merkel raus lächerlich 11 wünschen drohen Medium kriegen Europa Youtube Beweis eigen Frage Bundestag
8 offensichtlich Milliarde wahr einsam Polizist wer Hand Ampel freuen vorbei sitzen toll wählen xi ukrain europäisch Nachricht wegen hören 25
9 Angst 2022 oh Iran Gewalt gehen Seite wieso willst treffen Propaganda Thomas Meinung Wiese China Gedanke beweisen leben Herr super
10 Korruption 20 Fußball treiben mussn kommen Lösung dr hoffentlich schaden Glück warten Bürger Präsident Panzer Generation bleiben gehören lesen zerstören
11 Wahl Monat Freiheit Inflation fliegen ganz Familie Geschichte leisten extrem Bundeswehr zweiter schnell schuldig nein Rakete Aussage davon Gesellschaft verfolgen
12 erkennen Euro Mal Covid bedeuten sagen bestimmt demokratisch erwarten 00 fehlen Gegenteil egal Beitrag ukraine Gericht behaupten leider jung angst
13 abschaffen Habeck etc Regime Böller einfach verkaufen Hilfe entscheiden Luft Müller Bericht Partei Lügner liefern Alexander wm Leben lernen Argument
14 blöd 50 rechter Merz rechtlich warum Schnettka Hitler danach Sanktion beenden 40 politisch drauf Frieden Nacht Beleidigung schaffen Schule Afghanistan

So far, only the top words for each topic have been considered, without taking into account the weighting of the words. With a wordcloud it is possible to include this weighting and thus get a better understanding of the topic. In order to keep it clear, it was decided to visualise the model with 10 topics using wordclouds. This is also the model that performs best in the evaluation with the pyLDAvis library in the next section. If we look at the wordclouds shown below, the following topics could be identified:

  • Topic 0: This topic could be about Germany and domestic issues.

  • Topic 2: This topic is obviously about the ukraine war.

  • Topic 5: This topic seems to be related to the “AFD”. The terms suggest that it could be partly about calls to vote for the “AFD” or a statement that the “AFD” has been voted for in the past.

  • Topic 6: This topic is about covid and covid deniers. However, it cannot be clearly stated whether these are more comments by covid deniers or rather comments by people who are upset about the covid deniers and criticise them.

  • Topic 7: This topic could be about comments where others complain about the lateral thinking movement.

from matplotlib import pyplot as plt
from wordcloud import WordCloud
import matplotlib.colors as mcolors

cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]

cloud = WordCloud(background_color='white', width=2500, height=1800, max_words=10, colormap='tab10', color_func=lambda *args, **kwargs: cols[i], prefer_horizontal=1.0)

lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_10.gensim")
topics = lda_model.show_topics(formatted=False)

fig, axes = plt.subplots(5, 2, figsize=(10,10), sharex=True, sharey=True)

for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    topic_words = dict(topics[i][1])
    cloud.generate_from_frequencies(topic_words, max_font_size=300)
    plt.gca().imshow(cloud)
    plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
    plt.gca().axis('off')


plt.subplots_adjust(wspace=0, hspace=0)
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()
_images/topic_model_21_0.png