Topic modeling
Topic modeling#
Topic modeling is an unsupervised approach to identify topics from a corpus. Identifying topics among the comments helps to answer the question of which topics are dominant in the comments section and whether they are dominated by right-wing comments. To identify topics the Latent Dirichlet Allocation (LDA) will be used. The LDA is the most widely used model for topic modeling and learns the topic-word mappings from the corpus over several iterations [Atteveldt, 2022].
First, the libraries for the preprocessing of the dataset are imported. To remove stopwords the stopword list from the Natural Language Toolkit (NLTK) is used. In addition, the spacy library is used for lemmatization. Lemmatization is an important and often used pre-processing step for topic modelling, because it has been shown that lemmatization can lead to better results [May et al., 2019].
from IPython.core.display_functions import display
import pandas as pd
from cleantext import clean
import pickle
from pathlib import Path
import string
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore') # Disable warnings to improve output formation
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
import spacy
# The model has to be installed via the following command: $(env) python -m spacy download de_core_news_md
# The docker image already contains the model
spacy_model_german = spacy.load("de_core_news_md", disable=["parser", "ner"])
For preprocessing the text, we first create a stopword list and a punctuation list. Then we define methods for lemmatizing the text and removing punctuation marks. Both methods expect the text as input, perform the corresponding processing and return the processed text. The method tokenize_and_lemmatize_text
performs the complete preprocessing. First the linebreaks and the emojis are removed. To remove the emojis the cleantext
library is used. Then the two methods to remove punctuation and to perform lemmatization are applied. Since spacy returns a double bar --
as a token for punctuation marks or unknown characters, these tokens are removed to ensure that only meaningful tokens are included. Then the stopwords are removed. At the end empty tokens are removed because these tokens can occur, for example, if the comment consists only of emojis.
stop_words = set(stopwords.words('german'))
regular_punctuation = list(string.punctuation)
def get_lemmatized_text(text: str):
lemmatized_text = []
document = spacy_model_german(text)
for word in document:
lemmatized_text.append(word.lemma_)
return lemmatized_text
def remove_punctuation(text):
for punc in regular_punctuation:
if punc in text:
text = text.replace(punc, ' ')
return text.strip()
def tokenize_and_lemmatize_text(text: str):
text = clean(text, no_emoji=True, lang="de", no_urls=True)
text = remove_punctuation(text)
tokens = get_lemmatized_text(text)
tokens = [token for token in tokens if token != "--"]
tokens = [token for token in tokens if token not in stop_words]
tokens = [token for token in tokens if not token.isspace()] # remove spaces
tokens = [token for token in tokens if len(token) > 1] # remove single characters
return tokens
To preprocess all comments, they are first loaded from the csv file and then it is ensured that all comments are in string format. This step is necessary because otherwise the error occurs later that the comment would not be a string. This problem is probably caused by the fact that special characters are interpreted incorrectly when the comments are saved or read in. At the end we get a list that contains for each comment a list of tokens, as we can see in the output.
comments = pd.read_csv("data/youtube_comments_500.csv")
comments["Comments"] = comments["Comments"].astype(str)
preprocessed_comments = []
for index, row in tqdm(comments.iterrows()):
preprocessed_text: list = tokenize_and_lemmatize_text(row["Comments"])
if len(preprocessed_text) > 1:
preprocessed_comments.append(preprocessed_text)
406242it [24:50, 272.51it/s]
preprocessed_comments[0:2]
[['Tag',
'groß',
'Bericht',
'immer',
'Panzer',
'liefern',
'ganz',
'schön',
'sinnlos',
'Vermittlung',
'Neuigkeit'],
['scholz',
'gut',
'Weiss',
'wieso',
'brauchen',
'Verteidigungsminister',
'Stelle',
'sparen']]
The preprocessing is quite time-consuming and takes a few minutes, which is why the processed comments are saved as a pickle file. Saving the list as a pickle file makes it possible to save the list directly and load it again. Saving the comments here as a csv makes little sense, as the list of tokens for each comment has a different length.
file = Path("data/youtube_comments_500_preprocessed.pkl")
if not file.exists():
with open("data/youtube_comments_500_preprocessed.pkl", "wb") as fw:
pickle.dump(preprocessed_comments, fw)
with open("data/youtube_comments_500_preprocessed.pkl", "rb") as fr:
preprocessed_comments = pickle.load(fr)
To implement the LDA model, the gensim library is used. For the visualisation of the topics and results at the end, the pyLDAvis library is used. This library allows interactive exploration of the results, as we will see later.
For the LDA model, a dictionary must first be created which is a mapping between words and IDs for the words. This allows us to subsequently represent the text corpus, i.e. the comments, as a bag-of-words format. When creating the model, very frequent and infrequent words are ignored to improve the model. It is relatively obvious that very common words have less meaning and are therefore less likely to be associated with specific topics. Infrequent words, on the other hand, could belong to a topic, but it is unlikely that this topic will be identified because the LDA model learns only a few topics and it is therefore likely that this topic will not be found.
The hyperparameters alpha
and beta
(in gensim also called eta
) for the LDA model have to chosen carefully, as they strongly influence the model performance. Fortunately, the gensim’s LDA model provides the feature to automatically find the best choice for both hyperparameters. Another important hyperparameter is the number of topics, which has to be specified upfront. The choice of this hyperparameter is often based on domain knowledge and there is no good theoretical solution for this problem [Atteveldt, 2022]. One possible approach to finding a good choice may be to systematically increase or decrease the number of topics. However, on the one hand, this would require a lot of computing resources and, and on the other hand, it is difficult to decide whether one distribution of topics is better than the other. It was therefore decided to try out the following three values for the number of topics: 6, 10 and 20.
To compare which option performed better, two metrics are used.The perplexity measures how well the model can fit the actual word representation. The coherence measures how semantically coherent two topics are. However, the best model achieved by these metrics is not always the most interpretable model from a human perspective [Atteveldt, 2022].
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
dictionary = corpora.Dictionary(preprocessed_comments)
dictionary.filter_extremes(no_below=20, no_above=0.8)
corpus = [dictionary.doc2bow(text) for text in preprocessed_comments]
lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics = 20, id2word=dictionary, passes=15, alpha="auto", eta="auto")
lda_model.save('lda_models/lda_model_20.gensim')
The score of both evaluation metrics will decrease when adding more topics [Atteveldt, 2022]. This behavior can also be seen in the models trained here. In theory, one looks for the inflection point at which the values of the two metrics fall at a much slower rate [Atteveldt, 2022]. However, since only a few models were trained here, such an approach is difficult to carry out, especially since both metrics continue to decrease (at a similar rate).
lda_model_6 = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_6.gensim")
lda_model_10 = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_10.gensim")
lda_model_20 = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_20.gensim")
# Compute Perplexity
print('LDA_6 Perplexity: ', lda_model_6.log_perplexity(corpus))
print('LDA_10 Perplexity: ', lda_model_10.log_perplexity(corpus))
print('LDA_20 Perplexity: ', lda_model_20.log_perplexity(corpus))
# Compute Coherence Score
coherence_model_lda_6 = CoherenceModel(model=lda_model_6, texts=preprocessed_comments, dictionary=dictionary, coherence='c_v')
coherence_lda_6 = coherence_model_lda_6.get_coherence()
print('LDA_6 Coherence Score: ', coherence_lda_6)
coherence_model_lda_10 = CoherenceModel(model=lda_model_10, texts=preprocessed_comments, dictionary=dictionary, coherence='c_v')
coherence_lda_10 = coherence_model_lda_10.get_coherence()
print('LDA_10 Coherence Score: ', coherence_lda_10)
coherence_model_lda_20 = CoherenceModel(model=lda_model_20, texts=preprocessed_comments, dictionary=dictionary, coherence='c_v')
coherence_lda_20 = coherence_model_lda_20.get_coherence()
print('LDA_20 Coherence Score: ', coherence_lda_20)
LDA_6 Perplexity: -8.334388322182123
LDA_10 Perplexity: -9.166142281809835
LDA_20 Perplexity: -13.330936599649911
LDA_6 Coherence Score: 0.6025754127903082
LDA_10 Coherence Score: 0.5074582503859234
LDA_20 Coherence Score: 0.4254128399081809
If we look at the topics that the different models have found, we can identify some clear topics, but there are also some other topics that make little sense. Generally, the topics in the 6-topic model are too general. It is difficult to find clear topics there. In the models with 10 and 20 topics, on the other hand, it is easy to find some clear topics. However, it is difficult to say which of the two models is better.
lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_6.gensim")
display(pd.DataFrame({f"Topic {n}":[word for (word,word_weight) in words] for (n, words) in lda_model.show_topics(formatted=False, num_words=15)}))
Topic 0 | Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | |
---|---|---|---|---|---|---|
0 | Jahr | ja | Regierung | mehr | innen | Nachricht |
1 | Deutschland | gut | Frau | sollen | Sektendepp | The |
2 | Russland | schon | grün | geben | Coronaleugner | Reichsbürger |
3 | Krieg | mal | endlich | Mensch | hetzen | Reinhard |
4 | USA | wer | Politiker | Deutschland | Behauptung | Youtube |
5 | Ukraine | immer | Medium | Land | Lüge | Antwort |
6 | seit | gehen | wählen | warum | Putinanhimmler | Weihnachten |
7 | Putin | kommen | Volk | müssen | Lara | oh |
8 | Waffe | deutsch | Demokratie | tun | Beweis | Mrscrewy |
9 | russisch | ganz | eur | vieler | Croft | de |
10 | EU | sagen | Herr | Leute | Name | jährig |
11 | 10 | sehen | Partei | Kind | Michael | Müller |
12 | Europa | wissen | wann | Problem | beweisen | dürsch |
13 | ukrain | einfach | Berlin | dürfen | Jinping | Henning |
14 | Panzer | heute | EU | dafür | drohen | Kreml |
lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_10.gensim")
display(pd.DataFrame({f"Topic {n}":[word for (word,word_weight) in words] for (n, words) in lda_model.show_topics(formatted=False, num_topics=10, num_words=15)}))
Topic 0 | Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | Topic 6 | Topic 7 | Topic 8 | Topic 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | ja | grün | Russland | Polizei | 10 | schön | dumm | innen | 12 | sollen |
1 | mehr | öffentlich | Krieg | Euro | Million | afd | erster | Sektendepp | 14 | wer |
2 | gut | ard | USA | Absonderung | 000 | Kommentar | Coronaleugner | hetzen | Mädchen | sagen |
3 | Deutschland | Wahrheit | Putin | Mathias | rd | klein | schreiben | Lüge | 11 | sehen |
4 | schon | links | Ukraine | Korruption | 2022 | halt | na | fordern | xi | einfach |
5 | mal | Aussage | EU | Milliarde | 20 | wählen | Behauptung | selber | Hans | warum |
6 | geben | The | Waffe | verkaufen | sterben | danken | nix | bleiben | jährig | wissen |
7 | Jahr | Ausländer | endlich | europäisch | Corona | echt | Name | Putinanhimmler | Müller | tun |
8 | deutsch | Angst | russisch | Wiese | 100 | lange | Nachricht | Beweis | 00 | Leute |
9 | immer | etc | Europa | Pergon | ca | Partei | beweisen | Jinping | 40 | Kind |
10 | gehen | Rex | ukrain | Hetzer | Swongeböte | Demokratie | darauf | gesinnungsbraun | 24 | finden |
11 | kommen | rot | China | kriminell | tot | Berlin | vielleicht | Klimawandel | Blödsinn | Frau |
12 | Mensch | traurig | Panzer | öl | kämpfen | stimmen | Propaganda | behaupten | geh | denken |
13 | ganz | Wilhelm | Volk | Inflation | pro | lesen | offensichtlich | immer | 35 | Problem |
14 | Land | Imperator | Scholz | steigen | Freiheit | grüne | drohen | Papst | Fckafd | lassen |
lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_20.gensim")
display(pd.DataFrame({f"Topic {n}":[word for (word,word_weight) in words] for (n, words) in lda_model.show_topics(formatted=False, num_topics=20, num_words=15)}))
Topic 0 | Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | Topic 6 | Topic 7 | Topic 8 | Topic 9 | Topic 10 | Topic 11 | Topic 12 | Topic 13 | Topic 14 | Topic 15 | Topic 16 | Topic 17 | Topic 18 | Topic 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | grün | Jahr | Tagesschau | schreiben | Polizei | ja | sehen | EU | deutsch | stimmen | neu | The | Deutschland | jemand | Russland | Wort | innen | Mensch | Frau | nie |
1 | na | seit | erster | Name | öffentlich | mehr | echt | Politik | berichten | zurück | schön | darauf | Regierung | verlieren | Krieg | co2 | Coronaleugner | Land | Kind | Thema |
2 | grüne | letzter | eur | nennen | verbieten | gut | falsch | Politiker | Volk | teuer | endlich | Antwort | afd | ne | USA | Frankreich | hetzen | Welt | bitte | voll |
3 | ard | 10 | Berlin | Michael | Klimawandel | schon | beide | danken | Recht | 12 | dumm | tatsächlich | Staat | trotzdem | Putin | Flüchtling | fordern | Geld | Kommentar | absolut |
4 | linker | nächster | Reinhard | dank | Klima | mal | lieb | Russe | darüber | Preis | Scholz | oft | weit | typisch | Ukraine | weltweit | Behauptung | brauchen | nehmen | Wahrheit |
5 | SPD | Million | sowas | kannst | schützen | sollen | Weg | vergessen | Schuld | 14 | wann | etwa | zeigen | Mathias | Waffe | bekannt | Lüge | stehen | alt | and |
6 | links | zwei | kaufen | Gott | Fakt | geben | suchen | krank | Bild | 30 | ach | Nazi | Demokratie | Rente | russisch | wahrscheinlich | selber | ab | Mann | handeln |
7 | sofort | 000 | spielen | mögen | fahren | immer | Merkel | raus | lächerlich | 11 | wünschen | drohen | Medium | kriegen | Europa | Youtube | Beweis | eigen | Frage | Bundestag |
8 | offensichtlich | Milliarde | wahr | einsam | Polizist | wer | Hand | Ampel | freuen | vorbei | sitzen | toll | wählen | xi | ukrain | europäisch | Nachricht | wegen | hören | 25 |
9 | Angst | 2022 | oh | Iran | Gewalt | gehen | Seite | wieso | willst | treffen | Propaganda | Thomas | Meinung | Wiese | China | Gedanke | beweisen | leben | Herr | super |
10 | Korruption | 20 | Fußball | treiben | mussn | kommen | Lösung | dr | hoffentlich | schaden | Glück | warten | Bürger | Präsident | Panzer | Generation | bleiben | gehören | lesen | zerstören |
11 | Wahl | Monat | Freiheit | Inflation | fliegen | ganz | Familie | Geschichte | leisten | extrem | Bundeswehr | zweiter | schnell | schuldig | nein | Rakete | Aussage | davon | Gesellschaft | verfolgen |
12 | erkennen | Euro | Mal | Covid | bedeuten | sagen | bestimmt | demokratisch | erwarten | 00 | fehlen | Gegenteil | egal | Beitrag | ukraine | Gericht | behaupten | leider | jung | angst |
13 | abschaffen | Habeck | etc | Regime | Böller | einfach | verkaufen | Hilfe | entscheiden | Luft | Müller | Bericht | Partei | Lügner | liefern | Alexander | wm | Leben | lernen | Argument |
14 | blöd | 50 | rechter | Merz | rechtlich | warum | Schnettka | Hitler | danach | Sanktion | beenden | 40 | politisch | drauf | Frieden | Nacht | Beleidigung | schaffen | Schule | Afghanistan |
So far, only the top words for each topic have been considered, without taking into account the weighting of the words. With a wordcloud it is possible to include this weighting and thus get a better understanding of the topic. In order to keep it clear, it was decided to visualise the model with 10 topics using wordclouds. This is also the model that performs best in the evaluation with the pyLDAvis library in the next section. If we look at the wordclouds shown below, the following topics could be identified:
Topic 0: This topic could be about Germany and domestic issues.
Topic 2: This topic is obviously about the ukraine war.
Topic 5: This topic seems to be related to the “AFD”. The terms suggest that it could be partly about calls to vote for the “AFD” or a statement that the “AFD” has been voted for in the past.
Topic 6: This topic is about covid and covid deniers. However, it cannot be clearly stated whether these are more comments by covid deniers or rather comments by people who are upset about the covid deniers and criticise them.
Topic 7: This topic could be about comments where others complain about the lateral thinking movement.
from matplotlib import pyplot as plt
from wordcloud import WordCloud
import matplotlib.colors as mcolors
cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]
cloud = WordCloud(background_color='white', width=2500, height=1800, max_words=10, colormap='tab10', color_func=lambda *args, **kwargs: cols[i], prefer_horizontal=1.0)
lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_10.gensim")
topics = lda_model.show_topics(formatted=False)
fig, axes = plt.subplots(5, 2, figsize=(10,10), sharex=True, sharey=True)
for i, ax in enumerate(axes.flatten()):
fig.add_subplot(ax)
topic_words = dict(topics[i][1])
cloud.generate_from_frequencies(topic_words, max_font_size=300)
plt.gca().imshow(cloud)
plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
plt.gca().axis('off')
plt.subplots_adjust(wspace=0, hspace=0)
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()
