Topic modeling visualization#

The pyLDAvis library allows interactive exploration of the topics found with the LDA model. Unfortunately, the HTML output of this library partially breaks the alignment of the jupyter book page. For this reason, these graphics have been separated into a separate section. The code needed for the outputs is collapsed by default because the formatting of the cells is not completely correct.

A very interesting aspect of the evaluation, which is possible with this library, is to see how the topics overlap. In the ideal case, one would logically want to have topics that are unique and do not overlap with others. If we examine the topics with regard to this aspect, we notice that the 6-topic model has 2 overlaps. Topic 1 and 4 overlap partially and topic 2 and 5 overlap to a large extent. Considering that there are only 6 topics in total, this relatively large proportion of overlaps does not speak for the quality of the model. The model with 10 topics, on the other hand, has only two smaller overlaps and therefore seems to be better. The model with 20 topics does not seem to be better, as there are also a lot of overlaps in this model. From these observations can be concluded that the 10-topic model is probably the best.

If we take a closer look at the model with 10 topics, it is noticeable that a large proportion of the comments are on topics 1 and 10. Topic 10 is very vague and cannot be described in detail. Topic 1, however, could be domestic policy. This topic therefore seems to play a major role in the comments, which is not surprising.

from IPython.core.display_functions import display
import pyLDAvis.gensim_models
import gensim
import gensim.corpora as corpora
import pickle
import warnings
warnings.filterwarnings('ignore')
with open("data/youtube_comments_500_preprocessed.pkl", "rb") as fr:
    preprocessed_comments = pickle.load(fr)
dictionary = corpora.Dictionary(preprocessed_comments)
dictionary.filter_extremes(no_below=20, no_above=0.8)
corpus = [dictionary.doc2bow(text) for text in preprocessed_comments]

LDA model with 6 topics#

lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_6.gensim")
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary, sort_topics=False)
display(pyLDAvis.display(lda_display))

LDA model with 10 topics#

lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_10.gensim")
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

LDA model with 20 topics#

lda_model = gensim.models.ldamodel.LdaModel.load("lda_models/lda_model_20.gensim")
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)