Analyzing Employee Satisfaction in Major Consulting Firms from Glassdoor Reviews — Part 3 (Topic Modeling & LDA)

7 min readFeb 5, 2022

Team Members: Lucy Hwang, Rhiannon Pytlak, Hyeon Gu Kim, Mario Gonzalez, Namit Agrawal, Sophia Scott, Sungho Park

I n previous blogs, we scraped, preprocessed and lemmatized data. In addition, we also calculated lift scores which gave us insights of which words appear frequently with certain words — like “possibilities” and “career opportunities”, for instance. Now we will move further. In this blog, you will be able to see how we applied topic modeling and Latent Dirichlet Allocation (LDA) to our data.

Scrape consulting firms reviews from Glassdoor using Selenium
Preprocess and lemmatize data
Create word clouds
Calculate lift scores
Topic Modeling & Latent Dirichlet Allocation (LDA) (We’re here)
Cosine Similarity
Sentiment Analysis

What is Topic Modeling & LDA?

As always, let’s start with defining what topic modeling and LDA are. First off, topic modeling is simply estimating topic of a sentence. LDA is one of topic modeling algorithms that automatically discovers “latent” topics from given sentences. For example, say one of our documents is “I like to eat broccoli and bananas”. Then the topic of the sentence will be food. If a sentence is “Chinchillas and kittens are cure”, then LDA will classify this sentence as a cute animal topic.

LDA is one of topic modeling algorithms that automatically discovers “latent” topics from given sentences.

The reason why LDA is called Latent “Dirichlet” Allocation is the model is based on Dirichlet distribution. In order to understand how LDA works, we need to first understand how natural languages are viewed in the model. In statistical view point, documents are seen as probability density of topics and topics are seen as probability density of words. Below figure shows the overall process of LDA:

https://www.oreilly.com/library/view/scala-machine-learning/9781788479042/b52dd4a0-3b72-43a4-b708-aeb2c2acbeb1.xhtml

Notice that the words that construct each topic are denoted as probabilities. LDA automatically calculate probability of each words in each sentence and categorizes topics by the probabilities. A topic is nothing but a collection of dominant keywords that are typical representatives. And LDA considers each document as a collection of topics in a certain proportion.

LDA represents documents as mixture of topics and spits out words with certain probabilities.

So to sum it up, LDA is one of topic modeling algorithm that automatically classifies topics from each document by probabilities of each words.

Creating Bag-of-Words Corpus using Gensim.corpora

Before LDA, there’s prerequisite. We needed to create Bag-of-Words (BoW) corpus. In other words, we had to build a corpus of BoW that consists of words and its frequency of a document. We first created dictionary using Gensim.corpora.Dictionary() function which gives tokenized words its unique ids. Next, the dictionary was used to create BoW by doc2bow() function. As the name of the function implies, doc2bow() converts documents to BoW so that it can be utilized as input to topic modeling model or LDA later. Here is the code snippet we used to create BoW corpus:

The process and the output are as follows:

In dictionary creation phase, notice that unique ids are given to each words. For example, ID 25 was given to word “people” as the green arrow shows. In BoW phase, I picked the first three lists from BoW as an example where the word “people” could be found. Lastly, the output corpus was formed. The output corpus had words’ unique IDs and its frequency. So for the word “people”, it was represented as (25, 1) in the output corpus. Because such corpus with bunch of numbers was hard to read for humans, we converted the ids to its word so that it is more readable and interpretable.

Applying LDA using Gensim.models

Now it is ready to apply LDA! We used Gensim.models.ldamodel.LdaModel() to apply LDA to our data. Below is code snippet of LDA and its result:

Code snippet of LDA and its result (number of topic = 5)

Notice that we put our dictionary as “id2word” parameter of the LDA function. This parameter is used for mapping from word IDs to words. “num_topics” parameter is the number of requested latent topics to be extracted from the training corpus. Here, we first tried 5 topics. We then used something called Coherence Score for evaluating the model.

We also tried perplexity score which also measures how good the model is (the lower the better). With 5 number of topics, we could achieve perplexity of -6.59 and coherence score of 0.218. But we knew we can do better than this! Therefore, we experimented with different number of topics on the model.

Hyperparameter Tuning

Below is the code snippet of hyperparameter tuning:

From 2 topics to 40 topics, we experimented with the topic numbers and figured out that 32 topics led to the highest coherence score. The graph of the hyperparameter tuning is as follows:

We tried the same LDA model with 32 topics for its parameter and here’s the model’s scores:

Scores of LDA model after hyperparameter tuning

Hurray! We were able to improve our LDA model by the hyperparameter tuning on the number of topic. Now that we optimized our model, let’s take a closer look at the result of the LDA model.

Interpretation of LDA output

There are 32 topics and each topic has an equation like linear regression. Let’s scrutinize the result. According to LDA, Topic 0 is represented as 0.153lots+0.106diversity+0.034clients+0.031inclusion. This means that “lots”, “diversity”, “clients” and “inclusion” are top 4 keywords that contribute to Topic 0. The coefficients of each keyword are the weights on each keyword. For example, the weight of “lots” on Topic 0 is 0.153. The weight of “work” on Topic 9 is 0.089. The weight of “flexibility” on Topic 26 is 0.095 and so forth. Of course, each weight reflects how important a keyword is to that topic. From the weights and the keywords, we can estimate what the topics are. For instance, Topic 0 could be “learning opportunities”. Topic 26 could be “Flexible and fast paced environment”, etc…

Visualization of LDA using pyLDAvis

There is a nice tool to visualize the topic clusters. We utilized pyLDAvis to visualize the topic clusters. Package pyLDAvis provides really cool interactive visualization where you can browse through each topic cluster and see histogram of words inside that topic and estimated term frequency within the selected topic. Although I can’t upload that interactive graph here, I took a screenshot of it:

Screenshot of interactive visualization from pyLDAvis

Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic. A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant. A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart, which is our case. Topic 1 is selected on the screenshot above and the right side of the graph shows top 30 most relevant terms for Topic 1. The blue bars indicates overall term frequency and the red bars represents estimated term frequency within Topic 1.

There is a huge overlapping near the center. When we examined those clusters, it turned out that most of those topics are related to people and clients. Topic 11, 6, 5, and 3, for example, all have “people” or “clients” as their key words.

LDA vs PCA?

During the process of revisiting this project, I realized that LDA is actually very similar to PCA in terms of reducing dimensionality. As you saw above, we were able to actually reduce the dimensionality of data into 32 topics. Both PCA and LDA look for linear combinations of the features which best explain the data.

One obvious difference between PCA and LDA is that LDA is supervised learning and PCA is unsupervised learning. Note that PCA also ignores label which makes harder to interpret the result of PCA.

LDA focuses on finding a feature subspace that maximizes the separability between the groups.
PCA focuses on capturing the direction of maximum variation in the data set.

Conclusion & Thoughts

Please note that our data definitely need some improvement. As you might see, the output of LDA contains lots of keywords like “great” and “nice”. Because the data we have showed in this blog are pro reviews, it is natural to have enough “great” and “nice”. In my opinion, I think we should have go back and try to remove those obvious words so that the model would classify more correctly. Examining the interactive graph above, I realized many topics overlap and thought it could be due to the words “great” and “nice”. If we had more time, I would definitely try to fix this issue.

Please click here for Part 4 (Cosine Similarity) of this project!

Analyzing Employee Satisfaction in Major Consulting Firms from Glassdoor Reviews — Part 3 (Topic Modeling & LDA)

Written by Hyeon Gu Kim