Analyzing Employee Satisfaction in Major Consulting Firms from Glassdoor Reviews — Part 1 (Scraping, Preprocessing & Lemmatization and Word cloud)

Hyeon Gu Kim
6 min readJan 22, 2022
Photo by Austin Distel on Unsplash

Team Members: Lucy Hwang, Rhiannon Pytlak, Hyeon Gu Kim, Mario Gonzalez, Namit Agrawal, Sophia Scott, Sungho Park

E very company promises core values to employees. However, how do employees truly feel about the company they work for? Are company values being participated and upheld? In this project, our team decided to scrape five major consulting firms (BCG, Deloitte, EY, KPMG, PwC) from Glassdoor reviews and validate if the company practices what it preaches. Here are some steps we took to tackle the problem:

  1. Scrape consulting firms reviews from Glassdoor using Selenium
  2. Preprocess and lemmatize data
  3. Create word clouds
  4. Calculate lift scores
  5. Topic Modeling & Latent Dirichlet Allocation (LDA)
  6. Cosine Similarity
  7. Sentiment Analysis

Now that we have the steps listed, let’s dive into the project!

1. Scrape consulting firms reviews from Glassdoor using Selenium

I n order to analyze employee satisfaction from Glassdoor reviews, we obviously need to scrape those reviews from Glassdoor! Because there are so many pages and reviews to manually scrape, we decided to implement automatic scraper using Selenium. Selenium basically automates web applications. Along with Selenium, we also needed to import XLWT library in order to put the data into an Excel spreadsheet. With Selenium and XLWT, we could scrape Glassdoor reviews and put the scraped data into Excel spreadsheet.

We took reference from the following link for scraping: https://github.com/robertlandlord/glassdoor_scraper

Code snippet of scraper

After logging into the Glassdoor through Selenium, we could now see HTML codes of the page. For example, say we have the following review. The HTML of the review is as follows:

An example of a review and its HTML snippet

Under “mb-xxsm.mt-0.css-5j5djr” classes, we scraped all the titles of the current review page. We used “find_elements_by_class_name” method of Selenium to find elements by the class name. This method will return a list of strings of titles.

titles = driver.find_elements_by_class_name("mb-xxsm.mt-0.css-5j5djr")

Similarly, we performed the same procedure to scrape other elements such as timestamp, ratings, status of each review. Then, for each full review, we scraped necessary information using the following code:

Code snippet for scraping each full review

It’s all done (at least for the scraping part)! After iteration and iteration of the above processes for each page (our scraper scraped from 200 to 300 pages), we simply inserted those scraped information into the Excel spreadsheet accordingly and now we have an Excel spreadsheet as follows:

Resulted Excel spreadsheet from scraping

2. Preprocess and lemmatize data

A fter successfully scraping the reviews from Glassdoor, we need to preprocess and lemmatize the data so that they are ready to be analyzed! Please note that we used NLTK library for handling data.

First, we read the scraped Excel spreadsheet using Python Pandas which returns the output data frame like below:

Pandas data frame of scraped Glassdoor reviews about PwC

We had 11 columns and we mainly focused on “Pros” and “Cons” columns. Let’s see how each value of “Pros” and “Cons” columns looks like and figure about if there’s things that need treatment. Below is just one example of “Pros” column:

“The people, you will gain a ton of experience at a very fast rate, flexible working hours, work from home / hybrid environment, and is a job that constantly challenges you.”

First, we removed all the punctuations, commas, whitespaces and slashes, etc… and transformed the string into lower cases. The resulted string looks like this:

“the people you will gain a ton of experience at a very fast rate flexible working hours work from home hybrid environment and is a job that constantly challenges you”

Since our main goal is to analyze the employee’s satisfaction, we didn’t need to have all the words and information of each sentence. We only needed the keywords that can represent the whole sentence. In order to extract the keywords from a sentence, we utilized stopwords function from nltk.corpus to remove all the stopwords in the sentence (e.g. “a”, “the”, “is”, “are”, “because”, “should” and etc) and removed any remaining punctuations (e.g. !”#$%&’()*+,-./:;<=>?@[\]^_`{|}~) using string.punctuation. Then we used word_tokenize from nltk.tokenize to tokenize each words. Below is the code snippet and the result of this procedure:

Removing stopwords and punctuation, and tokenizing words using NLTK library

Now we can understand what the reviewer is trying to say with fewer words! We repeated this process for all sentences (pros and cons) and created two new columns: pros_clean and cons_clean:

Dataframe with pros_clean and cons_clean

Next, we manually took some attributes from the top frequent words for the lemmatization process. What is lemmatization? Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization, unlike stemming, reduces the inflected words properly ensuring that the root word belongs to the language. Here is an example of lemmatization and stemming and their difference:

Image from https://medium.com/geekculture/introduction-to-stemming-and-lemmatization-nlp-3b7617d84e65

For lemmatization, we made a dictionary that has keywords for each of the main five attributes of a company: work life balance, culture value, career opportunity, company benefit, senior management.

Dictionary of five main attributes of a company and its keywords

Using the above dictionary, we replaced words with one of the key of the dictionary if the word is in the values of the key. Below is the summary of preprocessing, tokenization, and lemmatization procedure and the resulted words from it:

Preprocessing, tokenization, lemmatization procedure

Lastly, we decided to put all of our sentences (Pros & Cons) together to create wordclouds for each company. Using WordCloud library, we were able to easily create wordclouds for each company. Since PwC has been the example of this blog, here is the wordcloud of PwC based on its Glassdoor review:

Wordcloud of PwC Glassdoor reviews

In this part, we covered the first 3 steps: scraping Glassdoor reviews, preprocessing and lemmatizing the data, and creating wordclouds. Although we can derive some insights about how the employees of each company feel about their companies by looking at the lemmatized word lists and the wordclouds, we definitely can do more! Now that we have all data ready, it’s time to analyze those data and figure out the employees’ satisfaction on their companies with different approaches: Lift, Topic Modeling & LDA and Sentiment Analysis.

Click here for Part 2 (Calculating Lift) of the project!

--

--