Analyzing Employee Satisfaction in Major Consulting Firms from Glassdoor Reviews — Part 1 (Scraping, Preprocessing & Lemmatization and Word cloud)
Team Members: Lucy Hwang, Rhiannon Pytlak, Hyeon Gu Kim, Mario Gonzalez, Namit Agrawal, Sophia Scott, Sungho Park
E very company promises core values to employees. However, how do employees truly feel about the company they work for? Are company values being participated and upheld? In this project, our team decided to scrape five major consulting firms (BCG, Deloitte, EY, KPMG, PwC) from Glassdoor reviews and validate if the company practices what it preaches. Here are some steps we took to tackle the problem:
- Scrape consulting firms reviews from Glassdoor using Selenium
- Preprocess and lemmatize data
- Create word clouds
- Calculate lift scores
- Topic Modeling & Latent Dirichlet Allocation (LDA)
- Cosine Similarity
- Sentiment Analysis
Now that we have the steps listed, let’s dive into the project!
1. Scrape consulting firms reviews from Glassdoor using Selenium
I n order to analyze employee satisfaction from Glassdoor reviews, we obviously need to scrape those reviews from Glassdoor! Because there are so many pages and reviews to manually scrape, we decided to implement automatic scraper using Selenium. Selenium basically automates web applications. Along with Selenium, we also needed to import XLWT library in order to put the data into an Excel spreadsheet. With Selenium and XLWT, we could scrape Glassdoor reviews and put the scraped data into Excel spreadsheet.
We took reference from the following link for scraping: https://github.com/robertlandlord/glassdoor_scraper
After logging into the Glassdoor through Selenium, we could now see HTML codes of the page. For example, say we have the following review. The HTML of the review is as follows:
Under “mb-xxsm.mt-0.css-5j5djr” classes, we scraped all the titles of the current review page. We used “find_elements_by_class_name” method of Selenium to find elements by the class name. This method will return a list of strings of titles.
titles = driver.find_elements_by_class_name("mb-xxsm.mt-0.css-5j5djr")
Similarly, we performed the same procedure to scrape other elements such as timestamp, ratings, status of each review. Then, for each full review, we scraped necessary information using the following code:
It’s all done (at least for the scraping part)! After iteration and iteration of the above processes for each page (our scraper scraped from 200 to 300 pages), we simply inserted those scraped information into the Excel spreadsheet accordingly and now we have an Excel spreadsheet as follows:
2. Preprocess and lemmatize data
A fter successfully scraping the reviews from Glassdoor, we need to preprocess and lemmatize the data so that they are ready to be analyzed! Please note that we used NLTK library for handling data.
First, we read the scraped Excel spreadsheet using Python Pandas which returns the output data frame like below:
We had 11 columns and we mainly focused on “Pros” and “Cons” columns. Let’s see how each value of “Pros” and “Cons” columns looks like and figure about if there’s things that need treatment. Below is just one example of “Pros” column:
“The people, you will gain a ton of experience at a very fast rate, flexible working hours, work from home / hybrid environment, and is a job that constantly challenges you.”
First, we removed all the punctuations, commas, whitespaces and slashes, etc… and transformed the string into lower cases. The resulted string looks like this:
“the people you will gain a ton of experience at a very fast rate flexible working hours work from home hybrid environment and is a job that constantly challenges you”
Since our main goal is to analyze the employee’s satisfaction, we didn’t need to have all the words and information of each sentence. We only needed the keywords that can represent the whole sentence. In order to extract the keywords from a sentence, we utilized stopwords function from nltk.corpus to remove all the stopwords in the sentence (e.g. “a”, “the”, “is”, “are”, “because”, “should” and etc) and removed any remaining punctuations (e.g. !”#$%&’()*+,-./:;<=>?@[\]^_`{|}~) using string.punctuation. Then we used word_tokenize from nltk.tokenize to tokenize each words. Below is the code snippet and the result of this procedure:
Now we can understand what the reviewer is trying to say with fewer words! We repeated this process for all sentences (pros and cons) and created two new columns: pros_clean and cons_clean:
Next, we manually took some attributes from the top frequent words for the lemmatization process. What is lemmatization? Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization, unlike stemming, reduces the inflected words properly ensuring that the root word belongs to the language. Here is an example of lemmatization and stemming and their difference:
For lemmatization, we made a dictionary that has keywords for each of the main five attributes of a company: work life balance, culture value, career opportunity, company benefit, senior management.
Using the above dictionary, we replaced words with one of the key of the dictionary if the word is in the values of the key. Below is the summary of preprocessing, tokenization, and lemmatization procedure and the resulted words from it:
Lastly, we decided to put all of our sentences (Pros & Cons) together to create wordclouds for each company. Using WordCloud library, we were able to easily create wordclouds for each company. Since PwC has been the example of this blog, here is the wordcloud of PwC based on its Glassdoor review:
In this part, we covered the first 3 steps: scraping Glassdoor reviews, preprocessing and lemmatizing the data, and creating wordclouds. Although we can derive some insights about how the employees of each company feel about their companies by looking at the lemmatized word lists and the wordclouds, we definitely can do more! Now that we have all data ready, it’s time to analyze those data and figure out the employees’ satisfaction on their companies with different approaches: Lift, Topic Modeling & LDA and Sentiment Analysis.