Web Scraping Employee Happiness

4 min readFeb 26, 2021

Introduction

If you are in the finance space, you are surely already quite aware of environmental, social and governance (ESG) implementation. If you are not, it is an incredibly exciting and relatively new development that aims to add historically qualitative factors to financial analysis. For this blog post, I will be diving into the popular ESG-related concept of employee happiness. My personal value-driven and long-term-focused investment philosophy aligns well with the fundamentals of ESG, and therefore I would like to explore the merit of some ESG metrics that might work well for a trading bot down the line. I will start this exploration with quantifying employee happiness.

Quantifying Employee Happiness

Goal:

Input: list of stock company names
Output: same list of companies ranked in order of happiest employees to unhappiest employees based on a composite score

The resulting table in this analysis can be used in a variety of ways. It can be a tool to filter companies for investment, employment, or partnership. For my investment purposes, I will most likely be using the results to aid in a company filtering process.

Methodology

Very different from the moving average strategy post that I did prior, there are a lot of assumptions that needed to be made to accomplish the task set out here. Employee happiness cannot be drilled down into a single metric. Yet, for this project, that is exactly what I am attempting to do. Therefore, many assumptions were made to produce the resulting composite score.

Companies

To simplify this process, I used the first 70 companies listed in alphabetical order from the S&P 500. I wanted enough results to make a solid case for comparison, but, given the time delays I implemented into the scraper, I didn’t want it to take forever. That said, these scrapers can be used to get data on any company as long as the sites used have available data. Here’s the code I used the grab the company names:

page = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
soup = BeautifulSoup(page.content, 'html.parser')table = soup.find(id='constituents').tbodytable_rows = table.find_all('tr')companies = []
for row in table_rows[1:70]:
    elements = row.find_all('td')
    company = elements[1].text
    companies.append(company)

Target Sites

For anybody familiar with employee review sites, Glassdoor is widely considered the market leader. The strength of its offerings, however, come in the form of text-based employee reviews. Other sites provided far more numerical data for the companies in question. I found that Indeed and Comparably both offer extensive information about a wide range of companies in the form of numerical data. Therefore, I went with those two options. Down the line, I may very well utilize text-based reviews to complement this analysis. If you are interested in web scraping Glassdoor, the sign-in pop up is the challenge. Within this project notebook, I have a commented-out Glassdoor scraper that can bypass this issue.

Calculations

Here is where the largest assumptions had to be made. There were two issues to overcome: (1) For each company I was scraping about 20 metrics. To condense these metrics into a single composite score, I categorized the metrics into 5 categories (company culture, company opportunity, company perks and benefits, company executive team and company employee treatment). Each was comprised of original 3+ metrics that I weighted equally to produce the category. To get the final composite score, all 5 categories were equally weighted as well. Therefore, the composite score runs on the assumption that employee happiness can be summed up by equally weighting those four categories. For the purpose of this exploration project, that calculation will have to do. (2) Not all metrics were available for all companies. To address this problem, the final calculations reflect only the available metrics. Therefore, the final composite scores are not necessarily comprised of the same underlying metrics.

Challenges

The main challenge faced was getting my IP address blocked while web scraping. This article below was extremely helpful. My biggest takeaway from being blocked a few times is that it is crucial to stay patient. If you are in a rush while web scraping, you are setting yourself up for trouble as so many sites now have protections in place to any fast-paced and automated behavior. Be patient and be random!

5 strategies to write unblock-able web scrapers in Python

People who read my posts in scraping series often contacted me to know how could they write scrapers that don’t get…

towardsdatascience.com

Findings

Below are the first 20 companies in the dataframe listed in descending order for final composite metric:

A couple things to note about the final product here. (1) The data does hold up. After doing some digging to confirm the results separately, I found that the information provided in the data frame is generally quite accurate. (2) The company employee treatment metric is missing for most companies. Despite its absence, it is worth keeping in for increased accuracy in the final composite metric.

Conclusion

While far from perfect, the project was a success. That being said, the code certainly needs a lot of optimization, and I may want to move into the more text-based reviews moving forward. With that in mind, if anybody reading this has any suggestions for improving any/all elements of this project, please to reach out — I would love to discuss. Finally, below I have included my code used for scraping Indeed and Comparably, along with the code that gets you into Glassdoor. Feel free to check out the full notebook here.