Web Scraping Corporate Governance and Diversity & Inclusion Data


Here we will dive into one more round of web scraping. My past few blog posts have detailed the process by which I have been acquiring data up to this point. Environmental, employee happiness, FCF and ROIC data have all now been collected. Here, we will be discussing the last two pieces of this puzzle — corporate governance and diversity & inclusion data. The puzzle is admittedly a long ways from complete, but, in utilizing the scraped data, I will apply some different data science techniques to see if there are any interesting correlations and relationships to share in the coming weeks. The remainder of this post, however, will be completely focused on corporate governance and diversity & inclusion data.

Quantifying Corporate Governance and Diversity & Inclusion Data

  • Input: list of stock company names and tickers
  • Output: (1) data frame of same list of companies ranked in order of best corporate governance data to worst corporate governance data; (2) data frame of same list of companies ranked in order of best diversity & inclusion data to worst.


Because I am bringing two data sets into the same post, I will break them up accordingly.

Corporate Governance

The data that I pulled for corporate governance was board diversity. While I plan on expanding this to encompass further metrics such as executive compensation in relation to median salary and checking for the presence of super voting structures, I will start with just board diversity. You can check out the site where I got the data here.

Diversity & Inclusion

Diversity & inclusion will also certainly need to be expanded upon. But, for the purpose of this post, I rated companies on this metric by counting the number of times “diversity” or “inclusion” appeared in their DEF 14A filings (definitive proxy statement). Here was my thinking: because I looked 5 years back, companies who have been considered diversity and inclusion as an essential part of their business would likely have a higher mention count compared to those who have not considered this metric as heavily. One large shortcoming to this strategy is that there is no way to weed out companies who are verbally greenwashing in their DEF 14A filings. Companies may see possible value in simply talking about diversity & inclusion while not actually doing much about it in practice.


Below is the data pulled for the percentage of females on the board of the companies in question:

Below here is the number of “diversity” and “inclusion” mentions for the same companies:


The web scrapers that have been spoken about and detailed in the last few posts will be used to scrape available data for all companies in the S&P 500. Further feature engineering will allow me to create a single overarching metric for ranking the S&P 500 companies. Stay tuned for the next post that will look to break down these companies by rating. I may try to sync up these metrics will ESG metrics provided by MSCI, Sustainalytics, Refinitiv and S&P Global to see how all these metrics compare. Until next time.




Recommended from Medium

A 101 Guide On The Least Squares Regression Method

4 Interesting Articles Data Scientists Should Read This Week (Nov 13)

Correlation plots in R

My 5 most used $Bash commands

Players, Positions, and Probability in the NBA

#3 Data Engineering — EXTRACT DATA from CSV Files

A Day In The Life Of A Data Scientist At High Peak

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Michael Wirtz

Michael Wirtz

More from Medium

How to Determine ROI for AI Projects

Conversational AI for Customer Service with e-bot7

Age of AI — Snooping the Tech Industry, Crunching Invoices, and Reducing Clutter

AI-driven Decision Making in Startup Investing