Web Scraping Corporate Governance and Diversity & Inclusion Data

Michael Wirtz
3 min readMar 20, 2021



Here we will dive into one more round of web scraping. My past few blog posts have detailed the process by which I have been acquiring data up to this point. Environmental, employee happiness, FCF and ROIC data have all now been collected. Here, we will be discussing the last two pieces of this puzzle — corporate governance and diversity & inclusion data. The puzzle is admittedly a long ways from complete, but, in utilizing the scraped data, I will apply some different data science techniques to see if there are any interesting correlations and relationships to share in the coming weeks. The remainder of this post, however, will be completely focused on corporate governance and diversity & inclusion data.

Quantifying Corporate Governance and Diversity & Inclusion Data

  • Input: list of stock company names and tickers
  • Output: (1) data frame of same list of companies ranked in order of best corporate governance data to worst corporate governance data; (2) data frame of same list of companies ranked in order of best diversity & inclusion data to worst.


Because I am bringing two data sets into the same post, I will break them up accordingly.

Corporate Governance

The data that I pulled for corporate governance was board diversity. While I plan on expanding this to encompass further metrics such as executive compensation in relation to median salary and checking for the presence of super voting structures, I will start with just board diversity. You can check out the site where I got the data here.

Diversity & Inclusion

Diversity & inclusion will also certainly need to be expanded upon. But, for the purpose of this post, I rated companies on this metric by counting the number of times “diversity” or “inclusion” appeared in their DEF 14A filings (definitive proxy statement). Here was my thinking: because I looked 5 years back, companies who have been considered diversity and inclusion as an essential part of their business would likely have a higher mention count compared to those who have not considered this metric as heavily. One large shortcoming to this strategy is that there is no way to weed out companies who are verbally greenwashing in their DEF 14A filings. Companies may see possible value in simply talking about diversity & inclusion while not actually doing much about it in practice.


Below is the data pulled for the percentage of females on the board of the companies in question:

Below here is the number of “diversity” and “inclusion” mentions for the same companies:


The web scrapers that have been spoken about and detailed in the last few posts will be used to scrape available data for all companies in the S&P 500. Further feature engineering will allow me to create a single overarching metric for ranking the S&P 500 companies. Stay tuned for the next post that will look to break down these companies by rating. I may try to sync up these metrics will ESG metrics provided by MSCI, Sustainalytics, Refinitiv and S&P Global to see how all these metrics compare. Until next time.