How to Lie Using Data Science

Michael Wirtz
5 min readOct 13, 2020

“I only believe in statistics that I doctored myself” -Winston Churchill

Intro

Backed by a passion for sustainable finance, I now find myself in a data science coding bootcamp. My undergraduate education did not provide me the opportunity to dive past the basics of any type of coding. That was a problem for many reasons — a problem only exacerbated by the presence of the following crucial concepts in sustainable finance: materiality and greenwashing. Let’s briefly define what these are. Materiality is used to describe any identifiable metric that can be linked to the profitability of a firm. A common “material” metric, unsurprisingly, is a firm’s relationship to the environment. Carbon emission, for example, would be considered a “material” metric for many firms. Greenwashing is a direct result of the concept of materiality. Greenwashing is the implementation of deceptive marketing techniques in an attempt to convince the consuming public that one firm’s products are more environmentally friendly than another’s.

The focus on these concepts in the world of finance is a result of a growing mountain of data, and I needed to understand the inner workings of that analysis myself. The only way to do that was go straight to the source. As I worked through my first few weeks of the bootcamp, I was astounded by the leeway and variance that is present in data interpretation and inference. I quickly understood the complexity in making confident conclusions based on this data.

Within this gray area that encompasses some of the data around sustainable finance, there is room for foul play — room for the misrepresentation of your data. In an off-brand attempt to play devil’s advocate, I will digress from sustainability and ethics, and I’m going to walk you through how to doctor any data to your liking.

Top 5 Best Ways to Lie with Your Data

We will not be discussing mean, median and mode here. Let’s step it up a notch. We are here to fool those who think they cannot be fooled — the snobby, high-brow, pseudo-intellectual types. Therefore, we must not be so obvious. We must lie within the bounds of truth, misrepresenting it in an attempt to deceive the undeceivable. I will lay these techniques out plainly and simply. With these building blocks you will have the power of analytical deception at the ready.

Want to read this story later? Save it in Journal.

1. Generalize and Overfit Your Data!

Once you have identified a correlation, you need not specify the limited region of the variables. How does the relationship proceed beyond the graph in the negative and positive direction? These are both questions which we do not concern ourselves with. Say it with me: “Extrapolate your data!” As they say, measure once, extrapolate forever. Keep this graph below in mind next time you find a use-case. Feel free to extrapolate any way that tickles your fancy.

https://www.mathworks.com/company/newsletters/articles/fitting-and-extrapolating-u-s-census-data.html

2. Make Your Visualizations Work for You!

You probably already know about this one, but that’s for a reason — it works. This is because the human mind has a set way that it likes to read graphs. Take advantage of this psychological weakness we all have. Humans love themselves a pattern. Ha! Here’s 3 quick ways on how to mess with the susceptible human mind:

Truncated Axis:

https://www.google.com/url?sa=i&url=https%3A%2F%2Fglean.info%2Fbeware-lies-fake-data-visualizations%2F&psig=AOvVaw3aKeYlY5qtU
https://glean.info/beware-lies-fake-data-visualizations/

Cherry-Pick Data:

https://www.skepticblog.org/2012/04/11/cherry-picked-data-and-deliberate-distortions/

Confusing Graph Choice:

https://venngage.com/blog/misleading-graphs/

3. Use Small, Biased, and Non-Random Samples!

This one is very important. Do not shy away from the power of sampling. The beauty of it is that you have the power to say almost anything you want. All it takes is a small, targeted, and certainly not random sample. Nobody needs to know. And who will even ask? As long as it looks good and it says what you want it to say, get it out there. See below to understand just how much more you can get away with when using a small sample:

https://trialsjournal.biomedcentral.com/articles/10.1186/1745-6215-15-264

4. Tie Experts to Your Study!

Once again, let’s return to the weaknesses of the brain. Humans have a tendency to put a weird amount of trust into the people or institutions that they believe to be ‘trusted’ authorities. Anytime you can swing it, find a way to tangentially tie one of these experts to your study. Look for loopholes! Impress your findings upon people with authority!

5. Imply Causation from Correlation!

This one is just plain obvious, and there is a reason that I saved it for last. You can certainly find strange correlations and claim causation. You have also use the techniques above to imply causation between almost any two variables. The possibilities are endless! Take a look at this weird one:

https://e-abm.com/correlation-does-not-imply-causation/

Conclusion

There isn’t much to say here. Take those nuggets of wisdom and go wild. At the least, remember that if you aren’t tricking other people with your doctored statistics, other people are probably trying to trick you with theirs. Therefore, dive into the data yourself!

More from Journal

There are many Black creators doing incredible work in Tech. This collection of resources shines a light on some of us:

--

--