This is what is meant when you hear “truth isn’t truth” or “alternative facts”. They are talking more to the results of the data analysis rather than the data itself. Data doesn’t lie. There are many different ways of misinterpreting data, and here are just a few:
It is hard to design a data capture. Harder still to pull together datasets that contain all of the information to create an informed and factual story. Whether a dimension missed in the capture, or the slicing of the sample removed an important segment in the population. It is easy to accidentally adjust the data and interpreted outcomes.
Survivor bias is when the data only represents some of the population segments, and the question interpreted as being needing to be answered from the existing data.
An example of survivor bias is the story (told much better in this link) that I’m sure all of you know well in which Abraham Wald, a mathematician who worked with the Statistical Research Group during WWII recognised survivorship bias in the data he was being presented. He was given data presenting all of the bullet holes in the planes returning from combat and tasked with confirming that all planes should have additional armour placed in those areas. Armour is expensive and heavy, any increase to the weight of the plane meant a reduction in its payload. Minimising the amount of armour added was in the best interests of all. Rather than confirm, Abraham suggested armour be added to the areas in which there was no data. Where the holes didn’t show as it was evident that planes shot in those areas were not returning.
Seeing the incomplete dataset and understanding the question being asked, Abraham was able to put together a model that better reflected reality.
For another example of survivor bias, have a look at a story about everyone’s favourite internet animal, cats.
This one shows up a lot. CEO: “so how can we make this a positive”. SalesTeam: “sure sales are going down but sales in product A are going up, let’s report that”. Basically, how do we reduce the size of the frame until we see what we want to see?
In many cases removing of data is create focus and remove noise in the data, showing only relevant information. Unfortunately, this can sometimes also remove important data whether purposeful or not. This is similar to survivorship bias in that a dimension is missing or the sample is not fully representative of the population. The difference being that data was removed after the capture to present the story.
An example of this is shown in the WHOs cancer agency where data was removed from the study to strengthen the hypothesis that glyphosate probably causes cancer in humans and animals. In summary, in order to support the pre-study conclusion that glyphosate causes cancer, data that showed findings were non-carcinogenic were removed from the result set. The study was adjusted to strengthen a result that was existing to reduce the resistance to the classification of glyphosate.
Of course, once the data was made public the opposite occurred and conspiracies are seen in every slice of the data.
Final example for now. This goes to causation and correlation. It happens quite a lot when combining different datasets and making incomplete links between the datasets. Linking sunscreen sales by date and time to ice cream revenue doesn’t mean that putting on sunscreen compels you to run and buy a softy. There is data missing in the link that is important to the correlation. Probably that it is summertime and hot out. For a few more odd correlations check out this site.
An example of correlation errors was the Google Flu Trends where Google would be able to predict flu outbreaks up to 10 days before the CDC using data generated by search keywords. The thought being that the more people searching for flu-related terms, the more people with the flu looking for relief.
We have found a close relationship between how many people search for flu-related topics and how many people actually have flu symptoms. Of course, not every person who searches for “flu” is actually sick, but a pattern emerges when all the flu-related search queries are added together. We compared our query counts with traditional flu surveillance systems and found that many search queries tend to be popular exactly when flu season is happening. By counting how often we see these search queries, we can estimate how much flu is circulating in different countries and regions around the world.
As it turns out, not only is not every person who searches flu not sick. There are a whole lot of search terms that can be related to “flu” that also do not indicate sickness. And interestingly just a few months after Google Flu Trends was released swine flu also came about and was completely missed by Google Flu Trends.
So when you are presented data, remember:
- “the data says so” is only relative to the model, not necessarily reality
- you are listening to a story, you need to decide what you take away
- if something is too good to be true, look at the rest of the data
- Nicholas Cage actually has nothing to do with the number of people who drown falling in to a pool
Thanks for reading, hope it was useful, or enjoyable at least.