If you’re too young to realize where the title reference comes from, I’m gonna make you lose your mind. It has something to do with parties and rocks and anthems. Actually, no, I just want you to have a good time so I’ll instead ask you to take a look at the title picture. What did you notice?
I am obviously drawing your attention to both the title and picture for a reason. With the title, you might not have realized there was a “pattern” to it till I pointed it out. With the picture, if you only took a quick glance, you might have seen just sheep.
If you managed to figure both out without me having to point it out, you can stop reading.
For those of you that did not, the reason I (tried) to pull one over you is because I want to talk about bias. No, not the bias in Machine Learning, which refers to a model not being able to “learn” anything. The bias I am going to talk about is more serious and has to do with humans and not machines. You’ve probably figured it out by now…we’re going to talk about cognitive biases. Why? because they have a lot to do with data-driven decision making and Data Science in general.
Let’s get right to it. Below you will find a number of cognitive biases us human’s have. In this post, I will only talk about one; its one that has significant impact on our lives but perhaps has not been talked about as much. It’s called the Base-Rate fallacy.
BASE RATE FALLACY
Think about the last time you had to work on project. Did it take more time than you initially expected it to? For me, this always happens when working in teams. You have a deadline in a week; since it took the team 4 days to get the last project done, you all think it will take you a maximum of 4 days to get things done this time around. Suddenly its 11:50 PM, 5 minutes before deadline, and you’re scrambling to get everything done after having spent a 4th day eating kebabs from the corner shop.
I too am guilty of blaming my team-mates or my boss and not myself whenever this happens (which is actually a bias, the self-serving bias, by the way) but in reality, I failed to ignore the base-rate because as a human, I am not very good with separating information from probabilities. To explain to you what base-rate fallacy is, I will defer to experts in the subject:
Amos Traversky and Daniel Kahneman, two of the most well-known Behavioral Scientists, carried out an experiment where they gave participants descriptions of 5 individuals selected (at random) from a pool of 70 lawyers and 30 engineers. The participants needed to predict whether each of the 5 people was a lawyer or an engineer. Kahneman & Traversky found that the participants’ predictions completely ignored the characteristics of the pool (the pool was made up of 30% engineers or 70% lawyers, which is the respective probability of each of the 5 descriptions falling into each group) from which the final descriptions were drawn. Instead, participants seemed to base their predictions of each person’s occupation on the extent to which the description was similar to the typical lawyer or engineer. Clearly, the participants were biased. If they weren’t, from a sample of 5, they would be better off saying 3 are lawyers and 2 are engineers (expected values of each).
What does this bias mean for you and your future group projects? Instead of considering how long it took the same team to finish a project previously, start collecting data on how long it took other teams before you to do same or similar projects! That way, you won’t ignore the base-rates.
Quick-side note: apparently there is quite a large sub-group of dedicated machine learning researchers trying to build algorithms that can make predictions based solely on similarity . Pedro Domingos, in his book “The Master Algorithm”, describes these people as analogizers.
IMPLICATIONS TO DATA SCIENCE
Getting back to the point, the base-rate fallacy is not just limited to group projects. It has far reaching implications to decision making, and by association to Data Science. By ignoring base rates, we might be attaching a higher probability of someone not paying their credit-bills based on, for example, how they look (the halo-effect, or why conmen succeed) rather than their actual ability to pay. Another example is that of doctors misdiagnosing patients because they failed to account for the base rate of a disease’s occurrence, essentially attaching a higher probability for very rare diseases.
Now put your Data Scientist hat on and think how Machine Learning prevents this bias from interfering in decision making? Let me give you a hint:
Bayes, as is turns out, is not biased.
The Bayes theorem, which was first proposed in the 1700’s by Thomas Bayes, takes into account base-rates by incorporating prior-probability in calculating the posterior probability. I won’t spend time in this post explaining how it works since Pedro Domingos does it better so I highly recommend his book for those more eager to learn about the tribes that inhabit “ML Land”.
What I will say is that Bayes’s theorem, the application of which is seen in a Machine Learning algorithm called Naive Bayes, is one of the most important (and trendy) algorithms in Machine Learning today.
SO WHAT?
You might be wondering what I am trying to prove. There are articles out there that describe Data Science and/or Machine Learning as something that will bring about a lot of unemployment, filter bubbles and what not. The perception of this industry is that of computer scientists who are payed to write thousands of lines of code because its cheaper to automate than to pay loan officers or, worse, of companies that want to push fake news to influence people’s decisions.
My purpose here is to give you a different perspective. In reality, a lot of Data Science is based on ideas that were developed (or rediscovered) to get rid of cognitive biases that all humans have. This includes the loan-officer and the Data Scientist alike. With lesser biases, decisions we were previously getting wrong are being made correctly. For example we no longer have the problem of not giving loans to someone simply because a loan-officer did not like the way they dressed (but did not realize this was affecting his/her decision).
I am not saying its all sunshine and rainbows, as there are plenty of ethical dilemmas with regards to Data Science that are subject to ongoing debate. What cannot be debated is that because of Data Science, we have lesser human bias in decision making, which has improved the lives of all humans.
Would you agree?