Skip to content
Home » Grandma’s Guide to Data Preparation (The missing pancetta)

Grandma’s Guide to Data Preparation (The missing pancetta)

I usually speak to my Grandma on Thursdays. True to tradition, we did so this past week. As soon as she picked up the phone, she first asked me if I had eaten and only then if I was doing well. Ah Grandmas! I told her about the recent heatwave in Europe and how France recorded its highest temperature (EVER) on Friday. Sure, the temperature is an irrelevant detail for me to tell you, but, like Big Data, it is very relevant for the future of humanity! Anyways, we continued the rest of the conversation from where it had ended last week.

Grandma: So, tell me, how is substituting bacon for pancetta similar to how you deal with missing values in your datasets?

Me: Actually, we don’t always do something akin to substituting bacon for pancetta. The choice really depends on how much data we have, how important the missing value is and how well we know the problem (or the business).

Let me try and explain it better using an example. Imagine I am supposed to make dinner for my Italian friend Filippo, my pollo-pescatarian friend Jasmine, and myself. I know Filippo prefers authenticity, so Pancetta is the ideal choice for him. Jasmine would prefer I use salmon. Me? I prefer chicken because, you know, being a student is hard on the wallet and chicken is cheap! (The why for this is an entirely different problem for the future of humanity).

To solve the problem of the missing Pancetta, or in a Data Scientist’s case, the problem of missing values, we could do one of three things:

a) Find a compromise: think of pancetta as one extreme of taste (and source) and salmon as another. Chicken is probably somewhere in between and something that is quite commonly eaten around the world. So, if I choose chicken, I am using a middle ground between the two. In Data Preparation, we can do something similar. We can substitute missing values by using a central tendency measure like an average (mean), a middle value (median) or the most “common” value (mode). In cases where there isn’t a large dispersion in my data this works pretty well. It wouldn’t if Jasmine were vegetarian.

b) “Guesstimate”: If I know Filippo had pancetta and mozarella calzone and Jasmine had sushi for lunch, they probably don’t want to repeat the same over dinner. Knowing that, I could guess that they wouldn’t mind chicken. A similar process can be carried out in Data Preparation. Using Bayesian formulas or Decision Trees, both of which use conditional probabilities, we can make an educated guess as to what the missing value may be.

c) Raise my hands and say I don’t know: Now imagine I don’t know Filippo and Jasmine very well. I don’t know whether Filippo demands authenticity or whether Jasmine has recently decided to become vegetarian. When us Data Scientists run into a situation where we simply don’t know enough about the problem (or the business) to be able to find a middle ground or guesstimate, but we have enough data to work with without having to do either, we simply ignore the missing values. If we don’t have enough data to work with, we replace the missing values with a global constant. Think of it as substituting meat with cheese because it’s a constant in almost every pasta dish out there. By doing so, at least we are aware that we have substituted rather than put something incomplete on the table.

Grandma: That makes total sense! If only I used whatever Bayesian formulas are to guess what you wanted to eat maybe you would eat some more. Now that you’ve explained to me why Data Preparation is necessary and how you do it, I am curious to know what happens after you’ve completed the process?

Me: I think you might agree Grandma that once most people have thoughtfully decided on the dish and painstakingly prepared the ingredients, the next problem is knowing how to cook it. In the end, execution is everything! It’s no different in our case. Most non-technical managers don’t care much about what we used to solve the problem, they only care about how we did it. The how is where algorithms and models come into play. Algorithms help us cook the data so that we can serve a solution. I will tell you more about it when we speak next week!

Grandma: OK, I am looking forward to it. You say it’s very hot out there, take care of yourself!