“You do not really understand something unless you can explain it to your grandmother.” We’ve all heard a variant of this quote, often credited to Albert Einstein. In technology circles, it implies the difficulty of explaining cutting edge advances to the layperson. This is the first of a series of posts in which I will try to explain concepts in Big Data as I understand them and as if I were at a dinner table explaining my understanding to my Grandmother.
Why, you ask? Well because like in most fields, the Big Data industry is filled with technical jargon that is only understood by the people in it. The complexity is not necessarily conducive to expanding the understanding to outsiders. I am not saying that it is not complex, as it is incredibly difficult work but without simplified explanations to outsiders (decision makers without technical knowledge), even the best analysis will bear no fruit.
Without further ado, let’s get to the topic for the next few posts, which is Data Preparation.
Grandma: What is this data preparation and why is it necessary.
Me: Would you, the greatest chef in the world, feed me week old fish? Or substitute Bacon for Pancetta in your Carbonara because you didn’t have any left in the fridge? Or put nuts in my food knowing I am allergic to it? Well depending on the business problem we are trying to solve, we need to train algorithms that can help decide on a solution. Each algorithm has its own requirements of what can be fed in terms of data. Data preparation is like what you do to prepare for our dinners! You must ensure that the data is i) not old, which can become a quality problem, ii) not missing a lot of important “ingredients” (values) and iii) consistent with what is needed to make the algorithm work optimally. Ignore preparation and you may put food on the table, but your grandson might not come visit very often!
Grandma: I don’t want you not to come visit so I make sure that I have everything I need, but you know, my memory is not what it used to be. Besides, your father buys food every week and puts it in the fridge for me. How can I ensure that I don’t have stale fish ending up on our dinner table?
Me: What you’re describing Grandma is a problem we face quite a lot. We rely on other people to put food in our fridge, which is called a Data Warehouse (or a Data Lake or something else). We ensure that someone else is not putting in old data using Metadata. It is a layer of information on top of the data that tells us when it was collected, when it was put into the warehouse, by whom and the like. Think of it as a post-it note you might ask dad to put on the fridge every time he goes shopping. He should write the date when he went shopping, what he bought and what he threw out from the fridge because it was too old to eat. That way everyone knows what is in the fridge, and you can always go back and look at the post-it to remind you if you need to!
Grandma: That is a great idea, I am going to ask him to do that from now on. I understand the problem with old food as it can make you sick. What’s the problem substituting some ingredients for similar ones? Isn’t it better to have Carbonara with Bacon than having just pasta with cream!?
Me: You have a point and we do substitute. There are some techniques we use to do that but at least for today, I think we’ve talked enough about the topic. I promise to answer your question some other day.