A gentle introduction to K-means
TLDR: K-means groups or partitions data points into “k” clusters. But what are clusters? They are nothing but a group of data points that are similar to each other. In Machine Learning parlance, it is an unsupervised learning task, for which, as the name suggests, there is no human intervention needed: the machine groups the data based on the patterns seen in the data.
Imagine yourself as a kid after a summer break, ready to go back to school. A new term means newer textbooks and notebooks. In front of you in a pile of books, with sleeves. Your brother has spent all day making sure each of them has a sleeve and they all look identically the same. You can’t distinguish between them and identify which is a notebook and which is a textbook.
Your task is to put a red sticker on the textbooks and a blue sticker on the notebook.
With no guidance, you try to make sense of the stack of books in front of you. You stare at them for a few seconds and realize that even though they might have the same sleeves, textbooks and notebooks are a bit different from each other. Eventually you figure out that the textbooks are bulky and have a hard cover whereas the notebooks are less bulky and have paper covers. Armed with this information, you put the stickers accordingly.
Explaining it in data terms
To explain this in data terms, the stack of books is essentially your dataset and instead of just a few, you have hundreds and millions of books to label.
Since it’s an unsupervised task i.e without any prior input from anyone, you group the books based on your observations. After grouping them, the notebooks have a blue sticker and the textbooks have a red ones. They are separated and grouped together to form clusters.
Like the image above the data points are messy, with no labels assigned to it. The algorithm tries to understand the similarities between each of them and tries to create clusters or groups that reflect these similarities.
A clustering algorithm like K-means is therefore designed to identify the similarities that help in differentiating various data points. In K-means, we use Euclidean distance to find the similarity between the data points. The K denotes the number of clusters that you want to see in your data: for our books, it’s two. For a visual representation, check out how the groups are extracted below.
Clustering used in real world
So what? Why would you ever need clustering in the real world? An example of the usefulness of clustering is Customer Segmentation via cluster analysis.
In this process customers that have similarities are grouped together to form a cluster. You might have often noticed that frequent and regular customers receive various offers, discounts and rewards. This is done via cluster analysis. The customers who are regulars and have higher frequency of purchase are grouped together and a cluster is formed. Then using a marketing campaign, the cluster is targeted. So, if you want to receive more discounts and rewards, you know what to do.
We’ve just given an overview of how clustering works but we’ve still left a few questions unanswered: how do you know how many “K”’s to choose. How does the K-means algorithm figure out how to separate the clusters, or when to stop clustering? All of this will be answered in Part 2 of this series so stay tuned!
The Xabit Team