Anyone out of high school or with a GED can write a few lines of code and implement machine learning algorithms. It’s not a bad thing that coding and resources are accessible to everyone around the world. We don’t even need to write the algorithms from scratch as there are multiple libraries that do just the same. The time freed from using libraries instead of writing algorithms from scratch could be utilized to perform parameter tuning to get better results.
In this post we will be looking more inside the KMeans algorithm. KMeans is a clustering algorithm that divides the data into multiple clusters based on the association of the data. It can be found under sklearn.cluster package.
class sklearn.cluster.KMeans(n_clusters=2, *, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm=’lloyd’)
Continuing the book clustering example, we know that there are only 2 types of books hence the 2 stickers. But what if we buy a third sticker. The third sticker will be of no use because there aren’t any books that need a different sticker. The additional sticker will run up your own money and, in the end, will be of no use.
Explaining this in data terms:
So, imagine there are 100 clusters but only 50 data points. If we assume that each data point has a single unique cluster, then there are exactly 50 clusters empty without any association to any data point. In each iteration the algorithm must go through all the clusters even though we know that there are more clusters than the data points. This not only takes more time but also produces a bad algorithm. Since there are very less data points in each clusters any new associations will be very hard to determine, and we will end up classifying a new data member into wrong cluster. Such algorithm is also very computationally expensive and can drive up the cost of the user.
So, we need to determine the appropriate number of clusters for our algorithm. There are mainly two methods in K means to determine the appropriate cluster size.
Elbow method
In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. Using the “elbow” or “knee of a curve” as a cut-off point is a common heuristic in mathematical optimization to choose a point where diminishing returns are no longer worth the additional cost.
In clustering, this means one should choose a number of clusters so that adding another cluster doesn’t give much better modelling of the data.
From the above diagram we try to find a point that resembles a sharp elbow. However only in ideal case there may be a sharp elbow.
Since we cannot identify a sharp elbow point we try and use another method to determine the number of clusters.
The Silhouette method
Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified.
It measures how close each point in one cluster is to point in neighbouring clusters. This measure has range of -1 to 1. The farther the points are the distinct the clusters are. This is indicated by +1. The negative value indicates that the data points are classified into wrong clusters and 0 indicates that the data is closer to decision boundary between two neighbouring clusters.
The higher silhouette score indicates that the data points are correctly placed in the right cluster. The silhouette plot shows that the k value of 2, 4, 5, 6, 7, 8 are not good pick due to their low score. Hence the optimal value of k is 3.
Besides the number of cluster there are few essential hyperparameters that are necessary to make our model better. Let’s have a look at them.
n_init int, default=10
This is the number of times the k-means algorithm will be run with different centroid seeds.
max_iter int, default=300
The maximum number of iterations of the k-means algorithm for a single run.
algorithm, default=”lloyd”
This parameter refers to the KMeans algorithm to use. The default is Lloyd however, we can also use Elkan which is proven effective on well defined clusters.
Init=’k-means++’, default=”k-means++”
This parameter is the method of initialization. By default k-means++ is selected. K-means++ selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence.
There are other parameters like tol and max_iter which are only used for more complex problems to improve their computational efficiency.
The number of clusters is the most important hyperparameter that needs to be selected. However running elbow method and silhouette method and getting the right number of k is not enough to create a perfect model. We must explore various hyperparameters and do necessary research on various methods. Once we have everything ready then we can set up our KMeans algorithm and tune it to perfection.
Here is a refresher on how to do it:
- Find the optimal number of k either by using elbow or silhouette method.
- Learn about the various parameters
- Tinker with the parameters by changing it from its default value
- Repeat until convergence is achieved.
If we want to take our KMeans a step further we can perform feature importance by weighting their contribution. Then we can change the distance measure from Euclidean distance to the Mahlanobis distance. Further we can perform scaling of the data using PCA and get an efficient model.
Happy Learning!
The Xabit Team