By Neha Kanneganti
Clustering is a well-known and powerful unsupervised machine learning method. The goal is to group our data into different sets, where elements within the same set are more similar in some way than elements from other sets. While there are many algorithms to generate clusters of similar objects, this blog post will introduce the k-means algorithm. There are many applications for this algorithm. For example, this algorithm is essential for search engines such as Google; when a user searches something, the search results are grouped using clustering algorithms. It can also be used to recommend new friends on social media or identify different types of tissue in medical scans.
What is K-Means Clustering?
In simple words, k-means clustering is an algorithm to group the data into K clusters. It is based on the idea that examples closer to each other will have more similarities. Here are the exact steps for the k-means algorithm:
Pick K random centroids (1 centroid per cluster)
Assign each example to a cluster based on the closest centroid
Recompute the centroids based on the examples in its cluster
Repeat steps 2 and 3 until the centroid no longer changes
Please note that the K is the parameter and therefore it is chosen before the algorithm runs. After the algorithm is run, every example must be grouped into one of the K clusters.
Example of K-Means Clustering
To further clarify the algorithm and demonstrate its use, let’s try to go through all the steps above. This will help you get a better understanding of how the k-means algorithm works. For our example, here is our starting data and our goal is to find five clusters for the thousand data points below:
Step 1: Pick k random centroid
A centroid is a point that we pick that will represent the center of a cluster. In the example down below, the centroid is represented by the star. Also, keep in mind the following:
The centroid doesn’t need to be located where an example in our dataset is
Don’t count the centroid as an example in our dataset.
Step 2: Assign each example to a cluster
Using the distance formula, we will calculate the distance from each example to every centroid. Then, we will group each example into the cluster represented by the closest centroid.
Step 3: Recompute the centroids
Now that every example belongs to a cluster, we want to identify the centroid for each cluster. So notice how the centroid is shifted to the center of these new clusters.
Step 4: Repeat steps 2 and 3 until the centroid no longer changes
Now that the centroids are updated, we can repeat the process. So we again measure the distance from each example to each centroid and then assign each example to the cluster of the closest centroid. So in short, we keep repeating the previous two steps until the centroids don’t change. When the centroids don’t change, then the algorithm is complete and we found our clusters!
Notice that because the centroid moved, some of the examples changed what color/cluster they belong to.
The centroids moved only slightly this time, but we’re getting close! Repeat!
The centroids didn’t change this time, so the algorithm is complete! We’ve found our clusters!
Hopefully, you are now familiar with k-means clustering and how this algorithm works. So you also have another useful concept added to your machine learning toolkit!