Implementing K-Means Clustering from Scratch: Optimization Techniques and Practical Applications

3 min readMay 23, 2024

Introduction

K-means clustering is a popular unsupervised learning algorithm used to partition a dataset into distinct groups or clusters. The goal is to minimize the variance within each cluster, effectively grouping similar data points together. In this blog post, we’ll delve into the implementation of K-means from scratch, explore optimization techniques, and discuss some of its recent practical applications.

Understanding K-Means Clustering

K-means clustering works by iteratively assigning data points to clusters and updating the cluster centroids until convergence. Here’s a step-by-step breakdown of the algorithm:

Initialization: Choose the number of clusters, 𝐾K, and randomly initialize the cluster centroids.
Assignment: Assign each data point to the nearest centroid based on Euclidean distance.
Update: Calculate the new centroids by taking the mean of all data points assigned to each cluster.
Convergence: Repeat the assignment and update steps until the centroids no longer change significantly.

Implementing K-Means from Scratch

Let’s implement K-means clustering from scratch in Python:

import numpy as np

def initialize_centroids(X, k):
    indices = np.random.choice(X.shape[0], k, replace=False)
    return X[indices]

def assign_clusters(X, centroids):
    distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))
    return np.argmin(distances, axis=0)

def update_centroids(X, labels, k):
    return np.array([X[labels == i].mean(axis=0) for i in range(k)])

def kmeans(X, k, max_iters=100):
    centroids = initialize_centroids(X, k)
    for _ in range(max_iters):
        labels = assign_clusters(X, centroids)
        new_centroids = update_centroids(X, labels, k)
        if np.all(centroids == new_centroids):
            break
        centroids = new_centroids
    return centroids, labels

# Example usage
X = np.random.rand(100, 2)  # Generate random data
k = 3  # Number of clusters
centroids, labels = kmeans(X, k)

Optimization Techniques

While the basic implementation of K-means is straightforward, several optimizations can improve its efficiency and effectiveness:

K-means++ Initialization: Instead of random initialization, use the K-means++ algorithm to choose initial centroids. This method spreads out the initial centroids, leading to better convergence and avoiding poor clustering.

def initialize_centroids_kmeans_plusplus(X, k):
    centroids = [X[np.random.choice(X.shape[0])]]
    for _ in range(1, k):
        distances = np.min([np.linalg.norm(X - c, axis=1)**2 for c in centroids], axis=0)
        prob = distances / np.sum(distances)
        cumulative_prob = np.cumsum(prob)
        r = np.random.rand()
        for j, p in enumerate(cumulative_prob):
            if r < p:
                centroids.append(X[j])
                break
    return np.array(centroids)

Elbow Method for Optimal K: Determine the optimal number of clusters by plotting the sum of squared errors (SSE) for different values of 𝐾 and looking for an “elbow” point where the SSE reduction slows down.

def calculate_sse(X, centroids, labels):
    sse = 0
    for i in range(len(centroids)):
        sse += np.sum((X[labels == i] - centroids[i])**2)
    return sse

sse_values = []
for k in range(1, 10):
    centroids, labels = kmeans(X, k)
    sse_values.append(calculate_sse(X, centroids, labels))

import matplotlib.pyplot as plt
plt.plot(range(1, 10), sse_values)
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.title('Elbow Method')
plt.show()

Mini-Batch K-means: For large datasets, Mini-Batch K-means processes small, random subsets (mini-batches) of the dataset, significantly speeding up the computation.

from sklearn.cluster import MiniBatchKMeans

mini_batch_kmeans = MiniBatchKMeans(n_clusters=k, batch_size=10)
mini_batch_kmeans.fit(X)
centroids = mini_batch_kmeans.cluster_centers_
labels = mini_batch_kmeans.labels_

Practical Implementations

Image Compression: K-means clustering can compress images by reducing the number of colors. Each pixel is assigned to the nearest color centroid, and the image is reconstructed using these centroids.
Customer Segmentation: In marketing, K-means clustering segments customers into distinct groups based on purchasing behavior, enabling targeted marketing strategies.
Anomaly Detection: K-means can detect anomalies in data by identifying points that do not fit well into any cluster, useful in fraud detection and network security.
Document Clustering: In natural language processing, K-means clusters documents into topics, aiding in information retrieval and text analysis.

Conclusion

Implementing K-means clustering from scratch offers a deep understanding of the algorithm and its optimization techniques. By incorporating methods like K-means++ initialization and the elbow method, we can enhance clustering performance. K-means continues to find practical applications across various domains, making it a valuable tool in the data scientist’s toolkit.

Feel free to experiment with the provided code, tweak the parameters, and apply it to your own datasets. Happy clustering!