Learn AI Series (#22) - K-Means Clustering - Finding Groups
What will I learn
- You will learn unsupervised learning -- discovering structure in data without any labels at all;
- the K-Means algorithm from scratch: assign points, compute centroids, repeat until convergence;
- what K-Means actually optimizes (within-cluster sum of squares) and why it converges to local minima;
- choosing K with the elbow method and silhouette scores;
- K-Means++ initialization and why the default random pick can go badly wrong;
- scaling features for distance-based algorithms (same lesson as episode #20, different context);
- the limitations of K-Means: when spherical cluster assumptions break down and what to watch for;
- a practical customer segmentation example using scikit-learn's Pipeline.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups (this post)
Learn AI Series (#22) - K-Means Clustering - Finding Groups
Welcome to what I'd call Arc 2 of this series. Everything we've built across episodes #1 through #21 was supervised learning -- we always had labels telling the model the right answer. "This apartment costs 350K." "This regime is bullish." "This data point belongs to class 1." We fed features AND answers into the model, and it learned the mapping between them. In the mini project last episode (#21) we even compared five different supervised algorithm families head-to-head on the same labeled dataset. That's the supervised paradigm in a nutshell.
But what if you DON'T have labels?
What if you have a database of 50,000 customers and you want to know: are there natural groups in here? Do some customers behave similarly to each other but differently from others? Nobody has gone through and tagged each customer as "Group A" or "Group B" -- that's the whole point. You want the algorithm to DISCOVER the structure. Which users belong together? Which products cluster naturally? Which sensor readings indicate the same type of event?
This is unsupervised learning, and clustering is its most intuitive form: grouping similar things together, without anyone telling you what the groups should be. Today we start with the simplest and most widely-used clustering algorithm: K-Means.
Let's dive right in.
A completely different question
In supervised learning, the question was always: "given these features, what's the correct label?" And we had a ground truth to check against. Accuracy, F1, R-squared -- every metric from episode #13 measures how well the model's answer matches reality.
In unsupervised learning, there IS no ground truth. There are no labels. The question changes to: "is there structure in this data, and can the algorithm find it?" That's a fundamentally harder question to answer, because you can't just compute "accuracy" when there's nothing to be accurate AGAINST. We'll need different evaluation tools -- and more importantly, different intuition about what "good" means.
K-Means answers the simplest version of this question: "given N data points, can you group them into K clusters where each point belongs to the cluster whose center is closest?" That's it. The whole algorithm fits in your head. But simple doesn't mean trivial -- there's surprising depth underneath.
The algorithm: assign, update, repeat
K-Means alternates between two steps:
- Assign each data point to the nearest centroid (cluster center)
- Update each centroid to the mean of its assigned points
Repeat until the assignments stop changing. That's the entire algorithm. Let me implement it from scratch so you can see every moving part:
import numpy as np
def kmeans(X, k, max_iters=100):
n, d = X.shape
# Initialize: pick k random data points as starting centroids
idx = np.random.choice(n, k, replace=False)
centroids = X[idx].copy()
for iteration in range(max_iters):
# ASSIGN: compute distance from every point to every centroid
# X[:, None] is (n, 1, d), centroids[None, :] is (1, k, d)
# Broadcasting gives (n, k, d) differences
distances = np.linalg.norm(
X[:, None] - centroids[None, :], axis=2
)
# For each point, which centroid is closest?
labels = np.argmin(distances, axis=1)
# UPDATE: each centroid becomes the mean of its assigned points
new_centroids = np.array([
X[labels == i].mean(axis=0) for i in range(k)
])
# Convergence check: did anything move?
if np.allclose(centroids, new_centroids):
print(f"Converged at iteration {iteration}")
break
centroids = new_centroids
return labels, centroids
The distance computation on line 9-11 is the heart of the algorithm. X[:, None] - centroids[None, :] uses NumPy broadcasting (remember episode #8 where we talked about how broadcasting generalizes element-wise operations across arrays of different shapes?) to compute the difference between every point and every centroid simultaneously. The result is an (n, k, d) array -- n points, k centroids, d dimensions. Taking the norm along axis=2 collapses the d-dimensional differences into a single distance number, giving us an (n, k) distance matrix: each point's distance to each centroid.
Then np.argmin(distances, axis=1) picks, for each row (each point), the column (centroid) with the smallest distance. Assignment done. Centroid update is just the mean of each group. Clean, fast, vectorized.
Let's test it:
np.random.seed(42)
# Three well-separated clusters in 2D
cluster_1 = np.random.randn(100, 2) + np.array([0, 0])
cluster_2 = np.random.randn(100, 2) + np.array([5, 5])
cluster_3 = np.random.randn(100, 2) + np.array([10, 0])
X = np.vstack([cluster_1, cluster_2, cluster_3])
labels, centroids = kmeans(X, k=3)
print(f"Centroid positions:\n{centroids}")
print(f"Cluster sizes: {[np.sum(labels == i) for i in range(3)]}")
On well-separated blobs like this, K-Means converges in a handful of iterations and nails the grouping. Every point ends up in the cluster it belongs to, the centroids sit right at the centers of the original blobs, and convergence takes maybe 5-8 iterations. Simple data, simple algorithm, clean result.
But the real world is rarely this clean. Let's look at what happens when things get more interesting.
What K-Means actually optimizes
Behind the simple assign-update loop, K-Means is minimizing a specific objective: the within-cluster sum of squares (WCSS), also called inertia. It's the total squared distance from each point to its assigned centroid:
WCSS = sum over all points i: ||x_i - centroid(cluster(i))||^2
Each iteration of K-Means is guaranteed to decrease (or maintain) this value. The assign step can't increase WCSS because each point moves to the CLOSEST centroid. The update step can't increase WCSS because the mean minimizes sum of squared distances within a group (that's a standard calculus result from episode #9). So WCSS goes down or stays the same at every step, and since there are only finitely many possible assignments, the algorithm must converge.
Here's the critical catch though: it converges to a local minimum, not necessarily the global one. Different random starting positions for the centroids can lead to different final clusterings with different WCSS values. K-Means finds a grouping that's locally optimal -- meaning you can't improve it by moving a single point -- but there might be a better grouping out there that the algorithm missed because it started in the wrong place.
# Same data, different random seeds -> different results
results = []
for seed in range(10):
np.random.seed(seed)
labels, centroids = kmeans(X, k=3)
# Compute WCSS
wcss = sum(
np.sum((X[labels == i] - centroids[i])**2)
for i in range(3)
)
results.append((seed, wcss))
print(f"Seed {seed}: WCSS = {wcss:.1f}")
best = min(results, key=lambda r: r[1])
worst = max(results, key=lambda r: r[1])
print(f"\nBest: seed={best[0]}, WCSS={best[1]:.1f}")
print(f"Worst: seed={worst[0]}, WCSS={worst[1]:.1f}")
On our well-separated three-cluster data, most seeds will find the same (or nearly the same) global minimum because the clusters are obvious. But on messier data with overlapping groups, you'll see real variation between runs. The practical solution: run K-Means multiple times and keep the best result. Which is exactly what scikit-learn does by default (more on that below).
This is conceptually similar to how we discussed gradient descent converging to local minima in episode #7 -- the training loop for neural networks has the same problem, and the same solution applies: try multiple starting points.
Choosing K: the elbow method
The most awkward part of K-Means: you have to tell it how many clusters to find. But if you already knew that, you'd already understand quit some about your data's structure. In practice, K is often unknown, and choosing it is part of the analysis.
The elbow method plots WCSS against K and looks for the "elbow" -- the point where adding more clusters stops producing significant improvement:
from sklearn.cluster import KMeans
inertias = []
K_range = range(1, 11)
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X)
inertias.append(km.inertia_)
print("K | WCSS | Improvement")
print("-" * 50)
for i, (k, inertia) in enumerate(zip(K_range, inertias)):
if i == 0:
improvement = "--"
else:
drop = inertias[i-1] - inertia
improvement = f"-{drop:.0f} ({drop/inertias[i-1]*100:.0f}%)"
bar = "#" * int(inertia / max(inertias) * 40)
print(f"K={k:<2d} | {inertia:>10.1f} | {improvement:>20s} {bar}")
With three true clusters, you'll see a massive drop in WCSS from K=1 to K=2, another big drop from K=2 to K=3, and then the improvements become small and gradual from K=4 onward. That "elbow" at K=3 is where the marginal benefit of another cluster drops sharply. It's visual, it's intuitive, and it often works.
Having said that, on real data the elbow is frequently ambiguous. The curve bends gradually rather than sharply, and reasonable people can disagree about where the elbow is. It's a heuristic, not a definitive answer -- and you should always combine it with domain knowledge. If you're segmenting customers and the business team says "we need 4 marketing strategies," well, K=4 might be the right answer regardless of what the elbow plot shows ;-)
Silhouette score: measuring cluster quality
The silhouette score is a more principled approach to evaluating clustering quality. For each data point, it computes two things:
- a: the mean distance to other points in the same cluster (how tight is my cluster?)
- b: the mean distance to points in the nearest other cluster (how separated am I from the next closest group?)
The silhouette for that point is (b - a) / max(a, b), and it ranges from -1 to +1:
- Near +1: the point is well inside its cluster and far from other clusters. Good.
- Near 0: the point sits on the boundary between two clusters. Ambiguous.
- Negative: the point is closer to another cluster than its own. Probably misassigned.
The average silhouette across all points gives you a single number for the entire clustering. Higher is better.
from sklearn.metrics import silhouette_score, silhouette_samples
best_k = 2
best_sil = -1
print(f"{'K':>3s} {'Silhouette':>12s} {'Visual'}")
print("-" * 45)
for k in range(2, 9):
km = KMeans(n_clusters=k, n_init=10, random_state=42)
cluster_labels = km.fit_predict(X)
sil = silhouette_score(X, cluster_labels)
bar = "#" * int(sil * 40)
print(f"K={k} {sil:>12.3f} {bar}")
if sil > best_sil:
best_sil = sil
best_k = k
print(f"\nBest K by silhouette: {best_k} (score={best_sil:.3f})")
On well-separated data, the silhouette peaks sharply at the true K. On messier data, it gives you a ranking: K=3 scores 0.72, K=4 scores 0.65, K=2 scores 0.58 -- so K=3 produces the best-quality clusters. It's more informative than the elbow method because it considers both cohesion (tightness within clusters) and separation (distance between clusters), not just total variance.
You can also look at individual silhouette values to identify problematic points -- the ones with negative scores that might be assigned to the wrong cluster. This is like the per-class analysis we did with classification reports in episode #13, but for unsupervised problems.
K-Means++ initialization: don't start stupid
Our from-scratch implementation picked initial centroids randomly from the data. This can go badly wrong. Imagine three clusters arranged in a line: cluster A on the left, cluster B in the middle, cluster C on the right. If two of your three random centroids land in cluster A and one in cluster C, then cluster B gets split between the two leftmost centroids. The algorithm converges, but to a bad solution.
K-Means++ (Arthur & Vassilvitskii, 2007) fixes this with smarter initialization:
- Pick the first centroid at random from the data points
- For each remaining centroid, choose a point with probability proportional to its squared distance from the nearest existing centroid
- Repeat until you have K centroids
The effect is that centroids are spread apart. Points far from any existing centroid have a higher chance of being chosen, so the initial centroids are likely to land in different clusters from the start. Let me show you the difference:
# Compare random init vs K-Means++ on a tricky dataset
np.random.seed(42)
X_tricky = np.vstack([
np.random.randn(200, 2) * 0.5 + np.array([-3, 0]),
np.random.randn(200, 2) * 0.5 + np.array([0, 0]),
np.random.randn(200, 2) * 0.5 + np.array([3, 0]),
])
# Run 50 times with random init
random_wcss = []
for seed in range(50):
km = KMeans(n_clusters=3, init='random', n_init=1, random_state=seed)
km.fit(X_tricky)
random_wcss.append(km.inertia_)
# Run 50 times with K-Means++
kpp_wcss = []
for seed in range(50):
km = KMeans(n_clusters=3, init='k-means++', n_init=1, random_state=seed)
km.fit(X_tricky)
kpp_wcss.append(km.inertia_)
print("Random init:")
print(f" Best WCSS: {min(random_wcss):.1f}")
print(f" Worst WCSS: {max(random_wcss):.1f}")
print(f" Std dev: {np.std(random_wcss):.1f}")
print(f" Hit optimal (within 1%): "
f"{sum(1 for w in random_wcss if w < min(random_wcss)*1.01)}/50")
print("\nK-Means++ init:")
print(f" Best WCSS: {min(kpp_wcss):.1f}")
print(f" Worst WCSS: {max(kpp_wcss):.1f}")
print(f" Std dev: {np.std(kpp_wcss):.1f}")
print(f" Hit optimal (within 1%): "
f"{sum(1 for w in kpp_wcss if w < min(kpp_wcss)*1.01)}/50")
K-Means++ is dramatically more consistent. Random init might find the optimal solution only 30-60% of the time; K-Means++ nails it in 90%+ of runs. The theoretical guarantee is that K-Means++ is O(log K) competitive with the optimal clustering, which is much better than the "no guarantee at all" you get from random initialization.
Scikit-learn uses K-Means++ by default (init='k-means++'), and you should never change this unless you have a very specific reason. Combined with n_init=10 (run 10 times, keep the best), sklearn's KMeans is quite robust out of the box.
Feature scaling: mandatory for distance-based methods
This should sound familiar by now. In episode #20 we saw that SVMs use distances (via the kernel), so feature scaling with StandardScaler was mandatory. K-Means has exactly the same requirement, for exactly the same reason: the algorithm computes Euclidean distances, and features with large ranges will dominate features with small ranges.
Imagine clustering customer data with two features: "annual income" (range: 20,000 to 200,000) and "age" (range: 18 to 80). Without scaling, the distance between two points is almost entirely determined by the income difference. A customer with income 100K and age 25 is "closer" to a customer with income 101K and age 75 than to one with income 120K and age 26. That's absurd -- but it's what Euclidean distance gives you when the features have wildly different scales.
The fix is identical to what we've been doing since episode #14: standardize features to zero mean and unit variance.
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Simulated customer data: income (large range) and age (small range)
np.random.seed(42)
n_customers = 300
income = np.concatenate([
np.random.normal(35000, 8000, 100),
np.random.normal(75000, 12000, 100),
np.random.normal(120000, 15000, 100),
])
age = np.concatenate([
np.random.normal(28, 4, 100),
np.random.normal(45, 6, 100),
np.random.normal(60, 5, 100),
])
X_cust = np.column_stack([income, age])
# Without scaling: income dominates
km_raw = KMeans(n_clusters=3, n_init=10, random_state=42)
labels_raw = km_raw.fit_predict(X_cust)
sil_raw = silhouette_score(X_cust, labels_raw)
# With scaling: both features contribute equally
pipe = Pipeline([
('scaler', StandardScaler()),
('kmeans', KMeans(n_clusters=3, n_init=10, random_state=42))
])
labels_scaled = pipe.fit_predict(X_cust)
sil_scaled = silhouette_score(
StandardScaler().fit_transform(X_cust), labels_scaled
)
print(f"Without scaling: silhouette = {sil_raw:.3f}")
print(f"With scaling: silhouette = {sil_scaled:.3f}")
print(f"\nCluster sizes (raw): {np.bincount(labels_raw)}")
print(f"Cluster sizes (scaled): {np.bincount(labels_scaled)}")
The scaled version should produce cleaner, more balanced clusters because both income AND age contribute to the distance calculation. Without scaling, age is basically invisible -- you're clustering on income alone. This is the same lesson from episodes #14, #16, and #20, but applied to an unsupervised context. Any time an algorithm computes distances, you need to scale first. Period.
K-Means with scikit-learn: the production version
In practice you'll use sklearn's implementation rather than the from-scratch version. It handles K-Means++ initialization, multiple restarts, convergence checking, and various edge cases (like empty clusters during iteration):
from sklearn.cluster import KMeans
km = KMeans(
n_clusters=3,
init='k-means++', # default, don't change
n_init=10, # run 10 times, keep best
max_iter=300, # max iterations per run
random_state=42
)
# fit_predict: fit the model AND return cluster labels in one call
labels = km.fit_predict(X)
print(f"Cluster centers:\n{km.cluster_centers_}")
print(f"Inertia (WCSS): {km.inertia_:.1f}")
print(f"Iterations used: {km.n_iter_}")
print(f"Cluster sizes: {np.bincount(labels)}")
Same fit/predict API from episode #16. Same Pipeline integration. You could swap KMeans into any pipeline where you had a supervised model -- the interface is consistent. The main difference is that fit doesn't take a y argument (there are no labels to learn from), and predict assigns new, unseen points to the nearest centroid.
That last part is worth emphasizing: once fitted, you can use km.predict(X_new) on new data to assign it to the existing clusters WITHOUT refitting. This is useful in production -- you cluster your historical customer base once, then assign each new customer to a segment as they sign up.
A practical example: customer segmentation
Let me tie things together with a more realistic example. Customer segmentation is one of the most common real-world applications of K-Means, and it illustrates both the power and the subtlety of unsupervised learning.
np.random.seed(42)
n = 500
# Generate synthetic customer data with 4 natural segments
# Segment A: young, low income, high engagement (students)
# Segment B: middle-aged, medium income, medium engagement (regulars)
# Segment C: middle-aged, high income, low engagement (affluent)
# Segment D: senior, medium income, high engagement (retirees)
segments = [
(125, [25, 22000, 45, 12]), # young students
(150, [42, 55000, 22, 8]), # working regulars
(125, [48, 95000, 8, 3]), # affluent infrequent
(100, [65, 45000, 38, 15]), # engaged retirees
]
data_blocks = []
for size, means in segments:
block = np.column_stack([
np.random.normal(means[0], 5, size), # age
np.random.normal(means[1], 8000, size), # income
np.random.normal(means[2], 8, size), # monthly visits
np.random.normal(means[3], 3, size), # products purchased
])
data_blocks.append(block)
X_seg = np.vstack(data_blocks)
feature_names = ['age', 'income', 'monthly_visits', 'products_bought']
print(f"Dataset: {X_seg.shape[0]} customers, {X_seg.shape[1]} features")
print(f"Feature ranges:")
for i, name in enumerate(feature_names):
print(f" {name:>20s}: {X_seg[:,i].min():.0f} - {X_seg[:,i].max():.0f}")
Now let's find the clusters. We don't "know" there are 4 segments -- that's what the algorithm needs to discover. We'll use the elbow method and silhouette scores to figure out K, then interpret whatever the model finds:
# Step 1: find K
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_seg)
scores = {}
print(f"{'K':>3s} {'WCSS':>10s} {'Silhouette':>12s}")
print("-" * 30)
for k in range(2, 8):
km = KMeans(n_clusters=k, n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
sil = silhouette_score(X_scaled, labels)
scores[k] = (km.inertia_, sil)
print(f"K={k} {km.inertia_:>10.1f} {sil:>12.3f}")
best_k = max(scores, key=lambda k: scores[k][1])
print(f"\nBest K by silhouette: {best_k}")
Once we've picked K, let's cluster and interpret:
# Step 2: cluster with best K, then interpret
km_final = KMeans(n_clusters=best_k, n_init=10, random_state=42)
final_labels = km_final.fit_predict(X_scaled)
# Translate centroids back to original scale for interpretation
centroids_original = scaler.inverse_transform(km_final.cluster_centers_)
print(f"\n{'Cluster':>9s}", end="")
for name in feature_names:
print(f" {name:>16s}", end="")
print(f" {'Size':>6s}")
print("-" * 80)
for i in range(best_k):
print(f"{'Cluster '+str(i):>9s}", end="")
for j in range(len(feature_names)):
print(f" {centroids_original[i, j]:>16.1f}", end="")
print(f" {np.sum(final_labels == i):>6d}")
The beauty of this approach is that the cluster centroids (translated back to the original feature scale) directly tell you what each segment looks like. "Cluster 0: average age 25, income 22K, 45 visits, 12 products" -- that's your young, engaged, low-income segment. You didn't label this. The algorithm discovered it from the data structure alone.
This is the fundamental appeal of unsupervised learning: it finds patterns you didn't know to look for. You might have hypothesized that income matters for segmentation, but maybe the algorithm reveals that monthly visits and products purchased create a more meaningful grouping than you expected. The data speaks for itself -- your job is to interpret what it says and decide if the discovered structure is actionable.
The limitations: when K-Means breaks down
K-Means makes strong assumptions about your data, and when those assumptions are violated, it fails predictably. Understanding these limitations is just as important as understanding the algorithm itself, because blindly applying K-Means to data that violates its assumptions gives you clusters that look reasonable in the output but are genuinely wrong.
Assumption 1: Clusters are spherical (or at least blob-shaped). K-Means uses Euclidean distance, which treats all directions equally. It finds round-ish groups. If your actual clusters are elongated ovals, crescent shapes, or concentric rings, K-Means will split them into spherical chunks rather than finding the natural structure. Remember the make_moons dataset from episode #20? Two interlocking crescents. K-Means would split those horizontally or vertically -- it CAN'T find the crescent shape because that's not what Euclidean distance measures.
Assumption 2: Clusters are roughly equal in size. K-Means tends to create balanced clusters even when the real groups are very different sizes. If you have 1,000 points in a big diffuse cluster and 50 points in a tiny tight cluster, K-Means will steal points from the big cluster to balance things out.
Assumption 3: Clusters are roughly equal in density. If one cluster is packed tight and another is spread out, the centroid of the sparse cluster gets pulled toward the dense one. The boundary ends up in the wrong place.
Assumption 4: You know K in advance. The elbow and silhouette methods help, but they're heuristics. On real data, the "right" number of clusters is often a judgment call that depends on your use case, not a mathematical truth the algorithm can reveal.
Let me demonstrate with a simple failure case:
from sklearn.datasets import make_moons
# Two crescent shapes -- K-Means can't handle this
X_moons, y_moons = make_moons(n_samples=300, noise=0.08, random_state=42)
km_moons = KMeans(n_clusters=2, n_init=10, random_state=42)
labels_moons = km_moons.fit_predict(X_moons)
# How well did K-Means do? Compare to true labels
from sklearn.metrics import adjusted_rand_score
ari = adjusted_rand_score(y_moons, labels_moons)
print(f"Adjusted Rand Index: {ari:.3f}")
print(f"(1.0 = perfect match, 0.0 = random)")
print(f"\nK-Means splits the moons vertically/horizontally,")
print(f"not along the crescent boundary.")
The Adjusted Rand Index (ARI) measures how well two sets of labels agree, corrected for chance. An ARI of 1.0 means perfect agreement; 0.0 means the clustering is no better than random assignment. K-Means will score poorly on the moons dataset because it physically cannot discover non-convex cluster shapes.
These limitations are NOT flaws -- they're consequences of the mathematical assumptions that make K-Means fast and interpretable. For data that roughly matches the assumptions (blob-shaped, similarly-sized groups), K-Means is hard to beat. For more complex cluster geometries, other algorithms exist that make fewer assumptions at the cost of more computation. We'll explore those next.
Putting it all together: the full pipeline
Here's a complete, production-style K-Means pipeline using the same sklearn patterns we've been building since episode #16:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# The pipeline: scale, then cluster
pipe = Pipeline([
('scaler', StandardScaler()),
('kmeans', KMeans(n_clusters=4, n_init=10, random_state=42))
])
# Fit on our customer data
pipe.fit(X_seg)
labels = pipe.predict(X_seg)
# Evaluate
X_scaled = pipe.named_steps['scaler'].transform(X_seg)
sil = silhouette_score(X_scaled, labels)
inertia = pipe.named_steps['kmeans'].inertia_
print(f"Pipeline results:")
print(f" Clusters: {len(np.unique(labels))}")
print(f" Silhouette score: {sil:.3f}")
print(f" WCSS (inertia): {inertia:.1f}")
print(f" Cluster sizes: {np.bincount(labels)}")
# Use the fitted pipeline to assign NEW customers
new_customers = np.array([
[30, 28000, 40, 10], # looks like a young engaged customer
[50, 100000, 5, 2], # looks like an affluent infrequent one
[70, 50000, 35, 14], # looks like a retiree
])
new_labels = pipe.predict(new_customers)
print(f"\nNew customer assignments: {new_labels}")
The pipeline ensures that new data gets the same scaling transformation that was applied during fitting. This is the same pattern we used for StandardScaler with SVMs in episode #20 and with logistic regression in episode #16. Consistent, reproducible, no accidental data leakage from the scaler.
What K-Means taught us about unsupervised thinking
This episode marks a fundamental shift in how we think about machine learning. In supervised learning, success was clearly defined: does the model's prediction match the known answer? In unsupervised learning, we're asking whether the discovered structure is meaningful and useful -- and those are harder questions that often require domain knowledge to answer.
K-Means is the simplest entry point into this world. The algorithm is elegant, the implementation is straightforward, and the results are easy to interpret (just look at the cluster centroids). But its simplicity comes with real limitations -- the assumption of spherical, equal-sized clusters, the requirement to specify K in advance, and the sensitivity to initialization and scaling.
The evaluation tools we learned today -- WCSS/inertia, the elbow method, silhouette scores, Adjusted Rand Index -- will carry forward to every clustering method we encounter. They're the unsupervised equivalents of accuracy, F1, and R-squared from the supervised world.
And the idea of reducing the dimensionality of your data before clustering -- removing noisy features so the distance calculation reflects real structure -- connects to concepts we'll explore much more deeply in upcoming episodes. If you have 50 features and only 5 of them are relevant for grouping, the signal gets diluted in all that noise. There are systematic ways to identify and extract the most informative dimensions from high-dimensional data, and they pair beautifully with clustering.
Let's recap
We built our first unsupervised learning algorithm from scratch, and here's the full picture:
- K-Means groups data into K clusters by iteratively assigning points to the nearest centroid and recomputing centroids as the mean of each group. The algorithm always converges -- but to a local minimum, not necessarily the global one;
- The objective being minimized is WCSS (within-cluster sum of squares, a.k.a. inertia) -- the total squared distance from each point to its assigned centroid. Multiple restarts (sklearn's
n_init=10default) help find better local minima; - Choose K using the elbow method (plot WCSS vs K, look for the bend) or silhouette scores (measure both cohesion and separation, pick the K with the highest average silhouette). Both are heuristics -- combine with domain knowledge;
- K-Means++ initialization spreads initial centroids apart using distance-weighted random selection, dramatically improving consistency over random initialization. Sklearn uses this by default;
- Feature scaling is mandatory -- K-Means uses Euclidean distance, so features with large ranges will dominate. Always put StandardScaler in a Pipeline before KMeans (same lesson as SVMs in episode #20);
- K-Means assumes spherical, equally-sized, equally-dense clusters. It fails predictably on crescent shapes, ring shapes, or clusters with very different sizes and densities;
- For blob-shaped data with roughly equal groups, K-Means is fast, simple, and hard to beat. For more complex cluster geometries, other algorithms exist that we'll explore next.