Learn AI Series (#23) - Advanced Clustering - Beyond K-Means

What will I learn

You will learn DBSCAN -- density-based clustering that finds arbitrarily shaped clusters and identifies noise points automatically;
hierarchical clustering -- building a tree of nested clusters with dendrograms, so you can see ALL possible groupings at once;
Gaussian Mixture Models -- soft clustering where points belong to multiple clusters with probabilities, not hard assignments;
HDBSCAN -- the algorithm that combines density-based and hierarchical approaches into what many practitioners consider the best general-purpose clustering method;
how to choose the right clustering algorithm for your data, and when K-Means is still the right answer;
cluster validation beyond silhouette scores -- stability analysis and why domain knowledge is the ultimate validator.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#23) - Advanced Clustering - Beyond K-Means

Last episode we built K-Means from scratch and used it to segment customers, choose K with the elbow method and silhouette scores, and understand what "unsupervised learning" actually means. And then I was honest with you about its limitations: K-Means assumes spherical clusters, requires you to specify K upfront, can't handle noise, and struggles when clusters have different sizes or densities. We even demonstrated the failure on the make_moons crescent dataset -- K-Means just cut them in half vertically, completely missing the actual structure.

So what do you do when your data doesn't cooperate with those assumptions?

Today we explore four algorithms that answer that question, each approaching the fundamental problem of "what is a cluster?" from a completely different angle. DBSCAN says clusters are dense regions. Hierarchical clustering says they're a tree of nested groups. Gaussian Mixture Models say they're overlapping probability distributions. And HDBSCAN says "all of the above, but let me figure it out automatically." By the end of this episode you'll have a full toolkit for clustering problems, and -- more importantly -- you'll know WHICH tool to reach for when.

Let's dive right in.

DBSCAN: density is the answer

K-Means defines clusters by their centers. A cluster is "all the points closest to this centroid." That's why it finds spherical blobs -- the centroid is equidistant from the cluster boundary in every direction, which is literally the definition of a sphere. If your data isn't blob-shaped, the centroid concept breaks down.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) throws away the centroid concept entirely. In stead, it defines clusters by density: a cluster is a dense region of points, separated from other dense regions by sparse areas. No centroids. No assumption about shape. If the dense region happens to be a crescent, a ring, or a wiggly snake, DBSCAN will find it -- because it never assumed the cluster had to be round.

The algorithm has two parameters:

eps: the radius of the neighborhood around each point
min_samples: the minimum number of points within that radius to count as a "core point"

DBSCAN classifies every data point into one of three categories:

Core points: have at least min_samples neighbors within distance eps. These are in the thick of a cluster.
Border points: are within eps of a core point, but don't have enough neighbors themselves. They're on the edge.
Noise points: aren't within eps of ANY core point. They don't belong to any cluster. They get label -1.

That last one is huge. K-Means forces every single point into a cluster, even if that point is clearly an outlier sitting in the middle of nowhere. DBSCAN says "nah, that's noise" and labels it accordingly. For any real-world dataset with messy outliers (which is... all of them), this is a massive advantage.

The algorithm works by expanding clusters from core points. Pick an unvisited core point. Add all its neighbors. For each neighbor that's also a core point, recursively add THEIR neighbors too. Keep expanding until you run out of core points to chain together. Everything connected through core point neighborhoods is one cluster. Then move on to the next unvisited core point and start a new cluster.

Let's see it handle the exact dataset that K-Means couldn't:

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons, make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score

# Two crescent shapes -- K-Means failed here in episode #22
X_moons, y_moons = make_moons(n_samples=300, noise=0.08, random_state=42)

db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X_moons)

n_clusters = len(set(labels) - {-1})
n_noise = np.sum(labels == -1)
ari = adjusted_rand_score(y_moons, labels)

print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
print(f"Cluster sizes: {[np.sum(labels == i) for i in range(n_clusters)]}")
print(f"Adjusted Rand Index: {ari:.3f}")
print(f"(Recall: K-Means scored poorly on this same data)")

DBSCAN finds the two crescents perfectly. ARI close to 1.0. And it didn't need us to tell it there were two clusters -- it discovered that from the density structure alone. Compare that to K-Means on the same data from last episode, which scored poorly on ARI because it physically cannot find non-convex shapes. This is the kind of problem DBSCAN was designed for.

Choosing eps: the k-distance plot

The tricky part is picking good values for eps and min_samples. Too small an eps and real clusters get fragmented into tiny pieces. Too large and distinct clusters merge together. The k-distance plot is the standard approach for finding a good eps:

from sklearn.neighbors import NearestNeighbors

# For each point, find the distance to its k-th nearest neighbor
# (where k = min_samples)
k = 5
nn = NearestNeighbors(n_neighbors=k)
nn.fit(X_moons)
distances, _ = nn.kneighbors(X_moons)

# Sort the k-th neighbor distances in ascending order
k_distances = np.sort(distances[:, k-1])

print("k-distance plot (look for the 'knee'):")
print(f"{'Percentile':>12s}  {'Distance':>10s}")
print("-" * 26)
for pct in [10, 25, 50, 75, 90, 95, 99]:
    idx = int(len(k_distances) * pct / 100)
    print(f"{pct:>10d}%  {k_distances[idx]:>10.4f}")

print(f"\nSuggested eps range: "
      f"{k_distances[int(len(k_distances)*0.85)]:.3f} - "
      f"{k_distances[int(len(k_distances)*0.95)]:.3f}")

The idea is this: points inside a cluster will have small distances to their k-th neighbor (they're surrounded by other points). Noise points will have large distances (they're isolated). When you sort these distances, there's typically a sharp jump -- the knee -- where you transition from in-cluster points to noise points. That knee value is a good candidate for eps. It's a heuristic, not an exact science, but it works surprisingly well in practice.

For min_samples, a common rule of thumb is min_samples = 2 * n_features. For 2D data, that's 4 or 5. For high-dimensional data, you need larger values to avoid finding spurious clusters in sparse regions.

Where DBSCAN struggles

DBSCAN has one fundamental weakness: it uses a single global eps value. This means it assumes all clusters have roughly similar density. If you have a tight, compact cluster AND a diffuse, spread-out cluster in the same dataset, no single eps works for both. A small eps finds the tight cluster but shatters the diffuse one into fragments. A large eps captures the diffuse cluster but merges the tight one with whatever's nearby.

# Clusters with VERY different densities
np.random.seed(42)
X_mixed = np.vstack([
    np.random.randn(200, 2) * 0.3 + np.array([0, 0]),    # tight cluster
    np.random.randn(200, 2) * 2.0 + np.array([8, 8]),     # diffuse cluster
])

# Try small eps: finds tight, shatters diffuse
db_small = DBSCAN(eps=0.3, min_samples=5)
labels_small = db_small.fit_predict(X_mixed)
print(f"Small eps (0.3): {len(set(labels_small) - {-1})} clusters, "
      f"{np.sum(labels_small == -1)} noise")

# Try large eps: merges everything
db_large = DBSCAN(eps=2.0, min_samples=5)
labels_large = db_large.fit_predict(X_mixed)
print(f"Large eps (2.0): {len(set(labels_large) - {-1})} clusters, "
      f"{np.sum(labels_large == -1)} noise")

# No single eps works for both
db_mid = DBSCAN(eps=0.8, min_samples=5)
labels_mid = db_mid.fit_predict(X_mixed)
print(f"Mid eps   (0.8): {len(set(labels_mid) - {-1})} clusters, "
      f"{np.sum(labels_mid == -1)} noise")

There's no winning move here with standard DBSCAN. This varying-density problem is exactly what motivates HDBSCAN, which we'll get to shortly. But first, let's explore a completely different approach to clustering.

Hierarchical clustering: the tree of groups

What if you didn't have to choose a specific number of clusters at all? What if you could see ALL possible groupings simultaneously -- from "every point is its own cluster" all the way up to "everything is one big cluster" -- and then pick the level of granularity that makes sense for your problem?

That's what hierarchical clustering does. It builds a dendrogram -- a tree structure that shows how points group together at every level of similarity. You can "cut" the tree at any height to get as many or as few clusters as you want.

Agglomerative (bottom-up) hierarchical clustering is the most common variant. The algorithm is conceptually simple:

Start with every point as its own cluster (N clusters)
Find the two closest clusters
Merge them into one cluster (now N-1 clusters)
Repeat until everything is one cluster

The result is a complete merge history that you can inspect at any level:

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Generate blob data with 4 clear groups
X_blobs, y_blobs = make_blobs(
    n_samples=200, centers=4, cluster_std=0.8, random_state=42
)

# Compute the linkage matrix (full merge history)
Z = linkage(X_blobs, method='ward')

# Print the last merges -- large distances indicate natural boundaries
print("Last 8 merges (largest distances = natural cluster boundaries):")
print(f"{'Step':>6s}  {'Cluster A':>10s}  {'Cluster B':>10s}  "
      f"{'Distance':>10s}  {'New size':>10s}")
print("-" * 52)
for i in range(8):
    row = Z[-(i+1)]
    print(f"{len(Z)-i:>6d}  {int(row[0]):>10d}  {int(row[1]):>10d}  "
          f"{row[2]:>10.2f}  {int(row[3]):>10d}")

The key insight is in those merge distances. You'll see a few merges with VERY large distances compared to the others. Those are the jumps where the algorithm had to bridge a big gap to combine two well-separated groups. The number of large jumps tells you the natural number of clusters. If the last three merges have distances of 15.2, 14.8, and 13.9, but the one before that jumps to 5.1, there's a clear gap -- cutting the tree just above 5.1 gives you 4 clusters.

Linkage methods: how "distance between clusters" is defined

The whole algorithm hinges on one question: how do you define the distance between two clusters? Different answers give different results, and the choice matters more than you might think.

Single linkage: minimum distance between any two points in different clusters. Finds elongated, chain-like clusters but is very sensitive to noise (a single bridge point between two clusters will merge them prematurely).
Complete linkage: maximum distance between any two points. Creates compact, roughly spherical clusters. Robust to noise but can split elongated natural clusters.
Average linkage: mean distance between all pairs of points. A balanced compromise between single and complete.
Ward linkage: minimizes the increase in total within-cluster variance at each merge. This is closest in spirit to K-Means (it optimizes the same kind of objective) and is usually the best default for general use.

# Cut the tree at 4 clusters with Ward linkage
hc = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels_hc = hc.fit_predict(X_blobs)

ari_hc = adjusted_rand_score(y_blobs, labels_hc)
print(f"Ward linkage, 4 clusters: ARI = {ari_hc:.3f}")
print(f"Cluster sizes: {[np.sum(labels_hc == i) for i in range(4)]}")

# Compare with single linkage
hc_single = AgglomerativeClustering(n_clusters=4, linkage='single')
labels_single = hc_single.fit_predict(X_blobs)
ari_single = adjusted_rand_score(y_blobs, labels_single)
print(f"Single linkage, 4 clusters: ARI = {ari_single:.3f}")

The dendrogram is the killer feature of hierarchical clustering. It shows you the ENTIRE clustering structure, not just one partition. You can visually inspect where the natural boundaries are, explore different granularities, and make an informed decision about how many clusters to use. That visual information is something K-Means and DBSCAN simply don't provide.

Having said that, hierarchical clustering has a serious scalability problem. Computing all pairwise distances requires O(n^2) memory, and the merge process itself is O(n^3) for some linkage methods. For a dataset of 10,000 points, that's manageable. For 100,000 points, you're looking at 10 billion distance computations. For a million points, forget it. K-Means handles a million points without breaking a sweat. Hierarchical clustering chokes. So think of it as a tool for moderate-sized datasets where the hierarchy itself is valuable -- taxonomy, organizational structure, biological classification -- not for large-scale problems.

Gaussian Mixture Models: soft clustering with probabilities

Every clustering method we've seen so far makes a hard assignment: each point belongs to exactly one cluster. Period. No ambiguity. But what happens with a point sitting right on the boundary between two clusters? Hard assignment forces a binary choice -- cluster 0 or cluster 1 -- even when the honest answer is "it could be either."

Gaussian Mixture Models (GMMs) solve this by providing soft assignments: each point gets a probability of belonging to each cluster. A point deep inside cluster 0 might have probabilities [0.98, 0.01, 0.01]. A boundary point might be [0.55, 0.40, 0.05]. That uncertainty information is genuinely useful, and it's lost in every hard clustering method.

A GMM models the data as a mixture of K Gaussian distributions (bell curves), each with its own mean (center), covariance (shape and orientation), and mixing weight (how big a share of the data it accounts for). The algorithm finds these parameters using Expectation-Maximization (EM) -- which is essentially a generalized version of K-Means that alternates between two steps:

E-step (Expectation): compute the probability that each point belongs to each Gaussian (soft assignment)
M-step (Maximization): update each Gaussian's parameters to best fit the points assigned to it (weighted by those probabilities)

If that sounds familiar, it should. K-Means does the same two steps -- assign points to the nearest centroid, recompute centroids. The difference is that K-Means uses hard assignments (each point fully belongs to one cluster) while GMM uses soft assignments (each point partially belongs to every cluster). K-Means is actually a special case of GMM where the covariances are all identical spheres and the assignments are forced to be 0 or 1.

from sklearn.mixture import GaussianMixture

# Same blob data
gm = GaussianMixture(
    n_components=4, covariance_type='full', random_state=42
)
gm.fit(X_blobs)

# Hard assignments (for comparison)
labels_gm = gm.predict(X_blobs)
ari_gm = adjusted_rand_score(y_blobs, labels_gm)

# Soft assignments: the actual probabilities
probs = gm.predict_proba(X_blobs)

# Find the most uncertain point
uncertainty = 1 - np.max(probs, axis=1)
most_uncertain_idx = np.argmax(uncertainty)
most_certain_idx = np.argmin(uncertainty)

print(f"GMM, 4 components: ARI = {ari_gm:.3f}")
print(f"\nMost uncertain point (index {most_uncertain_idx}):")
for i, p in enumerate(probs[most_uncertain_idx]):
    if p > 0.01:
        print(f"  Cluster {i}: {p:.1%}")

print(f"\nMost certain point (index {most_certain_idx}):")
for i, p in enumerate(probs[most_certain_idx]):
    if p > 0.01:
        print(f"  Cluster {i}: {p:.1%}")

print(f"\nUncertainty distribution:")
print(f"  Points with >95% confidence: "
      f"{np.sum(np.max(probs, axis=1) > 0.95)}/{len(probs)}")
print(f"  Points with <80% confidence: "
      f"{np.sum(np.max(probs, axis=1) < 0.80)}/{len(probs)}")

The soft assignments are practically useful in ways that go beyond just "nice to know." If you're segmenting customers and a customer has 60% probability of being in the "high-value" segment and 40% probability of being in the "growth potential" segment, you might want to target them with marketing for BOTH segments rather than picking one. That nuance is invisible to hard clustering methods.

GMM covariance types and model selection

The covariance_type parameter controls how flexible each cluster's shape can be:

'full': each cluster has its own full covariance matrix. Can model any elliptical shape at any angle. Most flexible but requires the most data to estimate reliably.
'diag': covariance matrices are diagonal (axis-aligned ellipses only). Less flexible but more stable with limited data.
'spherical': each cluster is a sphere (but different sizes allowed). This is close to K-Means.
'tied': all clusters share the same covariance matrix.

For choosing the number of components (K), GMMs offer something K-Means doesn't: principled model selection through information criteria. No elbow-staring required.

print(f"{'K':>3s}  {'AIC':>10s}  {'BIC':>10s}  {'Best?':>6s}")
print("-" * 36)

best_bic = float('inf')
best_k = 2

for n in range(2, 9):
    gm = GaussianMixture(n_components=n, random_state=42)
    gm.fit(X_blobs)
    aic = gm.aic(X_blobs)
    bic = gm.bic(X_blobs)
    marker = ""
    if bic < best_bic:
        best_bic = bic
        best_k = n
        marker = " <--"
    print(f"K={n}  {aic:>10.0f}  {bic:>10.0f}{marker}")

print(f"\nBest K by BIC: {best_k}")

AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) both balance how well the model fits the data against how many parameters it uses. Lower is better. BIC penalizes complexity more strongly than AIC, so it tends to pick simpler models -- fewer clusters. In practice, BIC is generally preferred for cluster selection because it's less prone to overfitting. This is conceptually similar to the regularization ideas we discussed in the supervised learning episodes -- you want the simplest model that explains the data well, not the most complex one that fits every wiggle.

HDBSCAN: the best of both worlds

Remember how DBSCAN struggled with clusters of varying density because it uses one fixed eps? HDBSCAN (Hierarchical DBSCAN) solves this elegantly. In stead of running DBSCAN at one eps value, it conceptually runs it at EVERY possible eps, builds a hierarchy from the results (like hierarchical clustering), and then extracts the most stable clusters from that hierarchy.

Think of it this way: as you gradually increase eps from 0 to infinity, dense clusters appear and persist for a long range of eps values. Noise points flicker in and out -- they briefly join a cluster at some eps and then get absorbed into a bigger cluster as eps grows. HDBSCAN identifies the clusters that were stable across the widest range of density thresholds. Stable = real. Flickering = noise.

from sklearn.cluster import HDBSCAN

# The mixed-density dataset that broke DBSCAN
hdb = HDBSCAN(min_cluster_size=15, min_samples=5)
labels_hdb = hdb.fit_predict(X_mixed)

n_clusters = len(set(labels_hdb) - {-1})
n_noise = np.sum(labels_hdb == -1)

print(f"HDBSCAN on mixed-density data:")
print(f"  Clusters found: {n_clusters}")
print(f"  Noise points: {n_noise}")
for i in range(n_clusters):
    size = np.sum(labels_hdb == i)
    print(f"  Cluster {i}: {size} points")

The key parameter is min_cluster_size -- the minimum number of points to form a cluster. This is much more intuitive than DBSCAN's eps because you're thinking about "how big does a group need to be to count?" rather than "what's the spatial scale of my clusters?" A marketing analyst knows "a customer segment should have at least 50 people to be actionable." They don't know the Euclidean distance threshold that separates their customer clusters.

HDBSCAN also provides cluster membership probabilities (similar to GMM's soft assignments):

# Membership strengths
print(f"\nMembership probability distribution:")
assigned = labels_hdb >= 0
if hasattr(hdb, 'probabilities_') and hdb.probabilities_ is not None:
    probs_hdb = hdb.probabilities_[assigned]
    print(f"  Mean probability (assigned points): {probs_hdb.mean():.3f}")
    print(f"  Points with prob > 0.9: "
          f"{np.sum(probs_hdb > 0.9)}/{len(probs_hdb)}")
    print(f"  Points with prob < 0.5: "
          f"{np.sum(probs_hdb < 0.5)}/{len(probs_hdb)}")

Points deep inside a cluster get high probability. Points on the fringes get lower probability. This gives you confidence information without the Gaussian assumption that GMMs make. You get the shape flexibility of DBSCAN plus the uncertainty quantification of GMMs plus the multi-scale analysis of hierarchical clustering. That's why HDBSCAN has become the go-to clustering algorithm for many practitioners in the last few years.

The main limitation: HDBSCAN is transductive. It can't predict cluster assignments for new, unseen data points. You cluster your existing data, and that's it. If new data arrives, you need to re-cluster everything. For production systems where you need to assign incoming data to existing clusters in real-time (like the customer segmentation pipeline we built in episode #22 with K-Means), HDBSCAN isn't the right choice. K-Means and GMMs can both do predict(X_new) on new data because they learn a model (centroids or Gaussians) that generalizes. HDBSCAN learns a partitioning of the specific dataset it saw.

The complete comparison: all five methods head-to-head

Let's put everything together and run all five clustering algorithms on the same datasets. This is the unsupervised equivalent of the model bake-off we did in episode #21 for supervised learning:

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Dataset 1: well-separated blobs (easy)
X1, y1 = make_blobs(n_samples=300, centers=4,
                     cluster_std=0.8, random_state=42)

# Dataset 2: crescent moons (shape challenge)
X2, y2 = make_moons(n_samples=300, noise=0.08, random_state=42)

# Dataset 3: mixed density
np.random.seed(42)
X3 = np.vstack([
    np.random.randn(200, 2) * 0.3 + [0, 0],
    np.random.randn(200, 2) * 2.0 + [8, 8],
])
y3 = np.array([0]*200 + [1]*200)

datasets = [
    ("Blobs (easy)", X1, y1, 4),
    ("Moons (shape)", X2, y2, 2),
    ("Mixed density", X3, y3, 2),
]

for name, X, y_true, expected_k in datasets:
    print(f"\n{'='*50}")
    print(f"Dataset: {name}")
    print(f"{'='*50}")

    # K-Means
    km = KMeans(n_clusters=expected_k, n_init=10, random_state=42)
    l_km = km.fit_predict(X)

    # DBSCAN
    db = DBSCAN(eps=0.5, min_samples=5)
    l_db = db.fit_predict(X)

    # Agglomerative
    hc = AgglomerativeClustering(n_clusters=expected_k, linkage='ward')
    l_hc = hc.fit_predict(X)

    # GMM
    gm = GaussianMixture(n_components=expected_k, random_state=42)
    l_gm = gm.fit_predict(X)

    # HDBSCAN
    hdb = HDBSCAN(min_cluster_size=15)
    l_hdb = hdb.fit_predict(X)

    results = {
        'K-Means': l_km, 'DBSCAN': l_db,
        'Hierarchical': l_hc, 'GMM': l_gm, 'HDBSCAN': l_hdb
    }

    print(f"{'Method':>14s}  {'ARI':>6s}  {'Clusters':>8s}  {'Noise':>6s}")
    print("-" * 40)
    for method, labels in results.items():
        n_cl = len(set(labels) - {-1})
        n_noise = np.sum(labels == -1)
        ari = adjusted_rand_score(y_true, labels) if n_cl > 0 else 0.0
        print(f"{method:>14s}  {ari:>6.3f}  {n_cl:>8d}  {n_noise:>6d}")

You'll see a clear pattern in these results. On the blobs dataset (spherical, well-separated), pretty much everything works. K-Means nails it. GMM nails it. Hierarchical nails it. Even DBSCAN and HDBSCAN do well. When the data fits everyone's assumptions, all algorithms agree.

On the moons dataset, K-Means and GMM struggle (they can't find non-convex shapes), while DBSCAN and HDBSCAN score near perfect. Hierarchical clustering with single linkage would also work here (it can chain through elongated shapes), but Ward linkage acts more like K-Means.

On the mixed density dataset, DBSCAN struggles (single eps problem), HDBSCAN handles it cleanly, and the others depend on whether they can accommodate the density difference.

This comparison illustrates the fundamental lesson: there is no universally best clustering algorithm. Each one makes different assumptions, and the right choice depends on your data. Knowing their strengths and weaknesses is what makes you effective.

Validating clusters: are they real?

Evaluating clustering quality is harder than evaluating supervised models. In classification, we had ground truth labels and could compute accuracy, F1, ARI -- all the metrics from episode #13. In real unsupervised problems, there IS no ground truth. That's the whole point -- if we had labels, we wouldn't need clustering.

The silhouette score from episode #22 is a good starting point. It measures cohesion (how tight is each cluster?) versus separation (how far apart are clusters?), and it works for any algorithm. But it has blind spots -- it tends to favor convex clusters (same assumption as K-Means), so it can give misleading scores for DBSCAN-style non-convex clusters that are perfectly valid.

Stability analysis is a more robust approach. The idea: if your clusters are real, they should appear consistently when you perturb the data slightly. Run the algorithm multiple times on bootstrapped samples (random subsets with replacement) and check whether the same clusters keep emerging:

from sklearn.utils import resample

def clustering_stability(X, algorithm, n_runs=20):
    """Run clustering on bootstrapped samples and measure
    agreement between runs using ARI."""
    all_labels = []

    for i in range(n_runs):
        X_boot, idx = resample(X, return_indices=True,
                               random_state=i)
        boot_labels = algorithm.fit_predict(X_boot)

        # Map back to original indices
        full_labels = np.full(len(X), -1)
        full_labels[idx] = boot_labels
        all_labels.append(full_labels)

    # Compare every pair of runs
    ari_scores = []
    for i in range(n_runs):
        for j in range(i+1, n_runs):
            # Only compare points that appeared in both runs
            mask = (all_labels[i] >= 0) & (all_labels[j] >= 0)
            if mask.sum() > 10:
                ari = adjusted_rand_score(
                    all_labels[i][mask], all_labels[j][mask]
                )
                ari_scores.append(ari)

    return np.mean(ari_scores), np.std(ari_scores)

# Test stability on the blob data
mean_ari, std_ari = clustering_stability(
    X_blobs, KMeans(n_clusters=4, n_init=10, random_state=42)
)
print(f"K-Means stability on blobs: ARI = {mean_ari:.3f} +/- {std_ari:.3f}")

mean_ari, std_ari = clustering_stability(
    X_blobs, DBSCAN(eps=1.0, min_samples=5)
)
print(f"DBSCAN stability on blobs:  ARI = {mean_ari:.3f} +/- {std_ari:.3f}")

High stability (mean ARI near 1.0 with low variance) means the clusters are robust -- they're probably real structure in the data, not artifacts of the specific sample. Low stability means the clusters shift around every time you resample -- they might be noise that the algorithm is forcing into groups.

And the most important validator of all? Domain knowledge. Show the clusters to someone who understands the data. If a biologist looks at your gene expression clusters and says "yes, these correspond to known cell types," that beats any metric. If a marketing team looks at your customer segments and says "we can build a campaign around each of these," the clusters are useful -- which is ultimately what matters in unsupervised learning. Remember what I said in episode #22: in unsupervised learning, the question isn't "is the model accurate?" but "is the structure the model found meaningful and useful?" That's a harder question, and it requires human judgment to answer.

A practical decision guide

After covering five clustering algorithms across two episodes, here's the decision process I recommend. And I should note -- this is the kind of practical guidance you only develop after running these algorithms on a bunch of different real datasets and seeing what happens. The textbook descriptions make them all sound equally good ;-)

Start with K-Means if your clusters are roughly blob-shaped and you have a reasonable guess for K (or can estimate it with the elbow/silhouette methods from episode #22). It's fast, scalable, interpretable, and works in production with predict() on new data. If it gives you good silhouette scores and the results make domain sense, stop there. Don't over-complicate things.

Use DBSCAN when you expect arbitrary cluster shapes AND noise points, and when your clusters have roughly similar density. It's great for spatial data, anomaly detection (the noise points!), and problems where you genuinely don't know K. The eps parameter takes some tuning via k-distance plots but it's manageable.

Use hierarchical clustering when the hierarchy itself is the point -- when you want to see relationships at multiple levels of granularity. Taxonomy, organizational analysis, document clustering where you want both broad topics and specific sub-topics. Keep the dataset under a few thousand points. It pairs beautifully with dendrograms for visual exploration.

Use GMMs when you want probability assignments, when your clusters might be elliptical (not spherical), or when you need principled model selection via BIC. Also useful when you need a proper generative model -- GMMs can generate new synthetic data points from the learned distribution, which none of the other methods can do.

Use HDBSCAN as the robust default when you're not sure what to expect. It handles arbitrary shapes, varying densities, noise, and doesn't require specifying K or eps. The only real downsides are computational cost on very large datasets and the inability to predict on new data. If those aren't constraints, HDBSCAN is an excellent first choice.

What we covered today

This episode completes our clustering toolkit. Combined with episode #22, you now have five distinct approaches to finding structure in unlabeled data:

DBSCAN defines clusters by density -- regions of many closely packed points surrounded by sparse areas. It finds arbitrary shapes, automatically identifies noise, and doesn't require K. Weakness: struggles with clusters of varying density because it uses a single global eps value;
Hierarchical clustering builds a tree (dendrogram) showing how points merge at every level of similarity. You can cut the tree at any height to get any number of clusters. The dendrogram IS the result -- it shows all possible groupings. Weakness: O(n^2) memory, O(n^3) time for some linkage methods. Not practical for large datasets;
Gaussian Mixture Models provide soft clustering with probabilities, model elliptical cluster shapes through covariance matrices, and support principled model selection via AIC/BIC. Technically a special case of EM (Expectation-Maximization), which generalizes the K-Means assign-update loop to probabilistic assignments;
HDBSCAN runs DBSCAN at every possible density threshold, builds a hierarchy, and extracts the most stable clusters. Adapts to varying densities naturally. Provides membership probabilities. The best general-purpose algorithm for most situations -- but can't predict on new data;
No single algorithm is best. K-Means for blobs, DBSCAN for shaped clusters at uniform density, hierarchical for exploring structure, GMM for probabilities and ellipses, HDBSCAN when you don't know what to expect. The right tool depends on the data and the question.

Clustering is inherently harder to evaluate than supervised learning because there's no ground truth. Silhouette scores, stability analysis, and domain knowledge form a three-legged stool of validation -- and the domain knowledge leg is often the strongest. A cluster is "real" when it's useful and meaningful, not just when a metric says so.

We've now built a solid unsupervised learning foundation. We can find groups in data. The natural next question is: what if your data lives in 50 dimensions, and most of those dimensions are noise? Distances become meaningless in high-dimensional spaces (everything is far from everything else), and clustering algorithms that depend on distance start to choke. There are techniques for compressing high-dimensional data down to its essential structure -- keeping the signal, discarding the noise -- and they pair beautifully with the clustering methods we've just learned.

Bedankt en tot de volgende keer!

Hive account@scipio

Learn AI Series (#23) - Advanced Clustering - Beyond K-Means

Learn AI Series (#23) - Advanced Clustering - Beyond K-Means

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#23) - Advanced Clustering - Beyond K-Means

DBSCAN: density is the answer

Choosing eps: the k-distance plot

Where DBSCAN struggles

Hierarchical clustering: the tree of groups

Linkage methods: how "distance between clusters" is defined

Gaussian Mixture Models: soft clustering with probabilities

GMM covariance types and model selection

HDBSCAN: the best of both worlds

The complete comparison: all five methods head-to-head

Validating clusters: are they real?

A practical decision guide

What we covered today

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):