Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP

What will I learn

You will learn why PCA fails on nonlinear data -- the Swiss roll problem and the limits of linear projections;
t-SNE -- the algorithm that made high-dimensional visualization mainstream, how it works with probability distributions over neighbors, and why the t-distribution is the key innovation;
how to read t-SNE plots honestly -- what's meaningful, what's an artifact, and the perplexity traps that fool beginners;
UMAP -- the modern alternative that's faster, preserves more global structure, and can transform new data;
a head-to-head comparison of PCA vs t-SNE vs UMAP on the handwritten digits dataset we used in episode #24;
practical guidelines for when to use each method, and the embedding trap that catches people using these as feature extractors.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP

At the end of episode #24 I flagged PCA's fundamental limitation: it's linear. It finds straight axes in feature space and projects your data onto them. For a lot of problems, that's exactly what you need -- PCA gave us clean visualization of the digits dataset, powerful denoising, and an excellent preprocessing step for distance-based algorithms like K-Means (episode #22) and SVMs (episode #20). But the moment your data lives on a curved surface in high-dimensional space, PCA's linear projections break down. Points that are far apart along the curve get squashed on top of each other in the projection. The geometry that actually matters -- the distances along the surface -- is exactly what PCA throws away.

Today we tackle two algorithms designed specifically for this problem: t-SNE and UMAP. Both are primarily visualization tools -- they take your 50- or 500-dimensional data and produce a 2D scatter plot where similar points cluster together and dissimilar points spread apart. They've become absolutely indispensable for exploring embeddings, verifying clustering results, and communicating structure in high-dimensional datasets to humans who can only see two dimensions at a time. If you've ever seen one of those beautiful scatter plots where handwritten digits form distinct colored clouds, or where word embeddings arrange themselves by meaning -- that was almost certainly t-SNE or UMAP.

Having said that, these algorithms come with serious gotchas that most tutorials gloss over. I'm going to teach you both the power AND the pitfalls, because a badly-read t-SNE plot is worse than no plot at all -- it gives you false confidence in structure that might not exist.

Let's dive right in.

The Swiss roll: where PCA falls apart

The classic demonstration of PCA's linearity problem is the Swiss roll -- a 2D surface that's been rolled up into a 3D spiral, like a sheet of paper curled into a tube. The points on this surface have a natural 2D structure (position along the strip), but that structure is embedded in 3D via a nonlinear transformation (the rolling).

import numpy as np
from sklearn.datasets import make_swiss_roll
from sklearn.decomposition import PCA

# Generate a Swiss roll -- a 2D manifold curved through 3D space
X_roll, color = make_swiss_roll(n_samples=1500, noise=0.5, random_state=42)
print(f"Swiss roll shape: {X_roll.shape}")
print(f"Color range: {color.min():.1f} to {color.max():.1f}")
print(f"(color = position along the unrolled strip)")

# Try PCA: project 3D -> 2D
pca_2d = PCA(n_components=2)
X_pca = pca_2d.fit_transform(X_roll)

print(f"\nPCA explained variance: {pca_2d.explained_variance_ratio_.round(3)}")
print(f"PCA output shape: {X_pca.shape}")

PCA projects the Swiss roll onto a flat plane by finding the two directions of maximum variance. The problem? The two widest directions of the roll are NOT the directions that preserve the strip's structure. PCA sees the spiral from outside and squashes it flat, which means points that were far apart along the strip (different colors) end up overlapping in the projection. Imagine looking at a cinnamon roll from above -- you'd see all the layers stacked on top of each other, with no way to tell which part of the spiral a given point belongs to.

What would a good projection look like? It would "unroll" the spiral back into a flat strip, preserving the neighborhood relationships along the surface. Points with similar colors (nearby on the original strip) should stay close, and points with very different colors (far apart on the strip) should be far apart. That requires bending -- a fundamentally nonlinear operation that PCA cannot do.

Let me show you numerically how badly PCA mixes things up:

# Measure how well PCA preserves local neighborhoods
from sklearn.neighbors import NearestNeighbors

# In the original 3D space, find each point's 10 nearest neighbors
nn_3d = NearestNeighbors(n_neighbors=10)
nn_3d.fit(X_roll)
_, neighbors_3d = nn_3d.kneighbors(X_roll)

# In PCA's 2D projection, find each point's 10 nearest neighbors
nn_pca = NearestNeighbors(n_neighbors=10)
nn_pca.fit(X_pca)
_, neighbors_pca = nn_pca.kneighbors(X_pca)

# How many 3D neighbors are still neighbors in 2D?
overlap = 0
for i in range(len(X_roll)):
    shared = len(set(neighbors_3d[i]) & set(neighbors_pca[i]))
    overlap += shared

avg_overlap = overlap / len(X_roll) / 10
print(f"Neighborhood preservation (PCA): {avg_overlap:.1%}")
print(f"(100% = all 3D neighbors are still neighbors in 2D)")
print(f"(10% = random -- neighbors are completely scrambled)")

PCA will preserve maybe 40-60% of neighborhoods on the Swiss roll. That sounds acceptable until you realize that for a visualization tool, scrambling 40-60% of your local relationships means the plot is actively misleading -- points that appear close are often NOT actually similar. On data with more complex nonlinear structure (and real datasets are almost always more complex than a clean Swiss roll), PCA's neighborhood preservation gets even worse.

t-SNE: probability distributions over neighbors

t-SNE (t-distributed Stochastic Neighbor Embedding, van der Maaten & Hinton, 2008) approaches the problem from a completely different angle than PCA. In stead of preserving global variance, it focuses on preserving local neighborhoods -- making sure that points which are close in the high-dimensional space remain close in the 2D embedding.

The algorithm works in two distinct phases. Understanding both is important because they explain both t-SNE's power and its quirks.

Phase 1: model the high-dimensional neighborhoods

For every pair of points in the original space, t-SNE computes a probability that measures how likely they are to be neighbors. Specifically, for each point i, it places a Gaussian (bell curve) centered on i with some width sigma_i. Points close to i get high probability under this Gaussian; distant points get low probability. The width sigma_i is set automatically so that the effective number of neighbors (measured by a quantity called the perplexity) matches a target value you specify.

Think of it like this: each point has a "field of view" determined by the perplexity. Low perplexity (say 5) means each point only cares about its 5 closest neighbors. High perplexity (50) means each point considers a broader neighborhood. The perplexity doesn't directly set the number of neighbors -- it's a smooth continuous version of that idea using information theory (related to the entropy concepts from episode #9).

Phase 2: find a 2D layout that matches

Now t-SNE creates a similar set of pair probabilities in 2D, but using a Student t-distribution (with one degree of freedom, so it's actually a Cauchy distribution) in stead of a Gaussian. It then uses gradient descent to minimize the KL divergence between the high-dimensional probability distribution and the 2D one. KL divergence (Kullback-Leibler divergence) measures how different two probability distributions are -- it's zero when they match perfectly and positive otherwise.

Why the t-distribution in stead of a Gaussian in 2D? This is the key innovation that makes the whole thing work. Gaussians fall off quickly -- in 2D, you only have a flat plane to work with, and if you tried to preserve all the neighborhood relationships from high-D space using Gaussians, everything would get crushed into the center. The t-distribution has heavier tails, meaning distant points in high-D are allowed to be VERY far apart in 2D without paying a big cost in the objective function. This prevents the crowding problem -- a fundamental geometric issue where you're trying to embed a high-dimensional neighborhood structure into a lower-dimensional space that simply doesn't have enough room for everything.

from sklearn.manifold import TSNE

# t-SNE on the Swiss roll
tsne = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42,
    n_iter=1000
)
X_tsne = tsne.fit_transform(X_roll)

# Measure neighborhood preservation (same test as PCA)
nn_tsne = NearestNeighbors(n_neighbors=10)
nn_tsne.fit(X_tsne)
_, neighbors_tsne = nn_tsne.kneighbors(X_tsne)

overlap_tsne = 0
for i in range(len(X_roll)):
    shared = len(set(neighbors_3d[i]) & set(neighbors_tsne[i]))
    overlap_tsne += shared

avg_tsne = overlap_tsne / len(X_roll) / 10
print(f"Neighborhood preservation (t-SNE): {avg_tsne:.1%}")
print(f"Neighborhood preservation (PCA):   {avg_overlap:.1%}")
print(f"\nt-SNE output shape: {X_tsne.shape}")
print(f"t-SNE KL divergence: {tsne.kl_divergence_:.4f}")

t-SNE should preserve significantly more neighborhoods than PCA on the Swiss roll. Where PCA got 40-60%, t-SNE typically gets 70-85%+. The improvement comes from allowing the embedding to bend and fold -- something PCA's linear projection fundamentally cannot do.

Reading t-SNE plots: the rules (CRITICAL)

t-SNE produces gorgeous visualizations that practically beg to be over-interpreted. I've seen people draw all kinds of conclusions from t-SNE plots that the algorithm simply does not support. So here are the rules for honest interpretation. Burn these into your brain before you ever show a t-SNE plot to a colleague.

Rule 1: Cluster existence is meaningful. If t-SNE shows distinct, well-separated groups, those groups almost certainly exist in the original data. The algorithm preserves local structure, so tight clusters in the plot correspond to tight clusters in high-dimensional space. This is the ONE thing you can trust.

Rule 2: Distances BETWEEN clusters are NOT meaningful. Two clusters that are far apart in the t-SNE plot might not be more different than two clusters that are close together. The relative positions of clusters are an artifact of the optimization process -- they depend on the random initialization and the gradient descent path, not on the actual inter-cluster distances in the original space. Do NOT say "cluster A and cluster B are similar because they're close in the t-SNE plot."

Rule 3: Cluster SIZES are NOT meaningful. t-SNE can inflate tight clusters and compress diffuse ones. A large cloud in the plot isn't necessarily a big group in the original space, and a tiny dot isn't necessarily a small group. The algorithm adjusts point density to match the target perplexity, which distorts the visual scale.

Rule 4: Cluster SHAPES are NOT meaningful. Don't interpret the elongation, curvature, or internal structure of clusters. t-SNE can produce all sorts of shapes depending on the perplexity, number of iterations, and random seed.

Let me demonstrate why perplexity matters so much:

from sklearn.datasets import load_digits

digits = load_digits()
X_dig, y_dig = digits.data, digits.target

# Same data, different perplexity values
for perp in [5, 15, 30, 50, 100]:
    tsne_p = TSNE(
        n_components=2,
        perplexity=perp,
        random_state=42,
        n_iter=1000
    )
    X_emb = tsne_p.fit_transform(X_dig)

    # Measure separation: how well can KNN classify in 2D?
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.model_selection import cross_val_score

    knn = KNeighborsClassifier(n_neighbors=5)
    acc = cross_val_score(knn, X_emb, y_dig, cv=5).mean()

    # Count visible clusters: how many tight groups?
    from sklearn.cluster import DBSCAN
    db = DBSCAN(eps=3.0, min_samples=10)
    db_labels = db.fit_predict(X_emb)
    n_clusters = len(set(db_labels) - {-1})

    print(f"Perplexity {perp:>3d}: KNN acc = {acc:.1%}, "
          f"visible clusters = {n_clusters}, "
          f"KL div = {tsne_p.kl_divergence_:.3f}")

Low perplexity (5) creates many small, fragmented clusters -- it's looking at such a tight neighborhood that it tears apart cohesive groups. High perplexity (100+) shows more global structure but can merge distinct groups into blobs. The "right" perplexity depends on your data and what you want to see. The standard advice is to try several values (15, 30, 50 are good starting points) and compare. If your conclusion changes dramatically with perplexity, it's probably not a robust conclusion.

Rule 5: The result is non-deterministic. Different random seeds produce different-looking plots. If a particular structure only appears with one specific seed, it's probably noise, not signal. Run it 3-4 times and trust only the patterns that are consistent across runs.

Rule 6: t-SNE cannot transform new data. Once you've computed an embedding for 1,000 points, you can't add point 1,001 without rerunning the entire computation. t-SNE doesn't learn a mapping -- it finds a specific arrangement for a specific dataset. This is the same transductive limitation we saw with HDBSCAN in episode #23. For production systems where new data arrives continuously, this is a real problem.

UMAP: the modern default

UMAP (Uniform Manifold Approximation and Projection, McInnes et al., 2018) has rapidly become the go-to alternative to t-SNE. It produces visually similar embeddings but with several practical advantages that matter a lot in day-to-day work.

The mathematical foundation is different -- UMAP comes from algebraic topology and category theory, specifically the theory of fuzzy simplicial sets (which... yeah, that sounds terrifying, but you absolutely do not need to understand the theory to use UMAP effectively). In practice, the algorithm works like this:

Build a weighted nearest-neighbor graph in high-dimensional space. Each point connects to its K nearest neighbors, with edge weights that decay with distance.
Find a low-dimensional layout that preserves the structure of this graph as closely as possible, using stochastic gradient descent on a cross-entropy loss.

The conceptual model is closer to graph layout algorithms than to probability matching (which is what t-SNE does). This is part of why UMAP tends to preserve global structure better than t-SNE -- it's explicitly modeling the connectivity of the data, not just pairwise probabilities.

import umap

# UMAP on the Swiss roll
reducer = umap.UMAP(
    n_neighbors=15,
    min_dist=0.1,
    n_components=2,
    random_state=42
)
X_umap = reducer.fit_transform(X_roll)

# Neighborhood preservation
nn_umap = NearestNeighbors(n_neighbors=10)
nn_umap.fit(X_umap)
_, neighbors_umap = nn_umap.kneighbors(X_umap)

overlap_umap = 0
for i in range(len(X_roll)):
    shared = len(set(neighbors_3d[i]) & set(neighbors_umap[i]))
    overlap_umap += shared

avg_umap = overlap_umap / len(X_roll) / 10
print(f"Neighborhood preservation comparison:")
print(f"  PCA:   {avg_overlap:.1%}")
print(f"  t-SNE: {avg_tsne:.1%}")
print(f"  UMAP:  {avg_umap:.1%}")

UMAP's key parameters

n_neighbors is the UMAP equivalent of t-SNE's perplexity. It controls the balance between local and global structure:

Low values (5-15): fine-grained local structure, more fragmented clusters, more detail
High values (50-200): smoother global patterns, broader groupings, less local detail

min_dist controls how tightly UMAP is allowed to pack points together:

Low values (0.0-0.1): tight, distinct clusters with clear boundaries
High values (0.3-0.8): more uniform distribution, softer cluster boundaries, better for seeing the overall topology

Let me show you how these parameters interact:

# Parameter exploration on the digits dataset
print(f"{'n_neighbors':>12s}  {'min_dist':>9s}  {'KNN acc':>8s}")
print("-" * 35)

for n_neigh in [5, 15, 50, 100]:
    for md in [0.0, 0.1, 0.5]:
        red = umap.UMAP(
            n_neighbors=n_neigh,
            min_dist=md,
            random_state=42
        )
        X_emb = red.fit_transform(X_dig)

        knn = KNeighborsClassifier(n_neighbors=5)
        acc = cross_val_score(knn, X_emb, y_dig, cv=5).mean()
        print(f"{n_neigh:>12d}  {md:>9.1f}  {acc:>7.1%}")

You'll notice that UMAP is fairly robust across parameter choices -- the accuracy doesn't swing wildly like t-SNE with different perplexity values. That stability is one of its biggest practical advantages. You don't need to spend an hour tuning parameters to get a useful visualization.

Why UMAP wins on speed

Here's where UMAP really pulls ahead. t-SNE computes pairwise affinities between ALL points -- that's O(n^2) in memory and compute. On 10,000 points it's fine. On 100,000 points it starts to hurt. On a million points, you're looking at terabytes of pairwise distances and hours of gradient descent.

UMAP only needs nearest-neighbor computations (typically using Annoy or NN-descent, which are approximate but fast), and its optimization runs on the sparse graph in stead of the dense pairwise matrix. The practical difference is massive:

import time

# Speed comparison on increasing dataset sizes
from sklearn.datasets import make_blobs

for n_samples in [1000, 3000, 5000]:
    X_speed, _ = make_blobs(
        n_samples=n_samples, n_features=50,
        centers=10, random_state=42
    )

    # t-SNE
    start = time.time()
    TSNE(n_components=2, perplexity=30,
         random_state=42).fit_transform(X_speed)
    t_tsne = time.time() - start

    # UMAP
    start = time.time()
    umap.UMAP(n_neighbors=15,
              random_state=42).fit_transform(X_speed)
    t_umap = time.time() - start

    print(f"n={n_samples:>5d}: t-SNE = {t_tsne:>6.1f}s, "
          f"UMAP = {t_umap:>6.1f}s, "
          f"speedup = {t_tsne/t_umap:.1f}x")

On moderate datasets (a few thousand points), UMAP is typically 3-5x faster. On larger datasets (50K+), the speedup can be 10-50x. And the gap grows with data size because t-SNE's O(n^2) scaling hits harder and harder while UMAP's approximate nearest-neighbor approach scales much more gracefully.

UMAP can transform new data

This is the killer feature that t-SNE doesn't have. Once you've fitted a UMAP model, you can use it to project new, unseen data points into the same embedding space:

# Fit UMAP on training data, transform test data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_dig, y_dig, test_size=0.2, random_state=42
)

# Fit on training set only
reducer_fit = umap.UMAP(n_neighbors=15, random_state=42)
X_train_2d = reducer_fit.fit_transform(X_train)

# Transform test set (WITHOUT refitting!)
X_test_2d = reducer_fit.transform(X_test)

print(f"Training set embedded: {X_train_2d.shape}")
print(f"Test set embedded:     {X_test_2d.shape}")

# Are the test points landing in the right clusters?
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_2d, y_train)
test_acc = knn.score(X_test_2d, y_test)
print(f"KNN accuracy on UMAP-transformed test set: {test_acc:.1%}")

This makes UMAP viable for production workflows where you need to visualize incoming data against an existing embedding, or where you're using the low-dimensional representation as input to a downstream model. t-SNE can't do this at all -- every new point requires rerunning the whole optimization from scratch. Remember the same distinction from episode #23: K-Means and GMMs can predict() on new data, but HDBSCAN is transductive. UMAP is inductive (like K-Means); t-SNE is transductive (like HDBSCAN).

The big comparison: PCA vs t-SNE vs UMAP

Let's put all three methods head-to-head on the digits dataset -- the same 8x8 pixel images of handwritten digits that we visualized with PCA in episode #24. We know the ground truth labels (10 digit classes), so we can objectively measure how well each method separates them:

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import silhouette_score

# Standardize first (as we learned in episode #24 -- always scale before PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_dig)

# PCA
X_pca_2d = PCA(n_components=2).fit_transform(X_scaled)

# t-SNE
X_tsne_2d = TSNE(
    n_components=2, perplexity=30, random_state=42
).fit_transform(X_scaled)

# UMAP
X_umap_2d = umap.UMAP(
    n_neighbors=15, min_dist=0.1, random_state=42
).fit_transform(X_scaled)

# Compare on three metrics
print(f"{'Method':>8s}  {'KNN acc':>8s}  {'Silhouette':>11s}  "
      f"{'Var explained':>14s}")
print("-" * 50)

for name, X_2d in [('PCA', X_pca_2d),
                    ('t-SNE', X_tsne_2d),
                    ('UMAP', X_umap_2d)]:
    knn = KNeighborsClassifier(n_neighbors=5)
    acc = cross_val_score(knn, X_2d, y_dig, cv=5).mean()
    sil = silhouette_score(X_2d, y_dig)

    if name == 'PCA':
        var_exp = PCA(n_components=2).fit(
            X_scaled).explained_variance_ratio_.sum()
        var_str = f"{var_exp:.1%}"
    else:
        var_str = "N/A"

    print(f"{name:>8s}  {acc:>7.1%}  {sil:>11.3f}  {var_str:>14s}")

The results tell a clear story. PCA preserves global structure and gives you explained variance metrics, but its 2D projection overlaps many digit classes because the linear projection can't capture the nonlinear manifold structure of handwritten digits. t-SNE creates dramatically better-separated clusters -- you'll see each digit class form its own tight island, with KNN accuracy jumping from maybe 80% (PCA) to 95%+ (t-SNE). UMAP performs similarly to t-SNE on separation quality, often with slightly better inter-cluster spacing.

The key insight is that digits that look "similar" to humans (3 and 8 share curved strokes, 7 and 1 are both straight-ish) are close together in ALL three embeddings. But PCA overlaps them because the differences are nonlinear, while t-SNE and UMAP can pull them apart by following the manifold structure.

Combining with clustering: the practical workflow

These visualization methods pair beautifully with the clustering techniques from episodes #22 and #23. The standard exploratory workflow looks like this:

from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import adjusted_rand_score

# Step 1: UMAP embedding (fast, good global structure)
embedding = umap.UMAP(
    n_neighbors=15, min_dist=0.1, random_state=42
).fit_transform(X_scaled)

# Step 2: Cluster in the embedding space
# (DBSCAN works great on UMAP output because
# UMAP creates well-separated dense clusters)
db = DBSCAN(eps=0.5, min_samples=10)
cluster_labels = db.fit_predict(embedding)

n_found = len(set(cluster_labels) - {-1})
n_noise = np.sum(cluster_labels == -1)
ari = adjusted_rand_score(y_dig, cluster_labels)

print(f"UMAP + DBSCAN pipeline:")
print(f"  Clusters found: {n_found}")
print(f"  Noise points: {n_noise}")
print(f"  Adjusted Rand Index: {ari:.3f}")

# Compare: K-Means on the UMAP embedding
km = KMeans(n_clusters=10, n_init=10, random_state=42)
km_labels = km.fit_predict(embedding)
ari_km = adjusted_rand_score(y_dig, km_labels)
print(f"\nUMAP + K-Means:")
print(f"  ARI: {ari_km:.3f}")

# Compare: K-Means directly on full 64D data
km_full = KMeans(n_clusters=10, n_init=10, random_state=42)
km_full_labels = km_full.fit_predict(X_scaled)
ari_full = adjusted_rand_score(y_dig, km_full_labels)
print(f"\nK-Means on full 64D (no reduction):")
print(f"  ARI: {ari_full:.3f}")

DBSCAN on UMAP output works surprizingly well because UMAP creates exactly the kind of structure DBSCAN is designed for -- dense, well-separated clusters with clear gaps between them. Remember from episode #23 how DBSCAN defines clusters as dense regions separated by sparse areas? UMAP's embeddings naturally have that property. The combination of UMAP's nonlinear embedding with DBSCAN's density-based clustering is one of the most powerful unsupervised analysis pipelines in modern practice.

Having said that, a caution: clustering on t-SNE or UMAP output is an exploratory tool, NOT a rigorous analysis method. The embedding introduces distortions (remember Rules 2-4 about t-SNE plots), and those distortions can create artificial cluster boundaries. Always validate the discovered clusters by going back to the original high-dimensional data and checking whether the groups make sense there too.

The embedding trap: don't use these as features

A word of caution that trips up more people than I'd like to admit. t-SNE and UMAP produce beautiful 2D representations that visually separate your classes. So why not use those 2D coordinates as features for a downstream classifier?

Because it's a trap. Here's why:

# The embedding trap -- don't do this
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

# "Good" pipeline: SVM on PCA features
pipe_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=20)),
    ('svm', SVC(kernel='rbf', random_state=42))
])
acc_pca = cross_val_score(pipe_pca, X_dig, y_dig, cv=5).mean()

# "Tempting" approach: SVM on UMAP features
# WARNING: this is methodologically flawed!
X_umap_all = umap.UMAP(
    n_neighbors=15, random_state=42
).fit_transform(X_scaled)
acc_umap = cross_val_score(
    SVC(kernel='rbf', random_state=42),
    X_umap_all, y_dig, cv=5
).mean()

# Correct comparison: UMAP fitted on EACH fold's training set
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
fold_accs = []
for train_idx, test_idx in skf.split(X_scaled, y_dig):
    reducer = umap.UMAP(n_neighbors=15, random_state=42)
    X_train_emb = reducer.fit_transform(X_scaled[train_idx])
    X_test_emb = reducer.transform(X_scaled[test_idx])

    svm = SVC(kernel='rbf', random_state=42)
    svm.fit(X_train_emb, y_dig[train_idx])
    fold_accs.append(svm.score(X_test_emb, y_dig[test_idx]))

acc_umap_proper = np.mean(fold_accs)

print(f"PCA (20 comp) + SVM:         {acc_pca:.1%}")
print(f"UMAP (leaky) + SVM:          {acc_umap:.1%}")
print(f"UMAP (proper folds) + SVM:   {acc_umap_proper:.1%}")

The "leaky" UMAP approach fits the embedding on ALL data (including the test fold), then evaluates with cross-validation. That's data leakage -- the UMAP embedding already "saw" the test points and arranged them nicely. The proper approach fits UMAP only on each training fold and uses transform on the test fold. The difference in accuracy can be substantial, and it shows how much of the "improvement" from UMAP features was actually just information leakage from the test set.

More fundamentally, these embeddings are optimized for visual separation, not predictive power. They can create structure that doesn't exist in the original data. Random noise projected through t-SNE can look like it has clusters. For actual ML pipelines, stick with PCA or (later in this series) autoencoders as preprocessing. Use t-SNE and UMAP for what they're designed for: exploration and visualization.

Installing umap-learn

One practical note: UMAP isn't included in scikit-learn. You need to install it separately:

# Install (run in your terminal, not in Python)
# pip install umap-learn

# Note: the package is "umap-learn" but you import it as "umap"
import umap
print(f"UMAP version: {umap.__version__}")

The package name on PyPI is umap-learn (not umap -- there's a different, unrelated package called umap that you do NOT want). You import it as just umap. The library depends on numba for JIT compilation, which sometimes makes the first run slow (compilation overhead) but subsequent runs fast. If you get installation issues with numba, try pip install numba separately first.

When to use which: the practical decision guide

After covering three different dimensionality reduction approaches across two episodes, here's my practical recommendation. And like the clustering decision guide from episode #23, this comes from running these on many different datasets, not just textbook examples ;-)

Use PCA when you need:

A fast, deterministic baseline that always gives the same result
Preprocessing before ML models (especially distance-based ones like SVMs and K-Means)
To understand how much information each component carries (explained variance)
A method that can project new data without refitting
Linear relationships are sufficient for your purposes

Use t-SNE when you need:

The absolute best local neighborhood preservation for visualization
Publication-quality plots for papers or presentations where cluster structure matters
Your dataset is moderate (under 30,000-50,000 points -- above that, it gets painfully slow)
You're willing to try multiple perplexity values and random seeds

Use UMAP when you need:

Fast visualization of large datasets (50K+ points, no problem)
Global structure preservation alongside local structure
The ability to transform new data into the same embedding
A general-purpose nonlinear embedding that "just works" with minimal tuning
The embedding as an exploratory preprocessing step before clustering

In practice, my workflow is: PCA first for a quick sanity check and to see if there's obvious linear structure. If PCA doesn't separate things well, UMAP for a more detailed exploration. t-SNE only when I specifically need the finest local structure and I'm preparing a figure for a publication. UMAP has become my default for almost everything -- it's faster, more stable across parameters, preserves more global structure, and the ability to transform new data makes the whole pipeline more useful.

What comes next

We've now completed a solid unsupervised learning block. We can find groups in data (K-Means, DBSCAN, HDBSCAN, GMMs from episodes #22-23), and we can compress and visualize high-dimensional data (PCA from #24, t-SNE and UMAP from today). These tools work together: reduce dimensions, then cluster; or cluster in high-D, then visualize in 2D to check the results.

The natural question that follows from all of this is: what about the data points that DON'T belong to any group? We've been focused on finding structure -- clusters, components, manifolds. But sometimes the most interesting data points are the ones that sit outside all the normal patterns. The outliers. The anomalies. The transactions that don't look like any other transaction in the database. Detecting those anomalous points is its own field of ML, and the tools we've built (distances, density, clustering, dimensionality reduction) are exactly the foundation it builds on.

Zo, wat hebben we geleerd vandaag?

PCA is linear -- it can't unfold curved manifolds like the Swiss roll. Points that are far apart along the surface get projected on top of each other. For nonlinear structure, you need different tools;
t-SNE (van der Maaten & Hinton, 2008) preserves local neighborhoods by matching probability distributions between high-D and 2D. The t-distribution in 2D prevents the crowding problem. It produces excellent cluster visualizations but is slow (O(n^2)), non-deterministic, and cannot transform new data;
Reading t-SNE plots honestly: cluster existence is real, but cluster distances, sizes, and shapes are artifacts. Always try multiple perplexity values. If your conclusion changes with perplexity, it's not robust;
UMAP (McInnes et al., 2018) builds a nearest-neighbor graph and finds a low-D layout preserving its structure. It's 3-50x faster than t-SNE, preserves more global structure, and CAN transform new data. It's the practical default for most visualization tasks;
The combination of UMAP + DBSCAN is a powerful exploratory pipeline -- UMAP creates dense, separated clusters that DBSCAN is perfectly designed to find;
Don't use t-SNE/UMAP output as ML features -- they're visualization tools optimized for visual separation, not predictive power. For preprocessing, use PCA (or autoencoders, which we'll build later with neural networks). Watch out for data leakage when evaluating embeddings;
PCA for baselines and preprocessing, t-SNE for publication figures, UMAP as the general-purpose default. That's the practical hierarchy.

Bedankt voor het lezen!

Hive account@scipio

Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP

Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP

The Swiss roll: where PCA falls apart

t-SNE: probability distributions over neighbors

Phase 1: model the high-dimensional neighborhoods

Phase 2: find a 2D layout that matches

Reading t-SNE plots: the rules (CRITICAL)

UMAP: the modern default

UMAP's key parameters

Why UMAP wins on speed

UMAP can transform new data

The big comparison: PCA vs t-SNE vs UMAP

Combining with clustering: the practical workflow

The embedding trap: don't use these as features

Installing umap-learn

When to use which: the practical decision guide

What comes next

Zo, wat hebben we geleerd vandaag?

Bedankt voor het lezen!

Curriculum (of the `Learn AI Series`):