Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
What will I learn
- You will learn why vanilla gradient descent isn't enough for real neural networks;
- stochastic gradient descent and mini-batches -- making training feasible on large datasets;
- momentum -- accelerating through narrow valleys in the loss landscape;
- RMSProp -- adaptive learning rates per parameter;
- Adam -- combining momentum and adaptive rates, the current default;
- AdamW -- decoupled weight decay, the standard in modern deep learning;
- implementing every optimizer from scratch in NumPy so you see exactly what's happening;
- practical guidance on when to use what (spoiler: start with AdamW).
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam (this post)
Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
Back in episode #7, gradient descent was beautifully simple: compute the gradient, take a step in the opposite direction, repeat. With a single parameter and a smooth loss function, that's all you need. But neural networks have millions of parameters, the loss landscape is a high-dimensional surface full of ridges, plateaus, saddle points, and sharp ravines -- and vanilla gradient descent handles none of these well. It zigzags, it gets stuck, it crawls through flat regions at an agonizing pace while wasting time oscillating through narrow valleys.
Last episode we covered the techniques that keep deep networks from blowing up during training -- He initialization, batch normalization, dropout, learning rate scheduling, early stopping. Those are about stability. This episode is about speed and quality -- the optimization algorithms that actually drive the learning process. By the end, you'll understand why Adam is the default choice in practice, when SGD with momentum still wins, and how to implement all of them from scratch in NumPy. Here we go!
From batch to stochastic: making training feasible
In our episode #7 training loop, we computed gradients on the entire dataset at each step. This is batch gradient descent -- accurate gradients, but painfully slow. With 1 million training samples, each gradient step requires a full forward and backward pass over all 1 million samples before you can update a single weight. If that takes 10 seconds per step, and you need 10,000 steps to converge, that's over a day of training. And this is a small dataset by modern standards.
Stochastic gradient descent (SGD) takes the opposite approach: compute the gradient on a single random sample and update immediately. Each update is noisy (the gradient from one sample is a rough approximation of the true gradient), but fast. You make many small, imprecise steps in stead of few large, precise ones. Over time, the noise averages out -- the expected value of the stochastic gradient equals the true gradient, so on average you're heading in the right direction.
Mini-batch SGD is the practical compromise everyone uses: compute the gradient on a small random batch (typically 32-256 samples), balancing noise reduction and computational efficiency. This is what everyone means by "SGD" in practice -- nobody uses single-sample SGD, and nobody uses full-batch gradient descent on large datasets.
import numpy as np
def sgd_update(weights, gradients, lr=0.01):
"""Vanilla SGD: step in the opposite direction of the gradient."""
return [w - lr * g for w, g in zip(weights, gradients)]
def create_batches(X, y, batch_size=32):
"""Yield random mini-batches for training."""
indices = np.random.permutation(len(X))
for start in range(0, len(X), batch_size):
batch_idx = indices[start:start + batch_size]
yield X[batch_idx], y[batch_idx]
# Quick demonstration
np.random.seed(42)
X_demo = np.random.randn(256, 4)
y_demo = np.random.randint(0, 2, 256)
batch_count = 0
for X_batch, y_batch in create_batches(X_demo, y_demo, batch_size=32):
batch_count += 1
print(f"Dataset: {len(X_demo)} samples")
print(f"Batch size: 32")
print(f"Batches per epoch: {batch_count}")
print(f"Each batch is a random subset -- different every epoch")
The batch size is itself a hyperparameter with interesting properties. Smaller batches (16-32) add more noise to the gradient, which acts as implicit regularization and can actually help the network escape sharp local minima that don't generalize well (remember our discussion of flat vs sharp minima in episode #40). Larger batches (256-1024) provide cleaner gradient estimates and better utilize GPU parallelism -- modern GPUs can process 32 samples almost as fast as they process 1, so you're essentially getting 32x more gradient information for free.
The current practical wisdom: use the largest batch size your GPU memory can fit, then adjust the learning rate proportionally. This is the linear scaling rule: if you double the batch size, double the learning rate. The intuition is straightforward -- a bigger batch gives a more accurate gradient estimate, so you can afford to take a larger step. This rule works well in practice up to moderately large batches (a few thousand), though it breaks down for very large batches where other considerations dominate.
# The linear scaling rule in action
base_lr = 0.001
base_batch = 32
print(f"{'Batch Size':>12s} {'Learning Rate':>15s} {'Steps/Epoch (1M samples)':>25s}")
print("-" * 56)
for bs in [32, 64, 128, 256, 512, 1024]:
lr = base_lr * (bs / base_batch)
steps = (1_000_000 + bs - 1) // bs
print(f"{bs:>12d} {lr:>15.6f} {steps:>25,d}")
print(f"\nLarger batches = fewer steps, but each step is more expensive")
print(f"GPU parallelism makes larger batches nearly free up to hardware limits")
Momentum: building up speed
Imagine rolling a ball down a hilly landscape. Without momentum, the ball stops at every slight uphill bump and gets stuck in every little depression. With momentum, it builds speed going downhill and powers through small bumps, eventually settling in a deeper valley.
The problem momentum solves is oscillation. In loss landscapes with narrow valleys (where the gradient is steep across the valley but shallow along it), vanilla SGD zigzags back and forth across the valley, making very slow progress along the direction that actually matters. The gradient points steeply across the valley at each step, so the optimizer bounces from wall to wall while barely moving forward. You've probably seen this if you've ever trained a network and watched the loss bounce up and down while trending slowly downward.
Momentum accumulates gradient direction over time: if the gradient consistently points the same way (along the valley), the effective step grows larger. If the gradient oscillates (across the valley), the back-and-forth directions cancel out. This smoothing effect dramatically reduces zigzagging and accelerates convergance in the consistent direction.
class SGDMomentum:
def __init__(self, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.velocity = None
def update(self, weights, gradients):
if self.velocity is None:
self.velocity = [np.zeros_like(w) for w in weights]
for i in range(len(weights)):
self.velocity[i] = (self.momentum * self.velocity[i]
- self.lr * gradients[i])
weights[i] += self.velocity[i]
return weights
The velocity accumulates past gradients with exponential decay controlled by the momentum coefficient (typically 0.9). At each step, the current gradient contributes 10% and the accumulated history contributes 90%. This effectively means the optimizer "remembers" the last ~10 gradient directions (because 0.9^10 is about 0.35, so gradients older than ~10 steps have decayed to less than 35% of their original weight).
Let me show you the difference between plain SGD and SGD with momentum on a toy optimization problem:
def rosenbrock_grad(x, y):
"""Gradient of the Rosenbrock function -- a classic test surface
with a narrow curved valley. Minimum at (1, 1)."""
dx = -2 * (1 - x) + 400 * x * (x**2 - y)
dy = 200 * (y - x**2)
return np.array([dx, dy])
# Compare vanilla SGD vs momentum on the Rosenbrock function
def optimize(start, optimizer_fn, n_steps=200):
pos = start.copy()
path = [pos.copy()]
for _ in range(n_steps):
grad = rosenbrock_grad(pos[0], pos[1])
pos = optimizer_fn(pos, grad)
path.append(pos.copy())
return np.array(path)
start = np.array([-1.0, 1.0])
lr = 0.0001
# Vanilla SGD
sgd_path = optimize(start, lambda p, g: p - lr * g, 500)
# SGD with momentum (manual tracking)
velocity = np.zeros(2)
def momentum_step(p, g, v=velocity):
v[:] = 0.9 * v - lr * g
return p + v
mom_path = optimize(start, momentum_step, 500)
print(f"Rosenbrock function: minimum at (1.0, 1.0)\n")
print(f"{'':>5s} {'SGD':>20s} {'SGD+Momentum':>20s}")
print("-" * 50)
for step in [0, 50, 100, 200, 300, 500]:
if step < len(sgd_path) and step < len(mom_path):
s = sgd_path[step]
m = mom_path[step]
print(f"{step:>5d} ({s[0]:>7.4f}, {s[1]:>7.4f}) "
f"({m[0]:>7.4f}, {m[1]:>7.4f})")
sgd_dist = np.sqrt((sgd_path[-1][0]-1)**2 + (sgd_path[-1][1]-1)**2)
mom_dist = np.sqrt((mom_path[-1][0]-1)**2 + (mom_path[-1][1]-1)**2)
print(f"\nFinal distance to minimum:")
print(f" SGD: {sgd_dist:.6f}")
print(f" SGD+Momentum: {mom_dist:.6f}")
Momentum doesn't just speed up convergence -- it changes what minima the optimizer finds. The accumulated velocity acts like physical inertia, helping the optimizer roll past sharp local minima (which are narrow and can't "catch" a fast-moving optimizer) and settle in broader, flatter minima that tend to generalize better. This connection between optimizer dynamics and generalization is an active research area, but the practical takeaway is clear: momentum almost always helps, and 0.9 is the standard default ;-)
Nesterov momentum: look before you leap
There's a clever variant of momentum called Nesterov accelerated gradient (NAG). Standard momentum computes the gradient at the current position and then applies the velocity. Nesterov momentum first takes a step in the direction of the accumulated velocity (a "lookahead" to where you're about to land), then computes the gradient at that lookahead position. The idea is that you're correcting from a better-informed position.
class NesterovMomentum:
def __init__(self, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.velocity = None
def update(self, weights, gradients_at_lookahead):
"""Note: gradients should be computed at (weights + momentum * velocity),
not at the current weights. This is the lookahead correction."""
if self.velocity is None:
self.velocity = [np.zeros_like(w) for w in weights]
for i in range(len(weights)):
self.velocity[i] = (self.momentum * self.velocity[i]
- self.lr * gradients_at_lookahead[i])
weights[i] += self.velocity[i]
return weights
print("Nesterov vs standard momentum:")
print(" Standard: compute gradient at position, THEN move")
print(" Nesterov: move to lookahead position, THEN compute gradient there")
print(" Nesterov converges faster on convex problems")
print(" In practice: marginal improvement, but some frameworks default to it")
In practice, Nesterov momentum provides a small but consistent improvement over standard momentum, especially on convex problems. The improvement on non-convex neural network losses is less clear-cut, but since it's essentially free (just a minor code change), many practitioners use it as the default.
RMSProp: adaptive learning rates
Here's a problem that momentum doesn't solve: different parameters in a neural network need different learning rates. A weight connected to a frequently-active input feature receives large, consistent gradients and might overshoot with a high learning rate. A weight connected to a rarely-active feature receives tiny, sparse gradients and needs a bigger learning rate to make any progress at all. Using a single global learning rate is a compromise -- too high for some parameters, too low for others.
RMSProp (Hinton, 2012 -- technically never published as a paper, introduced in a Coursera lecture, which is kind of amazing for such an influential algorithm) adapts the learning rate per parameter. It tracks the moving average of squared gradients for each parameter and divides the learning rate by the square root of this running average:
class RMSProp:
def __init__(self, lr=0.001, decay=0.99, eps=1e-8):
self.lr = lr
self.decay = decay
self.eps = eps
self.cache = None
def update(self, weights, gradients):
if self.cache is None:
self.cache = [np.zeros_like(w) for w in weights]
for i in range(len(weights)):
self.cache[i] = (self.decay * self.cache[i]
+ (1 - self.decay) * gradients[i]**2)
weights[i] -= (self.lr * gradients[i]
/ (np.sqrt(self.cache[i]) + self.eps))
return weights
What does this do in practice? Parameters with consistently large gradients accumulate a large cache value, which reduces their effective learning rate -- they take smaller, more careful steps. Parameters with small or sparse gradients get a small cache, which increases their effective rate -- they get bigger steps to compensate for the weak signal. This per-parameter adaptation means you don't need to hand-tune different learning rates for different layers or parameter groups (which would be a nightmare in a 100-layer network).
The eps (epsilon, typically 1e-8) prevents division by zero when the cache is near zero for parameters that rarely receive any gradient at all. It's a tiny detail but your training will produce NaN without it -- one of those defensive coding habits that becomes automatic after you've debugged NaN losses a few times.
# Demonstrating adaptive learning rates
np.random.seed(42)
# Simulate parameters with very different gradient magnitudes
grad_large = np.random.randn(1000) * 10 # large feature
grad_small = np.random.randn(1000) * 0.01 # rare feature
rms = RMSProp(lr=0.001)
weights = [np.zeros(1000), np.zeros(1000)]
for step in range(100):
grads = [grad_large * np.random.randn(), grad_small * np.random.randn()]
weights = rms.update(weights, grads)
print("RMSProp adapts step sizes to gradient magnitudes:")
print(f" Large-gradient params: cache ~{rms.cache[0].mean():.4f} "
f"-> effective lr ~{0.001 / np.sqrt(rms.cache[0].mean() + 1e-8):.6f}")
print(f" Small-gradient params: cache ~{rms.cache[1].mean():.8f} "
f"-> effective lr ~{0.001 / np.sqrt(rms.cache[1].mean() + 1e-8):.6f}")
print(f" Ratio: {np.sqrt(rms.cache[0].mean()) / np.sqrt(rms.cache[1].mean()):.0f}x difference in step size")
Adam: the best of both worlds
Adam (Adaptive Moment Estimation, Kingma & Ba, 2015) combines the ideas of momentum and RMSProp into a single optimizer. It maintains two running averages:
- First moment (like momentum): the exponential moving average of gradients -- tracks the direction
- Second moment (like RMSProp): the exponential moving average of squared gradients -- tracks the scale
The combination gives you momentum's ability to build speed in consistent directions AND RMSProp's per-parameter learning rate adaptation. Having said that, there's one more trick that makes Adam work well in practice: bias correction.
class Adam:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.m = None # first moment (momentum)
self.v = None # second moment (RMSProp)
self.t = 0 # timestep
def update(self, weights, gradients):
if self.m is None:
self.m = [np.zeros_like(w) for w in weights]
self.v = [np.zeros_like(w) for w in weights]
self.t += 1
for i in range(len(weights)):
# Update biased first moment estimate
self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * gradients[i]
# Update biased second moment estimate
self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * gradients[i]**2
# Bias correction (critical for early steps!)
m_hat = self.m[i] / (1 - self.beta1**self.t)
v_hat = self.v[i] / (1 - self.beta2**self.t)
# Update weights
weights[i] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
return weights
The bias correction step deserves a closer look because it's one of those details that seems fiddly but is mathematically crucial. Both moment estimates are initialized to zero. At timestep 1, after seeing one gradient, m = 0.9 * 0 + 0.1 * gradient = 0.1 * gradient. But the true expected value of the first moment should be close to gradient, not 0.1 * gradient. The bias correction divides by (1 - 0.9^1) = 0.1, recovering the full gradient estimate. By timestep 10, the correction factor is (1 - 0.9^10) = 0.651, still significant. By timestep 100, it's (1 - 0.9^100) = 0.99997, almost 1 -- the correction becomes negligible.
Without bias correction, the effective learning rate in early training is much lower than intended, which slows down the critical first few epochs. Let me show you:
# Bias correction matters most in early training
print(f"{'Step':>5s} {'1-beta1^t':>12s} {'1-beta2^t':>12s} "
f"{'m correction':>14s} {'v correction':>14s}")
print("-" * 63)
for t in [1, 2, 5, 10, 20, 50, 100, 500]:
bc1 = 1 - 0.9**t
bc2 = 1 - 0.999**t
print(f"{t:>5d} {bc1:>12.8f} {bc2:>12.8f} "
f"{1/bc1:>14.4f}x {1/bc2:>14.4f}x")
print(f"\nAt step 1:")
print(f" Without correction: effective m = 0.1 * gradient (10x too small)")
print(f" With correction: effective m = gradient (correct)")
print(f" Without correction: effective v = 0.001 * grad^2 (1000x too small)")
print(f" With correction: effective v = grad^2 (correct)")
Adam's default hyperparameters (lr=0.001, beta1=0.9, beta2=0.999) work well for the vast majority of problems. This is its killer feature: you rarely need to tune them. In my experience, unless you're squeezing the last 0.1% of performance out of a model, the defaults just work. Set and forget. That alone makes Adam worth using over manual SGD tuning for most practical scenarios.
AdamW: fixing weight decay
Standard Adam has a subtle but important problem with weight decay (L2 regularization). In vanilla SGD, L2 regularization adds lambda * w to the gradient. This is mathematically equivalent to shrinking all weights by a factor of (1 - lr * lambda) at each step -- simple and effective. But in Adam, the adaptive learning rates interfere with this. The weight decay term gets divided by the second moment estimate just like the regular gradient, which means different parameters get different amounts of regularization based on their gradient history. That's not what you want -- you want uniform regularization across all parameters, independent of how noisy or stable their gradients have been.
AdamW (Loshchilov & Hutter, 2019) fixes this by decoupling weight decay from the gradient-based update. In stead of adding the weight decay to the gradient (where it gets scaled by Adam's adaptive learning rates), it applies the decay directly to the weights after the Adam step:
class AdamW:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999,
eps=1e-8, weight_decay=0.01):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.wd = weight_decay
self.m = None
self.v = None
self.t = 0
def update(self, weights, gradients):
if self.m is None:
self.m = [np.zeros_like(w) for w in weights]
self.v = [np.zeros_like(w) for w in weights]
self.t += 1
for i in range(len(weights)):
# Standard Adam moment updates
self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * gradients[i]
self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * gradients[i]**2
m_hat = self.m[i] / (1 - self.beta1**self.t)
v_hat = self.v[i] / (1 - self.beta2**self.t)
# Adam update + DECOUPLED weight decay
weights[i] -= self.lr * (m_hat / (np.sqrt(v_hat) + self.eps)
+ self.wd * weights[i])
return weights
The difference is subtle in code but significant in practice. With coupled weight decay (standard Adam), the regularization strength varies per parameter based on gradient statistics. With decoupled weight decay (AdamW), every parameter gets the same proportional shrinkage regardless of its gradient history. This leads to better generalization, especially for large models. AdamW is the standard optimizer for training transformers and large language models -- the paper has over 10,000 citations for good reason.
Comparing optimizers: a concrete example
Theory is nice, but let me show you these optimizers actually working on a real-ish optimization problem. We'll create a 2D loss surface with the kind of pathological features that trip up vanilla SGD -- a narrow valley with very different curvatures along its two axes:
def ill_conditioned_loss(w):
"""A loss function with a narrow valley.
Very steep along w[0], very shallow along w[1].
Minimum at (0, 0)."""
return 0.5 * (100 * w[0]**2 + w[1]**2)
def ill_conditioned_grad(w):
return np.array([100 * w[0], w[1]])
def run_optimizer(optimizer, start, n_steps=100):
"""Run an optimizer and track the path."""
w = [start.copy()]
weights = [start.copy()]
for _ in range(n_steps):
grad = ill_conditioned_grad(weights[0])
weights = optimizer.update(weights, [grad])
w.append(weights[0].copy())
return w
start = np.array([1.0, 10.0])
# Run each optimizer
paths = {}
paths['SGD (lr=0.01)'] = run_optimizer(
type('', (), {'update': lambda s, w, g: [w[0] - 0.01 * g[0]]})(),
start, 200)
opt_mom = SGDMomentum(lr=0.01, momentum=0.9)
paths['SGD+Mom'] = run_optimizer(opt_mom, start, 200)
opt_rms = RMSProp(lr=0.01)
paths['RMSProp'] = run_optimizer(opt_rms, start, 200)
opt_adam = Adam(lr=0.01)
paths['Adam'] = run_optimizer(opt_adam, start, 200)
# Compare final positions
print(f"Starting point: ({start[0]:.1f}, {start[1]:.1f})")
print(f"True minimum: (0.0, 0.0)")
print(f"\n{'Optimizer':>15s} {'Final w[0]':>12s} {'Final w[1]':>12s} "
f"{'Loss':>12s}")
print("-" * 55)
for name, path in paths.items():
final = path[-1]
loss = ill_conditioned_loss(final)
print(f"{name:>15s} {final[0]:>12.6f} {final[1]:>12.6f} "
f"{loss:>12.6f}")
On this kind of ill-conditioned problem, the differences are stark. Vanilla SGD oscillates wildly along the steep direction (w[0]) and makes glacial progress along the shallow direction (w[1]). Momentum damps the oscillation but still has trouble with the different scales. RMSProp adapts to both scales but lacks the momentum to accelerate through flat regions. Adam combines both advantages -- it handles the scale mismatch AND builds momentum, converging fastest and most reliably.
When to use what: the honest guide
After implementing all of these, here's the practical recommendation:
AdamW is the default. Use it unless you have a specific, measured reason not to. It's robust, requires minimal tuning, works well across architectures from tiny MLPs to billion-parameter transformers. Default parameters: lr=0.001, beta1=0.9, beta2=0.999, weight_decay=0.01. If your model trains and converges, move on to more important problems -- architecture, data quality, and regularization matter far more than optimizer choice.
SGD + Momentum can find flatter minima that generalize slightly better, especially for computer vision models (CNNs trained on ImageNet, for example). Many ImageNet-winning models were trained with SGD. But it requires more careful learning rate tuning, a good learning rate schedule (warmup + cosine annealing, as we discussed in episode #40), and generally more hyperparameter work. The potential reward is slightly better final accuracy at the cost of significantly more tuning effort.
RMSProp is rarely used standalone since Adam effectively superseded it. You'll still encounter it as the default in some reinforcement learning settings (the original DQN paper used it), and some older codebases default to it, but for new projects Adam/AdamW is the better choice.
The honest truth: optimizer choice is a second-order effect. The difference between Adam and well-tuned SGD with momentum is typically less than 1% in final performance. The difference between good data and bad data, or a well-designed architecture and a poorly-designed one, or proper regularization and none at all -- those are first-order effects that can swing performance by 10-20%. Spend your time on architecture and data. Use AdamW and move on.
# Summary table
print("=" * 65)
print(f"{'Optimizer':>15s} {'Pros':>25s} {'Cons':>20s}")
print("=" * 65)
print(f"{'Vanilla SGD':>15s} {'Simple, well-understood':>25s} "
f"{'Slow, needs tuning':>20s}")
print(f"{'SGD+Momentum':>15s} {'Better minima for CNNs':>25s} "
f"{'Needs LR schedule':>20s}")
print(f"{'RMSProp':>15s} {'Per-param adaptation':>25s} "
f"{'Superseded by Adam':>20s}")
print(f"{'Adam':>15s} {'Robust defaults':>25s} "
f"{'L2 reg is coupled':>20s}")
print(f"{'AdamW':>15s} {'Best all-around':>25s} "
f"{'Marginal over Adam':>20s}")
print("=" * 65)
print(f"\nDefault recommendation: AdamW with lr=0.001, wd=0.01")
print(f"Try SGD+Momentum only if you have budget for hyperparameter search")
Samengevat
Here's the quick reference of everything we covered:
- Mini-batch SGD makes training feasible on large datasets by computing gradients on small random batches in stead of the full dataset. The linear scaling rule links batch size to learning rate;
- Momentum (typically 0.9) accumulates gradient direction over time, smoothing oscillations and accelerating consistent movement. It also helps the optimizer roll past sharp local minima toward flatter, better-generalizing solutions;
- Nesterov momentum adds a lookahead correction -- compute the gradient at where you're about to land, not where you currently are. A small but consistent improvement over standard momentum;
- RMSProp adapts learning rates per parameter by tracking squared gradient history. Large gradients get smaller steps, small gradients get larger steps. Introduced in a Coursera lecture, never formally published -- yet cited thousands of times;
- Adam combines momentum (first moment) and RMSProp (second moment) with bias correction. The bias correction is critical for accurate estimates in early training. Default hyperparameters (
lr=0.001,beta1=0.9,beta2=0.999) work across a huge range of problems; - AdamW decouples weight decay from the adaptive update, ensuring uniform regularization across all parameters regardless of gradient history. The standard for training transformers and modern deep learning;
- Practical advice: start with AdamW. Tune later if you need to -- or don't. Optimizer choice matters less than architecture and data quality.
With this episode, we've now covered every major component of the modern neural network training recipe: architecture (episode #38), learning (episode #39), stability techniques (episode #40), and optimization (this episode). The next step is to stop building everything from scratch and start using a proper deep learning framework that handles all of this for us, with GPU acceleration and automatic differentiation. That's where things get really interesting -- because once you're not limited by hand-coded NumPy, the scale of problems you can tackle opens up dramatically ;-)