Learn AI Series (#48) - Recurrent Neural Networks - Sequences
What will I learn
- You will learn why order matters -- and why feedforward networks can't handle sequential data;
- the vanilla RNN -- hidden state carries memory from one timestep to the next;
- backpropagation through time (BPTT) -- how gradients flow through sequences;
- the vanishing gradient problem in RNNs -- why vanilla RNNs forget;
- implementing an RNN from scratch in NumPy;
- character-level language models -- generating text one character at a time;
- building RNN-based classifiers and generators in PyTorch.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences (this post)
Learn AI Series (#48) - Recurrent Neural Networks - Sequences
Solutions to Episode #47 Exercises
Exercise 1: Build a detect_and_filter function that runs Faster R-CNN inference, filters by confidence, applies NMS, and returns cleaned results.
import torch
import torch.nn as nn
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2
from torchvision.models.detection import FasterRCNN_ResNet50_FPN_V2_Weights
from torchvision.ops import nms
def detect_and_filter(model, image, confidence_threshold=0.5, iou_threshold=0.5):
"""Run detection, filter by confidence, apply NMS."""
model.eval()
with torch.no_grad():
predictions = model(image)
results = []
for pred in predictions:
boxes = pred['boxes']
scores = pred['scores']
labels = pred['labels']
# Filter by confidence
conf_mask = scores > confidence_threshold
boxes = boxes[conf_mask]
scores = scores[conf_mask]
labels = labels[conf_mask]
# Apply NMS per class
keep = nms(boxes, scores, iou_threshold)
results.append({
'boxes': boxes[keep],
'scores': scores[keep],
'labels': labels[keep]
})
return results
weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT
detector = fasterrcnn_resnet50_fpn_v2(weights=weights)
# Test on 3 random images
batch = [torch.randn(3, 480, 640) for _ in range(3)]
detector.eval()
with torch.no_grad():
raw_preds = detector(batch)
filtered = detect_and_filter(detector, batch, confidence_threshold=0.3, iou_threshold=0.5)
for i in range(3):
raw_count = (raw_preds[i]['scores'] > 0.01).sum().item()
filtered_count = len(filtered[i]['scores'])
print(f"Image {i}: {raw_count} raw detections -> {filtered_count} after filtering")
Using torchvision.ops.nms is cleaner than reimplementing NMS from scratch -- it's written in C++ and handles edge cases. The key insight: filtering happens in two stages. First, confidence thresholding removes weak detections. Second, NMS removes duplicate boxes that overlap too much. In production you'd tune both thresholds depending on whether you prefer precision (fewer false positives) or recall (fewer misses).
Exercise 2: Minimal segmentation pipeline with MiniUNet and synthetic circle data.
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
class UNetBlock(nn.Module):
def __init__(self, in_ch, out_ch):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1), nn.BatchNorm2d(out_ch), nn.ReLU(),
nn.Conv2d(out_ch, out_ch, 3, padding=1), nn.BatchNorm2d(out_ch), nn.ReLU())
def forward(self, x):
return self.conv(x)
class MiniUNet(nn.Module):
def __init__(self, n_classes=2):
super().__init__()
self.enc1 = UNetBlock(3, 64)
self.enc2 = UNetBlock(64, 128)
self.pool = nn.MaxPool2d(2)
self.bottleneck = UNetBlock(128, 256)
self.up2 = nn.ConvTranspose2d(256, 128, 2, stride=2)
self.dec2 = UNetBlock(256, 128)
self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2)
self.dec1 = UNetBlock(128, 64)
self.output = nn.Conv2d(64, n_classes, 1)
def forward(self, x):
e1 = self.enc1(x)
e2 = self.enc2(self.pool(e1))
b = self.bottleneck(self.pool(e2))
d2 = self.dec2(torch.cat([self.up2(b), e2], dim=1))
d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))
return self.output(d1)
class CircleDataset(Dataset):
def __init__(self, n_samples=200, size=64):
self.images = []
self.masks = []
for _ in range(n_samples):
img = torch.randn(3, size, size) * 0.1
mask = torch.zeros(size, size, dtype=torch.long)
cx, cy = torch.randint(10, size-10, (2,))
r = torch.randint(5, 15, (1,)).item()
yy, xx = torch.meshgrid(torch.arange(size), torch.arange(size), indexing='ij')
circle = ((xx - cx)**2 + (yy - cy)**2) < r**2
mask[circle] = 1
img[:, circle] += 0.5 # make circle brighter
self.images.append(img)
self.masks.append(mask)
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
return self.images[idx], self.masks[idx]
train_data = CircleDataset(200)
test_data = CircleDataset(50)
train_loader = DataLoader(train_data, batch_size=16, shuffle=True)
test_loader = DataLoader(test_data, batch_size=50)
model = MiniUNet(n_classes=2)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(30):
model.train()
for imgs, masks in train_loader:
pred = model(imgs)
loss = loss_fn(pred, masks)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval()
with torch.no_grad():
for imgs, masks in test_loader:
pred = model(imgs).argmax(dim=1)
accuracy = (pred == masks).float().mean()
print(f"Per-pixel accuracy on test set: {accuracy.item():.1%}")
# Expect 95%+ accuracy -- circles are easy patterns for a U-Net
Even a tiny U-Net crushes this synthetic task because the circle boundaries are sharp and consistent. On real medical images with ambiguous boundaries, you'd need much more data and training time -- but the architecture pattern is identical.
Exercise 3: Feature extractor comparison between ResNet-18 and ResNet-50.
from torchvision import models
# Build feature extractors
resnet18 = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
fe18 = nn.Sequential(*list(resnet18.children())[:-1])
fe18.eval()
resnet50 = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
fe50 = nn.Sequential(*list(resnet50.children())[:-1])
fe50.eval()
# Extract features for 100 synthetic images
images = torch.randn(100, 3, 224, 224)
with torch.no_grad():
feat18 = fe18(images).squeeze(-1).squeeze(-1) # (100, 512)
feat50 = fe50(images).squeeze(-1).squeeze(-1) # (100, 2048)
# Compute similarity matrices
cos = nn.CosineSimilarity(dim=0)
sim18 = torch.zeros(100, 100)
sim50 = torch.zeros(100, 100)
for i in range(100):
for j in range(100):
sim18[i, j] = cos(feat18[i], feat18[j])
sim50[i, j] = cos(feat50[i], feat50[j])
# Flatten upper triangle and compute correlation
mask = torch.triu(torch.ones(100, 100, dtype=torch.bool), diagonal=1)
flat18 = sim18[mask]
flat50 = sim50[mask]
# Pearson correlation
mean18, mean50 = flat18.mean(), flat50.mean()
cov = ((flat18 - mean18) * (flat50 - mean50)).mean()
std18, std50 = flat18.std(), flat50.std()
correlation = cov / (std18 * std50)
print(f"ResNet-18 features: {feat18.shape}")
print(f"ResNet-50 features: {feat50.shape}")
print(f"Similarity matrix correlation: {correlation.item():.3f}")
print(f"\nHigh correlation means both architectures find the same images")
print(f"'similar' -- pretrained features are largely universal across models")
On random data the correlation is moderate. On real images it would be substantially higher, because both models have learned similar feature hierarchies from ImageNet -- edges, textures, shapes, object parts. This universality of pretrained features is precisely why transfer learning works so well regardless of which specific backbone you choose (as we discussed in episode #46).
On to today's episode
Three episodes on CNNs (episodes #45-47) and we've covered a LOT of ground -- convolution theory, classic-to-modern architectures, transfer learning, detection, segmentation, and style transfer. CNNs conquered images by exploiting spatial structure: local filters, weight sharing, pooling, hierarchical feature maps. Powerful stuff.
But what about data where order is the structure? Text, speech, music, stock prices, sensor readings -- all of these are sequences where the meaning depends on what came before. "The dog bit the man" means something very different from "The man bit the dog." Same words, same bag-of-words representation -- completely opposite meaning. Shuffle the words in a sentence and you get nonsense. Shuffle the notes in a melody and you get noise.
Feedforward networks (everything we've built so far, including CNNs) process each input independently. They have no concept of "before" or "after." You could shuffle the input features and, as long as you shuffled them consistently, the network wouldn't care. That's a fundamental limitation when your data is inherently sequential.
Today we fix that. Here we go!
Why feedforward networks can't do sequences
A feedforward network that predicts the next word in a sentence would need to see the entire preceding context as a fixed-size input. But sentences vary in length. The relevant context might be 3 words back or 300 words back. You could pad everything to the maximum length, but that's wasteful and it doesn't solve the deeper issue: the network still treats position 1 and position 100 as independent input features rather than steps in a process.
Consider how you'd build a sentiment classifier with what we know so far. You could take the word embeddings from episode #31, average them into a single vector, and feed that to a feedforward network. That works -- sort of. But it loses all word order information. "Not good" and "good not" produce the same average embedding. "I thought this movie would be terrible but it was actually amazing" has both "terrible" and "amazing" -- the average doesn't capture that the sentence's meaning pivots on "but."
import torch
import torch.nn as nn
import numpy as np
# The problem: averaging embeddings destroys order
# These two "sentences" would produce identical inputs:
sentence_a = ["not", "good"] # negative
sentence_b = ["good", "not"] # also negative, but for different reasons
# Simulated embeddings (in practice from episode #31)
vocab = {"not": torch.tensor([0.1, -0.5, 0.3]),
"good": torch.tensor([0.8, 0.6, 0.2])}
avg_a = torch.stack([vocab[w] for w in sentence_a]).mean(dim=0)
avg_b = torch.stack([vocab[w] for w in sentence_b]).mean(dim=0)
print(f"Average embedding for '{' '.join(sentence_a)}': {avg_a}")
print(f"Average embedding for '{' '.join(sentence_b)}': {avg_b}")
print(f"Identical? {torch.allclose(avg_a, avg_b)}")
print(f"\nOrder destroyed -- feedforward sees no difference")
We need an architecture that processes elements one at a time, in order, and maintains some kind of running memory of what it has seen so far. That's exactly what Recurrent Neural Networks provide.
The vanilla RNN
The core idea of an RNN is almost disappointingly simple. At each timestep t, the network takes two inputs: the current input x_t and the previous hidden state h_{t-1}. It produces a new hidden state h_t that combines both:
h_t = tanh(W_xh @ x_t + W_hh @ h_{t-1} + b_h)
That's it. The hidden state is the RNN's memory -- a compressed representation of everything it has seen so far in the sequence. At each step, it merges new information (from x_t) with accumulated history (from h_{t-1}). The tanh squashes values to [-1, 1], preventing the hidden state from blowing up as it accumulates information over many timesteps.
Three weight matrices do all the work. W_xh transforms the current input into hidden-state space. W_hh transforms the previous hidden state -- this is the recurrence, the loop that gives RNNs their name. W_hy maps from hidden state to output. The same matrices are shared across all timesteps, just like how CNN filters are shared across spatial positions (episode #45). This weight sharing means an RNN can process sequences of any length with the same number of parameters. A model trained on 10-word sentences works on 1000-word documents without any architectural change.
import numpy as np
class SimpleRNN:
def __init__(self, in_d, hid_d, out_d):
s = 0.01
self.Wxh = np.random.randn(in_d, hid_d) * s
self.Whh = np.random.randn(hid_d, hid_d) * s
self.Why = np.random.randn(hid_d, out_d) * s
self.bh = np.zeros(hid_d)
self.by = np.zeros(out_d)
self.hid_d = hid_d
def forward(self, inputs, h=None):
if h is None:
h = np.zeros(self.hid_d)
states = []
for x_t in inputs:
h = np.tanh(x_t @ self.Wxh + h @ self.Whh + self.bh)
states.append(h)
return states[-1] @ self.Why + self.by, states
rnn = SimpleRNN(10, 32, 5)
seq = [np.random.randn(10) for _ in range(20)]
out, states = rnn.forward(seq)
print(f"Sequence length: {len(seq)}")
print(f"Hidden state dim: {states[0].shape}")
print(f"Output shape: {out.shape}")
print(f"Parameters: {10*32 + 32*32 + 32*5 + 32 + 5} total")
Notice the structure: at each timestep, the hidden state h is updated by combining input and previous state through a tanh nonlinearity. The output (for classification or prediction) is typically computed from the final hidden state -- the one that has "seen" the entire sequence. For sequence-to-sequence tasks you can compute outputs at every timestep, but for simple classification taking the last state is the standard approach.
Unrolling the RNN
A useful way to think about RNNs is to "unroll" them through time. If you have a 50-timestep sequence, the RNN becomes equivalent to a 50-layer deep network where every layer shares the same weights. You can literally draw it out -- copy the RNN cell 50 times left to right, connect each hidden state to the next, and you have a standard feedforward graph that regular backpropagation works on.
# Visualizing the unrolling concept
# A 5-step RNN is equivalent to this feedforward computation:
np.random.seed(42)
in_d, hid_d = 4, 8
Wxh = np.random.randn(in_d, hid_d) * 0.01
Whh = np.random.randn(hid_d, hid_d) * 0.01
bh = np.zeros(hid_d)
# 5-step sequence
inputs = [np.random.randn(in_d) for _ in range(5)]
h = np.zeros(hid_d)
print("Unrolled RNN computation:")
for t, x_t in enumerate(inputs):
h_new = np.tanh(x_t @ Wxh + h @ Whh + bh)
change = np.linalg.norm(h_new - h)
print(f" t={t}: input norm={np.linalg.norm(x_t):.3f}, "
f"hidden norm={np.linalg.norm(h_new):.3f}, "
f"state change={change:.4f}")
h = h_new
print(f"\nFinal hidden state encodes the ENTIRE sequence history")
print(f"Same Wxh and Whh used at every timestep (weight sharing)")
This unrolling perspective explains both the power and the weakness of RNNs. The power: a single set of weights can process sequences of arbitrary length, learning general patterns rather than position-specific ones. The weakness: that chain of multiplications through 50 (or 500) copies of the same weight matrix creates gradient flow problems. And this is where the trouble starts.
Backpropagation through time (BPTT)
Training an RNN requires computing gradients through the temporal connections. Since the hidden state at timestep t depends on timestep t-1, which depends on t-2, and so on, the gradient must flow backward through all timesteps. This is backpropagation through time -- conceptually identical to regular backpropagation (episode #39), but applied to the unrolled computational graph.
The chain rule multiplies gradients through all "layers" (timesteps). For a loss computed at the final timestep, the gradient with respect to early hidden states involves multiplying the Jacobian matrices at each intermediate step. Each multiplication involves the recurrent weight matrix W_hh and the derivative of tanh.
# Demonstrating how gradients flow through time
import torch
in_d, hid_d = 4, 8
Wxh = torch.randn(in_d, hid_d, requires_grad=True) * 0.1
Whh = torch.randn(hid_d, hid_d, requires_grad=True) * 0.1
bh = torch.zeros(hid_d, requires_grad=True)
# Forward pass through 20 timesteps
seq_len = 20
inputs = [torch.randn(in_d) for _ in range(seq_len)]
h = torch.zeros(hid_d)
hidden_states = [h]
for x_t in inputs:
h = torch.tanh(x_t @ Wxh + h @ Whh + bh)
hidden_states.append(h)
# Loss at final timestep
target = torch.randn(hid_d)
loss = ((hidden_states[-1] - target) ** 2).sum()
loss.backward()
print(f"Gradient flowed back through {seq_len} timesteps")
print(f"Wxh gradient norm: {Wxh.grad.norm():.6f}")
print(f"Whh gradient norm: {Whh.grad.norm():.6f}")
print(f"bh gradient norm: {bh.grad.norm():.6f}")
# Check gradient magnitude at different timesteps
# by looking at how much each hidden state contributes
print(f"\nHidden state norms through time:")
for t in [0, 5, 10, 15, 19]:
print(f" t={t:>2d}: norm={hidden_states[t+1].detach().norm():.4f}")
The takeaway: BPTT is just regular backprop on the unrolled graph. PyTorch handles it automatically through autograd (episode #42) -- you don't need to implement the temporal gradient chain yourself. But understanding what PyTorch is doing under the hood matters, because the next section explains why it often breaks down.
The vanishing gradient problem (again)
We encountered vanishing gradients in episode #40 for deep feedforward networks. In RNNs, the problem is worse -- and more fundamental.
The gradient at timestep t with respect to the hidden state at timestep k (where k is much earlier) involves multiplying the gradient through (t - k) steps. Each step multiplies by W_hh and the derivative of tanh. The tanh derivative is always between 0 and 1, and unless W_hh has eigenvalues larger than 1 (which causes the opposite problem -- exploding gradients), the product shrinks exponentially with the number of steps.
# Demonstrating vanishing gradients empirically
torch.manual_seed(42)
def test_gradient_flow(seq_len, hidden_dim=32):
"""Measure gradient magnitude as a function of sequence length."""
Wxh = torch.randn(1, hidden_dim, requires_grad=True) * 0.1
Whh = torch.randn(hidden_dim, hidden_dim, requires_grad=True) * 0.1
h = torch.zeros(hidden_dim)
for _ in range(seq_len):
x = torch.randn(1)
h = torch.tanh(x @ Wxh + h @ Whh)
loss = h.sum()
loss.backward()
return Whh.grad.norm().item()
print("Gradient magnitude vs sequence length:")
for length in [5, 10, 20, 50, 100, 200]:
grad_norm = test_gradient_flow(length)
print(f" Length {length:>3d}: gradient norm = {grad_norm:.8f}")
Concretely: if each multiplication reduces the gradient by a factor of 0.9, then after 10 steps you retain 0.9^10 = 0.35 of the original signal. After 50 steps: 0.9^50 = 0.005. After 100 steps: essentially zero. Information from the beginning of the sequence produces no measurable gradient -- the network literally cannot learn from it.
In practice, vanilla RNNs can learn dependancies spanning roughly 10-20 timesteps. Beyond that, the gradient signal is too weak. Consider a sentence like: "The cat, which sat on the mat that belonged to my grandmother who lives in Amsterdam, was ..." -- by the time the RNN processes "was," the gradient path back to "cat" passes through so many multiplications that the model can't learn the subject-verb agreement. It effectively forgets what the sentence was about.
This is NOT a bug you can fix with better initialization or learning rate tuning. It's a mathematical consequence of repeatedly multiplying through the same weight matrix. Gradient clipping (capping the gradient norm) helps with the exploding side:
# Gradient clipping -- standard RNN training technique
model_params = [torch.randn(10, 10, requires_grad=True) for _ in range(3)]
# Simulate large gradients
for p in model_params:
p.grad = torch.randn_like(p) * 100 # artificially large
# Before clipping
total_norm_before = torch.sqrt(sum(p.grad.norm()**2 for p in model_params))
print(f"Gradient norm before clipping: {total_norm_before:.1f}")
# Clip gradients to max norm of 5.0
torch.nn.utils.clip_grad_norm_(model_params, max_norm=5.0)
total_norm_after = torch.sqrt(sum(p.grad.norm()**2 for p in model_params))
print(f"Gradient norm after clipping: {total_norm_after:.1f}")
print(f"\nClipping prevents explosions but can't fix vanishing")
print(f"You can't clip a gradient into existence -- if it's gone, it's gone")
The solution requires a different architecture -- one that provides shortcut paths for gradient flow, similar to how ResNet skip connections (episode #46) solved the depth problem for CNNs. That's exactly what LSTMs and GRUs provide, and we'll build them in the next episode.
RNNs in PyTorch
PyTorch provides nn.RNN with optimized CUDA implementations. Let's build a proper text classifier:
import torch
import torch.nn as nn
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, n_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
self.classifier = nn.Linear(hidden_dim, n_classes)
def forward(self, x):
embedded = self.embedding(x) # (batch, seq_len, embed_dim)
output, h_n = self.rnn(embedded) # h_n: final hidden state
return self.classifier(h_n.squeeze(0))
model = TextClassifier(vocab_size=1000, embed_dim=64, hidden_dim=128, n_classes=4)
x = torch.randint(0, 1000, (8, 50)) # batch of 8, sequence length 50
pred = model(x)
print(f"Input: {x.shape}")
print(f"Output: {pred.shape}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
The nn.Embedding layer converts integer token IDs to dense vectors -- similar to the word embeddings from episode #31, but learned jointly with the model rather than pretrained separately. Each word in the vocabulary gets a learnable vector, and the network adjusts these vectors during training so that words used in similar contexts develop similar representations.
The RNN processes the embedded sequence and returns two things: the output at every timestep (useful for sequence labeling) and the final hidden state h_n (useful for classification). For our text classifier, we use h_n -- it's the model's summary of the entire sequence, condensed into a single vector that the linear layer maps to class probabilities.
batch_first=True means the input tensor shape is (batch, sequence, features) in stead of PyTorch's default (sequence, batch, features). Always set this -- the default ordering is a historical artifact that trips everyone up.
Multi-layer and bidirectional RNNs
You can stack multiple RNN layers and process sequences in both directions:
# Stacking RNN layers
rnn_deep = nn.RNN(input_size=64, hidden_size=128, num_layers=3,
batch_first=True, dropout=0.2)
x = torch.randn(8, 50, 64) # batch=8, seq_len=50, features=64
output, h_n = rnn_deep(x)
print(f"3-layer RNN:")
print(f" Output shape: {output.shape}") # (8, 50, 128) -- last layer's output
print(f" Hidden shape: {h_n.shape}") # (3, 8, 128) -- one h per layer
print(f" Parameters: {sum(p.numel() for p in rnn_deep.parameters()):,}")
# Bidirectional RNN
rnn_bidir = nn.RNN(input_size=64, hidden_size=128, num_layers=2,
batch_first=True, bidirectional=True)
output_bi, h_n_bi = rnn_bidir(x)
print(f"\nBidirectional 2-layer RNN:")
print(f" Output shape: {output_bi.shape}") # (8, 50, 256) -- forward+backward concat
print(f" Hidden shape: {h_n_bi.shape}") # (4, 8, 128) -- 2 layers x 2 directions
print(f" Parameters: {sum(p.numel() for p in rnn_bidir.parameters()):,}")
Stacking multiple layers lets each layer learn increasingly abstract representations of the sequence, similar to how stacking CNN layers builds a feature hierarchy. The first layer might learn character-level or word-level patterns; deeper layers capture higher-level semantic patterns.
Bidirectional RNNs run the sequence both forward and backward, concatenating the two hidden states at each position. This gives every timestep access to context from both the past and the future. For tasks like sentiment analysis or named entity recognition (where you need the full sentence context to classify each word), bidirectional RNNs are substantially better than unidirectional ones. The tradeoff: you can't use bidirectional RNNs for generation tasks where you're predicting the next token, because the "future" doesn't exist yet at inference time ;-)
Training an RNN: a complete example
Let's train a real RNN on a simple sequence classification task. We'll generate synthetic data where the label depends on patterns across the sequence:
from torch.utils.data import TensorDataset, DataLoader
# Generate synthetic sequence data
# Rule: if sum of first half > sum of second half, class 1, else class 0
torch.manual_seed(42)
n_samples = 2000
seq_len = 30
n_features = 5
X = torch.randn(n_samples, seq_len, n_features)
first_half = X[:, :seq_len//2, :].sum(dim=(1, 2))
second_half = X[:, seq_len//2:, :].sum(dim=(1, 2))
y = (first_half > second_half).long()
# Split
X_train, X_test = X[:1600], X[1600:]
y_train, y_test = y[:1600], y[1600:]
train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=64, shuffle=True)
test_loader = DataLoader(TensorDataset(X_test, y_test), batch_size=200)
class SeqClassifier(nn.Module):
def __init__(self):
super().__init__()
self.rnn = nn.RNN(5, 64, num_layers=2, batch_first=True, dropout=0.1)
self.fc = nn.Linear(64, 2)
def forward(self, x):
_, h_n = self.rnn(x)
return self.fc(h_n[-1]) # use last layer's final hidden state
model = SeqClassifier()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(30):
model.train()
for xb, yb in train_loader:
loss = loss_fn(model(xb), yb)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
optimizer.step()
if epoch % 10 == 0:
model.eval()
correct = sum((model(xb).argmax(1) == yb).sum().item()
for xb, yb in test_loader)
print(f"Epoch {epoch:>2d}: test accuracy = {correct/len(y_test):.1%}")
Notice the clip_grad_norm_ call -- gradient clipping is standard practice when training RNNs. Without it, exploding gradients can cause NaN losses within a few batches. A max norm of 1.0 to 5.0 is typical. This task requires the RNN to compare patterns across the full sequence length, which tests its memory capacity. With seq_len=30, a vanilla RNN can handle it. Push it to 200 and performance degrades -- the vanishing gradient rears its head.
Character-level language models
A fun and instructive application: train an RNN to generate text one character at a time. The model sees a character and predicts the next one. At generation time, you feed the prediction back as input, producing text character by character. This is an autoregressive model -- each output becomes the next input, creating a feedback loop that generates arbitrary-length sequences from a single seed character.
class CharRNN(nn.Module):
def __init__(self, vocab_sz, hid=128):
super().__init__()
self.embed = nn.Embedding(vocab_sz, hid)
self.rnn = nn.RNN(hid, hid, batch_first=True)
self.fc = nn.Linear(hid, vocab_sz)
def forward(self, x, h=None):
out, h = self.rnn(self.embed(x), h)
return self.fc(out), h
def generate(self, start, c2i, i2c, length=100, temp=0.8):
self.eval()
idx = c2i[start]
result = [start]
h = None
with torch.no_grad():
for _ in range(length):
logits, h = self(torch.tensor([[idx]]), h)
probs = torch.softmax(logits[0, -1] / temp, dim=0)
idx = torch.multinomial(probs, 1).item()
result.append(i2c[idx])
return ''.join(result)
# Quick demo with a tiny vocabulary
text = "hello world hello python hello neural network"
chars = sorted(set(text))
c2i = {c: i for i, c in enumerate(chars)}
i2c = {i: c for c, i in c2i.items()}
vocab_size = len(chars)
# Prepare training data: predict next character
data = torch.tensor([c2i[c] for c in text])
X_char = data[:-1].unsqueeze(0) # all chars except last
y_char = data[1:].unsqueeze(0) # all chars except first
model = CharRNN(vocab_size, hid=64)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
for epoch in range(200):
logits, _ = model(X_char)
loss = nn.CrossEntropyLoss()(logits.view(-1, vocab_size), y_char.view(-1))
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 50 == 0:
print(f"Epoch {epoch}: loss={loss.item():.4f}")
print(f"\nGenerated: {model.generate('h', c2i, i2c, length=40, temp=0.5)}")
The temperature parameter is worth understanding because it appears in every generative model you'll encounter. It scales the logits before softmax: lower temperature (say 0.3) makes the probability distribution sharper -- the model picks the most likely character almost every time, producing repetitive but "safe" text. Higher temperature (say 1.5) flattens the distribution -- less likely characters get picked more often, producing varied but chaotic text. At temperature 1.0 the model samples from its learned distribution exactly.
Trained on Shakespeare, such a model produces surprisingly convincing pseudo-Shakespeare after just a few epochs. Trained on Python code, it generates syntactically plausible (if nonsensical) Python -- it learns indentation rules, bracket matching, and keyword patterns. The model has zero understanding of any of these domains. It has learned statistical patterns of which characters follow which sequences of characters, nothing more. But those statistical patterns are often enough to fool a casual reader ;-)
Sequence-to-sequence basics
So far we've used RNNs for classification (one output from the final state) and generation (one output per timestep). There's a third pattern: many-to-many with different input and output lengths. Translation ("hello" -> "bonjour"), summarization (paragraph -> sentence), or time series forecasting (past 30 values -> next 10 values).
# Simple many-to-many: predict next N values from past M values
class SeqToSeq(nn.Module):
def __init__(self, input_dim, hidden_dim, output_len):
super().__init__()
self.encoder = nn.RNN(input_dim, hidden_dim, batch_first=True)
self.decoder = nn.RNN(hidden_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, input_dim)
self.output_len = output_len
self.hidden_dim = hidden_dim
def forward(self, x):
# Encode input sequence
_, h = self.encoder(x)
# Decode: generate output_len steps
# Start with zeros as initial decoder input
dec_input = torch.zeros(x.size(0), 1, self.hidden_dim)
outputs = []
for _ in range(self.output_len):
dec_out, h = self.decoder(dec_input, h)
pred = self.fc(dec_out)
outputs.append(pred)
dec_input = dec_out # feed decoder's output back as input
return torch.cat(outputs, dim=1)
model = SeqToSeq(input_dim=1, hidden_dim=32, output_len=10)
x = torch.randn(16, 30, 1) # 16 sequences, 30 timesteps, 1 feature
out = model(x)
print(f"Input (past 30 values): {x.shape}")
print(f"Output (next 10 values): {out.shape}")
This encoder-decoder pattern is fundamental. The encoder processes the input and compresses it into a hidden state. The decoder uses that hidden state to generate the output sequence step by step. Having said that, vanilla RNNs struggle here because the entire input sequence must be compressed into a single hidden state vector -- a severe information bottleneck for long sequences. We'll see much better approaches when we cover attention mechanisms and transformers in upcoming episodes.
When RNNs are still used (and when they're not)
The honest truth: vanilla RNNs are rarely used in production in 2026. LSTMs and GRUs (next episode) solved the vanishing gradient problem and dominated sequence modeling from roughly 2014 to 2017. Then transformers solved the efficiency problem -- RNNs must process sequences step by step (inherently sequential), while transformers process all positions in parallel, making much better use of GPU hardware.
So why spend an entire episode on vanilla RNNs? Because they're the foundation everything else builds on. LSTMs and GRUs are direct extensions -- they add gating mechanisms to the same recurrent structure, so if you understand the vanilla RNN loop, you understand 80% of what makes gated architectures work. The concept of hidden state carrying memory through a sequence appears everywhere in modern architectures, even ones that aren't called "recurrent." State space models (Mamba and friends), which are making a comeback in 2025-2026, are essentially sophisticated RNN variants with clever parameterizations.
For small models on edge devices -- think smartwatches, hearing aids, IoT sensors -- RNN variants remain competitive because they process one timestep at a time with constant memory. A transformer over a 1000-step sequence needs to store attention weights over all 1000 positions (quadratic memory). An RNN needs to store one hidden state vector, regardless of sequence length. That memory efficiency matters when your entire model budget is 50KB.
And perhaps most importantly: the vanishing gradient problem and its solutions (gating, skip connections, careful initialization) are fundamental concepts that appear across all of deep learning. You'll see the same ideas in every architecture we study from here on.
What we built today
- Feedforward networks process inputs independently -- RNNs maintain hidden state that carries information across timesteps;
- The vanilla RNN update:
h_t = tanh(W_xh @ x_t + W_hh @ h_{t-1} + b_h)-- hidden state merges new input with accumulated history; - Weight sharing across timesteps means RNNs handle variable-length sequences with fixed parameters;
- BPTT unrolls the RNN into a deep network -- gradients multiply through all timesteps;
- Vanishing gradients limit vanilla RNNs to ~10-20 timestep dependencies -- long-range patterns are lost. Gradient clipping helps with explosions but can't fix vanishing;
- Bidirectional RNNs process sequences both forward and backward -- great for classification, unusable for generation;
- Character-level language models predict the next character given history -- temperature controls generation randomness;
- The encoder-decoder pattern compresses input into a hidden state, then generates output step by step;
- Vanilla RNNs are rarely used directly today, but the concepts (hidden state, weight sharing, BPTT, gating) underpin EVERYTHING in modern sequence modeling.
We've now entered the world of sequence models. The vanilla RNN is limited by vanishing gradients, but the architecture pattern -- step through a sequence, maintain state, share weights -- is the right idea. It just needs a better memory mechanism. That mechanism involves adding gates that control what information to keep, what to forget, and what to output. These gates give the network explicit control over its memory flow, creating shortcut paths for gradients that solve the vanishing problem elegantly. We'll build both LSTM and GRU cells from scratch, compare them, and see why they dominated NLP for years ;-)
Exercises
Exercise 1: Implement a complete RNN forward and backward pass from scratch in NumPy. Build a VanillaRNN class with forward(inputs) that stores all intermediate hidden states, and backward(d_output) that computes gradients for W_xh, W_hh, W_hy, b_h, and b_y using BPTT. Test it on a 10-step sequence with input_dim=4, hidden_dim=8, output_dim=3. Print the gradient norms for each weight matrix and verify they're non-zero.
Exercise 2: Build a sentiment classifier using nn.RNN. Generate synthetic data: 1000 sequences of length 20, where each element is a random float. Label = 1 if the sequence contains more than 3 values above 1.5, else 0. Train a single-layer RNN with hidden_dim=32 for 50 epochs. Report test accuracy and compare against a simple feedforward baseline that takes the full flattened sequence as input. Which performs better, and why?
Exercise 3: Implement a sequence length experiment. Train the same RNN architecture (hidden_dim=64) on sequences of length 10, 25, 50, 100, and 200. The task: classify whether the first element of the sequence is positive or negative (the rest are noise). Plot (or print) accuracy vs sequence length. At what length does the vanilla RNN start failing? This directly measures the vanishing gradient's practical impact on memory span.