Learn AI Series (#50) - Sequence-to-Sequence Models

What will I learn

You will learn the encoder-decoder architecture -- how two RNNs work together to transform one sequence into another;
the bottleneck problem -- why compressing an entire input into a single vector is a fundamental limitation;
teacher forcing -- a training trick that speeds up convergence but creates a gap between training and inference;
beam search -- generating better outputs by exploring multiple candidates simultaneously;
real applications: machine translation, summarization, and conversational models;
building a complete seq2seq training loop with scheduled sampling in PyTorch;
a preview of the attention mechanism that solves the bottleneck (covered in full next episode).

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#50) - Sequence-to-Sequence Models

Solutions to Episode #49 Exercises

Exercise 1: Implement the LSTMCell class and track cell state/hidden state norms over 50 timesteps, compared with vanilla RNN.

import torch
import torch.nn as nn

class LSTMCellManual(nn.Module):
    def __init__(self, in_d, hid_d):
        super().__init__()
        self.gates = nn.Linear(in_d + hid_d, 4 * hid_d)
        self.hid_d = hid_d

    def forward(self, x_t, h_prev, c_prev):
        combined = torch.cat([x_t, h_prev], dim=-1)
        gates = self.gates(combined)
        f, i, g, o = gates.chunk(4, dim=-1)
        f = torch.sigmoid(f)
        i = torch.sigmoid(i)
        g = torch.tanh(g)
        o = torch.sigmoid(o)
        c = f * c_prev + i * g
        h = o * torch.tanh(c)
        return h, c

torch.manual_seed(42)
in_d, hid_d = 8, 32

# LSTM cell tracking
cell = LSTMCellManual(in_d, hid_d)
h = torch.zeros(1, hid_d)
c = torch.zeros(1, hid_d)

print("LSTM cell state and hidden state norms:")
for t in range(50):
    x_t = torch.randn(1, in_d)
    h, c = cell(x_t, h, c)
    if t in [0, 10, 20, 30, 40, 49]:
        print(f"  t={t:>2d}: h_norm={h.norm():.4f}, c_norm={c.norm():.4f}")

# Vanilla RNN tracking
Wxh = torch.randn(in_d, hid_d) * 0.1
Whh = torch.randn(hid_d, hid_d) * 0.1
bh = torch.zeros(hid_d)
h_rnn = torch.zeros(hid_d)

print("\nVanilla RNN hidden state norms:")
for t in range(50):
    x_t = torch.randn(in_d)
    h_rnn = torch.tanh(x_t @ Wxh + h_rnn @ Whh + bh)
    if t in [0, 10, 20, 30, 40, 49]:
        print(f"  t={t:>2d}: h_norm={h_rnn.norm():.4f}")

The LSTM cell state norm grows steadily (information accumulates through the input gate) while the hidden state norm stays bounded (output gate + tanh). The vanilla RNN hidden state norm typically saturates quickly near a fixed value because tanh squashes everything -- it can't selectively store or release information the way the LSTM gates do.

Exercise 2: Sequence copying task demonstrating LSTM's long-range memory vs vanilla RNN.

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

torch.manual_seed(42)
n_samples = 3000
seq_len = 40
copy_len = 5

# Generate data: [5 random digits, 30 zeros, 5 targets = first 5 digits]
X = torch.zeros(n_samples, seq_len, 1)
y = torch.zeros(n_samples, copy_len, dtype=torch.long)
for i in range(n_samples):
    digits = torch.randint(1, 10, (copy_len,))
    X[i, :copy_len, 0] = digits.float() / 9.0  # normalize
    y[i] = digits - 1  # class indices 0-8

X_tr, X_te = X[:2400], X[2400:]
y_tr, y_te = y[:2400], y[2400:]
loader = DataLoader(TensorDataset(X_tr, y_tr), batch_size=64, shuffle=True)

class CopyModel(nn.Module):
    def __init__(self, rnn_type="lstm", hid=64):
        super().__init__()
        if rnn_type == "lstm":
            self.rnn = nn.LSTM(1, hid, batch_first=True)
        else:
            self.rnn = nn.RNN(1, hid, batch_first=True)
        self.fc = nn.Linear(hid, 9 * copy_len)  # 9 classes x 5 positions

    def forward(self, x):
        out, _ = self.rnn(x)
        final = out[:, -1, :]  # last timestep
        return self.fc(final).view(-1, copy_len, 9)

for rnn_type in ["rnn", "lstm"]:
    model = CopyModel(rnn_type)
    opt = torch.optim.Adam(model.parameters(), lr=1e-3)
    for epoch in range(40):
        model.train()
        for xb, yb in loader:
            logits = model(xb)
            loss = nn.CrossEntropyLoss()(logits.view(-1, 9), yb.view(-1))
            opt.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)
            opt.step()

    model.eval()
    with torch.no_grad():
        preds = model(X_te).argmax(-1)
        acc = (preds == y_te).float().mean()
    print(f"{rnn_type.upper():>4s}: copy accuracy = {acc:.1%}")

The LSTM should achieve high accuracy (80%+) because the cell state can carry the 5 digits across the 30-zero gap. The vanilla RNN typically struggles badly -- the gradient from the output positions can't reach back through 35 timesteps to learn what the initial digits were.

Exercise 3: Unidirectional vs bidirectional LSTM on a neighbor-dependent tagging task.

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

torch.manual_seed(42)
n_samples = 2000
seq_len = 20

X = torch.randn(n_samples, seq_len, 1)
y = torch.zeros(n_samples, seq_len, dtype=torch.long)
for i in range(n_samples):
    for j in range(1, seq_len - 1):
        if X[i, j-1, 0] > 0 and X[i, j+1, 0] > 0:
            y[i, j] = 1

X_tr, X_te = X[:1600], X[1600:]
y_tr, y_te = y[:1600], y[1600:]
loader = DataLoader(TensorDataset(X_tr, y_tr), batch_size=64, shuffle=True)

class Tagger(nn.Module):
    def __init__(self, bidir=False):
        super().__init__()
        self.lstm = nn.LSTM(1, 64, batch_first=True, bidirectional=bidir)
        hid = 128 if bidir else 64
        self.fc = nn.Linear(hid, 2)

    def forward(self, x):
        out, _ = self.lstm(x)
        return self.fc(out)  # (batch, seq_len, 2)

for bidir in [False, True]:
    label = "Bidirectional" if bidir else "Unidirectional"
    model = Tagger(bidir=bidir)
    opt = torch.optim.Adam(model.parameters(), lr=1e-3)
    for epoch in range(30):
        model.train()
        for xb, yb in loader:
            logits = model(xb)
            loss = nn.CrossEntropyLoss()(logits.view(-1, 2), yb.view(-1))
            opt.zero_grad()
            loss.backward()
            opt.step()

    model.eval()
    with torch.no_grad():
        preds = model(X_te).argmax(-1)
        acc = (preds == y_te).float().mean()
    print(f"{label:>15s}: tagging accuracy = {acc:.1%}")

The bidirectional model should win clearly because labeling position j requires knowing both the left neighbor (j-1) and the right neighbor (j+1). The unidirectional LSTM only sees past context when processing position j -- it literally cannot see the right neighbor yet, so it's missing half the information the label depends on.

On to today's episode

Fifty episodes. Half a hundred. When I started this series with "What Machine Learning Actually Is" I honestly wasn't sure I'd make it past twenty, let alone fifty ;-) But here we are, and the timing is perfect because we're entering what I consider the most exciting stretch of the whole series.

Last episode we built LSTMs and GRUs -- gated architectures that finally gave recurrent networks actual long-term memory. We compared them on the "remember the first element" task and saw that vanilla RNNs collapse at sequence length 100 while LSTMs and GRUs handle it with ease. The gates solved the vanishing gradient problem, and for nearly half a decade (2014-2017), gated RNNs were the dominant architecture for everything involving sequences.

But everything we've done with RNNs so far maps an input sequence to a single output: a sentiment label, a class, a next-character prediction. What about the problems where both the input AND the output are sequences -- and they can have completely different lengths? Translate an English sentence into Dutch. Summarize a 500-word paragraph into two sentences. Turn a user's question into a chatbot response.

You can't solve this with a single RNN. A standard RNN produces one output per timestep, so the output length is locked to the input length. "How are you?" has three words but "Hoe gaat het met je?" has five. You need an architecture that can read an input of any length and then generate an output of any length -- independently.

Here we go!

The encoder-decoder architecture

The sequence-to-sequence (seq2seq) model, introduced by Sutskever, Vinyals, and Le at Google in 2014, does exactly this by chaining two RNNs together: an encoder that reads the input and a decoder that produces the output.

The encoder is an RNN (typically an LSTM, as we built in episode #49) that processes the input sequence one token at a time, building up a hidden state that progressivly accumulates information about the entire input. When it reaches the end of the input, its final hidden state is a fixed-size vector that (in theory) captures everything the model needs to know about the input. This vector is called the context vector.

The decoder is a separate RNN that takes the context vector as its initial hidden state and generates the output sequence, one token at a time. At each step, it produces a probability distribution over the output vocabulary, picks the most likely token (or samples from the distribution), and feeds that token back as input for the next step. A special end-of-sequence token (EOS) tells the decoder when to stop.

import torch
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, vocab_sz, emb_d, hid_d, n_layers=1):
        super().__init__()
        self.embed = nn.Embedding(vocab_sz, emb_d)
        self.lstm = nn.LSTM(emb_d, hid_d, n_layers, batch_first=True)

    def forward(self, src):
        outputs, (h, c) = self.lstm(self.embed(src))
        return h, c  # context vector

class Decoder(nn.Module):
    def __init__(self, vocab_sz, emb_d, hid_d, n_layers=1):
        super().__init__()
        self.embed = nn.Embedding(vocab_sz, emb_d)
        self.lstm = nn.LSTM(emb_d, hid_d, n_layers, batch_first=True)
        self.fc = nn.Linear(hid_d, vocab_sz)

    def forward(self, token, h, c):
        out, (h, c) = self.lstm(self.embed(token), (h, c))
        return self.fc(out), h, c

# Quick demo
enc = Encoder(vocab_sz=1000, emb_d=64, hid_d=128)
dec = Decoder(vocab_sz=800, emb_d=64, hid_d=128)

src = torch.randint(0, 1000, (4, 12))  # batch=4, src_len=12
h, c = enc(src)
print(f"Encoder input: {src.shape}")
print(f"Context vector h: {h.shape}, c: {c.shape}")

# Decoder generates one token at a time
token = torch.randint(0, 800, (4, 1))  # start token
logits, h, c = dec(token, h, c)
print(f"Decoder output logits: {logits.shape}")  # (4, 1, 800)

The architecture is strikingly clean. The encoder reads "How are you?" and compresses its meaning into the final (h, c) pair. The decoder receives this pair as its starting state and generates "Hoe gaat het met je?" one word at a time. The two networks share no weights -- they're completely separate LSTMs connected only by the context vector.

This separation is powerful because the input and output can have entirely different vocabularies (English vs Dutch), different lengths, and different structures. The context vector is the only bridge between them -- a learned compression of the input that the decoder unpacks into the output.

The bottleneck problem

Here's the fundamental flaw: the context vector is a single fixed-size vector, regardless of whether the input is 5 words or 500 words. A 256-dimensional hidden state must compress "The quick brown fox jumps over the lazy dog" into the same number of floats as it compresses an entire Wikipedia article.

For short sequences (under 20-30 tokens), this works reasonably well. The LSTM's cell state is expressive enough to retain the important bits. But as sequences get longer, the compression becomes lossy. Information from the beginning of the input gets progressively overwritten by information from later timesteps -- the same issue we discussed with vanilla RNNs in episode #48, now manifesting as an information bottleneck rather than a gradient problem.

import torch
import torch.nn as nn

def measure_bottleneck(seq_len, hidden_dim=128, n_trials=5):
    """Measure how much input information survives the bottleneck."""
    encoder = nn.LSTM(16, hidden_dim, batch_first=True)
    encoder.eval()

    # Create two sequences that differ only at position 0
    torch.manual_seed(42)
    base = torch.randn(1, seq_len, 16)
    modified = base.clone()
    modified[0, 0, :] = torch.randn(16)  # change ONLY the first token

    with torch.no_grad():
        _, (h_base, _) = encoder(base)
        _, (h_mod, _) = encoder(modified)

    # How different are the context vectors?
    diff = (h_base - h_mod).norm().item()
    base_norm = h_base.norm().item()
    return diff / base_norm  # relative difference

print("Bottleneck impact -- how well does the encoder remember token 0?")
print(f"{'Seq len':>10s}  {'Relative diff':>14s}  {'Assessment':>20s}")
for length in [5, 10, 25, 50, 100, 200]:
    rel_diff = measure_bottleneck(length)
    quality = "strong signal" if rel_diff > 0.1 else "weak signal" if rel_diff > 0.01 else "almost lost"
    print(f"{length:>10d}  {rel_diff:>14.6f}  {quality:>20s}")

Empirically, early seq2seq models for machine translation showed a sharp performance drop as sentence length increased beyond about 20 words. The quality was impressive for short sentences but degraded rapidly. A translation system that works brilliantly on "I love coffee" but falls apart on real paragraphs isn't very useful ;-)

The solution -- attention -- came just months later and solved this problem elegantly. We'll build it in the next episode. But first, there are two more important concepts in seq2seq training and inference that you need to understand.

Teacher forcing

Training a seq2seq model involves a subtle chicken-and-egg problem. The decoder generates tokens one at a time, and each generated token becomes the input for the next step. But during early training, the model's predictions are essentially random. If the first generated word is wrong, every subsequent word is conditioned on that wrong input, and the entire output goes off the rails. The model can't learn from complete garbage.

Teacher forcing solves this by feeding the correct previous token (from the ground truth target) as input to the decoder at each step, rather than the model's own prediction. During training, the decoder for "Hoe gaat het met je?" receives "Hoe" as input at step 2 (regardless of what it predicted at step 1), "gaat" at step 3, and so on. This gives the model clean inputs to learn from, even when its predictions are still unreliable.

import torch
import torch.nn as nn

# Demonstrating the difference between teacher forcing and free running
torch.manual_seed(42)
vocab_size = 50
hidden_dim = 32
seq_len = 8

decoder = nn.LSTM(16, hidden_dim, batch_first=True)
embed = nn.Embedding(vocab_size, 16)
output_proj = nn.Linear(hidden_dim, vocab_size)

target = torch.randint(0, vocab_size, (1, seq_len))  # ground truth
h = torch.randn(1, 1, hidden_dim)
c = torch.zeros(1, 1, hidden_dim)

# Teacher forcing: always feed ground truth
print("Teacher forcing (always correct input):")
for t in range(seq_len):
    inp = embed(target[:, t:t+1])
    out, (h_tf, c_tf) = decoder(inp, (h, c))
    pred = output_proj(out).argmax(-1).item()
    print(f"  t={t}: input={target[0, t].item():>3d} (ground truth), "
          f"predicted={pred:>3d}")

# Free running: feed own predictions
print("\nFree running (own predictions as input):")
h_fr, c_fr = h.clone(), c.clone()
inp_token = target[:, 0:1]  # start with first ground truth token
for t in range(seq_len):
    inp = embed(inp_token)
    out, (h_fr, c_fr) = decoder(inp, (h_fr, c_fr))
    pred = output_proj(out).argmax(-1)
    print(f"  t={t}: input={inp_token[0, 0].item():>3d} "
          f"{'(ground truth)' if t == 0 else '(own prediction)'}, "
          f"predicted={pred[0, 0].item():>3d}")
    inp_token = pred  # feed prediction back

The downside is exposure bias: during training, the model always sees correct previous tokens, but during inference, it sees its own predictions -- which contain errors. The model has never learned to recover from its own mistakes. If it generates "Hoe gaat" but then makes an error at step 3, it has no experience operating under that condition because it never encountered errors during training.

Scheduled sampling gradually transitions from teacher forcing to model predictions during training -- start with 100% teacher forcing, and over the course of training, randomly replace some ground-truth inputs with the model's own predictions. By the end of training, the model has seen some of its own errors and learned to cope. Curriculum learning starts with short, easy sequences and gradually increases difficulty. Neither fully solves the problem, but both help.

In practice, most implementations use teacher forcing throughout training (it's simpler and faster) and accept the exposure bias. The gap between training and inference performance is real but managable for most applications, especially once attention is added.

Beam search

At inference time, the decoder generates one token per step and feeds it back as input. The simplest approach -- greedy decoding -- picks the highest-probability token at each step. But greedy decoding can miss better sequences. If "I" is the most likely first word but "The" leads to a much better overall translation, greedy decoding is stuck with "I" forever.

Beam search maintains k candidates (called the beam width) at each step in stead of one. At step 1, keep the top k tokens. At step 2, expand each of those k candidates by all possible next tokens, score all k x V combinations (where V is vocabulary size), and keep only the top k. Continue until all beams produce an EOS token or hit a maximum length.

import torch
import torch.nn as nn
import torch.nn.functional as F

def beam_search(decoder, embed, proj, h, c, start_tok, eos_tok,
                beam_width=5, max_len=20):
    """Simple beam search decoder."""
    # Each beam: (log_prob, token_sequence, hidden, cell)
    beams = [(0.0, [start_tok], h, c)]
    completed = []

    for step in range(max_len):
        candidates = []
        for log_prob, seq, h_b, c_b in beams:
            if seq[-1] == eos_tok:
                completed.append((log_prob, seq))
                continue
            inp = embed(torch.tensor([[seq[-1]]]))
            out, (h_new, c_new) = decoder(inp, (h_b, c_b))
            log_probs = F.log_softmax(proj(out[0, -1]), dim=0)

            topk_lp, topk_idx = log_probs.topk(beam_width)
            for lp, idx in zip(topk_lp.tolist(), topk_idx.tolist()):
                candidates.append((log_prob + lp, seq + [idx],
                                   h_new.clone(), c_new.clone()))

        # Keep top beam_width candidates
        candidates.sort(key=lambda x: x[0], reverse=True)
        beams = candidates[:beam_width]

        if len(beams) == 0:
            break

    # Add any remaining beams to completed
    completed.extend([(lp, seq) for lp, seq, _, _ in beams])
    completed.sort(key=lambda x: x[0], reverse=True)
    return completed

# Demo
vocab_sz = 100
hid = 64
decoder_rnn = nn.LSTM(32, hid, batch_first=True)
embed_layer = nn.Embedding(vocab_sz, 32)
proj_layer = nn.Linear(hid, vocab_sz)

h0 = torch.randn(1, 1, hid)
c0 = torch.zeros(1, 1, hid)

results = beam_search(decoder_rnn, embed_layer, proj_layer,
                      h0, c0, start_tok=1, eos_tok=2,
                      beam_width=5, max_len=10)

print(f"Beam search results (beam_width=5):")
for i, (lp, seq) in enumerate(results[:5]):
    # Length normalization
    norm_lp = lp / len(seq)
    print(f"  Beam {i}: log_prob={lp:.2f}, "
          f"norm_log_prob={norm_lp:.2f}, "
          f"length={len(seq)}, tokens={seq[:8]}...")

With beam width k=1, beam search reduces to greedy decoding. With k=5 (a common choice), you explore 5 parallel hypotheses at each step. The computational cost scales linearly with k, and in practice k=5-10 gives most of the benefit -- increasing beyond that rarely helps.

One interesting side effect: beam search tends to produce shorter outputs than expected, because shorter sequences accumulate fewer negative log-probability terms. A common fix is length normalization -- dividing the total log-probability by the sequence length (raised to some power between 0 and 1) to avoid this bias towards short outputs.

Putting it together: the complete seq2seq training loop

Here's how encoder and decoder connect in a full training loop. The encoder processes the source sequence and hands its final states to the decoder. The decoder generates one token per step, using scheduled sampling to gradually transition from teacher forcing to its own predictions:

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

class Seq2Seq(nn.Module):
    def __init__(self, enc_vocab, dec_vocab, emb_d=64, hid_d=128):
        super().__init__()
        self.encoder = Encoder(enc_vocab, emb_d, hid_d)
        self.decoder = Decoder(dec_vocab, emb_d, hid_d)
        self.dec_vocab = dec_vocab

    def forward(self, src, tgt, tf_ratio=0.5):
        h, c = self.encoder(src)
        inp = tgt[:, :1]  # start token
        outputs = []
        for t in range(1, tgt.size(1)):
            out, h, c = self.decoder(inp, h, c)
            outputs.append(out)
            use_tf = torch.rand(1).item() < tf_ratio
            inp = tgt[:, t:t+1] if use_tf else out.argmax(-1)
        return torch.cat(outputs, dim=1)

# Synthetic sequence reversal task
# Model learns to reverse a sequence of integers
torch.manual_seed(42)
vocab_sz = 20
n_samples = 3000
max_len = 10

src_data = torch.randint(2, vocab_sz, (n_samples, max_len))  # 0=pad, 1=EOS
tgt_data = src_data.flip(1)  # reverse as target

X_tr, X_te = src_data[:2400], src_data[2400:]
y_tr, y_te = tgt_data[:2400], tgt_data[2400:]
loader = DataLoader(TensorDataset(X_tr, y_tr), batch_size=64, shuffle=True)

model = Seq2Seq(enc_vocab=vocab_sz, dec_vocab=vocab_sz)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(30):
    model.train()
    # Scheduled sampling: decrease teacher forcing over time
    tf_ratio = max(0.2, 1.0 - epoch * 0.03)

    for src_b, tgt_b in loader:
        logits = model(src_b, tgt_b, tf_ratio=tf_ratio)
        loss = nn.CrossEntropyLoss()(
            logits.reshape(-1, vocab_sz),
            tgt_b[:, 1:].reshape(-1))
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)
        optimizer.step()

    if epoch % 5 == 0:
        model.eval()
        with torch.no_grad():
            test_logits = model(X_te, y_te, tf_ratio=0.0)  # no TF at test
            preds = test_logits.argmax(-1)
            token_acc = (preds == y_te[:, 1:]).float().mean()
            seq_acc = (preds == y_te[:, 1:]).all(dim=1).float().mean()
        print(f"Epoch {epoch:>2d} (TF={tf_ratio:.2f}): "
              f"token_acc={token_acc:.1%}, seq_acc={seq_acc:.1%}")

The tf_ratio parameter controls the teacher forcing rate. We start at 1.0 (always teacher force) and decrease to 0.2 over training -- this is the scheduled sampling strategy. At test time we set it to 0.0 (pure free running, no teacher forcing). The sequence reversal task is a good diagnostic: it requires the model to memorize the entire input before generating output in reverse order. If the model can reverse length-10 sequences, the encoder-decoder information flow is working.

Notice we report both token accuracy (what fraction of individual output tokens are correct) and sequence accuracy (what fraction of complete output sequences are entirely correct). Sequence accuracy is much harder -- getting 9 out of 10 tokens right scores 0% on sequence accuracy. This gap matters in production: a translation that gets one word wrong might still be understandable, but an API call with one wrong character is completely broken.

Applications that changed the field

The seq2seq framework turned out to be remarkably general. Once you frame a problem as "input sequence -> output sequence," the same architecture applies:

Machine translation was the original and most prominent application. Google's Neural Machine Translation system (GNMT, 2016) used an 8-layer encoder, 8-layer decoder with attention and residual connections. It replaced the phrase-based statistical system that Google Translate had used for a decade, and the quality jump was so dramatic that users noticed immediately -- prompting the famous New York Times article about Google Translate's "overnight improvement."

Text summarization treats a long document as the input sequence and a short summary as the output. The model learns to compress -- retaining key facts while discarding details. Abstractive summarization (generating new words, not just extracting sentences) became possible with seq2seq.

Conversational models treat the conversation history as the input and the response as the output. Google's "A Neural Conversational Model" (2015) showed that a seq2seq model trained on movie subtitles could produce surprisingly coherent (if sometimes bizarre) dialogue without any hand-coded rules. This was a direct ancestor of modern chatbots, though the transformer architecture later replaced LSTMs as the backbone.

Code generation treats a natural language description as input and code as output. Speech recognition uses audio features as the input sequence and text as the output. Image captioning uses a CNN as the encoder (replacing the RNN encoder with a visual feature extractor, like the pretrained ResNet features we built in episode #47) and an RNN decoder to generate captions word by word.

The power of the framework lies in its generality: any problem that can be cast as "understand this input, then produce that output" fits the encoder-decoder paradigm. The specific choice of encoder and decoder can vary -- you can mix CNNs, RNNs, and (later) transformers freely.

Multi-layer encoders and decoders

In practice, production seq2seq models stack multiple LSTM layers (as we covered in episode #49). Each layer captures increasingly abstract representations, and the final layer's hidden state is the context vector. The decoder mirrors the encoder's depth:

import torch
import torch.nn as nn

class DeepEncoder(nn.Module):
    def __init__(self, vocab_sz, emb_d, hid_d, n_layers=2, dropout=0.3):
        super().__init__()
        self.embed = nn.Embedding(vocab_sz, emb_d)
        self.lstm = nn.LSTM(emb_d, hid_d, n_layers,
                            batch_first=True, dropout=dropout)
        self.drop = nn.Dropout(dropout)

    def forward(self, src):
        emb = self.drop(self.embed(src))
        outputs, (h, c) = self.lstm(emb)
        return outputs, h, c

class DeepDecoder(nn.Module):
    def __init__(self, vocab_sz, emb_d, hid_d, n_layers=2, dropout=0.3):
        super().__init__()
        self.embed = nn.Embedding(vocab_sz, emb_d)
        self.lstm = nn.LSTM(emb_d, hid_d, n_layers,
                            batch_first=True, dropout=dropout)
        self.fc = nn.Linear(hid_d, vocab_sz)
        self.drop = nn.Dropout(dropout)

    def forward(self, token, h, c):
        emb = self.drop(self.embed(token))
        out, (h, c) = self.lstm(emb, (h, c))
        return self.fc(self.drop(out)), h, c

# 2-layer encoder-decoder
enc = DeepEncoder(vocab_sz=5000, emb_d=128, hid_d=256, n_layers=2)
dec = DeepDecoder(vocab_sz=4000, emb_d=128, hid_d=256, n_layers=2)

src = torch.randint(0, 5000, (8, 30))  # batch=8, src_len=30
enc_outs, h, c = enc(src)

print(f"Encoder:")
print(f"  All outputs: {enc_outs.shape}")  # (8, 30, 256) -- all timesteps
print(f"  Hidden: {h.shape}")               # (2, 8, 256) -- 2 layers
print(f"  Cell: {c.shape}")                 # (2, 8, 256) -- 2 layers

# Decoder step
token = torch.randint(0, 4000, (8, 1))
logits, h, c = dec(token, h, c)
print(f"\nDecoder step:")
print(f"  Logits: {logits.shape}")  # (8, 1, 4000)
print(f"  Params: enc={sum(p.numel() for p in enc.parameters()):,}, "
      f"dec={sum(p.numel() for p in dec.parameters()):,}")

The DeepEncoder returns ALL encoder outputs (not just the final state) -- we'll need those for attention in the next episode. For now, only the final hidden and cell states matter. Notice that the encoder outputs have shape (batch, seq_len, hidden_dim) -- one hidden state per input timestep. The attention mechanism will learn to selectively focus on these intermediate states rather than relying solely on the compressed final state.

The dropout pattern is the same as our sentiment LSTM from episode #49: dropout on embeddings, dropout between LSTM layers (via the dropout argument), and dropout before the output projection. This combination is the standard regularization recipe for recurrent sequence models.

The attention preview

Having said that, the bottleneck problem is real, and it limits seq2seq to relatively short sequences. But what if, in stead of compressing the entire input into a single context vector, the decoder could look back at the encoder's hidden states at every timestep? At each decoding step, it could decide which parts of the input are most relevant to the current output token and focus on those.

That's the attention mechanism, and it changed everything. In stead of one context vector, the decoder gets a different context vector at each step -- a weighted combination of all encoder hidden states, where the weights are learned based on what the decoder is currently trying to produce. Translating the verb? Pay attention to the input verb. Translating the subject? Look back at the input subject.

import torch
import torch.nn.functional as F

# Conceptual sketch of attention
batch_size = 4
src_len = 20
hidden_dim = 128

# These come from the encoder (all timestep outputs)
encoder_outputs = torch.randn(batch_size, src_len, hidden_dim)

# This is the decoder's current hidden state
decoder_hidden = torch.randn(batch_size, hidden_dim)

# Attention scores: how relevant is each encoder position?
scores = torch.bmm(
    encoder_outputs,
    decoder_hidden.unsqueeze(-1)
).squeeze(-1)  # (batch, src_len)

# Softmax gives us attention weights (sum to 1)
attention_weights = F.softmax(scores, dim=1)

# Weighted sum of encoder outputs = context vector for THIS step
context = torch.bmm(
    attention_weights.unsqueeze(1),
    encoder_outputs
).squeeze(1)  # (batch, hidden_dim)

print(f"Attention weights shape: {attention_weights.shape}")
print(f"Context vector shape: {context.shape}")
print(f"Attention weights sum: {attention_weights[0].sum():.4f}")
print(f"\nWeight distribution (first sequence):")
print(f"  Max attention at position: {attention_weights[0].argmax().item()}")
print(f"  Top-3 positions: {attention_weights[0].topk(3).indices.tolist()}")

Attention didn't just fix the bottleneck -- it made seq2seq models dramatically better at every sequence length. And it opened the door to an even bigger idea: what if you removed the RNN entirely and built the whole model out of attention? That idea became the transformer, and it's where this series is headed. We'll build attention from scratch in the next episode, and transformers in the episodes after that.

The bottom line

Sequence-to-sequence chains two RNNs: an encoder that compresses the input into a context vector, and a decoder that generates the output from that vector;
The context vector bottleneck limits performance on long sequences -- a fixed-size vector can't faithfully represent arbitrarily long inputs;
Teacher forcing feeds ground-truth tokens during training for stable learning, but creates exposure bias (the model never sees its own errors during training);
Scheduled sampling gradually replaces teacher forcing with the model's own predictions during training, partially mitigating exposure bias;
Beam search explores multiple candidate outputs in parallel, consistently outperforming greedy decoding at modest computational cost;
The encoder-decoder framework is general: translation, summarization, conversation, code generation, and captioning all fit the same pattern;
Multi-layer encoders and decoders capture increasingly abstract representations, matching the stacked architecture from episode #49;
Attention solves the bottleneck by letting the decoder look back at all encoder states -- we'll build it from scratch next episode.

We're at a turning point in this series. Everything from episodes #37 through #49 -- perceptrons, forward passes, backpropagation, training challenges, optimization, PyTorch, CNNs, RNNs, LSTMs -- has been building toward the architecture that changed everything. Seq2seq showed that you can chain neural networks together in creative ways to solve problems that no single network can handle. Attention showed that you don't need recurrence at all to process sequences. And the transformer (coming very soon) showed that attention alone, scaled up, is all you need ;-)

Exercises

Exercise 1: Build a sequence sorting seq2seq model. Generate training data where the input is a sequence of 8 random integers (range 2-19) and the target is the same integers sorted in ascending order. Train the Seq2Seq model from this episode for 40 epochs with scheduled sampling. Report both token accuracy and full-sequence accuracy on a held-out test set. This task is harder than reversal because the model must compare values across the entire input, not just mirror positions.

Exercise 2: Implement a greedy decoder vs beam search comparison. Train a seq2seq model on the reversal task from this episode, then generate outputs using (a) greedy decoding (beam width 1) and (b) beam search with beam widths 3, 5, and 10. For each setting, compute token accuracy on 200 test sequences. At what beam width do you stop seeing improvement? How does beam search affect the average output length compared to greedy?

Exercise 3: Build a bottleneck analysis experiment. Train seq2seq models with hidden dimensions of 32, 64, 128, and 256 on a reversal task with sequences of length 15. For each hidden dimension, report token accuracy. Then repeat with sequence length 30. At what hidden dimension does the model start handling the longer sequences? This directly measures how much information the context vector can carry -- larger hidden dim = larger bottleneck, but does doubling the dimension double the effective memory?

Much respect for staying with me this far. Thanks!

Hive account@scipio

Learn AI Series (#50) - Sequence-to-Sequence Models

Learn AI Series (#50) - Sequence-to-Sequence Models

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#50) - Sequence-to-Sequence Models

Solutions to Episode #49 Exercises

On to today's episode

The encoder-decoder architecture

The bottleneck problem

Teacher forcing

Beam search

Putting it together: the complete seq2seq training loop

Applications that changed the field

Multi-layer encoders and decoders

The attention preview

The bottom line

Exercises

Much respect for staying with me this far. Thanks!

Curriculum (of the `Learn AI Series`):