Chapter 12: Understanding LSTM Networks
“LSTMs are explicitly designed to avoid the long-term dependency problem.”
Based on: “Understanding LSTM Networks” (Christopher Olah, 2015)
📄 Original Blog Post: colah.github.io
12.1 The Most Influential Blog Post in Deep Learning
If Karpathy’s RNN post (Chapter 11) showed what RNNs can do, Colah’s LSTM post showed how they work. Published in August 2015, it became perhaps the most-read technical blog post in deep learning history.
Why? Because it made LSTMs intuitive.
graph TB
subgraph "The Problem"
V["Vanilla RNNs forget<br/>after ~10-20 steps"]
end
subgraph "The Solution"
L["LSTMs remember for<br/>hundreds of steps"]
end
V -->|"How?"| L
G["Gated memory cells:<br/>Forget, Input, Output gates"]
L --> G
style G fill:#ffe66d,color:#000
12.2 The Problem of Long-Term Dependencies
When Context Is Far Away
Consider predicting the last word:
“I grew up in France… [many sentences later]… I speak fluent __.”
The answer (“French”) depends on information from far back!
graph LR
subgraph "Short Gap (Easy)"
W1["'The clouds are in the'"]
P1["→ 'sky'"]
end
subgraph "Long Gap (Hard)"
W2["'I grew up in France...<br/>[100 words]...<br/>I speak fluent'"]
P2["→ 'French'"]
end
W1 --> P1
W2 -->|"Need to remember 'France'"| P2
style P2 fill:#ff6b6b,color:#fff
Why Vanilla RNNs Fail
Recall from Chapter 11: gradients flow through repeated matrix multiplications.
\[\frac{\partial h_T}{\partial h_0} = \prod_{t=1}^{T} W_{hh}\]graph LR
subgraph "Gradient Flow"
G0["∂L/∂h₀"]
G1["×W"]
G2["×W"]
G3["×W"]
GT["×W^T"]
end
G0 --> G1 --> G2 --> G3 --> GT
P["If ||W|| < 1: gradient → 0<br/>If ||W|| > 1: gradient → ∞"]
GT --> P
style P fill:#ff6b6b,color:#fff
12.3 LSTM: The Core Idea
The Cell State: A Conveyor Belt
LSTMs add a cell state that runs through time with minimal modification:
graph LR
subgraph "Cell State Flow"
C0["C₀"] -->|"+"| C1["C₁"]
C1 -->|"+"| C2["C₂"]
C2 -->|"+"| C3["C₃"]
end
K["Cell state = 'conveyor belt'<br/>Information flows with<br/>only linear interactions"]
C1 --> K
style K fill:#4ecdc4,color:#fff
The key insight: additive updates (not multiplicative) preserve gradients!
The Four Components
graph TB
subgraph "LSTM Cell"
FG["Forget Gate<br/>'What to remove?'"]
IG["Input Gate<br/>'What to add?'"]
CS["Cell State Update<br/>'New candidate values'"]
OG["Output Gate<br/>'What to output?'"]
end
FG --> CELL["Cell State C_t"]
IG --> CELL
CS --> CELL
CELL --> OG --> H["Hidden State h_t"]
style CELL fill:#ffe66d,color:#000
12.4 The Forget Gate
“What Should We Forget?”
The forget gate decides what information to remove from the cell state.
graph LR
subgraph "Forget Gate"
H["h_{t-1}"]
X["x_t"]
CONCAT["[h,x]"]
SIG["σ(W_f·[h,x] + b_f)"]
F["f_t ∈ (0,1)"]
end
H --> CONCAT
X --> CONCAT
CONCAT --> SIG --> F
F -->|"×"| CPREV["C_{t-1}"]
E["f_t = 0: forget completely<br/>f_t = 1: remember completely"]
F --> E
The Equation
\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]Example
When we see a new subject in a sentence:
- Old subject: “The cat, which was sitting on the mat, …”
- New subject: “The dogs ran…”
- Forget gate: Clear the “singular” information, prepare for “plural”
12.5 The Input Gate
“What New Information Should We Store?”
Two parts:
- Input gate: Which values to update
- Candidate values: What the new values are
graph TB
subgraph "Input Gate Layer"
H1["h_{t-1}"]
X1["x_t"]
SIG1["σ(W_i·[h,x] + b_i)"]
I["i_t (what to update)"]
end
subgraph "Candidate Values"
H2["h_{t-1}"]
X2["x_t"]
TANH["tanh(W_C·[h,x] + b_C)"]
CT["C̃_t (candidates)"]
end
H1 --> SIG1 --> I
X1 --> SIG1
H2 --> TANH --> CT
X2 --> TANH
I -->|"×"| MUL["i_t × C̃_t"]
CT --> MUL
style MUL fill:#ffe66d,color:#000
The Equations
\(i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\) \(\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\)
12.6 Updating the Cell State
Forget Old + Add New
graph LR
subgraph "Cell State Update"
COLD["C_{t-1}"]
F["f_t"]
FORGET["f_t × C_{t-1}<br/>(forget)"]
I["i_t"]
CT["C̃_t"]
ADD["i_t × C̃_t<br/>(add new)"]
CNEW["C_t"]
end
COLD --> FORGET
F --> FORGET
I --> ADD
CT --> ADD
FORGET -->|"+"| CNEW
ADD --> CNEW
style CNEW fill:#4ecdc4,color:#fff
The Equation
\[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\]This is the heart of the LSTM: additive update to the cell state!
12.7 The Output Gate
“What Should We Output?”
The output gate controls what part of the cell state becomes the hidden state.
graph TB
subgraph "Output Gate"
H["h_{t-1}"]
X["x_t"]
SIG["σ(W_o·[h,x] + b_o)"]
O["o_t"]
end
subgraph "Hidden State"
C["C_t"]
TANH["tanh(C_t)"]
MUL["o_t × tanh(C_t)"]
HT["h_t"]
end
H --> SIG --> O
X --> SIG
C --> TANH --> MUL
O --> MUL --> HT
style HT fill:#ffe66d,color:#000
The Equations
\(o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\) \(h_t = o_t \odot \tanh(C_t)\)
12.8 The Complete LSTM Cell
All Together
graph TB
subgraph "Complete LSTM"
INPUT["x_t, h_{t-1}"]
FG["Forget Gate<br/>f_t = σ(W_f·[h,x]+b)"]
IG["Input Gate<br/>i_t = σ(W_i·[h,x]+b)"]
CAND["Candidate<br/>C̃_t = tanh(W_C·[h,x]+b)"]
OG["Output Gate<br/>o_t = σ(W_o·[h,x]+b)"]
CUPDATE["C_t = f_t⊙C_{t-1} + i_t⊙C̃_t"]
HUPDATE["h_t = o_t⊙tanh(C_t)"]
OUTPUT["h_t, C_t"]
end
INPUT --> FG
INPUT --> IG
INPUT --> CAND
INPUT --> OG
FG --> CUPDATE
IG --> CUPDATE
CAND --> CUPDATE
CUPDATE --> HUPDATE
OG --> HUPDATE
HUPDATE --> OUTPUT
Summary of Gates
| Gate | Symbol | Function | Activation |
|---|---|---|---|
| Forget | f_t | What to erase | Sigmoid (0-1) |
| Input | i_t | What to write | Sigmoid (0-1) |
| Candidate | C̃_t | New values | Tanh (-1 to 1) |
| Output | o_t | What to output | Sigmoid (0-1) |
12.9 Why LSTMs Solve Vanishing Gradients
The Gradient Highway
The cell state provides a path for gradients to flow unchanged:
graph LR
subgraph "Gradient Flow Through Cell State"
CT["∂L/∂C_T"]
C2["∂L/∂C_{T-1}"]
C1["∂L/∂C_{T-2}"]
C0["∂L/∂C_0"]
end
CT -->|"× f_{T}"| C2
C2 -->|"× f_{T-1}"| C1
C1 -->|"×..."| C0
K["If f ≈ 1, gradients flow!<br/>Network learns when to forget."]
C1 --> K
style K fill:#ffe66d,color:#000
The Key Difference
| Vanilla RNN | LSTM |
|---|---|
| Multiplicative: h_{t} = tanh(Wh_{t-1}) | Additive: C_t = f⊙C_{t-1} + i⊙C̃ |
| Gradient: ∏W (vanishes/explodes) | Gradient: ∏f (controlled by gates) |
| No control over memory | Explicit forget/remember |
12.10 Variants of LSTM
Peephole Connections
Let gates look at the cell state directly:
graph LR
subgraph "Peephole LSTM"
C["C_{t-1}"]
F["Forget gate"]
I["Input gate"]
O["Output gate"]
end
C -->|"peephole"| F
C -->|"peephole"| I
C -->|"peephole"| O
E["Gates can 'peek' at<br/>cell state for decisions"]
C --> E
Coupled Forget and Input Gates
Instead of separate forget and input decisions:
\[C_t = f_t \odot C_{t-1} + (1 - f_t) \odot \tilde{C}_t\]“What we forget = what we add” (simpler!)
GRU (Gated Recurrent Unit)
A popular simplification:
graph TB
subgraph "GRU (Simpler Alternative)"
Z["Update gate z_t"]
R["Reset gate r_t"]
H["h̃_t = tanh(W·[r⊙h,x])"]
OUT["h_t = (1-z)⊙h + z⊙h̃"]
end
Z --> OUT
R --> H
H --> OUT
K["Only 2 gates (vs 3 in LSTM)<br/>No separate cell state<br/>Often similar performance"]
OUT --> K
12.11 LSTM in Practice
Stacked LSTMs
Multiple LSTM layers for more capacity:
graph TB
subgraph "2-Layer LSTM"
X["x_t"]
L1["LSTM Layer 1"]
L2["LSTM Layer 2"]
Y["y_t"]
end
X --> L1 --> L2 --> Y
L1 -->|"h_1 becomes input"| L2
Bidirectional LSTMs
Process sequence in both directions:
graph LR
subgraph "Bidirectional"
X1["x₁"]
X2["x₂"]
X3["x₃"]
F1["→"]
F2["→"]
F3["→"]
B1["←"]
B2["←"]
B3["←"]
end
X1 --> F1 --> F2 --> F3
X3 --> B3 --> B2 --> B1
O["Outputs combine<br/>forward and backward"]
F2 --> O
B2 --> O
12.12 When to Use LSTMs
Good For
| Task | Why LSTM Works |
|---|---|
| Language Modeling | Long-range syntax dependencies |
| Speech Recognition | Acoustic patterns over time |
| Machine Translation | Encoder needs full sentence |
| Time Series | Long-term trends and patterns |
| Music Generation | Musical structure and motifs |
When to Consider Alternatives
| Situation | Alternative |
|---|---|
| Very long sequences (1000+) | Transformers (Chapter 16) |
| Need parallelization | Transformers |
| Simple patterns | Vanilla RNN or 1D CNN |
| Real-time requirements | Lightweight architectures |
12.13 Implementation
PyTorch LSTM
import torch.nn as nn
# Single layer LSTM
lstm = nn.LSTM(
input_size=256, # Dimension of input
hidden_size=512, # Dimension of hidden state
num_layers=2, # Stacked layers
bidirectional=True, # Both directions
dropout=0.5 # Between layers
)
# Usage
output, (h_n, c_n) = lstm(input_sequence, (h_0, c_0))
# output: all hidden states
# h_n: final hidden state
# c_n: final cell state
Key Hyperparameters
| Parameter | Typical Range | Notes |
|---|---|---|
| hidden_size | 128-1024 | Larger = more capacity |
| num_layers | 1-4 | More layers, more dropout |
| dropout | 0.2-0.5 | Between layers only |
| learning_rate | 1e-4 to 1e-2 | Often lower than CNNs |
12.14 Connection to Other Chapters
graph TB
CH12["Chapter 12<br/>Understanding LSTMs"]
CH12 --> CH11["Chapter 11: RNN Effectiveness<br/><i>The problem LSTMs solve</i>"]
CH12 --> CH13["Chapter 13: RNN Regularization<br/><i>Dropout for LSTMs</i>"]
CH12 --> CH15["Chapter 15: Attention<br/><i>Attending over LSTM states</i>"]
CH12 --> CH16["Chapter 16: Transformers<br/><i>Attention replaces recurrence</i>"]
style CH12 fill:#ff6b6b,color:#fff
12.15 Key Equations Summary
Forget Gate
\(f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\)
Input Gate
\(i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\)
Candidate Values
\(\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\)
Cell State Update
\(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\)
Output Gate
\(o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\)
Hidden State
\(h_t = o_t \odot \tanh(C_t)\)
12.16 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["Cell state = memory highway<br/>with additive updates"]
T2["Forget gate: what to erase"]
T3["Input gate: what to write"]
T4["Output gate: what to show"]
T5["Additive updates preserve<br/>gradients over long sequences"]
end
T1 --> C["LSTMs solve the vanishing<br/>gradient problem through<br/>gated memory cells that<br/>learn what to remember<br/>and what to forget."]
T2 --> C
T3 --> C
T4 --> C
T5 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
In One Sentence
LSTMs solve the vanishing gradient problem through gated memory cells—with forget, input, and output gates—that maintain a cell state “highway” enabling information and gradients to flow across hundreds of time steps.
Exercises
-
Conceptual: Explain in your own words why additive updates to the cell state help with gradient flow, compared to multiplicative updates in vanilla RNNs.
-
Trace Through: Given f_t = 0.9, i_t = 0.3, C_{t-1} = [1, 2, 3], C̃_t = [0.5, 0.5, 0.5], compute C_t.
-
Implementation: Implement an LSTM cell from scratch in NumPy and verify it matches PyTorch’s output.
-
Comparison: Train both a vanilla RNN and an LSTM on a task requiring memory of 50+ time steps (e.g., adding problem). Compare learning curves.
References & Further Reading
| Resource | Link |
|---|---|
| Original Blog Post (Colah) | colah.github.io |
| Original LSTM Paper (Hochreiter & Schmidhuber) | Paper |
| GRU Paper (Cho et al.) | arXiv:1406.1078 |
| LSTM: A Search Space Odyssey | arXiv:1503.04069 |
| Deep Learning Book Ch. 10 | deeplearningbook.org |
| PyTorch LSTM Documentation | PyTorch |
Next Chapter: Chapter 13: Recurrent Neural Network Regularization — We explore how to apply dropout to RNNs correctly, preventing overfitting while maintaining the benefits of recurrence.