Chapter 11: The Unreasonable Effectiveness of Recurrent Neural Networks
“There’s something magical about Recurrent Neural Networks.”
Based on: “The Unreasonable Effectiveness of Recurrent Neural Networks” (Andrej Karpathy, 2015)
📄 Original Blog Post: karpathy.github.io
11.1 A Blog Post That Changed Everything
In May 2015, Andrej Karpathy (who would later become Director of AI at Tesla and work at OpenAI with Ilya) published a blog post that captured the imagination of the machine learning community.
With simple character-level RNNs, he generated:
- Shakespeare plays
- Wikipedia articles
- LaTeX mathematics
- Linux source code
- Baby names
The results were surprisingly coherent—and deeply fascinating.
graph TB
subgraph "What RNNs Generated"
S["Shakespeare<br/>'PANDARUS: Alas, I think he shall be come...'"]
W["Wikipedia<br/>'Naturalism and decision for the...'"]
C["C Code<br/>'static int<br/> get_device_state...'"]
L["LaTeX<br/>'\\begin{theorem}...'"]
end
R["Simple RNN<br/>~single layer<br/>~hundreds of units"]
R --> S
R --> W
R --> C
R --> L
style R fill:#ffe66d,color:#000
11.2 What Are Recurrent Neural Networks?
The Core Idea
Unlike feedforward networks, RNNs have loops:
graph LR
subgraph "Feedforward Network"
I1["Input"] --> H1["Hidden"] --> O1["Output"]
end
subgraph "Recurrent Network"
I2["Input"] --> H2["Hidden"]
H2 --> O2["Output"]
H2 -->|"loop"| H2
end
Unrolling Through Time
The loop means the network processes sequences step by step:
graph LR
subgraph "Unrolled RNN"
X0["x₀"] --> H0["h₀"]
H0 --> Y0["y₀"]
X1["x₁"] --> H1["h₁"]
H0 --> H1
H1 --> Y1["y₁"]
X2["x₂"] --> H2["h₂"]
H1 --> H2
H2 --> Y2["y₂"]
X3["x₃"] --> H3["h₃"]
H2 --> H3
H3 --> Y3["y₃"]
end
K["Same weights used at every step!"]
H1 --> K
style K fill:#ffe66d,color:#000
The Equations
\(h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)\) \(y_t = W_{hy} h_t + b_y\)
The hidden state $h_t$ is the network’s memory—it carries information from past inputs.
11.3 Sequence Modeling Tasks
The Power of Flexible I/O
graph TB
subgraph "RNN Configurations"
O2O["One-to-One<br/>Standard feedforward"]
O2M["One-to-Many<br/>Image captioning"]
M2O["Many-to-One<br/>Sentiment analysis"]
M2M["Many-to-Many<br/>Translation"]
M2Ms["Many-to-Many (synced)<br/>Video labeling"]
end
I["Input"] --> O2O
I --> O2M
I --> M2O
I --> M2M
I --> M2Ms
Specific Examples
| Configuration | Task | Input | Output |
|---|---|---|---|
| One-to-Many | Image Captioning | Image | “A cat sitting on a couch” |
| Many-to-One | Sentiment | Review text | Positive/Negative |
| Many-to-Many | Translation | English | French |
| Synced M-to-M | Video tagging | Video frames | Labels per frame |
11.4 Character-Level Language Models
The Training Setup
Karpathy’s key insight: train at the character level.
graph LR
subgraph "Training on 'hello'"
H["'h'"] -->|"predict"| E["'e'"]
E -->|"predict"| L1["'l'"]
L1 -->|"predict"| L2["'l'"]
L2 -->|"predict"| O["'o'"]
end
T["Input: 'hell'<br/>Target: 'ello'<br/>Shifted by one character"]
style T fill:#ffe66d,color:#000
Why Character-Level?
| Aspect | Word-Level | Character-Level |
|---|---|---|
| Vocabulary | ~50,000 words | ~100 characters |
| Unknown words | Problem | No problem |
| Spelling/structure | Can’t learn | Learns naturally |
| Model size | Larger | Smaller |
11.5 Training an RNN
The Algorithm: Backpropagation Through Time (BPTT)
graph TB
subgraph "BPTT"
F["Forward pass:<br/>Compute all h_t and y_t"]
L["Compute loss at each step"]
B["Backward pass:<br/>Gradients flow back through time"]
U["Update weights"]
end
F --> L --> B --> U
P["Same weights → sum gradients<br/>from all timesteps"]
B --> P
Truncated BPTT
For long sequences, truncate the backward pass:
graph LR
subgraph "Truncated BPTT (length 25)"
S1["Steps 1-25"]
S2["Steps 26-50"]
S3["Steps 51-75"]
end
S1 -->|"forward hidden state"| S2
S2 -->|"forward hidden state"| S3
N["Backprop only within each chunk<br/>But hidden state flows across"]
S2 --> N
style N fill:#ffe66d,color:#000
11.6 Sampling from the Model
Temperature-Controlled Generation
At test time, sample characters from the output distribution:
\[P(x_i) = \frac{\exp(y_i / T)}{\sum_j \exp(y_j / T)}\]graph TB
subgraph "Temperature Effect"
TL["T → 0 (low)<br/>More confident<br/>More repetitive"]
TM["T = 1 (normal)<br/>Balanced<br/>Natural diversity"]
TH["T → ∞ (high)<br/>More random<br/>Less coherent"]
end
TL --> S["Sampled text"]
TM --> S
TH --> S
Example Outputs at Different Temperatures
T=0.5 (conservative):
"The little house was a great way to the rest of the
company and the same time the same time..."
T=1.0 (balanced):
"The little boat was floating down the river when
suddenly a strange sound came from the forest..."
T=1.5 (creative):
"The jrkled boat was fringly down the xiver,
poppingly strantz sounds..."
11.7 Shakespeare Generation
Training Data
~1MB of Shakespeare’s complete works.
Generated Output
PANDARUS:
Alas, I think he shall be come approached and the day
When little sorrow sits and death makes all the same,
And thou art weeping in the gentle wind,
With nature's tears that morning yet doth move.
LUCETTA:
I would the gods had made thee a piece of virtue!
Not real Shakespeare—but captures:
- Character names
- Dialogue structure
- Iambic-ish rhythm
- Shakespearean vocabulary
graph TB
subgraph "What the RNN Learned"
S["Structure<br/>Speaker: Dialogue format"]
V["Vocabulary<br/>'Alas', 'thou', 'doth'"]
P["Patterns<br/>Line breaks, punctuation"]
M["Meaning-ish<br/>Coherent phrases"]
end
R["All from predicting<br/>the next character!"]
S --> R
V --> R
P --> R
M --> R
style R fill:#ffe66d,color:#000
11.8 Wikipedia Generation
Generated Article
Naturalism and decision for the fundamental unity of the
science of religion and history of the present century...
The game was played on October 1, 2011, and the team
finished with a 4-3 record in the season...
The RNN learned:
- Wiki article structure
- Factual-sounding (but fabricated) content
- Citations and references
- Date formats
11.9 Code Generation
LaTeX
\begin{theorem}
Let $\mathcal{M}$ be a maximal subgroup of $G$ and let
$\alpha$ be a $p$-group. Then the following are equivalent:
\end{theorem}
The RNN learned LaTeX syntax, mathematical notation, and theorem structure!
Linux Source Code
static int __init request_resource(struct resource *root,
struct resource *new)
{
struct resource *conflict;
write_lock(&resource_lock);
conflict = __request_resource(root, new);
write_unlock(&resource_lock);
return conflict ? -EBUSY : 0;
}
Valid C syntax, proper indentation, realistic function names!
11.10 Visualizing Hidden States
What Do the Hidden Units Learn?
Karpathy visualized individual hidden units:
graph TB
subgraph "Discovered Features"
Q["Quote detector<br/>Activates inside quotes"]
I["Indent tracker<br/>Tracks nesting level"]
N["Newline predictor<br/>Fires at line ends"]
C["Comment detector<br/>Knows when in comment"]
end
E["Interpretable features emerge<br/>without explicit supervision!"]
Q --> E
I --> E
N --> E
C --> E
style E fill:#ffe66d,color:#000
The Quote Detector
He said "hello there" to her.
^^^^^^^^^^^^
Unit activates!
One hidden unit learned to track whether the current position is inside a quotation—purely from predicting the next character!
11.11 The Vanishing Gradient Problem
Why Vanilla RNNs Struggle
graph LR
subgraph "Gradient Flow"
H0["h₀"] -->|"×W"| H1["h₁"]
H1 -->|"×W"| H2["h₂"]
H2 -->|"×W"| H3["h₃"]
H3 -->|"×W"| HN["h_n"]
end
P["Gradient = W^n<br/>If ||W|| < 1: vanishes<br/>If ||W|| > 1: explodes"]
HN --> P
style P fill:#ff6b6b,color:#fff
Practical Impact
- Short-term: RNNs work well
- Long-term: Information gets lost
- ~10-20 steps: Practical limit for vanilla RNNs
This is why Chapter 12 (LSTMs) is so important!
11.12 Implementation Insights
Minimal RNN in NumPy
# Forward pass for one step
def rnn_step(x, h_prev, W_xh, W_hh, W_hy, b_h, b_y):
# Update hidden state
h = np.tanh(W_xh @ x + W_hh @ h_prev + b_h)
# Compute output
y = W_hy @ h + b_y
return h, y
# Full forward pass
def rnn_forward(inputs, h0, params):
h = h0
outputs = []
for x in inputs:
h, y = rnn_step(x, h, **params)
outputs.append(y)
return outputs, h
Key Hyperparameters
| Parameter | Typical Value |
|---|---|
| Hidden size | 128-512 |
| Layers | 1-3 |
| Learning rate | 0.001-0.01 |
| Sequence length | 25-100 |
| Gradient clipping | 1-5 |
11.13 Why This Blog Post Matters
Impact on the Field
timeline
title RNN Renaissance
2015 : Karpathy's blog post
: RNNs become "cool"
2015-16 : Explosion of RNN papers
: Sequence-to-sequence models
2016 : Google Neural Machine Translation
: LSTMs in production
2017 : Attention Is All You Need
: Transformers begin to replace RNNs
2020s : Transformers dominate
: But RNN ideas persist
The Demonstration Effect
The blog post showed that:
- RNNs are accessible: Simple to implement and understand
- Character-level works: Don’t need complex tokenization
- Structure emerges: Networks learn syntax, format, style
- Visualization matters: Seeing hidden states builds intuition
11.14 Connection to Other Chapters
graph TB
CH11["Chapter 11<br/>RNN Effectiveness"]
CH11 --> CH12["Chapter 12: LSTMs<br/><i>Solution to vanishing gradients</i>"]
CH11 --> CH13["Chapter 13: RNN Regularization<br/><i>Making RNNs generalize</i>"]
CH11 --> CH15["Chapter 15: Attention<br/><i>Better than just hidden state</i>"]
CH11 --> CH16["Chapter 16: Transformers<br/><i>Attention without recurrence</i>"]
style CH11 fill:#ff6b6b,color:#fff
11.15 Key Equations Summary
RNN Hidden State Update
\[h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)\]Output Computation
\[y_t = W_{hy} h_t + b_y\]Softmax with Temperature
\[P(x_i) = \frac{\exp(y_i / T)}{\sum_j \exp(y_j / T)}\]Cross-Entropy Loss
\[L = -\sum_t \log P(x_{t+1} | x_1, ..., x_t)\]11.16 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["RNNs process sequences<br/>by maintaining hidden state"]
T2["Character-level models<br/>learn structure automatically"]
T3["Generated text captures<br/>syntax, style, and patterns"]
T4["Hidden units develop<br/>interpretable features"]
T5["Vanishing gradients limit<br/>vanilla RNN memory"]
end
T1 --> C["Karpathy's blog showed that<br/>simple RNNs can generate<br/>surprisingly coherent text,<br/>code, and structured data—<br/>inspiring a generation of<br/>sequence modeling research."]
T2 --> C
T3 --> C
T4 --> C
T5 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
In One Sentence
Karpathy’s blog post demonstrated that character-level RNNs can generate surprisingly coherent Shakespeare, Wikipedia, and source code—revealing the unreasonable effectiveness of recurrence for sequence modeling.
Exercises
-
Implementation: Implement a minimal character-level RNN in NumPy and train it on a small text corpus. Generate samples at different temperatures.
-
Visualization: Train an RNN and visualize the activation of hidden units over a sequence. Can you find interpretable features?
-
Comparison: Compare character-level and word-level language models on the same dataset. What are the trade-offs?
-
Analysis: Why does the RNN learn to close parentheses and quotes correctly? What does this tell us about what it’s learning?
References & Further Reading
| Resource | Link |
|---|---|
| Original Blog Post (Karpathy) | karpathy.github.io |
| min-char-rnn.py | GitHub Gist |
| Understanding LSTM Networks | Colah’s Blog |
| Visualizing RNNs | Karpathy Thesis |
| Sequence to Sequence Learning | arXiv:1409.3215 |
| Deep Learning Book Ch. 10 | deeplearningbook.org |
Next Chapter: Chapter 12: Understanding LSTM Networks — We explore how Long Short-Term Memory networks solve the vanishing gradient problem with gated memory cells, enabling learning over much longer sequences.