Chapter 16: Attention Is All You Need
“We propose the Transformer, a model architecture eschewing recurrence and convolutions entirely and relying solely on attention mechanisms.”
Based on: “Attention Is All You Need” (Ashish Vaswani, Noam Shazeer, Niki Parmar, et al., 2017)
| 📄 Original Paper: arXiv:1706.03762 | NeurIPS 2017 |
16.1 The Paper That Changed Everything
In 2017, a team from Google Brain published a paper with a bold claim: “Attention Is All You Need.”
They eliminated:
- ❌ Recurrence (RNNs/LSTMs)
- ❌ Convolutions
They kept:
- ✅ Attention mechanisms
The result: The Transformer—the architecture that powers GPT, BERT, and virtually every modern LLM.
graph TB
subgraph "Before Transformers"
R["RNNs/LSTMs<br/>Sequential processing"]
C["CNNs<br/>Convolutional layers"]
end
subgraph "After Transformers"
T["Transformers<br/>Pure attention"]
end
R -->|"Replaced by"| T
C -->|"Replaced by"| T
I["Parallel processing<br/>Better long-range dependencies<br/>Foundation for LLMs"]
T --> I
style T fill:#ffe66d,color:#000
Figure: Transformers replaced RNNs and CNNs for sequence processing, enabling parallel processing, better long-range dependencies, and serving as the foundation for modern large language models.
16.2 Why Eliminate Recurrence?
The Sequential Bottleneck
RNNs process sequences one element at a time:
graph LR
subgraph "RNN Processing"
X1["x₁"] --> H1["h₁"]
X2["x₂"] --> H2["h₂"]
X3["x₃"] --> H3["h₃"]
X4["x₄"] --> H4["h₄"]
end
P["Cannot parallelize!<br/>Must wait for h₁ to compute h₂"]
H1 --> P
style P fill:#ff6b6b,color:#fff
Figure: RNNs process sequences sequentially, creating a bottleneck where each step must wait for the previous one, preventing parallelization and slowing training.
This makes training slow and limits scalability.
The Solution: Parallel Attention
Transformers process all positions simultaneously:
graph TB
subgraph "Transformer Processing"
X["[x₁, x₂, x₃, x₄]"]
ATT["Self-Attention<br/>(all positions at once)"]
OUT["[y₁, y₂, y₃, y₄]"]
end
X --> ATT --> OUT
K["All positions computed<br/>in parallel!"]
ATT --> K
style K fill:#4ecdc4,color:#fff
Figure: Transformers process all sequence positions simultaneously through self-attention, enabling full parallelization during training and inference.
16.3 The Transformer Architecture
High-Level Overview
graph TB
subgraph "Transformer"
ENC["Encoder<br/>(6 layers)"]
DEC["Decoder<br/>(6 layers)"]
end
subgraph "Encoder Layer"
SA["Self-Attention"]
FF["Feed Forward"]
ADD1["Add & Norm"]
ADD2["Add & Norm"]
end
subgraph "Decoder Layer"
MSA["Masked Self-Attention"]
CA["Cross-Attention"]
FF2["Feed Forward"]
ADD3["Add & Norm"]
ADD4["Add & Norm"]
ADD5["Add & Norm"]
end
ENC --> DEC
SA --> ADD1 --> FF --> ADD2
MSA --> ADD3 --> CA --> ADD4 --> FF2 --> ADD5
Figure: High-level Transformer architecture showing encoder (6 layers) and decoder (6 layers), each with self-attention, feed-forward networks, and residual connections with layer normalization.
Key Components
- Self-Attention: Each position attends to all positions
- Multi-Head Attention: Multiple attention mechanisms in parallel
- Position Encoding: Injects positional information
- Feed-Forward Networks: Point-wise transformations
- Residual Connections: Skip connections (like ResNet!)
- Layer Normalization: Normalization after each sub-layer
16.4 Scaled Dot-Product Attention
The Core Mechanism
graph TB
subgraph "Scaled Dot-Product Attention"
Q["Queries Q"]
K["Keys K"]
V["Values V"]
DOT["QK^T"]
SCALE["÷ √d_k"]
SOFT["Softmax"]
MUL["× V"]
OUT["Attention(Q, K, V)"]
end
Q --> DOT
K --> DOT
DOT --> SCALE --> SOFT --> MUL
V --> MUL
MUL --> OUT
F["Formula:<br/>Attention = softmax(QK^T/√d_k) V"]
OUT --> F
style F fill:#ffe66d,color:#000
Figure: Scaled dot-product attention mechanism. Queries Q are matched against keys K, scaled by √d_k, passed through softmax, then used to weight values V, producing the attention output.
The Formula
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]Why Scale by √d_k?
Without scaling, dot products grow large → softmax saturates → tiny gradients.
graph LR
subgraph "Without Scaling"
L["Large dot products<br/>→ Saturated softmax<br/>→ Vanishing gradients"]
end
subgraph "With Scaling"
S["Scaled by √d_k<br/>→ Stable softmax<br/>→ Good gradients"]
end
L -->|"Fixed by"| S
style S fill:#4ecdc4,color:#fff
Figure: Scaling by √d_k prevents large dot products that would cause softmax saturation and vanishing gradients, ensuring stable training.
16.5 Multi-Head Attention
Why Multiple Heads?
Different heads learn different types of relationships:
graph TB
subgraph "Multi-Head Attention"
Q["Q"]
K["K"]
V["V"]
H1["Head 1<br/>Syntactic relations"]
H2["Head 2<br/>Semantic relations"]
H3["Head 3<br/>Long-range dependencies"]
H4["Head 4<br/>Positional patterns"]
CONCAT["Concat"]
PROJ["Linear projection"]
OUT["Output"]
end
Q --> H1
K --> H1
V --> H1
Q --> H2
K --> H2
V --> H2
Q --> H3
K --> H3
V --> H3
Q --> H4
K --> H4
V --> H4
H1 --> CONCAT
H2 --> CONCAT
H3 --> CONCAT
H4 --> CONCAT
CONCAT --> PROJ --> OUT
Figure: Multi-head attention allows the model to attend to different types of relationships simultaneously—syntactic, semantic, long-range dependencies, and positional patterns—then concatenates and projects the results.
The Formula
\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\]Where: \(\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\)
Each head has its own learned projection matrices!
16.6 Position Encoding
The Problem
Attention has no inherent notion of order. We need to inject positional information.
Solution: Sinusoidal Position Encoding
\(PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)\) \(PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)\)
graph TB
subgraph "Position Encoding"
P1["Position 0<br/>[sin(0), cos(0), sin(0/100), cos(0/100), ...]"]
P2["Position 1<br/>[sin(1), cos(1), sin(1/100), cos(1/100), ...]"]
P3["Position 2<br/>[sin(2), cos(2), sin(2/100), cos(2/100), ...]"]
end
ADD["Add to embeddings"]
P1 --> ADD
P2 --> ADD
P3 --> ADD
K["Learned relative positions<br/>Can extrapolate to longer sequences"]
ADD --> K
style K fill:#ffe66d,color:#000
Figure: Sinusoidal position encoding adds positional information to token embeddings. Each position gets a unique encoding based on sine and cosine functions, allowing the model to learn relative positions and extrapolate to longer sequences.
Why Sinusoidal?
- Deterministic: No learned parameters
- Extrapolates: Can handle sequences longer than training
- Relative positions: Model can learn relative distances
16.7 The Encoder
Encoder Layer Structure
graph TB
subgraph "Encoder Layer"
X["Input"]
SA["Multi-Head<br/>Self-Attention"]
ADD1["Add & Norm"]
FF["Feed Forward<br/>(2 linear layers)"]
ADD2["Add & Norm"]
OUT["Output"]
end
X --> SA --> ADD1
X -->|"residual"| ADD1
ADD1 --> FF --> ADD2
ADD1 -->|"residual"| ADD2
ADD2 --> OUT
K["Residual connections +<br/>Layer normalization<br/>(Pre-norm style)"]
ADD2 --> K
style K fill:#ffe66d,color:#000
Figure: Encoder layer structure with multi-head self-attention, feed-forward network, residual connections, and layer normalization. The residual connections help with gradient flow and training stability.
Feed-Forward Network
\[\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2\]Two linear transformations with ReLU activation in between.
16.8 The Decoder
Decoder Layer Structure
graph TB
subgraph "Decoder Layer"
Y["Input"]
MSA["Masked Multi-Head<br/>Self-Attention"]
ADD1["Add & Norm"]
CA["Multi-Head<br/>Cross-Attention"]
ADD2["Add & Norm"]
FF["Feed Forward"]
ADD3["Add & Norm"]
OUT["Output"]
end
Y --> MSA --> ADD1
Y -->|"residual"| ADD1
ADD1 --> CA --> ADD2
ADD1 -->|"residual"| ADD2
ENC_OUT["Encoder Output"] --> CA
ADD2 --> FF --> ADD3
ADD2 -->|"residual"| ADD3
ADD3 --> OUT
Figure: Decoder layer structure with masked self-attention (prevents looking ahead), cross-attention (attends to encoder output), feed-forward network, and residual connections with layer normalization.
Masked Self-Attention
Prevents positions from attending to future positions:
graph LR
subgraph "Masked Attention"
Y1["y₁"] -->|"✓"| Y1
Y2["y₂"] -->|"✓"| Y1
Y2 -->|"✓"| Y2
Y3["y₃"] -->|"✓"| Y1
Y3 -->|"✓"| Y2
Y3 -->|"✓"| Y3
Y3 -->|"✗"| Y4
end
K["Mask ensures<br/>autoregressive property"]
Y3 --> K
style K fill:#ffe66d,color:#000
Figure: Masked attention prevents decoder positions from attending to future positions (marked with ✗), ensuring the autoregressive property where each position only sees previous positions.
Cross-Attention
Decoder attends to encoder outputs:
- Queries (Q): From decoder
- Keys (K): From encoder
- Values (V): From encoder
This connects encoder and decoder!
16.9 Why Transformers Work So Well
Advantages
graph TB
subgraph "Transformer Advantages"
P["Parallelization<br/>All positions at once"]
L["Long-range dependencies<br/>Direct attention paths"]
I["Interpretability<br/>Attention weights"]
S["Scalability<br/>Efficient training"]
end
B["Better performance<br/>on many tasks"]
P --> B
L --> B
I --> B
S --> B
style B fill:#4ecdc4,color:#fff
Figure: Key advantages of Transformers: parallelization (all positions processed simultaneously), long-range dependencies (direct attention paths), interpretability (attention weights), and scalability (efficient training).
Comparison with RNNs
| Aspect | RNN | Transformer |
|---|---|---|
| Parallelization | ❌ Sequential | ✅ Parallel |
| Long-range | Hard (gradients vanish) | Easy (direct attention) |
| Training speed | Slow | Fast |
| Memory | O(n) | O(n²) for attention |
16.10 Experimental Results
WMT 2014 English-German
xychart-beta
title "BLEU Scores on WMT'14 En-De"
x-axis ["Best ConvS2S", "Transformer (base)", "Transformer (big)"]
y-axis "BLEU Score" 0 --> 30
bar [25.2, 28.4, 28.9]
Figure: Transformer performance on WMT’14 English-German translation. The base model achieves 28.4 BLEU, and the big model reaches 28.9 BLEU, significantly outperforming previous convolutional sequence-to-sequence models (25.2 BLEU).
Training Speed
Transformer trained in 3.5 days vs ConvS2S in 9 days (on 8 GPUs)
Key Findings
- Faster training: Despite O(n²) attention, parallelization wins
- Better quality: State-of-the-art BLEU scores
- Scalable: Big model (6 layers → 6 layers, but wider) improves further
16.11 The Impact
What Transformers Enabled
timeline
title Transformer Revolution
2017 : Transformer paper
: Attention is all you need
2018 : BERT, GPT-1
: Pre-trained Transformers
2019 : GPT-2
: Large language models
2020 : GPT-3
: 175B parameters
2022 : ChatGPT
: Transformer-based chatbot
2023 : GPT-4, Claude
: Advanced reasoning
2024 : Gemini, GPT-4 Turbo
: Multimodal Transformers
Figure: Timeline of the Transformer revolution, from the original 2017 paper through BERT, GPT models, ChatGPT, and modern multimodal systems, showing how Transformers became the foundation of modern AI.
Modern Applications
- Language Models: GPT, BERT, T5, PaLM
- Vision: Vision Transformers (ViT)
- Multimodal: CLIP, DALL-E
- Code: Codex, GitHub Copilot
- Science: AlphaFold 2, scientific LLMs
16.12 Understanding Self-Attention
What Does It Learn?
graph TB
subgraph "Self-Attention Patterns"
S1["Syntactic:<br/>'The cat' → 'cat' attends to 'The'"]
S2["Semantic:<br/>'bank' → attends to 'river' or 'money'"]
S3["Long-range:<br/>'it' → attends to 'cat' (50 words away)"]
S4["Coreference:<br/>'he' → attends to 'John'"]
end
K["Learns linguistic<br/>and semantic patterns"]
S1 --> K
S2 --> K
S3 --> K
S4 --> K
style K fill:#ffe66d,color:#000
Figure: Self-attention learns various linguistic patterns: syntactic relationships (determiner-noun), semantic relationships (word sense disambiguation), long-range dependencies (pronoun resolution), and coreference (entity tracking).
Visualization Example
Input: "The animal didn't cross the street because it was too wide"
Attention from "it":
- "animal": 0.4
- "street": 0.3
- "cross": 0.2
- Others: 0.1
Model learned: "it" refers to "street"!
16.13 Connection to Other Chapters
graph TB
CH16["Chapter 16<br/>Transformers"]
CH16 --> CH15["Chapter 15: NMT Attention<br/><i>Foundation: attention mechanism</i>"]
CH16 --> CH8["Chapter 8: ResNet<br/><i>Residual connections</i>"]
CH16 --> CH9["Chapter 9: Identity Mappings<br/><i>Pre-norm architecture</i>"]
CH16 --> CH14["Chapter 14: Relational RNNs<br/><i>Self-attention in RNNs</i>"]
CH16 --> CH25["Chapter 25: Scaling Laws<br/><i>Transformers scale beautifully</i>"]
style CH16 fill:#ff6b6b,color:#fff
Figure: Transformers connect to multiple chapters: attention mechanisms (Chapter 15), residual connections (Chapter 8), identity mappings (Chapter 9), and relational reasoning (Chapter 14).
16.14 Key Equations Summary
Scaled Dot-Product Attention
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]Multi-Head Attention
\(\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\) \(\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\)
Position Encoding
\(PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)\) \(PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)\)
Feed-Forward Network
\[\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2\]Layer Normalization
\[\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\]16.15 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["Transformers eliminate<br/>recurrence and convolution"]
T2["Self-attention processes<br/>all positions in parallel"]
T3["Multi-head attention captures<br/>different relationship types"]
T4["Position encoding injects<br/>order information"]
T5["Foundation for all<br/>modern LLMs"]
end
T1 --> C["The Transformer architecture<br/>replaced RNNs and CNNs for most<br/>sequence tasks by using pure attention,<br/>enabling parallel processing and<br/>better long-range dependencies—<br/>becoming the foundation of GPT, BERT,<br/>and virtually every modern language model."]
T2 --> C
T3 --> C
T4 --> C
T5 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
Figure: Key takeaways from the Transformer architecture: elimination of recurrence/convolution, parallel self-attention, multi-head attention, and scalability that enabled modern large language models.
In One Sentence
The Transformer architecture eliminates recurrence and convolution, relying solely on multi-head self-attention to process sequences in parallel, achieving state-of-the-art results and becoming the foundation for all modern large language models.
Exercises
-
Conceptual: Explain why self-attention can be parallelized while RNNs cannot. What are the computational complexity trade-offs?
-
Mathematical: Derive why scaling by √d_k prevents softmax saturation. What happens if d_k is very large?
-
Implementation: Implement a single-head self-attention layer from scratch in PyTorch. Test it on a simple sequence.
-
Analysis: Compare the memory requirements of a Transformer vs an LSTM for a sequence of length n. When does each win?
References & Further Reading
| Resource | Link |
|---|---|
| Original Paper (Vaswani et al., 2017) | arXiv:1706.03762 |
| The Annotated Transformer | Harvard NLP |
| Illustrated Transformer | Jay Alammar Blog |
| BERT Paper | arXiv:1810.04805 |
| GPT Paper | arXiv:2005.14165 |
| Vision Transformers | arXiv:2010.11929 |
Next Chapter: Chapter 17: The Annotated Transformer — We dive into a line-by-line implementation walkthrough of the Transformer, making every detail concrete and implementable.