Chapter 26: GPipe - Efficient Training of Giant Neural Networks
“We introduce GPipe, a pipeline parallelism library that enables efficient training of giant neural networks by partitioning models across multiple accelerators.”
Based on: “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism” (Yanping Huang, Youlong Cheng, Ankur Bapna, et al., 2018)
| 📄 Original Paper: arXiv:1811.06965 | NeurIPS 2019 |
26.1 The Problem: Models Too Large for a Single GPU
As models grow (following scaling laws from Chapter 25), they exceed single GPU memory:
graph TB
subgraph "The Problem"
MODEL["Large Model<br/>(e.g., 1B parameters)"]
GPU["Single GPU<br/>(e.g., 16GB memory)"]
FAIL["❌ Out of Memory"]
end
MODEL --> GPU --> FAIL
K["Model doesn't fit<br/>on one GPU"]
FAIL --> K
style FAIL fill:#ff6b6b,color:#fff
Solutions
- Data parallelism: Replicate model, split data (Chapter 24)
- Model parallelism: Split model across GPUs (this chapter)
- Pipeline parallelism: Split model into stages (GPipe)
26.2 Pipeline Parallelism vs Other Approaches
Comparison
graph TB
subgraph "Parallelism Strategies"
DP["Data Parallelism<br/>Same model, different data"]
MP["Model Parallelism<br/>Split layers across GPUs"]
PP["Pipeline Parallelism<br/>Split into stages (GPipe)"]
end
DP -->|"Good for"| D1["Many small models"]
MP -->|"Good for"| M1["Wide layers"]
PP -->|"Good for"| P1["Deep sequential models"]
style PP fill:#4ecdc4,color:#fff
Why Pipeline Parallelism?
For deep sequential models (Transformers, ResNets):
- Natural partitioning: Split by layers
- Better GPU utilization than model parallelism
- Simpler than complex model parallelism
26.3 The GPipe Architecture
Basic Idea
Split the model into stages, each on a different GPU:
graph TB
subgraph "GPipe Pipeline"
INPUT["Input batch"]
S1["Stage 1<br/>(GPU 1)<br/>Layers 1-4"]
S2["Stage 2<br/>(GPU 2)<br/>Layers 5-8"]
S3["Stage 3<br/>(GPU 3)<br/>Layers 9-12"]
S4["Stage 4<br/>(GPU 4)<br/>Layers 13-16"]
OUTPUT["Output"]
end
INPUT --> S1 --> S2 --> S3 --> S4 --> OUTPUT
K["Each GPU processes<br/>different layers"]
S2 --> K
style K fill:#ffe66d,color:#000
Forward Pass
Data flows sequentially through stages:
- GPU 1 processes input → sends to GPU 2
- GPU 2 processes → sends to GPU 3
- GPU 3 processes → sends to GPU 4
- GPU 4 processes → outputs result
26.4 The Pipeline Bubble Problem
Naive Pipeline
If we process one batch at a time:
gantt
title Naive Pipeline (Inefficient)
dateFormat X
axisFormat %s
section GPU 1
Batch 1 :0, 4
Batch 2 :4, 4
Batch 3 :8, 4
section GPU 2
Idle :0, 1
Batch 1 :1, 4
Idle :5, 1
Batch 2 :6, 4
section GPU 3
Idle :0, 2
Batch 1 :2, 4
Idle :6, 2
Batch 2 :8, 4
section GPU 4
Idle :0, 3
Batch 1 :3, 4
Idle :7, 3
Batch 2 :11, 4
Problem: Most GPUs are idle most of the time!
The Bubble
graph TB
subgraph "Pipeline Bubble"
START["Pipeline starts"]
FILL["Filling pipeline<br/>(GPUs idle)"]
STEADY["Steady state<br/>(all GPUs busy)"]
DRAIN["Draining pipeline<br/>(GPUs idle)"]
end
START --> FILL --> STEADY --> DRAIN
K["Bubble = wasted compute<br/>during fill and drain"]
FILL --> K
DRAIN --> K
style K fill:#ff6b6b,color:#fff
26.5 Micro-Batching: The Solution
Split Batch into Micro-Batches
Instead of one large batch, split into micro-batches:
graph TB
subgraph "Micro-Batching"
BATCH["Large batch<br/>(e.g., 256 samples)"]
MB1["Micro-batch 1<br/>(64 samples)"]
MB2["Micro-batch 2<br/>(64 samples)"]
MB3["Micro-batch 3<br/>(64 samples)"]
MB4["Micro-batch 4<br/>(64 samples)"]
end
BATCH --> MB1
BATCH --> MB2
BATCH --> MB3
BATCH --> MB4
K["Process multiple<br/>micro-batches in pipeline"]
MB1 --> K
style K fill:#4ecdc4,color:#fff
Pipeline with Micro-Batches
gantt
title GPipe Pipeline with Micro-Batches
dateFormat X
axisFormat %s
section GPU 1
MB1 :0, 1
MB2 :1, 1
MB3 :2, 1
MB4 :3, 1
section GPU 2
MB1 :1, 1
MB2 :2, 1
MB3 :3, 1
MB4 :4, 1
section GPU 3
MB1 :2, 1
MB2 :3, 1
MB3 :4, 1
MB4 :5, 1
section GPU 4
MB1 :3, 1
MB2 :4, 1
MB3 :5, 1
MB4 :6, 1
Result: Much better GPU utilization!
26.6 Gradient Accumulation
The Challenge
Each micro-batch produces gradients, but we need gradients for the full batch:
graph TB
subgraph "Gradient Accumulation"
MB1["Micro-batch 1<br/>→ grad₁"]
MB2["Micro-batch 2<br/>→ grad₂"]
MB3["Micro-batch 3<br/>→ grad₃"]
MB4["Micro-batch 4<br/>→ grad₄"]
ACCUM["Accumulate:<br/>grad = grad₁ + grad₂ + grad₃ + grad₄"]
UPDATE["Update weights"]
end
MB1 --> ACCUM
MB2 --> ACCUM
MB3 --> ACCUM
MB4 --> ACCUM
ACCUM --> UPDATE
K["Sum gradients across<br/>micro-batches before update"]
ACCUM --> K
style K fill:#ffe66d,color:#000
Mathematical Formulation
For micro-batches $m_1, …, m_k$:
\[\nabla L = \sum_{i=1}^k \nabla L(m_i)\]Then update: $\theta \leftarrow \theta - \alpha \nabla L$
26.7 The Complete GPipe Algorithm
Forward Pass
graph TB
subgraph "Forward Pass"
SPLIT["Split batch into<br/>micro-batches"]
P1["Process MB₁ on Stage 1"]
P2["Process MB₁ on Stage 2"]
P3["Process MB₁ on Stage 3"]
P4["Process MB₁ on Stage 4"]
STORE["Store activations<br/>for backward pass"]
end
SPLIT --> P1 --> P2 --> P3 --> P4 --> STORE
K["Pipeline processes<br/>multiple micro-batches<br/>in parallel"]
P2 --> K
style K fill:#4ecdc4,color:#fff
Backward Pass
graph TB
subgraph "Backward Pass"
LOSS["Compute loss<br/>on final stage"]
B4["Backward through Stage 4"]
B3["Backward through Stage 3"]
B2["Backward through Stage 2"]
B1["Backward through Stage 1"]
GRAD["Accumulate gradients"]
end
LOSS --> B4 --> B3 --> B2 --> B1 --> GRAD
K["Gradients flow backward<br/>through pipeline"]
B2 --> K
style K fill:#ffe66d,color:#000
Memory Management
GPipe stores activations for backward pass:
- Forward: Store activations at each stage
- Backward: Use stored activations to compute gradients
This requires significant memory, but enables correct gradient computation.
26.8 Efficiency Analysis
Pipeline Utilization
With $p$ stages and $m$ micro-batches:
Ideal utilization (ignoring overhead): \(\text{Utilization} = \frac{m}{m + p - 1}\)
xychart-beta
title "Pipeline Utilization vs Micro-Batches"
x-axis "Micro-Batches (m)" [4, 8, 16, 32, 64]
y-axis "Utilization %" 0 --> 100
line "4 stages" [50, 73, 84, 91, 95]
line "8 stages" [31, 53, 70, 82, 90]
Key insight: More micro-batches → better utilization (but more memory).
26.9 Memory Efficiency: Re-materialization
The Memory Problem
Storing all activations for backward pass uses a lot of memory:
graph TB
subgraph "Memory Usage"
FORWARD["Forward pass:<br/>Store all activations"]
MEM["Memory = O(batch_size × layers)"]
PROBLEM["❌ Out of memory<br/>for large models"]
end
FORWARD --> MEM --> PROBLEM
style PROBLEM fill:#ff6b6b,color:#fff
Gradient Checkpointing
Re-materialization: Recompute activations during backward pass instead of storing:
graph TB
subgraph "Gradient Checkpointing"
FWD["Forward: Store only<br/>checkpoint activations"]
BWD["Backward: Recompute<br/>intermediate activations"]
SAVE["Saves memory<br/>at cost of recomputation"]
end
FWD --> BWD --> SAVE
K["Trade compute<br/>for memory"]
SAVE --> K
style K fill:#ffe66d,color:#000
GPipe uses this to train even larger models.
26.10 Experimental Results
Model Size
GPipe enabled training of very large models:
xychart-beta
title "Model Size with GPipe"
x-axis ["Single GPU", "4 GPUs", "8 GPUs", "16 GPUs"]
y-axis "Parameters (Billions)" 0 --> 20
bar [0.5, 2.0, 4.0, 8.0]
Speedup
xychart-beta
title "Training Speedup"
x-axis ["1 GPU", "4 GPUs (GPipe)", "8 GPUs (GPipe)"]
y-axis "Speedup (×)" 0 --> 8
bar [1.0, 3.2, 5.8]
Near-linear speedup with good micro-batch sizing!
26.11 Comparison with Other Methods
Data Parallelism
| Aspect | Data Parallelism | GPipe (Pipeline) |
|---|---|---|
| Model size | Limited by single GPU | Can exceed single GPU |
| Communication | All-reduce gradients | Point-to-point activations |
| Efficiency | High for small models | High for large models |
| Complexity | Simple | Moderate |
Model Parallelism
| Aspect | Model Parallelism | GPipe |
|---|---|---|
| GPU utilization | Low (sequential) | High (pipelined) |
| Synchronization | Frequent | Batched |
| Memory | Distributed | Checkpointed |
26.12 Modern Variants
PipeDream
Asynchronous pipeline (doesn’t wait for all micro-batches):
graph TB
subgraph "PipeDream"
ASYNC["Asynchronous updates<br/>(no waiting)"]
FAST["Faster training"]
STALE["Stale gradients<br/>(trade-off)"]
end
ASYNC --> FAST --> STALE
style FAST fill:#4ecdc4,color:#fff
Megatron-LM
Tensor parallelism + pipeline parallelism:
- Split layers within stages (tensor parallelism)
- Split across stages (pipeline parallelism)
DeepSpeed
Microsoft’s library combining:
- Pipeline parallelism
- ZeRO (zero redundancy optimizer)
- Gradient checkpointing
26.13 Connection to Scaling Laws (Chapter 25)
Enabling Large Models
Scaling laws predict better performance with larger models. GPipe enables training those models:
graph TB
subgraph "Scaling Laws + GPipe"
SL["Scaling Laws:<br/>Larger models → Better"]
GPIPE["GPipe:<br/>Enables large models"]
LARGE["Train 10B+ parameter models"]
end
SL --> GPIPE --> LARGE
K["GPipe makes scaling laws<br/>practically achievable"]
LARGE --> K
style K fill:#4ecdc4,color:#fff
26.14 Connection to Other Chapters
graph TB
CH26["Chapter 26<br/>GPipe"]
CH26 --> CH25["Chapter 25: Scaling Laws<br/><i>Enables large models</i>"]
CH26 --> CH24["Chapter 24: Deep Speech 2<br/><i>Data parallelism</i>"]
CH26 --> CH16["Chapter 16: Transformers<br/><i>Deep sequential models</i>"]
CH26 --> CH8["Chapter 8: ResNet<br/><i>Very deep networks</i>"]
style CH26 fill:#ff6b6b,color:#fff
26.15 Key Equations Summary
Pipeline Utilization
\[\text{Utilization} = \frac{m}{m + p - 1}\]Where:
- $m$ = number of micro-batches
- $p$ = number of pipeline stages
Gradient Accumulation
\[\nabla L = \sum_{i=1}^m \nabla L(\text{MB}_i)\]Memory with Checkpointing
\[\text{Memory} = O\left(\frac{\text{batch\_size}}{m} \times \text{layers\_per\_stage}\right)\]26.16 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["Pipeline parallelism splits<br/>model into stages across GPUs"]
T2["Micro-batching improves<br/>GPU utilization"]
T3["Gradient accumulation<br/>combines micro-batch gradients"]
T4["Enables training models<br/>larger than single GPU memory"]
T5["Near-linear speedup<br/>with good configuration"]
end
T1 --> C["GPipe enables efficient training<br/>of giant neural networks by splitting<br/>models into pipeline stages across<br/>multiple GPUs, using micro-batching<br/>to improve utilization and gradient<br/>accumulation to maintain correctness."]
T2 --> C
T3 --> C
T4 --> C
T5 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
In One Sentence
GPipe enables efficient training of giant neural networks by splitting models into pipeline stages across multiple GPUs, using micro-batching to improve utilization and gradient accumulation to maintain training correctness.
🎉 Part VI Complete!
You’ve finished the Scaling and Efficiency section. You now understand:
- End-to-end speech recognition at scale (Chapter 24)
- Scaling laws that predict performance (Chapter 25)
- Pipeline parallelism for giant models (Chapter 26)
Next up: Part VII - The Future of Intelligence, where we explore what comes next!
Exercises
-
Conceptual: Explain why pipeline parallelism is better than model parallelism for deep sequential models. What are the trade-offs?
-
Mathematical: Calculate the pipeline utilization for 8 stages with 16 micro-batches. How many micro-batches are needed for 90% utilization?
-
Analysis: Compare the memory requirements of GPipe with and without gradient checkpointing. When is checkpointing worth the recomputation cost?
-
Extension: How would you modify GPipe for models with skip connections (like ResNet)? What additional challenges arise?
References & Further Reading
| Resource | Link |
|---|---|
| Original Paper (Huang et al., 2018) | arXiv:1811.06965 |
| PipeDream Paper | arXiv:1806.03377 |
| Megatron-LM Paper | arXiv:1909.08053 |
| DeepSpeed Paper | arXiv:1910.02054 |
| Gradient Checkpointing | arXiv:1604.06174 |
| GPipe Implementation | TensorFlow |
Next Chapter: Chapter 27: The Future of Intelligence — We conclude the book by exploring emerging directions, open questions, and the future of AI research.