Chapter 25: Scaling Laws for Neural Language Models
“We study empirical scaling laws for language model performance on the cross-entropy loss. We find that performance scales as a power-law with model size, dataset size, and compute.”
Based on: “Scaling Laws for Neural Language Models” (Jared Kaplan, Sam McCandlish, Tom Henighan, et al., 2020)
| 📄 Original Paper: arXiv:2001.08361 | OpenAI |
25.1 The Scaling Question
As neural networks get bigger, how does performance improve? Is it linear? Exponential? Something else?
graph TB
subgraph "The Scaling Question"
Q1["Double the model size<br/>→ Double the performance?"]
Q2["Double the data<br/>→ Double the performance?"]
Q3["Double the compute<br/>→ Double the performance?"]
end
A["We need empirical laws<br/>to understand scaling"]
Q1 --> A
Q2 --> A
Q3 --> A
style A fill:#ffe66d,color:#000
This paper provides empirical answers based on training hundreds of models.
25.2 The Three Dimensions of Scaling
Compute, Data, and Model Size
graph TB
subgraph "Scaling Dimensions"
C["Compute<br/>(FLOPs)"]
D["Data<br/>(tokens)"]
M["Model Size<br/>(parameters)"]
end
P["Performance<br/>(cross-entropy loss)"]
C --> P
D --> P
M --> P
K["All three affect performance<br/>but in different ways"]
P --> K
style K fill:#4ecdc4,color:#fff
The Relationship
\[L(N, D, C) = \text{Performance as function of model size } N, \text{ data } D, \text{ compute } C\]25.3 The Power Law Discovery
Model Size Scaling
Performance scales as a power law with model size:
xychart-beta
title "Loss vs Model Size (Power Law)"
x-axis "Parameters (N)" [1e6, 1e7, 1e8, 1e9, 1e10]
y-axis "Cross-Entropy Loss" 2.0 --> 3.5
line "Empirical" [3.2, 2.8, 2.5, 2.2, 2.0]
The Formula
\[L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}\]Where:
- $N_c$ = critical model size
- $\alpha_N$ ≈ 0.076 (empirically determined)
Key insight: Performance improves, but with diminishing returns.
25.4 Dataset Size Scaling
Data Scaling Law
Similarly, performance scales with dataset size:
xychart-beta
title "Loss vs Dataset Size (Power Law)"
x-axis "Tokens (D)" [1e7, 1e8, 1e9, 1e10, 1e11]
y-axis "Cross-Entropy Loss" 2.0 --> 3.5
line "Empirical" [3.0, 2.6, 2.3, 2.1, 2.0]
The Formula
\[L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}\]Where $\alpha_D$ ≈ 0.095 (empirically determined).
25.5 Compute Scaling
Compute-Dependent Performance
When compute is the limiting factor:
xychart-beta
title "Loss vs Compute (Power Law)"
x-axis "Compute (FLOPs)" [1e18, 1e19, 1e20, 1e21, 1e22]
y-axis "Cross-Entropy Loss" 2.0 --> 3.5
line "Empirical" [3.1, 2.7, 2.4, 2.1, 1.9]
The Formula
\[L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}\]Where $\alpha_C$ ≈ 0.050 (empirically determined).
25.6 The Unified Scaling Law
Combining All Three
The full scaling law accounts for all dimensions:
\[L(N, D) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty\]Where $L_\infty$ is the irreducible loss (theoretical minimum).
graph TB
subgraph "Unified Scaling"
N["Model Size<br/>α_N ≈ 0.076"]
D["Dataset Size<br/>α_D ≈ 0.095"]
L["Irreducible Loss<br/>L_∞"]
SUM["Additive combination"]
LOSS["Final Loss"]
end
N --> SUM
D --> SUM
L --> SUM
SUM --> LOSS
K["Each dimension contributes<br/>additively to the loss"]
SUM --> K
style K fill:#ffe66d,color:#000
25.7 Optimal Allocation
The Compute Budget Question
Given a fixed compute budget $C$, how should we allocate it between:
- Model size $N$
- Training data $D$
- Training steps $S$
graph TB
subgraph "Optimal Allocation"
C["Compute Budget C"]
N["Model Size N"]
D["Data Size D"]
S["Training Steps S"]
end
C -->|"Allocate"| N
C -->|"Allocate"| D
C -->|"Allocate"| S
K["C = 6NDS<br/>(6 FLOPs per parameter per token)"]
C --> K
style K fill:#4ecdc4,color:#fff
The Optimal Ratio
Empirically, the optimal allocation is:
- Model parameters: Scale with compute as $N \propto C^{0.73}$
- Training tokens: Scale as $D \propto C^{0.27}$
graph TB
subgraph "Optimal Allocation"
C["Compute C"]
N["N ∝ C^0.73<br/>(73% to model)"]
D["D ∝ C^0.27<br/>(27% to data)"]
end
C --> N
C --> D
K["Most compute should go<br/>to larger models, not more data"]
N --> K
style K fill:#ffe66d,color:#000
25.8 Diminishing Returns
Why Power Laws Matter
Power laws mean diminishing returns:
graph TB
subgraph "Diminishing Returns"
X1["10× model size<br/>→ ~1.2× better"]
X2["100× model size<br/>→ ~1.4× better"]
X3["1000× model size<br/>→ ~1.6× better"]
end
K["Each order of magnitude<br/>gives less improvement"]
X1 --> K
X2 --> K
X3 --> K
style K fill:#ff6b6b,color:#fff
The Implications
- 10× compute → ~1.4× better performance
- 100× compute → ~1.7× better performance
- 1000× compute → ~2.0× better performance
This is why training GPT-3 required massive compute for incremental gains.
25.9 The Chinchilla Paper (Follow-up)
Challenging the Allocation
The Chinchilla paper (2022) found different optimal ratios:
graph TB
subgraph "Allocation Comparison"
K["Kaplan et al. (2020)<br/>N ∝ C^0.73, D ∝ C^0.27"]
C["Chinchilla (2022)<br/>N ∝ C^0.5, D ∝ C^0.5"]
end
K -->|"More to model"| DIFF["Different optimal ratio"]
C -->|"Equal allocation"| DIFF
K2["Chinchilla suggests<br/>more data is needed"]
DIFF --> K2
style K2 fill:#ffe66d,color:#000
The Debate
- Kaplan et al.: Larger models with less data
- Chinchilla: Balanced model and data scaling
Both are valid—depends on the compute budget and use case.
25.10 Practical Implications
For Training Large Models
graph TB
subgraph "Training Strategy"
BUDGET["Compute Budget"]
ALLOC["Allocate: 73% model, 27% data<br/>(or 50/50 per Chinchilla)"]
TRAIN["Train until convergence"]
EVAL["Evaluate on validation"]
end
BUDGET --> ALLOC --> TRAIN --> EVAL
K["Use scaling laws to<br/>predict performance<br/>before training"]
EVAL --> K
style K fill:#4ecdc4,color:#fff
Predicting Performance
You can estimate performance before training:
\[L(N, D) = \left(\frac{N_c}{N}\right)^{0.076} + \left(\frac{D_c}{D}\right)^{0.095} + L_\infty\]25.11 The Compute Frontier
Historical Scaling
timeline
title Compute Scaling Over Time
2012 : AlexNet
: ~10^9 FLOPs
2015 : ResNet
: ~10^10 FLOPs
2018 : BERT
: ~10^19 FLOPs
2020 : GPT-3
: ~10^23 FLOPs
2023 : GPT-4
: ~10^25 FLOPs
Future Projections
If trends continue:
- 2025: ~10^27 FLOPs
- 2030: ~10^30 FLOPs
But diminishing returns mean each order of magnitude gives less improvement.
25.12 Connection to MDL (Chapter 1)
The Compression View
From Chapter 1, MDL minimizes: $L(H) + L(D|H)$
For scaling laws:
- L(H): Model description length (scales with $N$)
- L(D|H): Data description length given model (scales with $D$)
graph TB
subgraph "MDL ↔ Scaling Laws"
MDL["MDL: L(H) + L(D|H)"]
SCALE["Scaling: L(N) + L(D)"]
end
MDL -->|"equivalent"| SCALE
K["Scaling laws quantify<br/>the MDL trade-off"]
SCALE --> K
style K fill:#4ecdc4,color:#fff
25.13 The Data Efficiency Question
How Much Data Is Enough?
graph TB
subgraph "Data Efficiency"
SMALL["Small model<br/>Needs less data"]
LARGE["Large model<br/>Needs more data"]
end
K["Larger models are<br/>more data-hungry<br/>but also more capable"]
SMALL --> K
LARGE --> K
style K fill:#ffe66d,color:#000
The Sweet Spot
There’s an optimal model size for a given dataset:
- Too small: Underfits
- Too large: Overfits (needs more data)
25.14 Connection to Other Chapters
graph TB
CH25["Chapter 25<br/>Scaling Laws"]
CH25 --> CH1["Chapter 1: MDL<br/><i>Model-data trade-off</i>"]
CH25 --> CH23["Chapter 23: VLAE<br/><i>Rate-distortion scaling</i>"]
CH25 --> CH24["Chapter 24: Deep Speech 2<br/><i>Scale enables performance</i>"]
CH25 --> CH26["Chapter 26: GPipe<br/><i>Training large models</i>"]
style CH25 fill:#ff6b6b,color:#fff
25.15 Key Equations Summary
Model Size Scaling
\[L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076\]Dataset Size Scaling
\[L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095\]Compute Scaling
\[L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050\]Unified Law
\[L(N, D) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty\]Optimal Allocation (Kaplan)
\[N \propto C^{0.73}, \quad D \propto C^{0.27}\]Compute Formula
\[C = 6NDS\]25.16 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["Performance scales as<br/>power laws with N, D, C"]
T2["Diminishing returns:<br/>10× compute → ~1.4× better"]
T3["Optimal allocation:<br/>~73% to model, ~27% to data"]
T4["Can predict performance<br/>before training"]
T5["Connects to MDL principles"]
end
T1 --> C["Scaling laws reveal that neural<br/>network performance improves as<br/>power laws with model size, data,<br/>and compute, with diminishing returns<br/>that guide optimal resource allocation<br/>for training large language models."]
T2 --> C
T3 --> C
T4 --> C
T5 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
In One Sentence
Scaling laws reveal that neural network performance improves as power laws with model size, data, and compute, with diminishing returns that guide optimal resource allocation for training large language models.
Exercises
-
Conceptual: Why do power laws lead to diminishing returns? What would linear or exponential scaling imply?
-
Mathematical: If a model with $10^9$ parameters achieves loss 2.5, what loss would you expect from a $10^{10}$ parameter model (assuming optimal data allocation)?
-
Analysis: Compare the Kaplan et al. optimal allocation (73/27) with Chinchilla’s (50/50). Under what conditions would each be better?
-
Extension: How would you modify the scaling laws for different architectures (CNNs, GNNs)? What factors might change?
References & Further Reading
| Resource | Link |
|---|---|
| Original Paper (Kaplan et al., 2020) | arXiv:2001.08361 |
| Chinchilla Paper (Hoffmann et al., 2022) | arXiv:2203.15556 |
| GPT-3 Paper | arXiv:2005.14165 |
| Scaling Laws for Vision | arXiv:2106.09685 |
| Beyond Scaling Laws | arXiv:2210.14891 |
| Compute Trends | Epoch AI |
Next Chapter: Chapter 26: GPipe - Efficient Training of Giant Neural Networks — We explore pipeline parallelism, a technique for training models that don’t fit on a single GPU, enabling the massive models predicted by scaling laws.