Chapter 25: Scaling Laws for Neural Language Models

“We study empirical scaling laws for language model performance on the cross-entropy loss. We find that performance scales as a power-law with model size, dataset size, and compute.”

Based on: “Scaling Laws for Neural Language Models” (Jared Kaplan, Sam McCandlish, Tom Henighan, et al., 2020)

📄 Original Paper: arXiv:2001.08361

OpenAI

25.1 The Scaling Question

As neural networks get bigger, how does performance improve? Is it linear? Exponential? Something else?

graph TB
    subgraph "The Scaling Question"
        Q1["Double the model size<br/>→ Double the performance?"]
        Q2["Double the data<br/>→ Double the performance?"]
        Q3["Double the compute<br/>→ Double the performance?"]
    end
    
    A["We need empirical laws<br/>to understand scaling"]
    
    Q1 --> A
    Q2 --> A
    Q3 --> A
    
    style A fill:#ffe66d,color:#000

This paper provides empirical answers based on training hundreds of models.

25.2 The Three Dimensions of Scaling

Compute, Data, and Model Size

graph TB
    subgraph "Scaling Dimensions"
        C["Compute<br/>(FLOPs)"]
        D["Data<br/>(tokens)"]
        M["Model Size<br/>(parameters)"]
    end
    
    P["Performance<br/>(cross-entropy loss)"]
    
    C --> P
    D --> P
    M --> P
    
    K["All three affect performance<br/>but in different ways"]
    
    P --> K
    
    style K fill:#4ecdc4,color:#fff

The Relationship

\[L(N, D, C) = \text{Performance as function of model size } N, \text{ data } D, \text{ compute } C\]

25.3 The Power Law Discovery

Model Size Scaling

Performance scales as a power law with model size:

xychart-beta
    title "Loss vs Model Size (Power Law)"
    x-axis "Parameters (N)" [1e6, 1e7, 1e8, 1e9, 1e10]
    y-axis "Cross-Entropy Loss" 2.0 --> 3.5
    line "Empirical" [3.2, 2.8, 2.5, 2.2, 2.0]

The Formula

\[L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}\]

Where:

$N_c$ = critical model size
$\alpha_N$ ≈ 0.076 (empirically determined)

Key insight: Performance improves, but with diminishing returns.

25.4 Dataset Size Scaling

Data Scaling Law

Similarly, performance scales with dataset size:

xychart-beta
    title "Loss vs Dataset Size (Power Law)"
    x-axis "Tokens (D)" [1e7, 1e8, 1e9, 1e10, 1e11]
    y-axis "Cross-Entropy Loss" 2.0 --> 3.5
    line "Empirical" [3.0, 2.6, 2.3, 2.1, 2.0]

The Formula

\[L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}\]

Where $\alpha_D$ ≈ 0.095 (empirically determined).

25.5 Compute Scaling

Compute-Dependent Performance

When compute is the limiting factor:

xychart-beta
    title "Loss vs Compute (Power Law)"
    x-axis "Compute (FLOPs)" [1e18, 1e19, 1e20, 1e21, 1e22]
    y-axis "Cross-Entropy Loss" 2.0 --> 3.5
    line "Empirical" [3.1, 2.7, 2.4, 2.1, 1.9]

The Formula

\[L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}\]

Where $\alpha_C$ ≈ 0.050 (empirically determined).

25.6 The Unified Scaling Law

Combining All Three

The full scaling law accounts for all dimensions:

\[L(N, D) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty\]

Where $L_\infty$ is the irreducible loss (theoretical minimum).

graph TB
    subgraph "Unified Scaling"
        N["Model Size<br/>α_N ≈ 0.076"]
        D["Dataset Size<br/>α_D ≈ 0.095"]
        L["Irreducible Loss<br/>L_∞"]
        SUM["Additive combination"]
        LOSS["Final Loss"]
    end
    
    N --> SUM
    D --> SUM
    L --> SUM
    SUM --> LOSS
    
    K["Each dimension contributes<br/>additively to the loss"]
    
    SUM --> K
    
    style K fill:#ffe66d,color:#000

25.7 Optimal Allocation

The Compute Budget Question

Given a fixed compute budget $C$, how should we allocate it between:

Model size $N$
Training data $D$
Training steps $S$

graph TB
    subgraph "Optimal Allocation"
        C["Compute Budget C"]
        N["Model Size N"]
        D["Data Size D"]
        S["Training Steps S"]
    end
    
    C -->|"Allocate"| N
    C -->|"Allocate"| D
    C -->|"Allocate"| S
    
    K["C = 6NDS<br/>(6 FLOPs per parameter per token)"]
    
    C --> K
    
    style K fill:#4ecdc4,color:#fff

The Optimal Ratio

Empirically, the optimal allocation is:

Model parameters: Scale with compute as $N \propto C^{0.73}$
Training tokens: Scale as $D \propto C^{0.27}$

graph TB
    subgraph "Optimal Allocation"
        C["Compute C"]
        N["N ∝ C^0.73<br/>(73% to model)"]
        D["D ∝ C^0.27<br/>(27% to data)"]
    end
    
    C --> N
    C --> D
    
    K["Most compute should go<br/>to larger models, not more data"]
    
    N --> K
    
    style K fill:#ffe66d,color:#000

25.8 Diminishing Returns

Why Power Laws Matter

Power laws mean diminishing returns:

graph TB
    subgraph "Diminishing Returns"
        X1["10× model size<br/>→ ~1.2× better"]
        X2["100× model size<br/>→ ~1.4× better"]
        X3["1000× model size<br/>→ ~1.6× better"]
    end
    
    K["Each order of magnitude<br/>gives less improvement"]
    
    X1 --> K
    X2 --> K
    X3 --> K
    
    style K fill:#ff6b6b,color:#fff

The Implications

10× compute → ~1.4× better performance
100× compute → ~1.7× better performance
1000× compute → ~2.0× better performance

This is why training GPT-3 required massive compute for incremental gains.

25.9 The Chinchilla Paper (Follow-up)

Challenging the Allocation

The Chinchilla paper (2022) found different optimal ratios:

graph TB
    subgraph "Allocation Comparison"
        K["Kaplan et al. (2020)<br/>N ∝ C^0.73, D ∝ C^0.27"]
        C["Chinchilla (2022)<br/>N ∝ C^0.5, D ∝ C^0.5"]
    end
    
    K -->|"More to model"| DIFF["Different optimal ratio"]
    C -->|"Equal allocation"| DIFF
    
    K2["Chinchilla suggests<br/>more data is needed"]
    
    DIFF --> K2
    
    style K2 fill:#ffe66d,color:#000

The Debate

Kaplan et al.: Larger models with less data
Chinchilla: Balanced model and data scaling

Both are valid—depends on the compute budget and use case.

25.10 Practical Implications

For Training Large Models

graph TB
    subgraph "Training Strategy"
        BUDGET["Compute Budget"]
        ALLOC["Allocate: 73% model, 27% data<br/>(or 50/50 per Chinchilla)"]
        TRAIN["Train until convergence"]
        EVAL["Evaluate on validation"]
    end
    
    BUDGET --> ALLOC --> TRAIN --> EVAL
    
    K["Use scaling laws to<br/>predict performance<br/>before training"]
    
    EVAL --> K
    
    style K fill:#4ecdc4,color:#fff

Predicting Performance

You can estimate performance before training:

\[L(N, D) = \left(\frac{N_c}{N}\right)^{0.076} + \left(\frac{D_c}{D}\right)^{0.095} + L_\infty\]

25.11 The Compute Frontier

Historical Scaling

timeline
    title Compute Scaling Over Time
    2012 : AlexNet
         : ~10^9 FLOPs
    2015 : ResNet
         : ~10^10 FLOPs
    2018 : BERT
         : ~10^19 FLOPs
    2020 : GPT-3
         : ~10^23 FLOPs
    2023 : GPT-4
         : ~10^25 FLOPs

Future Projections

If trends continue:

2025: ~10^27 FLOPs
2030: ~10^30 FLOPs

But diminishing returns mean each order of magnitude gives less improvement.

25.12 Connection to MDL (Chapter 1)

The Compression View

From Chapter 1, MDL minimizes: $L(H) + L(D|H)$

For scaling laws:

L(H): Model description length (scales with $N$)
L(D|H): Data description length given model (scales with $D$)

graph TB
    subgraph "MDL ↔ Scaling Laws"
        MDL["MDL: L(H) + L(D|H)"]
        SCALE["Scaling: L(N) + L(D)"]
    end
    
    MDL -->|"equivalent"| SCALE
    
    K["Scaling laws quantify<br/>the MDL trade-off"]
    
    SCALE --> K
    
    style K fill:#4ecdc4,color:#fff

25.13 The Data Efficiency Question

How Much Data Is Enough?

graph TB
    subgraph "Data Efficiency"
        SMALL["Small model<br/>Needs less data"]
        LARGE["Large model<br/>Needs more data"]
    end
    
    K["Larger models are<br/>more data-hungry<br/>but also more capable"]
    
    SMALL --> K
    LARGE --> K
    
    style K fill:#ffe66d,color:#000

The Sweet Spot

There’s an optimal model size for a given dataset:

Too small: Underfits
Too large: Overfits (needs more data)

25.14 Connection to Other Chapters

graph TB
    CH25["Chapter 25<br/>Scaling Laws"]
    
    CH25 --> CH1["Chapter 1: MDL<br/><i>Model-data trade-off</i>"]
    CH25 --> CH23["Chapter 23: VLAE<br/><i>Rate-distortion scaling</i>"]
    CH25 --> CH24["Chapter 24: Deep Speech 2<br/><i>Scale enables performance</i>"]
    CH25 --> CH26["Chapter 26: GPipe<br/><i>Training large models</i>"]
    
    style CH25 fill:#ff6b6b,color:#fff

25.15 Key Equations Summary

Model Size Scaling

\[L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076\]

Dataset Size Scaling

\[L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095\]

Compute Scaling

\[L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050\]

Unified Law

\[L(N, D) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty\]

Optimal Allocation (Kaplan)

\[N \propto C^{0.73}, \quad D \propto C^{0.27}\]

Compute Formula

\[C = 6NDS\]

25.16 Chapter Summary

graph TB
    subgraph "Key Takeaways"
        T1["Performance scales as<br/>power laws with N, D, C"]
        T2["Diminishing returns:<br/>10× compute → ~1.4× better"]
        T3["Optimal allocation:<br/>~73% to model, ~27% to data"]
        T4["Can predict performance<br/>before training"]
        T5["Connects to MDL principles"]
    end
    
    T1 --> C["Scaling laws reveal that neural<br/>network performance improves as<br/>power laws with model size, data,<br/>and compute, with diminishing returns<br/>that guide optimal resource allocation<br/>for training large language models."]
    T2 --> C
    T3 --> C
    T4 --> C
    T5 --> C
    
    style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px

In One Sentence

Scaling laws reveal that neural network performance improves as power laws with model size, data, and compute, with diminishing returns that guide optimal resource allocation for training large language models.

Exercises

Conceptual: Why do power laws lead to diminishing returns? What would linear or exponential scaling imply?
Mathematical: If a model with $10^9$ parameters achieves loss 2.5, what loss would you expect from a $10^{10}$ parameter model (assuming optimal data allocation)?
Analysis: Compare the Kaplan et al. optimal allocation (73/27) with Chinchilla’s (50/50). Under what conditions would each be better?
Extension: How would you modify the scaling laws for different architectures (CNNs, GNNs)? What factors might change?

References & Further Reading

Resource	Link
Original Paper (Kaplan et al., 2020)	arXiv:2001.08361
Chinchilla Paper (Hoffmann et al., 2022)	arXiv:2203.15556
GPT-3 Paper	arXiv:2005.14165
Scaling Laws for Vision	arXiv:2106.09685
Beyond Scaling Laws	arXiv:2210.14891
Compute Trends	Epoch AI

Next Chapter: Chapter 26: GPipe - Efficient Training of Giant Neural Networks — We explore pipeline parallelism, a technique for training models that don’t fit on a single GPU, enabling the massive models predicted by scaling laws.

← Back to Part VI

Table of Contents