Chapter 1: The Minimum Description Length Principle
“The best model is the one that provides the shortest description of the data.”
Based on: “A Tutorial Introduction to the Minimum Description Length Principle” (Peter Grünwald, 2004)
📄 Original Paper: arXiv:math/0406077
1.1 Why Start Here?
Before we dive into neural networks, transformers, and the architectural marvels of modern AI, we need to understand a fundamental question that has puzzled scientists for centuries:
How do we choose the best explanation for our observations?
Imagine you’re an astronomer in ancient times, tracking the movement of planets across the sky. You could create an incredibly complex model with epicycles upon epicycles that perfectly matches every observation. Or you could propose that planets orbit the sun in ellipses—a simpler explanation that might not fit every data point perfectly but captures the underlying truth.
The Minimum Description Length (MDL) principle gives us a mathematically rigorous way to balance model complexity against how well the model fits the data. This isn’t just philosophy—it’s the theoretical foundation for:
- Why neural networks need regularization
- Why dropout and weight decay work
- Why simpler models often generalize better
- Why “Occam’s Razor” is more than folk wisdom
1.2 The Core Intuition
The Two-Part Code
The MDL principle can be summarized in one powerful idea:
The best hypothesis H for data D is the one that minimizes:
L(H) + L(D|H)
Where:
- L(H) = the length (in bits) needed to describe the hypothesis/model
- L(D|H) = the length (in bits) needed to describe the data given the hypothesis
graph LR
subgraph "Total Description Length"
A["L(H)<br/>Model Complexity"]
B["L(D|H)<br/>Data Given Model"]
end
A --> C["Total = L(H) + L(D|H)"]
B --> C
C --> D["MINIMIZE THIS"]
style A fill:#ff6b6b,color:#fff
style B fill:#4ecdc4,color:#fff
style D fill:#ffe66d,color:#000
Figure: The MDL principle minimizes the sum of model complexity L(H) and data encoding cost L(D|H).
A Simple Example: Fitting Points
Suppose you have 10 data points and want to fit a curve:
graph TB
subgraph "Model A: Straight Line"
A1["L(H) = 10 bits<br/>(2 parameters: slope, intercept)"]
A2["L(D|H) = 50 bits<br/>(some points don't fit perfectly)"]
A3["Total = 60 bits"]
end
subgraph "Model B: Degree-9 Polynomial"
B1["L(H) = 100 bits<br/>(10 parameters)"]
B2["L(D|H) = 0 bits<br/>(fits all points exactly)"]
B3["Total = 100 bits"]
end
subgraph "Model C: Quadratic"
C1["L(H) = 15 bits<br/>(3 parameters)"]
C2["L(D|H) = 30 bits<br/>(reasonable fit)"]
C3["Total = 45 bits ✓"]
end
style A3 fill:#ffcc00
style B3 fill:#ff6666
style C3 fill:#66ff66
Figure: Comparing three models for fitting 10 data points. The quadratic (Model C) achieves the optimal balance with 45 total bits—simpler than the polynomial (100 bits) and better-fitting than the line (60 bits).
The polynomial that passes through all 10 points has zero error but requires specifying 10 coefficients. The straight line is simple but might miss important patterns. The quadratic often wins—it’s the sweet spot between complexity and fit.
1.3 Information Theory Foundations
Bits as a Universal Currency
Why measure in bits? Because bits give us a universal, objective measure of complexity that doesn’t depend on arbitrary choices.
The key insight from information theory:
The optimal code length for an event with probability p is -log₂(p) bits
This is Shannon’s source coding theorem. If something is very likely (p ≈ 1), it takes few bits to encode. If something is rare (p ≈ 0), it takes many bits.
graph TD
subgraph "Probability ↔ Code Length"
P1["p = 1/2"] --> L1["-log₂(1/2) = 1 bit"]
P2["p = 1/4"] --> L2["-log₂(1/4) = 2 bits"]
P3["p = 1/8"] --> L3["-log₂(1/8) = 3 bits"]
P4["p = 1/1024"] --> L4["-log₂(1/1024) = 10 bits"]
end
Figure: The relationship between probability and code length. Lower probability events require more bits to encode, following Shannon’s information theory.
Why This Matters for Machine Learning
When a model assigns high probability to the observed data, it takes fewer bits to encode the data using that model. This creates a direct link:
Good predictions ⟺ Short code lengths ⟺ Low L(D|H)
1.4 The Fundamental Trade-off
Overfitting vs Underfitting Through the MDL Lens
graph TB
subgraph "The MDL Trade-off Curve"
direction LR
U["UNDERFITTING<br/>Simple model<br/>High L(D|H)<br/>Low L(H)"]
O["OPTIMAL<br/>Right complexity<br/>Balanced"]
V["OVERFITTING<br/>Complex model<br/>Low L(D|H)<br/>High L(H)"]
end
U -.->|"Add complexity"| O
O -.->|"Add more complexity"| V
style U fill:#ff9999
style O fill:#99ff99
style V fill:#ff9999
Figure: The MDL trade-off spectrum. Underfitting models are too simple (high data cost), overfitting models are too complex (high model cost), while optimal models balance both costs.
Visualizing the Trade-off
Imagine plotting total description length against model complexity:
xychart-beta
title "MDL Trade-off Visualization"
x-axis "Model Complexity" [Simple, Low, Medium-Low, Medium, Medium-High, High, Complex]
y-axis "Bits Required" 0 --> 200
line "L(H) - Model Cost" [10, 25, 45, 70, 100, 135, 175]
line "L(D|H) - Data Cost" [150, 80, 45, 35, 32, 31, 30]
line "Total L(H)+L(D|H)" [160, 105, 90, 105, 132, 166, 205]
Figure: Visualization of the MDL trade-off. As model complexity increases, L(H) increases while L(D|H) decreases. The optimal model minimizes the total (shown at “Medium” complexity with 90 bits).
The minimum of the total curve gives us the optimal model complexity. This is MDL in action!
1.5 Two-Part Codes in Detail
The Sender-Receiver Game
A helpful way to understand MDL is through a communication game:
Setup:
- A sender observes data D
- A sender and receiver have agreed on a set of possible models/hypotheses
- The sender must transmit D to the receiver using as few bits as possible
Strategy:
- First, send which model H to use (costs L(H) bits)
- Then, send the data encoded using model H (costs L(D|H) bits)
sequenceDiagram
participant S as Sender
participant R as Receiver
Note over S: Has data D, chooses model H
S->>R: Send model H (L(H) bits)
Note over R: Now knows the model
S->>R: Send D encoded with H (L(D|H) bits)
Note over R: Decodes to get D
Note over S,R: Total cost: L(H) + L(D|H) bits
Figure: The sender-receiver communication game. The sender first transmits the model H (costing L(H) bits), then transmits the data encoded using that model (costing L(D|H) bits). MDL minimizes the total transmission cost.
The MDL principle says: choose the model that minimizes total transmission cost.
Example: Encoding a Sequence
Suppose we want to transmit the sequence: 0000000001111111111
Model 1: No pattern (raw encoding)
- L(H) = 0 bits (no model to describe)
- L(D|H) = 19 bits (one bit per character)
- Total = 19 bits
Model 2: Run-length encoding
- L(H) = ~5 bits (describe the RLE scheme)
- L(D|H) = ~8 bits (say “9 zeros, then 10 ones”)
- Total ≈ 13 bits ✓
Model 3: Elaborate compression with dictionary
- L(H) = 50 bits (complex scheme description)
- L(D|H) = 2 bits
- Total = 52 bits ✗
The second model wins—it finds the right level of abstraction.
1.6 Mathematical Formulation
Formal Definition
Given:
- A class of models M = {M₁, M₂, M₃, …}
- Observed data D
The MDL estimate is:
\[\hat{M}_{MDL} = \arg\min_{M \in \mathcal{M}} \left[ L(M) + L(D|M) \right]\]For Parametric Models
When models are parameterized (like neural networks with weights θ):
\[\hat{\theta}_{MDL} = \arg\min_{\theta} \left[ L(\theta) + L(D|\theta) \right]\]Where:
- L(θ) often relates to the precision needed to specify parameters
- L(D|θ) is typically the negative log-likelihood: $-\log P(D|\theta)$
graph LR
subgraph "Connecting MDL to ML"
A["L(θ) = Prior<br/>-log P(θ)"]
B["L(D|θ) = Likelihood<br/>-log P(D|θ)"]
C["Total = -log P(θ)P(D|θ)<br/>= -log P(θ|D) + const"]
end
A --> C
B --> C
C --> D["This is MAP estimation!"]
style D fill:#ffe66d
Figure: MDL connects to Bayesian inference. The model description length L(θ) corresponds to the negative log-prior, while L(D|θ) corresponds to the negative log-likelihood. Together, they form the negative log-posterior (MAP estimation).
The MDL ↔ Bayesian Connection
MDL and Bayesian inference are deeply connected:
| MDL Concept | Bayesian Equivalent |
|---|---|
| L(H) | -log P(H) = Prior |
| L(D|H) | -log P(D|H) = Negative log-likelihood |
| Minimize L(H) + L(D|H) | Maximize P(H)P(D|H) = MAP |
However, MDL has advantages:
- Doesn’t require specifying priors
- Provides an objective, code-based justification
- Works even when Bayesian interpretation is unclear
1.7 Crude MDL vs Refined MDL
Crude (Two-Part) MDL
What we’ve discussed so far is “crude” MDL:
- Pick the best model
- Describe model + data given model
- Simple to understand and implement
Refined (One-Part) MDL
Modern MDL theory uses a more sophisticated approach:
Instead of separating model and data, use a universal code that automatically adapts to the data’s complexity.
graph TB
subgraph "Crude MDL"
A1["Step 1: Choose model H*"]
A2["Step 2: Encode H*"]
A3["Step 3: Encode D given H*"]
A1 --> A2 --> A3
end
subgraph "Refined MDL"
B1["Use a single universal code"]
B2["Code automatically<br/>adapts complexity"]
B3["No explicit model selection"]
B1 --> B2 --> B3
end
Figure: Comparison of crude MDL (explicit model selection) versus refined MDL (universal coding that automatically adapts to complexity without explicit model choice).
The refined version uses concepts like:
- Normalized Maximum Likelihood (NML)
- Stochastic Complexity
- Prequential codes
We’ll keep our focus on the intuitive two-part version, which captures the essential ideas.
1.8 MDL in Practice: Machine Learning Applications
Regularization as MDL
Every regularization technique can be viewed through the MDL lens:
graph TB
subgraph "Regularization = MDL"
L["Loss Function<br/>= L(D|θ)"]
R["Regularization Term<br/>= L(θ)"]
T["Total Objective<br/>= L(D|θ) + λL(θ)"]
end
L --> T
R --> T
subgraph "Examples"
E1["L2 Regularization<br/>||θ||² ↔ Gaussian prior on weights"]
E2["L1 Regularization<br/>||θ||₁ ↔ Laplace prior on weights"]
E3["Dropout<br/>↔ Model averaging/compression"]
end
T --> E1
T --> E2
T --> E3
Figure: Regularization in machine learning is equivalent to MDL. The loss function corresponds to L(D|θ), regularization to L(θ), and common techniques like L2, L1, and dropout all have MDL interpretations.
Neural Network Compression
The MDL perspective explains why:
-
Pruning works: Removing unnecessary weights reduces L(θ) without much increase in L(D|θ)
-
Quantization works: Using fewer bits per weight reduces L(θ)
-
Knowledge distillation works: A smaller model with similar predictions has lower total description length
Model Selection
MDL provides principled model selection:
# Pseudo-code for MDL model selection
def mdl_select(models, data):
best_model = None
best_mdl = float('inf')
for model in models:
# Fit model to data
model.fit(data)
# Compute description lengths
L_model = compute_model_complexity(model) # L(H)
L_data = -model.log_likelihood(data) # L(D|H)
total_mdl = L_model + L_data
if total_mdl < best_mdl:
best_mdl = total_mdl
best_model = model
return best_model
1.9 Historical Context and Significance
The Intellectual Lineage
timeline
title Evolution of Simplicity Principles
1347 : William of Ockham
: "Occam's Razor"
: Prefer simpler explanations
1948 : Claude Shannon
: Information Theory
: Bits as universal measure
1960s : Andrey Kolmogorov
: Ray Solomonoff
: Algorithmic Information Theory
1978 : Jorma Rissanen
: MDL Principle
: Formal framework
1993 : Hinton & Van Camp
: MDL for Neural Nets
: "Keeping NNs Simple"
2020s : Modern Deep Learning
: Implicit regularization
: Lottery ticket hypothesis
Figure: Historical timeline of simplicity principles, from Occam’s Razor (1347) through information theory (1948), algorithmic information theory (1960s), MDL (1978), neural network applications (1993), to modern deep learning (2020s).
Why Ilya Included This
The MDL principle appears in Ilya’s reading list because it provides:
- Theoretical grounding: Understanding WHY certain techniques work
- Design principles: Guiding architecture and regularization choices
- Generalization insight: Explaining the bias-variance trade-off
- Connection to physics: Links to thermodynamics and complexity
1.10 Connections to Other Papers in This Book
The MDL principle connects directly to several other papers in Ilya’s list:
graph TB
MDL["Chapter 1<br/>MDL Principle"]
MDL --> KC["Chapter 2<br/>Kolmogorov Complexity<br/><i>Theoretical foundation</i>"]
MDL --> NN["Chapter 3<br/>Keeping NNs Simple<br/><i>Direct application</i>"]
MDL --> VLAE["Chapter 23<br/>Variational Lossy AE<br/><i>Rate-distortion theory</i>"]
MDL --> SCALE["Chapter 25<br/>Scaling Laws<br/><i>Optimal allocation</i>"]
MDL --> COFFEE["Chapter 4<br/>Coffee Automaton<br/><i>Complexity dynamics</i>"]
style MDL fill:#ff6b6b,color:#fff
style KC fill:#4ecdc4
style NN fill:#4ecdc4
style VLAE fill:#95e1d3
style SCALE fill:#95e1d3
style COFFEE fill:#4ecdc4
Figure: MDL connects to multiple chapters throughout the book, providing theoretical foundations (Kolmogorov Complexity), direct applications (Keeping NNs Simple), and connections to modern topics (VLAE, Scaling Laws).
1.11 Key Equations Summary
The Fundamental Equation
\[\text{MDL Objective: } \min_H \left[ L(H) + L(D|H) \right]\]Code Length and Probability
\[L(x) = -\log_2 P(x) \text{ bits}\]For Continuous Parameters
\[L(\theta) \approx \frac{k}{2} \log n + \text{const}\]Where k = number of parameters, n = number of data points
(This is the BIC/MDL approximation)
Stochastic Complexity (Advanced)
\[SC(D) = -\log P(D|\hat{\theta}_{ML}) + \text{COMP}\]Where COMP measures the “richness” of the model class.
1.12 Practical Takeaways
For Practitioners
-
When choosing between models, prefer the one that would require fewer bits to transmit your predictions to a colleague who knows the model
-
Regularization strength λ can be interpreted as the relative cost of model bits vs data bits
-
Early stopping in training often finds the MDL-optimal point
-
Model architecture choices should consider description length—more parameters need more data to justify
Mental Model
graph LR
subgraph "The MDL Mindset"
Q["When in doubt, ask:"]
Q --> A1["'How many bits to describe<br/>my model?'"]
Q --> A2["'How many bits to describe<br/>my errors?'"]
Q --> A3["'Can I reduce the total?'"]
end
Figure: The MDL mindset provides a practical framework for model design: always consider both model complexity and data encoding cost, and seek to minimize their sum.
1.13 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["MDL = Occam's Razor<br/>formalized in bits"]
T2["Total cost = Model complexity<br/>+ Encoding residuals"]
T3["Equivalent to regularized<br/>maximum likelihood"]
T4["Foundation for understanding<br/>generalization"]
end
T1 --> C["The simplest model that<br/>explains the data well<br/>will generalize best"]
T2 --> C
T3 --> C
T4 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
Figure: Key takeaways from the MDL principle, showing how it formalizes Occam’s Razor, connects to regularization, and provides a foundation for understanding generalization.
In One Sentence
The Minimum Description Length principle tells us that the best model is one that achieves the shortest combined description of itself and the data it explains—formalizing the intuition that good theories are both simple and accurate.
Exercises
-
Conceptual: You have two weather prediction models. Model A uses 50 parameters and predicts tomorrow’s weather correctly 95% of the time. Model B uses 5 parameters and predicts correctly 90% of the time. Which might have lower MDL? Under what conditions?
-
Coding: Implement a simple MDL-based polynomial degree selection for curve fitting. Compare the selected degree to what cross-validation chooses.
-
Thought Experiment: How would you apply MDL thinking to decide between a decision tree and a neural network for a classification problem with 1,000 training examples?
References & Further Reading
| Resource | Link |
|---|---|
| Original Tutorial Paper (Grünwald, 2004) | arXiv:math/0406077 |
| Rissanen’s Original MDL Paper (1978) | IEEE |
| Grünwald’s MDL Book (2007) | MIT Press |
| MDL Tutorial Slides | CWI Amsterdam |
| Wallace - Minimum Message Length (2005) | Springer |
| MacKay - Information Theory (2003) | Free Online |
| Blier & Ollivier - DL Description Length (2018) | arXiv:1802.07044 |
| Lotfi et al. - PAC-Bayes Compression (2022) | arXiv:2211.13609 |
Next Chapter: Chapter 2: Kolmogorov Complexity and Algorithmic Randomness — We’ll explore the theoretical foundation that MDL builds upon: the idea that complexity can be measured by the shortest program that produces an output.