Chapter 2: Kolmogorov Complexity and Algorithmic Randomness

“The complexity of an object is the length of the shortest program that produces it.”

Based on: “Kolmogorov Complexity and Algorithmic Randomness” (Shen, Uspensky, Vereshchagin)

📄 Original Book: AMS Mathematical Surveys

2.1 From MDL to Something Deeper

In Chapter 1, we learned that MDL measures the quality of a model by how compactly it lets us describe data. But this raises a fundamental question:

What is the ultimate limit of compression? What is the “true” complexity of a string or object?

Kolmogorov Complexity answers this by defining complexity in terms of computation itself. It’s one of the most beautiful ideas in computer science—and it has profound implications for understanding intelligence, randomness, and the nature of patterns.

2.2 The Central Definition

Kolmogorov Complexity Defined

The Kolmogorov Complexity K(x) of a string x is:

The length of the shortest program that outputs x and then halts.

graph LR
    subgraph "Kolmogorov Complexity"
        P["Shortest Program p"]
        U["Universal<br/>Turing Machine"]
        X["Output: x"]
        
        P -->|"input"| U
        U -->|"produces"| X
    end
    
    L["K(x) = |p| = length of p"]
    
    P --> L
    
    style P fill:#ff6b6b,color:#fff
    style X fill:#4ecdc4,color:#fff
    style L fill:#ffe66d,color:#000

Figure: Kolmogorov complexity K(x) is defined as the length of the shortest program p that produces string x when run on a universal Turing machine U.

Intuitive Examples

String	Description	Approximate K(x)
`0000000000000000`	“Print 0 sixteen times”	~20 bits
`0101010101010101`	“Print 01 eight times”	~22 bits
`1100100100001111...` (π digits)	“Compute π to n digits”	~log(n) + C
`01101000111010001...` (random)	No pattern—must store literally	~n bits

The key insight: Strings with patterns are compressible. Random strings are not.

2.3 The Universal Turing Machine

Why “Universal”?

Kolmogorov Complexity depends on a choice of programming language or Turing machine. But here’s the remarkable fact:

The choice of language only changes K(x) by a constant!

This is called the Invariance Theorem.

graph TB
    subgraph "Invariance Theorem"
        U1["Universal TM U₁<br/>(Python-like)"]
        U2["Universal TM U₂<br/>(C-like)"]
        U3["Universal TM U₃<br/>(Lisp-like)"]
        
        X["String x"]
        
        U1 -->|"K_U₁(x)"| X
        U2 -->|"K_U₂(x)"| X
        U3 -->|"K_U₃(x)"| X
    end
    
    EQ["|K_U₁(x) - K_U₂(x)| ≤ c<br/>where c is a constant<br/>independent of x"]
    
    style EQ fill:#ffe66d,color:#000

Figure: The invariance theorem states that Kolmogorov complexity is universal—different universal Turing machines (U₁, U₂, U₃) give complexities that differ by at most a constant, making K(x) machine-independent.

Why This Matters

The constant c is just the length of a compiler/interpreter from one language to another. For any string x:

\[|K_{U_1}(x) - K_{U_2}(x)| \leq c_{12}\]

This means Kolmogorov Complexity is essentially unique—the choice of language doesn’t matter for large strings.

2.4 Properties of Kolmogorov Complexity

Fundamental Properties

graph TB
    subgraph "Key Properties of K(x)"
        P1["Upper Bound<br/>K(x) ≤ |x| + c<br/><i>Can always just store x</i>"]
        
        P2["Subadditivity<br/>K(x,y) ≤ K(x) + K(y) + c<br/><i>Combining never costs much extra</i>"]
        
        P3["Symmetry of Information<br/>K(x,y) = K(x) + K(y|x) + O(log)<br/><i>Order doesn't matter much</i>"]
        
        P4["Uncomputability<br/>K(x) is not computable!<br/><i>No algorithm can compute it</i>"]
    end

Figure: Key properties of Kolmogorov complexity: it’s bounded above by string length, satisfies subadditivity and symmetry of information, but is fundamentally uncomputable.

The Incomputability Shocker

One of the most surprising facts:

There is no algorithm that computes K(x) for all strings x.

This follows from the halting problem. If we could compute K(x), we could solve the halting problem.

graph TB
    subgraph "Why K(x) is Uncomputable"
        A["Assume algorithm A<br/>computes K(x)"]
        B["Use A to find first string x<br/>with K(x) > n"]
        C["Output x using program:<br/>'Find first x with K(x)>n'"]
        D["This program has length ~log(n)"]
        E["Contradiction!<br/>K(x) > n but described in log(n) bits"]
    end
    
    A --> B --> C --> D --> E
    
    style E fill:#ff6b6b,color:#fff

Figure: Proof by contradiction that K(x) is uncomputable. If an algorithm could compute K(x), we could use it to find strings with high complexity, then describe them with short programs—a contradiction.

This is related to the Berry Paradox: “The smallest positive integer not definable in under sixty letters.”

2.5 Randomness as Incompressibility

What Makes a String Random?

The deepest insight of Kolmogorov Complexity is its definition of randomness:

A string x is random if K(x) ≈ |x|

In other words, a random string cannot be compressed—it has no patterns to exploit.

graph LR
    subgraph "The Compression Spectrum"
        direction LR
        A["Highly Patterned<br/>K(x) << |x|<br/>e.g., 0000...0000"]
        B["Some Structure<br/>K(x) < |x|<br/>e.g., text, code"]
        C["Random<br/>K(x) ≈ |x|<br/>e.g., noise"]
    end
    
    A -.->|"Increasing randomness"| B
    B -.->|"Increasing randomness"| C
    
    style A fill:#66ff66
    style B fill:#ffff66
    style C fill:#ff6666

Most Strings Are Random

A counting argument shows:

There are 2ⁿ strings of length n
There are fewer than 2^(n-k) programs of length < n-k
Therefore, most strings cannot be compressed by more than k bits

pie title "Strings of Length n"
    "Compressible by ≥10 bits" : 0.1
    "Compressible by 5-9 bits" : 3
    "Compressible by 1-4 bits" : 10
    "Incompressible (random)" : 86.9

Figure: Distribution of compressibility for strings of length n. The vast majority (86.9%) are incompressible (random), while only a small fraction can be significantly compressed.

Randomness is the norm. Patterns are the exception.

2.6 Conditional Kolmogorov Complexity

Complexity Given Side Information

We can define conditional complexity K(x|y):

K(x|y) = the length of the shortest program that outputs x given y as auxiliary input

graph LR
    subgraph "Conditional Complexity"
        Y["Auxiliary input: y"]
        P["Shortest program p"]
        U["Universal TM"]
        X["Output: x"]
        
        Y --> U
        P --> U
        U --> X
    end
    
    L["K(x|y) = |p|"]
    P --> L

Key Relationship

\[K(x,y) = K(x) + K(y\|x) + O(\log(K(x,y)))\]

This says: to describe the pair (x,y), describe x, then describe y given x. The log factor is for bookkeeping.

2.7 Algorithmic Information Theory

Shannon vs Kolmogorov

Two notions of information:

Shannon Entropy	Kolmogorov Complexity
Average over probability distribution	Individual object
Requires knowing P(x)	No probabilities needed
Computable	Uncomputable
About typical strings	About specific string

graph TB
    subgraph "Two Views of Information"
        S["Shannon Entropy H(X)<br/>= E[-log P(X)]<br/>= Average code length"]
        
        K["Kolmogorov Complexity K(x)<br/>= Shortest program for x<br/>= Absolute information content"]
    end
    
    C["Connection:<br/>For most x ~ P,<br/>K(x) ≈ H(X)"]
    
    S --> C
    K --> C
    
    style C fill:#ffe66d,color:#000

The Beautiful Connection

For a source with entropy H:

Most strings of length n have K(x) ≈ nH
Only atypical strings deviate significantly

This connects the probabilistic (Shannon) and algorithmic (Kolmogorov) views of information.

2.8 Applications in AI and Machine Learning

Why This Matters for Deep Learning

graph TB
    KC["Kolmogorov<br/>Complexity"]
    
    KC --> G["Generalization<br/><i>Models that find short<br/>descriptions generalize</i>"]
    
    KC --> L["Learning Theory<br/><i>PAC-learning and<br/>compression bounds</i>"]
    
    KC --> I["Induction<br/><i>Solomonoff's universal prior</i>"]
    
    KC --> N["Neural Nets<br/><i>Implicit compression<br/>in learned weights</i>"]
    
    KC --> R["Representation<br/><i>Learned features = compression</i>"]

Solomonoff Induction

Ray Solomonoff used Kolmogorov Complexity to define the universal prior:

\[P(x) = \sum_{p: U(p) = x} 2^{-|p|}\]

This assigns higher probability to strings with short descriptions. It’s the theoretically optimal way to predict—but uncomputable!

graph LR
    subgraph "Solomonoff's Universal Prior"
        A["All programs p<br/>that output x"]
        B["Weight by 2^(-|p|)<br/>shorter = more probable"]
        C["Sum to get P(x)"]
    end
    
    A --> B --> C
    
    D["Optimal prediction<br/>(but uncomputable)"]
    C --> D
    
    style D fill:#ff6b6b,color:#fff

Deep Learning as Compression

Modern neural networks can be viewed as finding compressed representations:

graph LR
    subgraph "Neural Network as Compressor"
        I["Input<br/>(high dimensional)"]
        E["Encoder<br/>(learned)"]
        L["Latent Space<br/>(compressed)"]
        D["Decoder<br/>(learned)"]
        O["Output<br/>(reconstructed)"]
    end
    
    I --> E --> L --> D --> O
    
    K["K(data) ≈ K(weights) + K(residuals)"]
    
    style L fill:#ffe66d,color:#000

2.9 The Halting Problem Connection

Why Computation Limits Matter

Kolmogorov Complexity is deeply connected to the halting problem:

graph TB
    subgraph "The Uncomputability Web"
        HP["Halting Problem<br/><i>Undecidable</i>"]
        KC["Kolmogorov Complexity<br/><i>Uncomputable</i>"]
        BP["Berry Paradox<br/><i>Self-reference</i>"]
        GT["Gödel's Theorems<br/><i>Incompleteness</i>"]
    end
    
    HP <--> KC
    KC <--> BP
    BP <--> GT
    GT <--> HP
    
    C["All stem from<br/>self-reference and<br/>diagonalization"]
    
    HP --> C
    KC --> C
    BP --> C
    GT --> C
    
    style C fill:#ff6b6b,color:#fff

Practical Implication

We can’t compute K(x), but we can:

Upper bound it by finding any program that outputs x
Approximate it using practical compressors (gzip, neural nets)
Use the concept for theoretical analysis

2.10 Prefix-Free Complexity

The standard Kolmogorov Complexity has a subtle issue: programs aren’t self-delimiting. Prefix-free complexity K̃(x) fixes this:

K̃(x) uses only prefix-free codes—no program is a prefix of another

graph TB
    subgraph "Prefix-Free vs Standard"
        S["Standard K(x)<br/>Programs: 0, 1, 00, 01, 10, 11...<br/>'00' is prefix of '001'"]
        
        P["Prefix-Free K̃(x)<br/>Programs: 0, 10, 110, 1110...<br/>No program is prefix of another"]
    end
    
    A["Advantage: K̃(x) obeys<br/>Kraft inequality:<br/>Σ 2^(-K̃(x)) ≤ 1"]
    
    P --> A
    
    style A fill:#ffe66d,color:#000

This makes K̃(x) behave more like a proper probability distribution and is essential for Solomonoff induction.

2.11 Randomness Definitions Compared

Multiple Views of Randomness

Kolmogorov Complexity leads to several equivalent definitions of randomness:

graph TB
    subgraph "Equivalent Randomness Definitions"
        ML["Martin-Löf Random<br/><i>Passes all effective<br/>statistical tests</i>"]
        
        KC["Kolmogorov Random<br/><i>K(x₁...xₙ) ≥ n - c<br/>for all prefixes</i>"]
        
        CH["Chaitin Random<br/><i>K̃(x₁...xₙ) ≥ n<br/>infinitely often</i>"]
        
        SC["Schnorr Random<br/><i>Effective martingales<br/>don't succeed</i>"]
    end
    
    ML <-->|"equivalent"| KC
    KC <-->|"equivalent"| CH
    
    EQ["All capture the same<br/>fundamental concept!"]
    
    ML --> EQ
    KC --> EQ
    CH --> EQ
    
    style EQ fill:#66ff66,color:#000

The Universality is Remarkable

These definitions come from completely different perspectives:

Statistical (Martin-Löf): passes all tests
Compressibility (Kolmogorov): can’t be compressed
Information (Chaitin): high prefix-free complexity
Betting (Schnorr): can’t be predicted profitably

Yet they all define the same random sequences!

2.12 Connections to Other Chapters

graph TB
    KC["Chapter 2<br/>Kolmogorov Complexity"]
    
    KC --> MDL["Chapter 1: MDL<br/><i>KC provides theoretical<br/>foundation for MDL</i>"]
    
    KC --> NN["Chapter 3: Simple NNs<br/><i>Weight description length<br/>approximates KC</i>"]
    
    KC --> SI["Chapter 27: Machine Superintelligence<br/><i>AIXI uses Solomonoff<br/>induction (KC-based)</i>"]
    
    KC --> COFFEE["Chapter 4: Coffee Automaton<br/><i>Complexity dynamics<br/>vs KC</i>"]
    
    KC --> SCALE["Chapter 25: Scaling Laws<br/><i>Optimal description<br/>length allocation</i>"]
    
    style KC fill:#ff6b6b,color:#fff

2.13 The Omega Number: A Mind-Bending Consequence

Chaitin’s Constant Ω

Define Ω as the probability that a random program halts:

\[\Omega = \sum_{p \text{ halts}} 2^{-|p|}\]

This number is:

Well-defined (the sum converges)
Uncomputable (knowing n bits of Ω solves halting for programs up to length n)
Random (its binary expansion is Kolmogorov random)
Algorithmically unknowable beyond a certain point

graph TB
    subgraph "The Omega Number"
        O["Ω = 0.0078749969978123..."]
        P1["Encodes solution to<br/>halting problem"]
        P2["Knowing n bits solves<br/>halting for programs ≤ n"]
        P3["Maximally unknowable<br/>yet precisely defined"]
    end
    
    O --> P1
    O --> P2
    O --> P3
    
    style O fill:#ff6b6b,color:#fff

This shows that there is a precise boundary to what mathematics can know—and Kolmogorov Complexity lets us locate it.

2.14 Practical Approximations

Since K(x) is Uncomputable…

We use approximations:

graph TB
    subgraph "Approximating Kolmogorov Complexity"
        G["Gzip, Bzip2, LZ77<br/><i>Standard compressors</i>"]
        N["Neural Compressors<br/><i>Learned compression</i>"]
        A["Arithmetic Coding<br/><i>Optimal for known distributions</i>"]
        B["BWT-based<br/><i>Context modeling</i>"]
    end
    
    ALL["All provide upper bounds<br/>K(x) ≤ |compressed(x)| + c"]
    
    G --> ALL
    N --> ALL
    A --> ALL
    B --> ALL
    
    style ALL fill:#ffe66d,color:#000

The Normalized Compression Distance

A practical similarity measure:

\[NCD(x,y) = \frac{C(xy) - \min(C(x), C(y))}{\max(C(x), C(y))}\]

Where C is a real compressor. This approximates normalized information distance—and works for spam detection, plagiarism detection, genomics, and more!

2.15 Key Equations Summary

The Fundamental Definition

\(K(x) = \min\{|p| : U(p) = x\}\)

Invariance Theorem

\(|K_{U_1}(x) - K_{U_2}(x)| \leq c_{12}\)

Upper Bound

\(K(x) \leq |x| + O(1)\)

Chain Rule

\(K(x,y) = K(x) + K(y|x) + O(\log K(x,y))\)

Symmetry of Information

\(K(x|y) + K(y) = K(y|x) + K(x) + O(\log K(x,y))\)

Randomness Criterion

\(x \text{ is random} \Leftrightarrow K(x) \geq |x| - O(1)\)

2.16 Chapter Summary

graph TB
    subgraph "Key Takeaways"
        T1["K(x) = shortest<br/>program outputting x"]
        T2["Random = Incompressible<br/>K(x) ≈ |x|"]
        T3["K(x) is uncomputable<br/>but approximable"]
        T4["Foundation for<br/>learning theory"]
    end
    
    T1 --> C["Complexity is about<br/>INFORMATION, not syntax.<br/>Patterns = Compression.<br/>Randomness = Structure-free."]
    T2 --> C
    T3 --> C
    T4 --> C
    
    style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px

In One Sentence

Kolmogorov Complexity defines the absolute information content of an object as the length of the shortest program that generates it—providing the theoretical foundation for understanding patterns, randomness, and the limits of compression.

2.17 Philosophical Implications

For Artificial Intelligence

Learning is compression: A good model finds short descriptions of data
Generalization requires patterns: Random data cannot be learned
There are fundamental limits: Some truths are forever unknowable
Simplicity is objective: Despite different programming languages, complexity is essentially unique

For Understanding Intelligence

graph LR
    subgraph "Intelligence as Compression"
        O["Observations"]
        I["Intelligence"]
        P["Predictions"]
        
        O -->|"compress"| I
        I -->|"decompress"| P
    end
    
    L["Finding short descriptions<br/>= Understanding<br/>= Intelligence"]
    
    I --> L
    
    style L fill:#ffe66d,color:#000

Exercises

Estimation: Estimate K(x) for the string “aaabbbccc” (repeated 100 times). Compare to K(y) for a random string of the same length.
Proof Sketch: Explain intuitively why K(x) cannot be computed by any algorithm.
Practical: Use a compressor (gzip, zlib) to compute the approximate Normalized Compression Distance between two text documents. Does it correlate with semantic similarity?
Thought Experiment: If an AI could compute K(x), what problems could it solve? What would this imply about P vs NP?

References & Further Reading

Resource	Link
Original Book (Shen et al., 2017)	LIRMM PDF
Li & Vitányi Textbook (2019)	Springer
Kolmogorov’s Original Paper (1965)	Springer
Chaitin’s Omega Number	arXiv:math/0404335
Solomonoff Induction Original (1964)	PDF
Hutter - Universal AI (2005)	Springer
Schmidhuber - Speed Prior (2002)	arXiv:cs/0011017
Grünwald & Vitányi - Shannon vs Kolmogorov (2008)	arXiv:0804.2459
Cover & Thomas - Information Theory Ch.14	Wiley

Next Chapter: Chapter 3: Keeping Neural Networks Simple — We’ll see how MDL and Kolmogorov Complexity ideas directly apply to training neural networks, in Hinton & Van Camp’s seminal 1993 paper.

← Back to Part I

Table of Contents