Chapter 23: Variational Lossy Autoencoder
“We propose a lossy compression framework that connects variational autoencoders to rate-distortion theory.”
Based on: “Variational Lossy Autoencoder” (Xi Chen, Diederik P. Kingma, Tim Salimans, et al., 2016)
| 📄 Original Paper: arXiv:1611.02731 | ICLR 2017 |
23.1 Connecting VAEs to Information Theory
Variational Autoencoders (VAEs) are powerful generative models. But their connection to information theory and compression wasn’t fully understood until this paper.
graph TB
subgraph "The Connection"
VAE["Variational Autoencoder<br/>(generative model)"]
MDL["MDL / Rate-Distortion<br/>(compression theory)"]
VLAE["Variational Lossy Autoencoder<br/>(unifies both)"]
end
VAE --> VLAE
MDL --> VLAE
K["VLAE provides information-theoretic<br/>interpretation of VAEs"]
VLAE --> K
style K fill:#ffe66d,color:#000
This connects back to Chapter 1 (MDL) and Chapter 3 (Keeping NNs Simple)!
23.2 The Standard VAE
Architecture Recap
graph TB
subgraph "Standard VAE"
X["Input x"]
ENC["Encoder<br/>q(z|x)"]
Z["Latent z ~ q(z|x)"]
DEC["Decoder<br/>p(x|z)"]
X_REC["Reconstructed x̂"]
end
X --> ENC --> Z --> DEC --> X_REC
L["Loss = -log p(x|z) + KL(q(z|x) || p(z))"]
Z --> L
style L fill:#ff6b6b,color:#fff
The ELBO
\[\mathcal{L}_{ELBO} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - KL(q(z|x) \| p(z))\]23.3 The Posterior Collapse Problem
What Is Posterior Collapse?
When the encoder learns to ignore the latent code:
graph TB
subgraph "Posterior Collapse"
X["Input x"]
ENC["Encoder<br/>q(z|x) ≈ p(z)"]
Z["z becomes independent<br/>of x!"]
DEC["Decoder<br/>p(x|z) ≈ p(x)"]
end
X --> ENC --> Z --> DEC
P["Encoder learns nothing!<br/>Latent code is useless"]
Z --> P
style P fill:#ff6b6b,color:#fff
Why It Happens
The KL term $KL(q(z|x) | p(z))$ can dominate, pushing $q(z|x)$ toward the prior $p(z)$.
23.4 Rate-Distortion Theory
The Fundamental Trade-off
Rate-Distortion theory (from information theory) formalizes compression:
graph TB
subgraph "Rate-Distortion Trade-off"
R["Rate<br/>(bits to encode)"]
D["Distortion<br/>(reconstruction error)"]
end
T["Trade-off: Lower rate → Higher distortion<br/>Higher rate → Lower distortion"]
R --> T
D --> T
style T fill:#ffe66d,color:#000
The Rate-Distortion Function
\[R(D) = \min_{p(\hat{x}|x)} I(X; \hat{X}) \text{ s.t. } \mathbb{E}[d(X, \hat{X})] \leq D\]Where:
- $R(D)$ = minimum rate for distortion $D$
- $I(X; \hat{X})$ = mutual information
- $d(X, \hat{X})$ = distortion measure
23.5 VLAE: The Connection
VAE as Lossy Compression
The VLAE paper shows that VAEs are lossy compressors:
graph TB
subgraph "VLAE Interpretation"
X["Input x"]
ENC["Encoder: x → z<br/>(compression)"]
Z["Latent z<br/>(compressed representation)"]
DEC["Decoder: z → x̂<br/>(decompression)"]
X_REC["Reconstructed x̂<br/>(lossy)"]
end
X --> ENC --> Z --> DEC --> X_REC
R["Rate = I(x; z)<br/>(information in latent)"]
D["Distortion = -log p(x|z)<br/>(reconstruction error)"]
Z --> R
X_REC --> D
style R fill:#4ecdc4,color:#fff
style D fill:#ff6b6b,color:#fff
The VLAE Objective
\[\mathcal{L}_{VLAE} = \underbrace{\mathbb{E}_{q(z|x)}[-\log p(x|z)]}_{\text{Distortion}} + \beta \underbrace{KL(q(z|x) \| p(z))}_{\text{Rate}}\]Where $\beta$ controls the rate-distortion trade-off.
23.6 Understanding Rate and Distortion
Rate: Mutual Information
\[R = I(x; z) = KL(q(z|x) \| p(z)) - \mathbb{E}_x[KL(q(z|x) \| q(z))]\]graph TB
subgraph "Rate Interpretation"
R1["High rate: z contains<br/>lots of information about x"]
R2["Low rate: z is<br/>nearly independent of x"]
end
K["Rate = how much information<br/>we store in the latent code"]
R1 --> K
R2 --> K
style K fill:#ffe66d,color:#000
Distortion: Reconstruction Error
\[D = \mathbb{E}_{q(z|x)}[-\log p(x|z)]\]graph TB
subgraph "Distortion Interpretation"
D1["Low distortion: x̂ ≈ x<br/>(good reconstruction)"]
D2["High distortion: x̂ ≠ x<br/>(poor reconstruction)"]
end
K["Distortion = how well<br/>we can reconstruct x from z"]
D1 --> K
D2 --> K
style K fill:#ffe66d,color:#000
23.7 The β-VAE Connection
Controlling the Trade-off
The $\beta$ parameter in VLAE is similar to β-VAE:
xychart-beta
title "Rate-Distortion Trade-off (β-VAE)"
x-axis "β" [0, 0.5, 1, 2, 4, 8]
y-axis "Value" 0 --> 10
line "Rate (I(x;z))" [8, 6, 4, 2, 1, 0.5]
line "Distortion (-log p(x|z))" [1, 2, 3, 4, 6, 8]
- β < 1: Prioritize reconstruction (high rate, low distortion)
- β = 1: Standard VAE (balanced)
- β > 1: Prioritize compression (low rate, high distortion)
23.8 Bits-Back Coding Connection
Back to Chapter 3
Remember bits-back coding from Chapter 3 (Hinton & Van Camp)?
graph TB
subgraph "Bits-Back in VLAE"
Q["q(z|x)<br/>(encoder distribution)"]
S["Sample z ~ q(z|x)"]
M["Message encoded<br/>in sampling randomness"]
R["Rate reduction<br/>via bits-back"]
end
Q --> S --> M --> R
K["The 'bits-back' argument<br/>reduces effective rate"]
R --> K
style K fill:#ffe66d,color:#000
VLAE makes this connection explicit!
23.9 Hierarchical VLAE
Multi-Scale Compression
VLAE can be extended to hierarchical models:
graph TB
subgraph "Hierarchical VLAE"
X["x"]
Z1["z₁<br/>(coarse)"]
Z2["z₂<br/>(fine)"]
X_REC["x̂"]
end
X --> Z1 --> Z2 --> X_REC
K["Multiple levels of<br/>compression/abstraction"]
Z2 --> K
style K fill:#4ecdc4,color:#fff
Each level adds more detail, creating a multi-resolution representation.
23.10 Experimental Results
Image Compression
VLAE achieves competitive compression rates:
xychart-beta
title "Compression Performance (bits per pixel)"
x-axis ["JPEG", "PNG", "VLAE (β=0.1)", "VLAE (β=1.0)"]
y-axis "Bits per Pixel" 0 --> 10
bar [2.5, 4.0, 1.8, 0.8]
VLAE can compress better than standard methods at high compression rates!
Generation Quality
VLAE also generates high-quality samples, balancing compression and generation.
23.11 Connection to MDL (Chapter 1)
The MDL View
| From Chapter 1, MDL minimizes: $L(H) + L(D | H)$ |
For VLAE:
- L(H) = Rate: $KL(q(z|x) | p(z))$ (description of latent)
- L(D|H) = Distortion: $-\log p(x|z)$ (description of data given latent)
graph TB
subgraph "MDL ↔ VLAE"
MDL["MDL: L(H) + L(D|H)"]
VLAE["VLAE: Rate + Distortion"]
end
MDL -->|"equivalent"| VLAE
K["VLAE is MDL for<br/>lossy compression!"]
VLAE --> K
style K fill:#4ecdc4,color:#fff
23.12 Practical Implications
For Compression
graph TB
subgraph "Compression Applications"
IMG["Images"]
VLAE["VLAE encoder"]
Z["Compressed z"]
STORE["Storage/Transmission"]
VLAE_DEC["VLAE decoder"]
IMG_REC["Reconstructed image"]
end
IMG --> VLAE --> Z --> STORE --> VLAE_DEC --> IMG_REC
K["Learned compression<br/>better than hand-designed"]
STORE --> K
style K fill:#ffe66d,color:#000
For Generation
VLAE can also generate new samples by sampling from $p(z)$ and decoding.
23.13 Modern Variants
VQ-VAE
Vector Quantized VAE uses discrete latents:
graph TB
subgraph "VQ-VAE"
X["x"]
ENC["Encoder"]
Z_CONT["Continuous z"]
VQ["Vector Quantization<br/>(discretize)"]
Z_DISC["Discrete z"]
DEC["Decoder"]
X_REC["x̂"]
end
X --> ENC --> Z_CONT --> VQ --> Z_DISC --> DEC --> X_REC
K["Discrete codes enable<br/>autoregressive modeling"]
Z_DISC --> K
style K fill:#4ecdc4,color:#fff
β-VAE and Disentanglement
Higher $\beta$ encourages disentangled representations (Chapter 3 connection!).
23.14 Connection to Other Chapters
graph TB
CH23["Chapter 23<br/>VLAE"]
CH23 --> CH1["Chapter 1: MDL<br/><i>Rate = L(H), Distortion = L(D|H)</i>"]
CH23 --> CH3["Chapter 3: Simple NNs<br/><i>Bits-back coding</i>"]
CH23 --> CH2["Chapter 2: Kolmogorov<br/><i>Compression perspective</i>"]
CH23 --> CH25["Chapter 25: Scaling Laws<br/><i>Optimal allocation</i>"]
style CH23 fill:#ff6b6b,color:#fff
23.15 Key Equations Summary
VLAE Objective
\[\mathcal{L}_{VLAE} = \mathbb{E}_{q(z|x)}[-\log p(x|z)] + \beta KL(q(z|x) \| p(z))\]Rate (Mutual Information)
\[R = I(x; z) \approx KL(q(z|x) \| p(z))\]Distortion
\[D = \mathbb{E}_{q(z|x)}[-\log p(x|z)]\]Rate-Distortion Trade-off
\[\min_{q(z|x), p(x|z)} D + \beta R\]23.16 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["VLAE connects VAEs to<br/>rate-distortion theory"]
T2["Rate = mutual information<br/>I(x; z)"]
T3["Distortion = reconstruction<br/>error -log p(x|z)"]
T4["β controls rate-distortion<br/>trade-off"]
T5["Connects to MDL and<br/>bits-back coding"]
end
T1 --> C["Variational Lossy Autoencoders<br/>provide an information-theoretic<br/>framework for understanding VAEs<br/>as lossy compressors, connecting<br/>generative modeling to rate-distortion<br/>theory and MDL principles."]
T2 --> C
T3 --> C
T4 --> C
T5 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
In One Sentence
Variational Lossy Autoencoders provide an information-theoretic interpretation of VAEs as lossy compressors, where rate (mutual information) trades off with distortion (reconstruction error), connecting back to MDL and bits-back coding principles.
🎉 Part V Complete!
You’ve finished the Advanced Architectures section. You now understand:
- Pointer Networks for variable outputs (Chapter 18)
- Set2Seq for unordered inputs (Chapter 19)
- Neural Turing Machines with external memory (Chapter 20)
- Message Passing for graphs (Chapter 21)
- Relation Networks for reasoning (Chapter 22)
- VLAE connecting compression and generation (Chapter 23)
Next up: Part VI - Scaling and Efficiency, where we explore training neural networks at massive scale!
Exercises
-
Conceptual: Explain the connection between VLAE’s rate-distortion trade-off and MDL’s model-data trade-off from Chapter 1.
-
Mathematical: Derive why $I(x; z) \leq KL(q(z|x) | p(z))$. When does equality hold?
-
Implementation: Implement a simple VLAE and vary $\beta$ to see the rate-distortion trade-off. Plot the curve.
-
Analysis: Compare VLAE compression to standard methods (JPEG, PNG). What are the advantages and disadvantages of learned compression?
References & Further Reading
| Resource | Link |
|---|---|
| Original Paper (Chen et al., 2016) | arXiv:1611.02731 |
| β-VAE Paper | arXiv:1804.03599 |
| VQ-VAE Paper | arXiv:1711.00937 |
| Rate-Distortion Theory | Cover & Thomas Ch. 10 |
| Bits-Back with RANS | arXiv:1901.04866 |
| Hierarchical VAE | arXiv:1606.04934 |
Next Chapter: Chapter 24: Deep Speech 2 — We begin Part VI by exploring end-to-end speech recognition at scale, showing how deep learning revolutionized speech processing.