Chapter 10: Multi-Scale Context Aggregation by Dilated Convolutions
“Dilated convolutions support exponentially expanding receptive fields without losing resolution.”
Based on: “Multi-Scale Context Aggregation by Dilated Convolutions” (Fisher Yu, Vladlen Koltun, 2015)
| 📄 Original Paper: arXiv:1511.07122 | ICLR 2016 |
10.1 The Dense Prediction Problem
So far we’ve focused on image classification: one label per image. But many vision tasks require dense prediction: a label for every pixel.
graph LR
subgraph "Classification"
I1["Image"] --> L1["'cat'"]
end
subgraph "Dense Prediction"
I2["Image"] --> L2["Label per pixel"]
end
E["Semantic Segmentation<br/>Depth Estimation<br/>Optical Flow"]
L2 --> E
style E fill:#ffe66d,color:#000
Dense prediction creates a fundamental tension:
- Large receptive field: Need to see global context
- High resolution: Need to preserve spatial detail
Standard CNNs sacrifice one for the other.
10.2 The Resolution Problem
What Happens in Classification CNNs
graph LR
subgraph "Classification CNN"
I["224×224"]
C1["112×112"]
C2["56×56"]
C3["28×28"]
C4["14×14"]
C5["7×7"]
FC["1×1<br/>(global pool)"]
end
I --> C1 --> C2 --> C3 --> C4 --> C5 --> FC
P["Resolution lost at each stage!<br/>Fine details destroyed."]
C5 --> P
style P fill:#ff6b6b,color:#fff
Naive Solutions and Their Problems
| Approach | Problem |
|---|---|
| Remove pooling | Receptive field too small |
| Upsampling at end | Information already lost |
| Larger filters | Too many parameters |
| Skip connections (U-Net) | Complex, still loses some info |
10.3 Dilated Convolutions: The Key Idea
What Is Dilation?
A dilated convolution (also called atrous convolution) spreads out the filter:
graph TB
subgraph "Standard 3×3 Conv (dilation=1)"
S["■ ■ ■<br/>■ ■ ■<br/>■ ■ ■"]
SR["Receptive field: 3×3"]
end
subgraph "Dilated 3×3 Conv (dilation=2)"
D2["■ · ■ · ■<br/>· · · · ·<br/>■ · ■ · ■<br/>· · · · ·<br/>■ · ■ · ■"]
D2R["Receptive field: 5×5"]
end
subgraph "Dilated 3×3 Conv (dilation=4)"
D4["■ · · · ■ · · · ■<br/>· · · · · · · · ·<br/>· · · · · · · · ·<br/>· · · · · · · · ·<br/>■ · · · ■ · · · ■<br/>..."]
D4R["Receptive field: 9×9"]
end
K["Same number of parameters,<br/>MUCH larger receptive field!"]
D4 --> K
style K fill:#ffe66d,color:#000
The Formula
For a 1D signal with filter w and dilation rate r:
\[(F *_r w)(p) = \sum_{s} F(p + r \cdot s) \cdot w(s)\]The filter samples the input at intervals of r instead of 1.
10.4 Exponentially Growing Receptive Fields
The Power of Stacked Dilations
Stack dilated convolutions with exponentially increasing rates:
graph TB
subgraph "Stacked Dilated Convolutions"
L0["Input"]
L1["Layer 1: dilation=1<br/>RF: 3"]
L2["Layer 2: dilation=2<br/>RF: 7"]
L3["Layer 3: dilation=4<br/>RF: 15"]
L4["Layer 4: dilation=8<br/>RF: 31"]
L5["Layer 5: dilation=16<br/>RF: 63"]
end
L0 --> L1 --> L2 --> L3 --> L4 --> L5
R["Receptive field grows EXPONENTIALLY<br/>while parameters grow LINEARLY!"]
L5 --> R
style R fill:#4ecdc4,color:#fff
Comparison: Standard vs Dilated
| Layers | Standard Conv RF | Dilated Conv RF |
|---|---|---|
| 1 | 3 | 3 |
| 2 | 5 | 7 |
| 3 | 7 | 15 |
| 4 | 9 | 31 |
| 5 | 11 | 63 |
| 6 | 13 | 127 |
With the same depth and parameters, dilated convolutions achieve 10× larger receptive fields!
10.5 The Context Module
Architecture
The paper proposes a context module that can be appended to any CNN:
graph TB
subgraph "Context Module"
I["Input feature map<br/>(from base network)"]
C1["3×3 Conv, dilation=1"]
C2["3×3 Conv, dilation=2"]
C3["3×3 Conv, dilation=4"]
C4["3×3 Conv, dilation=8"]
C5["3×3 Conv, dilation=16"]
C6["3×3 Conv, dilation=1"]
C7["1×1 Conv (output)"]
O["Dense prediction"]
end
I --> C1 --> C2 --> C3 --> C4 --> C5 --> C6 --> C7 --> O
N["7 layers capture context<br/>at exponentially increasing scales"]
C5 --> N
Module Variants
| Variant | Description |
|---|---|
| Basic | All layers have same channels |
| Large | More channels in middle layers |
| Front-end | VGG-16 adapted with dilations |
10.6 Adapting Classification Networks
Removing Pooling, Adding Dilation
The key insight: you can convert a classification CNN to a dense prediction network by:
- Remove the last pooling layers
- Replace subsequent convolutions with dilated convolutions
- Maintain resolution throughout
graph TB
subgraph "VGG for Classification"
V1["Conv1-2 (224)"]
P1["Pool → 112"]
V2["Conv3-4 (112)"]
P2["Pool → 56"]
V3["Conv5-7 (56)"]
P3["Pool → 28"]
V4["Conv8-10 (28)"]
P4["Pool → 14"]
V5["Conv11-13 (14)"]
P5["Pool → 7"]
FC["FC layers"]
end
subgraph "VGG for Dense Prediction"
VD1["Conv1-2 (224)"]
PD1["Pool → 112"]
VD2["Conv3-4 (112)"]
PD2["Pool → 56"]
VD3["Conv5-7 (56)"]
VD4["Conv8-10, dilation=2 (56)"]
VD5["Conv11-13, dilation=4 (56)"]
OUT["Output (56)"]
end
V1 --> P1 --> V2 --> P2 --> V3 --> P3 --> V4 --> P4 --> V5 --> P5 --> FC
VD1 --> PD1 --> VD2 --> PD2 --> VD3 --> VD4 --> VD5 --> OUT
K["Resolution maintained!<br/>Large receptive field via dilation."]
OUT --> K
style K fill:#ffe66d,color:#000
10.7 Why Dilation Works
Theoretical Justification
graph TB
subgraph "Multi-Scale Processing"
S1["Scale 1: Fine details<br/>(dilation=1)"]
S2["Scale 2: Local context<br/>(dilation=2,4)"]
S3["Scale 3: Global context<br/>(dilation=8,16)"]
end
A["All scales combined<br/>in single forward pass"]
S1 --> A
S2 --> A
S3 --> A
style A fill:#ffe66d,color:#000
Information Aggregation
Each output pixel aggregates information from:
- Nearby pixels: Fine-grained appearance
- Medium distance: Object parts and structure
- Far away: Scene context and global semantics
10.8 The Gridding Problem
What Is Gridding?
A subtle issue: dilated convolutions can cause gridding artifacts:
graph TB
subgraph "Gridding Problem"
D["Dilation = 2"]
G["Grid pattern in activations:<br/>■ · ■ · ■<br/>· · · · ·<br/>■ · ■ · ■"]
P["Not all pixels equally<br/>contribute to output"]
end
D --> G --> P
style P fill:#ff6b6b,color:#fff
Solutions
- Hybrid Dilated Convolution (HDC): Use non-uniform dilation rates
- Return to dilation=1: Final layers with standard convolution
- Multi-scale fusion: Combine different dilation branches
graph LR
subgraph "HDC Pattern"
H["1, 2, 3 instead of 1, 2, 4"]
E["No common divisor > 1<br/>→ All pixels covered"]
end
H --> E
10.9 Results and Applications
Semantic Segmentation Results
xychart-beta
title "Pascal VOC 2012 Segmentation (mIoU %)"
x-axis ["FCN-8s", "DeepLab", "This Paper", "This + CRF"]
y-axis "mIoU %" 60 --> 80
bar [62.2, 71.6, 73.5, 75.3]
Where Dilated Convolutions Are Used
graph TB
subgraph "Applications"
S["Semantic Segmentation<br/>(DeepLab, PSPNet)"]
D["Depth Estimation"]
O["Object Detection<br/>(Feature Pyramid Networks)"]
A["Audio (WaveNet)"]
M["Medical Imaging"]
end
DC["Dilated<br/>Convolutions"]
DC --> S
DC --> D
DC --> O
DC --> A
DC --> M
style DC fill:#4ecdc4,color:#fff
10.10 WaveNet: Dilated Convolutions for Audio
A Surprising Application
Google’s WaveNet used dilated 1D convolutions for audio generation:
graph TB
subgraph "WaveNet Architecture"
I["Input samples"]
L1["Dilated Conv, r=1"]
L2["Dilated Conv, r=2"]
L3["Dilated Conv, r=4"]
L4["Dilated Conv, r=8"]
L5["... r=512"]
O["Next sample prediction"]
end
I --> L1 --> L2 --> L3 --> L4 --> L5 --> O
R["Receptive field of thousands<br/>of samples = seconds of audio"]
L5 --> R
style R fill:#ffe66d,color:#000
This allows modeling long-range dependencies in sequential data—a precursor to the attention mechanisms we’ll see in Part IV!
10.11 Comparison with Other Approaches
Dense Prediction Methods
graph TB
subgraph "Approaches to Dense Prediction"
E["Encoder-Decoder<br/>(U-Net)"]
D["Dilated Convolutions<br/>(This paper)"]
P["Pyramid Pooling<br/>(PSPNet)"]
A["Attention<br/>(Later work)"]
end
subgraph "Trade-offs"
E --> TE["+ Skip connections<br/>- Complex architecture"]
D --> TD["+ Simple modification<br/>- Gridding artifacts"]
P --> TP["+ Multi-scale features<br/>- Fixed scales"]
A --> TA["+ Adaptive<br/>- Expensive"]
end
Modern Best Practice
Most state-of-the-art models combine these approaches:
- Dilated convolutions in backbone
- Multi-scale pooling
- Skip connections
- Sometimes attention
10.12 Implementation Details
Dilated Convolution in Code
# PyTorch dilated convolution
import torch.nn as nn
# Standard convolution
conv_standard = nn.Conv2d(
in_channels=64,
out_channels=64,
kernel_size=3,
padding=1,
dilation=1 # default
)
# Dilated convolution (rate=2)
conv_dilated = nn.Conv2d(
in_channels=64,
out_channels=64,
kernel_size=3,
padding=2, # padding = dilation for 'same'
dilation=2
)
# Both have same number of parameters!
Padding for Dilated Convolutions
For “same” padding with dilation rate r and kernel size k:
\[\text{padding} = \frac{(k - 1) \cdot r}{2}\]For 3×3 kernel:
- dilation=1 → padding=1
- dilation=2 → padding=2
- dilation=4 → padding=4
10.13 Connection to Other Chapters
graph TB
CH10["Chapter 10<br/>Dilated Convolutions"]
CH10 --> CH7["Chapter 7: CS231n<br/><i>Receptive field basics</i>"]
CH10 --> CH8["Chapter 8: ResNet<br/><i>Combined with dilations<br/>in modern nets</i>"]
CH10 --> CH16["Chapter 16: Transformers<br/><i>Alternative for<br/>global context</i>"]
CH10 --> CH24["Chapter 24: Deep Speech 2<br/><i>Also needs long-range<br/>dependencies</i>"]
style CH10 fill:#ff6b6b,color:#fff
10.14 Key Equations Summary
Dilated Convolution (1D)
\[(F *_r w)(p) = \sum_{s} F(p + r \cdot s) \cdot w(s)\]Receptive Field Growth
For L layers with dilation rates $r_1, r_2, …, r_L$ and kernel size k:
\[RF = 1 + \sum_{i=1}^{L} (k-1) \cdot r_i\]Exponential Dilation Schedule
\[r_i = 2^{i-1}\]Gives receptive field: $RF = 2^L \cdot (k-1) + 1$
Output Size (with ‘same’ padding)
\[H_{out} = H_{in} \quad \text{(resolution preserved!)}\]10.15 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["Dilated convolutions expand<br/>receptive field without<br/>losing resolution"]
T2["Exponential dilation rates<br/>give exponential RF growth"]
T3["Same parameters as<br/>standard convolutions"]
T4["Essential for dense<br/>prediction tasks"]
T5["Used in segmentation,<br/>audio, and more"]
end
T1 --> C["Dilated convolutions solve<br/>the fundamental tension between<br/>resolution and receptive field,<br/>enabling dense prediction with<br/>global context awareness."]
T2 --> C
T3 --> C
T4 --> C
T5 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
In One Sentence
Dilated convolutions expand the receptive field exponentially by sampling inputs at regular intervals, enabling dense prediction networks to capture multi-scale context without sacrificing spatial resolution.
🎉 Part II Complete!
You’ve finished the Convolutional Neural Networks section. You now understand:
- How AlexNet started the revolution (Chapter 6)
- The complete CNN foundations from CS231n (Chapter 7)
- How ResNet enabled training 100+ layers (Chapter 8)
- Why identity mappings matter (Chapter 9)
- How dilated convolutions solve dense prediction (Chapter 10)
Next up: Part III - Sequence Models and Recurrent Networks, where we tackle sequential data with RNNs and LSTMs.
Exercises
-
Calculation: Calculate the receptive field of a network with 5 layers of 3×3 convolutions with dilation rates [1, 2, 4, 8, 16].
-
Implementation: Implement a context module as described in the paper and apply it to a semantic segmentation task.
-
Comparison: Compare the number of parameters and receptive field for (a) 7 standard 3×3 convs and (b) 7 dilated 3×3 convs with exponential dilation.
-
Analysis: Why might dilated convolutions cause gridding artifacts? Propose and test a solution.
References & Further Reading
| Resource | Link |
|---|---|
| Original Paper (Yu & Koltun, 2015) | arXiv:1511.07122 |
| DeepLab (Uses Atrous Conv) | arXiv:1606.00915 |
| WaveNet (Dilated 1D Conv) | arXiv:1609.03499 |
| PSPNet (Multi-scale) | arXiv:1612.01105 |
| Understanding Dilated Convolutions | Blog Post |
| Hybrid Dilated Convolution | arXiv:1702.08502 |
Next Chapter: Chapter 11: The Unreasonable Effectiveness of RNNs — We begin Part III by exploring Andrej Karpathy’s famous blog post on how recurrent neural networks can generate surprisingly coherent text, code, and more.