Chapter 10: Multi-Scale Context Aggregation by Dilated Convolutions

“Dilated convolutions support exponentially expanding receptive fields without losing resolution.”

Based on: “Multi-Scale Context Aggregation by Dilated Convolutions” (Fisher Yu, Vladlen Koltun, 2015)

📄 Original Paper: arXiv:1511.07122

ICLR 2016

10.1 The Dense Prediction Problem

So far we’ve focused on image classification: one label per image. But many vision tasks require dense prediction: a label for every pixel.

graph LR
    subgraph "Classification"
        I1["Image"] --> L1["'cat'"]
    end
    
    subgraph "Dense Prediction"
        I2["Image"] --> L2["Label per pixel"]
    end
    
    E["Semantic Segmentation<br/>Depth Estimation<br/>Optical Flow"]
    
    L2 --> E
    
    style E fill:#ffe66d,color:#000

Dense prediction creates a fundamental tension:

Large receptive field: Need to see global context
High resolution: Need to preserve spatial detail

Standard CNNs sacrifice one for the other.

10.2 The Resolution Problem

What Happens in Classification CNNs

graph LR
    subgraph "Classification CNN"
        I["224×224"]
        C1["112×112"]
        C2["56×56"]
        C3["28×28"]
        C4["14×14"]
        C5["7×7"]
        FC["1×1<br/>(global pool)"]
    end
    
    I --> C1 --> C2 --> C3 --> C4 --> C5 --> FC
    
    P["Resolution lost at each stage!<br/>Fine details destroyed."]
    
    C5 --> P
    
    style P fill:#ff6b6b,color:#fff

Naive Solutions and Their Problems

Approach	Problem
Remove pooling	Receptive field too small
Upsampling at end	Information already lost
Larger filters	Too many parameters
Skip connections (U-Net)	Complex, still loses some info

10.3 Dilated Convolutions: The Key Idea

What Is Dilation?

A dilated convolution (also called atrous convolution) spreads out the filter:

graph TB
    subgraph "Standard 3×3 Conv (dilation=1)"
        S["■ ■ ■<br/>■ ■ ■<br/>■ ■ ■"]
        SR["Receptive field: 3×3"]
    end
    
    subgraph "Dilated 3×3 Conv (dilation=2)"
        D2["■ · ■ · ■<br/>· · · · ·<br/>■ · ■ · ■<br/>· · · · ·<br/>■ · ■ · ■"]
        D2R["Receptive field: 5×5"]
    end
    
    subgraph "Dilated 3×3 Conv (dilation=4)"
        D4["■ · · · ■ · · · ■<br/>· · · · · · · · ·<br/>· · · · · · · · ·<br/>· · · · · · · · ·<br/>■ · · · ■ · · · ■<br/>..."]
        D4R["Receptive field: 9×9"]
    end
    
    K["Same number of parameters,<br/>MUCH larger receptive field!"]
    
    D4 --> K
    
    style K fill:#ffe66d,color:#000

The Formula

For a 1D signal with filter w and dilation rate r:

\[(F *_r w)(p) = \sum_{s} F(p + r \cdot s) \cdot w(s)\]

The filter samples the input at intervals of r instead of 1.

10.4 Exponentially Growing Receptive Fields

The Power of Stacked Dilations

Stack dilated convolutions with exponentially increasing rates:

graph TB
    subgraph "Stacked Dilated Convolutions"
        L0["Input"]
        L1["Layer 1: dilation=1<br/>RF: 3"]
        L2["Layer 2: dilation=2<br/>RF: 7"]
        L3["Layer 3: dilation=4<br/>RF: 15"]
        L4["Layer 4: dilation=8<br/>RF: 31"]
        L5["Layer 5: dilation=16<br/>RF: 63"]
    end
    
    L0 --> L1 --> L2 --> L3 --> L4 --> L5
    
    R["Receptive field grows EXPONENTIALLY<br/>while parameters grow LINEARLY!"]
    
    L5 --> R
    
    style R fill:#4ecdc4,color:#fff

Comparison: Standard vs Dilated

Layers	Standard Conv RF	Dilated Conv RF
1	3	3
2	5	7
3	7	15
4	9	31
5	11	63
6	13	127

With the same depth and parameters, dilated convolutions achieve 10× larger receptive fields!

10.5 The Context Module

Architecture

The paper proposes a context module that can be appended to any CNN:

graph TB
    subgraph "Context Module"
        I["Input feature map<br/>(from base network)"]
        C1["3×3 Conv, dilation=1"]
        C2["3×3 Conv, dilation=2"]
        C3["3×3 Conv, dilation=4"]
        C4["3×3 Conv, dilation=8"]
        C5["3×3 Conv, dilation=16"]
        C6["3×3 Conv, dilation=1"]
        C7["1×1 Conv (output)"]
        O["Dense prediction"]
    end
    
    I --> C1 --> C2 --> C3 --> C4 --> C5 --> C6 --> C7 --> O
    
    N["7 layers capture context<br/>at exponentially increasing scales"]
    
    C5 --> N

Module Variants

Variant	Description
Basic	All layers have same channels
Large	More channels in middle layers
Front-end	VGG-16 adapted with dilations

10.6 Adapting Classification Networks

Removing Pooling, Adding Dilation

The key insight: you can convert a classification CNN to a dense prediction network by:

Remove the last pooling layers
Replace subsequent convolutions with dilated convolutions
Maintain resolution throughout

graph TB
    subgraph "VGG for Classification"
        V1["Conv1-2 (224)"]
        P1["Pool → 112"]
        V2["Conv3-4 (112)"]
        P2["Pool → 56"]
        V3["Conv5-7 (56)"]
        P3["Pool → 28"]
        V4["Conv8-10 (28)"]
        P4["Pool → 14"]
        V5["Conv11-13 (14)"]
        P5["Pool → 7"]
        FC["FC layers"]
    end
    
    subgraph "VGG for Dense Prediction"
        VD1["Conv1-2 (224)"]
        PD1["Pool → 112"]
        VD2["Conv3-4 (112)"]
        PD2["Pool → 56"]
        VD3["Conv5-7 (56)"]
        VD4["Conv8-10, dilation=2 (56)"]
        VD5["Conv11-13, dilation=4 (56)"]
        OUT["Output (56)"]
    end
    
    V1 --> P1 --> V2 --> P2 --> V3 --> P3 --> V4 --> P4 --> V5 --> P5 --> FC
    VD1 --> PD1 --> VD2 --> PD2 --> VD3 --> VD4 --> VD5 --> OUT
    
    K["Resolution maintained!<br/>Large receptive field via dilation."]
    
    OUT --> K
    
    style K fill:#ffe66d,color:#000

10.7 Why Dilation Works

Theoretical Justification

graph TB
    subgraph "Multi-Scale Processing"
        S1["Scale 1: Fine details<br/>(dilation=1)"]
        S2["Scale 2: Local context<br/>(dilation=2,4)"]
        S3["Scale 3: Global context<br/>(dilation=8,16)"]
    end
    
    A["All scales combined<br/>in single forward pass"]
    
    S1 --> A
    S2 --> A
    S3 --> A
    
    style A fill:#ffe66d,color:#000

Information Aggregation

Each output pixel aggregates information from:

Nearby pixels: Fine-grained appearance
Medium distance: Object parts and structure
Far away: Scene context and global semantics

10.8 The Gridding Problem

What Is Gridding?

A subtle issue: dilated convolutions can cause gridding artifacts:

graph TB
    subgraph "Gridding Problem"
        D["Dilation = 2"]
        G["Grid pattern in activations:<br/>■ · ■ · ■<br/>· · · · ·<br/>■ · ■ · ■"]
        P["Not all pixels equally<br/>contribute to output"]
    end
    
    D --> G --> P
    
    style P fill:#ff6b6b,color:#fff

Solutions

Hybrid Dilated Convolution (HDC): Use non-uniform dilation rates
Return to dilation=1: Final layers with standard convolution
Multi-scale fusion: Combine different dilation branches

graph LR
    subgraph "HDC Pattern"
        H["1, 2, 3 instead of 1, 2, 4"]
        E["No common divisor > 1<br/>→ All pixels covered"]
    end
    
    H --> E

10.9 Results and Applications

Semantic Segmentation Results

xychart-beta
    title "Pascal VOC 2012 Segmentation (mIoU %)"
    x-axis ["FCN-8s", "DeepLab", "This Paper", "This + CRF"]
    y-axis "mIoU %" 60 --> 80
    bar [62.2, 71.6, 73.5, 75.3]

Where Dilated Convolutions Are Used

graph TB
    subgraph "Applications"
        S["Semantic Segmentation<br/>(DeepLab, PSPNet)"]
        D["Depth Estimation"]
        O["Object Detection<br/>(Feature Pyramid Networks)"]
        A["Audio (WaveNet)"]
        M["Medical Imaging"]
    end
    
    DC["Dilated<br/>Convolutions"]
    
    DC --> S
    DC --> D
    DC --> O
    DC --> A
    DC --> M
    
    style DC fill:#4ecdc4,color:#fff

10.10 WaveNet: Dilated Convolutions for Audio

A Surprising Application

Google’s WaveNet used dilated 1D convolutions for audio generation:

graph TB
    subgraph "WaveNet Architecture"
        I["Input samples"]
        L1["Dilated Conv, r=1"]
        L2["Dilated Conv, r=2"]
        L3["Dilated Conv, r=4"]
        L4["Dilated Conv, r=8"]
        L5["... r=512"]
        O["Next sample prediction"]
    end
    
    I --> L1 --> L2 --> L3 --> L4 --> L5 --> O
    
    R["Receptive field of thousands<br/>of samples = seconds of audio"]
    
    L5 --> R
    
    style R fill:#ffe66d,color:#000

This allows modeling long-range dependencies in sequential data—a precursor to the attention mechanisms we’ll see in Part IV!

10.11 Comparison with Other Approaches

Dense Prediction Methods

graph TB
    subgraph "Approaches to Dense Prediction"
        E["Encoder-Decoder<br/>(U-Net)"]
        D["Dilated Convolutions<br/>(This paper)"]
        P["Pyramid Pooling<br/>(PSPNet)"]
        A["Attention<br/>(Later work)"]
    end
    
    subgraph "Trade-offs"
        E --> TE["+ Skip connections<br/>- Complex architecture"]
        D --> TD["+ Simple modification<br/>- Gridding artifacts"]
        P --> TP["+ Multi-scale features<br/>- Fixed scales"]
        A --> TA["+ Adaptive<br/>- Expensive"]
    end

Modern Best Practice

Most state-of-the-art models combine these approaches:

Dilated convolutions in backbone
Multi-scale pooling
Skip connections
Sometimes attention

10.12 Implementation Details

Dilated Convolution in Code

# PyTorch dilated convolution
import torch.nn as nn

# Standard convolution
conv_standard = nn.Conv2d(
    in_channels=64, 
    out_channels=64,
    kernel_size=3, 
    padding=1,
    dilation=1  # default
)

# Dilated convolution (rate=2)
conv_dilated = nn.Conv2d(
    in_channels=64,
    out_channels=64, 
    kernel_size=3,
    padding=2,      # padding = dilation for 'same'
    dilation=2
)

# Both have same number of parameters!

Padding for Dilated Convolutions

For “same” padding with dilation rate r and kernel size k:

\[\text{padding} = \frac{(k - 1) \cdot r}{2}\]

For 3×3 kernel:

dilation=1 → padding=1
dilation=2 → padding=2
dilation=4 → padding=4

10.13 Connection to Other Chapters

graph TB
    CH10["Chapter 10<br/>Dilated Convolutions"]
    
    CH10 --> CH7["Chapter 7: CS231n<br/><i>Receptive field basics</i>"]
    CH10 --> CH8["Chapter 8: ResNet<br/><i>Combined with dilations<br/>in modern nets</i>"]
    CH10 --> CH16["Chapter 16: Transformers<br/><i>Alternative for<br/>global context</i>"]
    CH10 --> CH24["Chapter 24: Deep Speech 2<br/><i>Also needs long-range<br/>dependencies</i>"]
    
    style CH10 fill:#ff6b6b,color:#fff

10.14 Key Equations Summary

Dilated Convolution (1D)

\[(F *_r w)(p) = \sum_{s} F(p + r \cdot s) \cdot w(s)\]

Receptive Field Growth

For L layers with dilation rates $r_1, r_2, …, r_L$ and kernel size k:

\[RF = 1 + \sum_{i=1}^{L} (k-1) \cdot r_i\]

Exponential Dilation Schedule

\[r_i = 2^{i-1}\]

Gives receptive field: $RF = 2^L \cdot (k-1) + 1$

Output Size (with ‘same’ padding)

\[H_{out} = H_{in} \quad \text{(resolution preserved!)}\]

10.15 Chapter Summary

graph TB
    subgraph "Key Takeaways"
        T1["Dilated convolutions expand<br/>receptive field without<br/>losing resolution"]
        T2["Exponential dilation rates<br/>give exponential RF growth"]
        T3["Same parameters as<br/>standard convolutions"]
        T4["Essential for dense<br/>prediction tasks"]
        T5["Used in segmentation,<br/>audio, and more"]
    end
    
    T1 --> C["Dilated convolutions solve<br/>the fundamental tension between<br/>resolution and receptive field,<br/>enabling dense prediction with<br/>global context awareness."]
    T2 --> C
    T3 --> C
    T4 --> C
    T5 --> C
    
    style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px

In One Sentence

Dilated convolutions expand the receptive field exponentially by sampling inputs at regular intervals, enabling dense prediction networks to capture multi-scale context without sacrificing spatial resolution.

🎉 Part II Complete!

You’ve finished the Convolutional Neural Networks section. You now understand:

How AlexNet started the revolution (Chapter 6)
The complete CNN foundations from CS231n (Chapter 7)
How ResNet enabled training 100+ layers (Chapter 8)
Why identity mappings matter (Chapter 9)
How dilated convolutions solve dense prediction (Chapter 10)

Next up: Part III - Sequence Models and Recurrent Networks, where we tackle sequential data with RNNs and LSTMs.

Exercises

Calculation: Calculate the receptive field of a network with 5 layers of 3×3 convolutions with dilation rates [1, 2, 4, 8, 16].
Implementation: Implement a context module as described in the paper and apply it to a semantic segmentation task.
Comparison: Compare the number of parameters and receptive field for (a) 7 standard 3×3 convs and (b) 7 dilated 3×3 convs with exponential dilation.
Analysis: Why might dilated convolutions cause gridding artifacts? Propose and test a solution.

References & Further Reading

Resource	Link
Original Paper (Yu & Koltun, 2015)	arXiv:1511.07122
DeepLab (Uses Atrous Conv)	arXiv:1606.00915
WaveNet (Dilated 1D Conv)	arXiv:1609.03499
PSPNet (Multi-scale)	arXiv:1612.01105
Understanding Dilated Convolutions	Blog Post
Hybrid Dilated Convolution	arXiv:1702.08502

Next Chapter: Chapter 11: The Unreasonable Effectiveness of RNNs — We begin Part III by exploring Andrej Karpathy’s famous blog post on how recurrent neural networks can generate surprisingly coherent text, code, and more.

← Back to Part II

Table of Contents