Chapter 10: Multi-Scale Context Aggregation by Dilated Convolutions

“Dilated convolutions support exponentially expanding receptive fields without losing resolution.”

Based on: “Multi-Scale Context Aggregation by Dilated Convolutions” (Fisher Yu, Vladlen Koltun, 2015)

📄 Original Paper: arXiv:1511.07122 ICLR 2016

10.1 The Dense Prediction Problem

So far we’ve focused on image classification: one label per image. But many vision tasks require dense prediction: a label for every pixel.

graph LR
    subgraph "Classification"
        I1["Image"] --> L1["'cat'"]
    end
    
    subgraph "Dense Prediction"
        I2["Image"] --> L2["Label per pixel"]
    end
    
    E["Semantic Segmentation<br/>Depth Estimation<br/>Optical Flow"]
    
    L2 --> E
    
    style E fill:#ffe66d,color:#000

Dense prediction creates a fundamental tension:

  • Large receptive field: Need to see global context
  • High resolution: Need to preserve spatial detail

Standard CNNs sacrifice one for the other.


10.2 The Resolution Problem

What Happens in Classification CNNs

graph LR
    subgraph "Classification CNN"
        I["224×224"]
        C1["112×112"]
        C2["56×56"]
        C3["28×28"]
        C4["14×14"]
        C5["7×7"]
        FC["1×1<br/>(global pool)"]
    end
    
    I --> C1 --> C2 --> C3 --> C4 --> C5 --> FC
    
    P["Resolution lost at each stage!<br/>Fine details destroyed."]
    
    C5 --> P
    
    style P fill:#ff6b6b,color:#fff

Naive Solutions and Their Problems

Approach Problem
Remove pooling Receptive field too small
Upsampling at end Information already lost
Larger filters Too many parameters
Skip connections (U-Net) Complex, still loses some info

10.3 Dilated Convolutions: The Key Idea

What Is Dilation?

A dilated convolution (also called atrous convolution) spreads out the filter:

graph TB
    subgraph "Standard 3×3 Conv (dilation=1)"
        S["■ ■ ■<br/>■ ■ ■<br/>■ ■ ■"]
        SR["Receptive field: 3×3"]
    end
    
    subgraph "Dilated 3×3 Conv (dilation=2)"
        D2["■ · ■ · ■<br/>· · · · ·<br/>■ · ■ · ■<br/>· · · · ·<br/>■ · ■ · ■"]
        D2R["Receptive field: 5×5"]
    end
    
    subgraph "Dilated 3×3 Conv (dilation=4)"
        D4["■ · · · ■ · · · ■<br/>· · · · · · · · ·<br/>· · · · · · · · ·<br/>· · · · · · · · ·<br/>■ · · · ■ · · · ■<br/>..."]
        D4R["Receptive field: 9×9"]
    end
    
    K["Same number of parameters,<br/>MUCH larger receptive field!"]
    
    D4 --> K
    
    style K fill:#ffe66d,color:#000

The Formula

For a 1D signal with filter w and dilation rate r:

\[(F *_r w)(p) = \sum_{s} F(p + r \cdot s) \cdot w(s)\]

The filter samples the input at intervals of r instead of 1.


10.4 Exponentially Growing Receptive Fields

The Power of Stacked Dilations

Stack dilated convolutions with exponentially increasing rates:

graph TB
    subgraph "Stacked Dilated Convolutions"
        L0["Input"]
        L1["Layer 1: dilation=1<br/>RF: 3"]
        L2["Layer 2: dilation=2<br/>RF: 7"]
        L3["Layer 3: dilation=4<br/>RF: 15"]
        L4["Layer 4: dilation=8<br/>RF: 31"]
        L5["Layer 5: dilation=16<br/>RF: 63"]
    end
    
    L0 --> L1 --> L2 --> L3 --> L4 --> L5
    
    R["Receptive field grows EXPONENTIALLY<br/>while parameters grow LINEARLY!"]
    
    L5 --> R
    
    style R fill:#4ecdc4,color:#fff

Comparison: Standard vs Dilated

Layers Standard Conv RF Dilated Conv RF
1 3 3
2 5 7
3 7 15
4 9 31
5 11 63
6 13 127

With the same depth and parameters, dilated convolutions achieve 10× larger receptive fields!


10.5 The Context Module

Architecture

The paper proposes a context module that can be appended to any CNN:

graph TB
    subgraph "Context Module"
        I["Input feature map<br/>(from base network)"]
        C1["3×3 Conv, dilation=1"]
        C2["3×3 Conv, dilation=2"]
        C3["3×3 Conv, dilation=4"]
        C4["3×3 Conv, dilation=8"]
        C5["3×3 Conv, dilation=16"]
        C6["3×3 Conv, dilation=1"]
        C7["1×1 Conv (output)"]
        O["Dense prediction"]
    end
    
    I --> C1 --> C2 --> C3 --> C4 --> C5 --> C6 --> C7 --> O
    
    N["7 layers capture context<br/>at exponentially increasing scales"]
    
    C5 --> N

Module Variants

Variant Description
Basic All layers have same channels
Large More channels in middle layers
Front-end VGG-16 adapted with dilations

10.6 Adapting Classification Networks

Removing Pooling, Adding Dilation

The key insight: you can convert a classification CNN to a dense prediction network by:

  1. Remove the last pooling layers
  2. Replace subsequent convolutions with dilated convolutions
  3. Maintain resolution throughout
graph TB
    subgraph "VGG for Classification"
        V1["Conv1-2 (224)"]
        P1["Pool → 112"]
        V2["Conv3-4 (112)"]
        P2["Pool → 56"]
        V3["Conv5-7 (56)"]
        P3["Pool → 28"]
        V4["Conv8-10 (28)"]
        P4["Pool → 14"]
        V5["Conv11-13 (14)"]
        P5["Pool → 7"]
        FC["FC layers"]
    end
    
    subgraph "VGG for Dense Prediction"
        VD1["Conv1-2 (224)"]
        PD1["Pool → 112"]
        VD2["Conv3-4 (112)"]
        PD2["Pool → 56"]
        VD3["Conv5-7 (56)"]
        VD4["Conv8-10, dilation=2 (56)"]
        VD5["Conv11-13, dilation=4 (56)"]
        OUT["Output (56)"]
    end
    
    V1 --> P1 --> V2 --> P2 --> V3 --> P3 --> V4 --> P4 --> V5 --> P5 --> FC
    VD1 --> PD1 --> VD2 --> PD2 --> VD3 --> VD4 --> VD5 --> OUT
    
    K["Resolution maintained!<br/>Large receptive field via dilation."]
    
    OUT --> K
    
    style K fill:#ffe66d,color:#000

10.7 Why Dilation Works

Theoretical Justification

graph TB
    subgraph "Multi-Scale Processing"
        S1["Scale 1: Fine details<br/>(dilation=1)"]
        S2["Scale 2: Local context<br/>(dilation=2,4)"]
        S3["Scale 3: Global context<br/>(dilation=8,16)"]
    end
    
    A["All scales combined<br/>in single forward pass"]
    
    S1 --> A
    S2 --> A
    S3 --> A
    
    style A fill:#ffe66d,color:#000

Information Aggregation

Each output pixel aggregates information from:

  • Nearby pixels: Fine-grained appearance
  • Medium distance: Object parts and structure
  • Far away: Scene context and global semantics

10.8 The Gridding Problem

What Is Gridding?

A subtle issue: dilated convolutions can cause gridding artifacts:

graph TB
    subgraph "Gridding Problem"
        D["Dilation = 2"]
        G["Grid pattern in activations:<br/>■ · ■ · ■<br/>· · · · ·<br/>■ · ■ · ■"]
        P["Not all pixels equally<br/>contribute to output"]
    end
    
    D --> G --> P
    
    style P fill:#ff6b6b,color:#fff

Solutions

  1. Hybrid Dilated Convolution (HDC): Use non-uniform dilation rates
  2. Return to dilation=1: Final layers with standard convolution
  3. Multi-scale fusion: Combine different dilation branches
graph LR
    subgraph "HDC Pattern"
        H["1, 2, 3 instead of 1, 2, 4"]
        E["No common divisor > 1<br/>→ All pixels covered"]
    end
    
    H --> E

10.9 Results and Applications

Semantic Segmentation Results

xychart-beta
    title "Pascal VOC 2012 Segmentation (mIoU %)"
    x-axis ["FCN-8s", "DeepLab", "This Paper", "This + CRF"]
    y-axis "mIoU %" 60 --> 80
    bar [62.2, 71.6, 73.5, 75.3]

Where Dilated Convolutions Are Used

graph TB
    subgraph "Applications"
        S["Semantic Segmentation<br/>(DeepLab, PSPNet)"]
        D["Depth Estimation"]
        O["Object Detection<br/>(Feature Pyramid Networks)"]
        A["Audio (WaveNet)"]
        M["Medical Imaging"]
    end
    
    DC["Dilated<br/>Convolutions"]
    
    DC --> S
    DC --> D
    DC --> O
    DC --> A
    DC --> M
    
    style DC fill:#4ecdc4,color:#fff

10.10 WaveNet: Dilated Convolutions for Audio

A Surprising Application

Google’s WaveNet used dilated 1D convolutions for audio generation:

graph TB
    subgraph "WaveNet Architecture"
        I["Input samples"]
        L1["Dilated Conv, r=1"]
        L2["Dilated Conv, r=2"]
        L3["Dilated Conv, r=4"]
        L4["Dilated Conv, r=8"]
        L5["... r=512"]
        O["Next sample prediction"]
    end
    
    I --> L1 --> L2 --> L3 --> L4 --> L5 --> O
    
    R["Receptive field of thousands<br/>of samples = seconds of audio"]
    
    L5 --> R
    
    style R fill:#ffe66d,color:#000

This allows modeling long-range dependencies in sequential data—a precursor to the attention mechanisms we’ll see in Part IV!


10.11 Comparison with Other Approaches

Dense Prediction Methods

graph TB
    subgraph "Approaches to Dense Prediction"
        E["Encoder-Decoder<br/>(U-Net)"]
        D["Dilated Convolutions<br/>(This paper)"]
        P["Pyramid Pooling<br/>(PSPNet)"]
        A["Attention<br/>(Later work)"]
    end
    
    subgraph "Trade-offs"
        E --> TE["+ Skip connections<br/>- Complex architecture"]
        D --> TD["+ Simple modification<br/>- Gridding artifacts"]
        P --> TP["+ Multi-scale features<br/>- Fixed scales"]
        A --> TA["+ Adaptive<br/>- Expensive"]
    end

Modern Best Practice

Most state-of-the-art models combine these approaches:

  • Dilated convolutions in backbone
  • Multi-scale pooling
  • Skip connections
  • Sometimes attention

10.12 Implementation Details

Dilated Convolution in Code

# PyTorch dilated convolution
import torch.nn as nn

# Standard convolution
conv_standard = nn.Conv2d(
    in_channels=64, 
    out_channels=64,
    kernel_size=3, 
    padding=1,
    dilation=1  # default
)

# Dilated convolution (rate=2)
conv_dilated = nn.Conv2d(
    in_channels=64,
    out_channels=64, 
    kernel_size=3,
    padding=2,      # padding = dilation for 'same'
    dilation=2
)

# Both have same number of parameters!

Padding for Dilated Convolutions

For “same” padding with dilation rate r and kernel size k:

\[\text{padding} = \frac{(k - 1) \cdot r}{2}\]

For 3×3 kernel:

  • dilation=1 → padding=1
  • dilation=2 → padding=2
  • dilation=4 → padding=4

10.13 Connection to Other Chapters

graph TB
    CH10["Chapter 10<br/>Dilated Convolutions"]
    
    CH10 --> CH7["Chapter 7: CS231n<br/><i>Receptive field basics</i>"]
    CH10 --> CH8["Chapter 8: ResNet<br/><i>Combined with dilations<br/>in modern nets</i>"]
    CH10 --> CH16["Chapter 16: Transformers<br/><i>Alternative for<br/>global context</i>"]
    CH10 --> CH24["Chapter 24: Deep Speech 2<br/><i>Also needs long-range<br/>dependencies</i>"]
    
    style CH10 fill:#ff6b6b,color:#fff

10.14 Key Equations Summary

Dilated Convolution (1D)

\[(F *_r w)(p) = \sum_{s} F(p + r \cdot s) \cdot w(s)\]

Receptive Field Growth

For L layers with dilation rates $r_1, r_2, …, r_L$ and kernel size k:

\[RF = 1 + \sum_{i=1}^{L} (k-1) \cdot r_i\]

Exponential Dilation Schedule

\[r_i = 2^{i-1}\]

Gives receptive field: $RF = 2^L \cdot (k-1) + 1$

Output Size (with ‘same’ padding)

\[H_{out} = H_{in} \quad \text{(resolution preserved!)}\]

10.15 Chapter Summary

graph TB
    subgraph "Key Takeaways"
        T1["Dilated convolutions expand<br/>receptive field without<br/>losing resolution"]
        T2["Exponential dilation rates<br/>give exponential RF growth"]
        T3["Same parameters as<br/>standard convolutions"]
        T4["Essential for dense<br/>prediction tasks"]
        T5["Used in segmentation,<br/>audio, and more"]
    end
    
    T1 --> C["Dilated convolutions solve<br/>the fundamental tension between<br/>resolution and receptive field,<br/>enabling dense prediction with<br/>global context awareness."]
    T2 --> C
    T3 --> C
    T4 --> C
    T5 --> C
    
    style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px

In One Sentence

Dilated convolutions expand the receptive field exponentially by sampling inputs at regular intervals, enabling dense prediction networks to capture multi-scale context without sacrificing spatial resolution.


🎉 Part II Complete!

You’ve finished the Convolutional Neural Networks section. You now understand:

  • How AlexNet started the revolution (Chapter 6)
  • The complete CNN foundations from CS231n (Chapter 7)
  • How ResNet enabled training 100+ layers (Chapter 8)
  • Why identity mappings matter (Chapter 9)
  • How dilated convolutions solve dense prediction (Chapter 10)

Next up: Part III - Sequence Models and Recurrent Networks, where we tackle sequential data with RNNs and LSTMs.


Exercises

  1. Calculation: Calculate the receptive field of a network with 5 layers of 3×3 convolutions with dilation rates [1, 2, 4, 8, 16].

  2. Implementation: Implement a context module as described in the paper and apply it to a semantic segmentation task.

  3. Comparison: Compare the number of parameters and receptive field for (a) 7 standard 3×3 convs and (b) 7 dilated 3×3 convs with exponential dilation.

  4. Analysis: Why might dilated convolutions cause gridding artifacts? Propose and test a solution.


References & Further Reading

Resource Link
Original Paper (Yu & Koltun, 2015) arXiv:1511.07122
DeepLab (Uses Atrous Conv) arXiv:1606.00915
WaveNet (Dilated 1D Conv) arXiv:1609.03499
PSPNet (Multi-scale) arXiv:1612.01105
Understanding Dilated Convolutions Blog Post
Hybrid Dilated Convolution arXiv:1702.08502

Next Chapter: Chapter 11: The Unreasonable Effectiveness of RNNs — We begin Part III by exploring Andrej Karpathy’s famous blog post on how recurrent neural networks can generate surprisingly coherent text, code, and more.



Back to top

Educational content based on public research papers. All original papers are cited with links to their sources.