Chapter 8: Deep Residual Learning for Image Recognition

“We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously.”

Based on: “Deep Residual Learning for Image Recognition” (Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, 2015)

📄 Original Paper: arXiv:1512.03385

CVPR 2016 Best Paper

8.1 The Depth Revolution

After AlexNet’s 8 layers (Chapter 6), researchers raced to go deeper. VGGNet reached 19 layers. But then something strange happened:

Adding more layers made performance WORSE.

This wasn’t overfitting—even training error increased! There was something fundamentally limiting about deep networks.

graph LR
    subgraph "The Paradox (2015)"
        A["20-layer network<br/>Training error: 10%"]
        B["56-layer network<br/>Training error: 12%"]
    end
    
    Q["Why does more depth<br/>hurt TRAINING error?"]
    
    A --> Q
    B --> Q
    
    style Q fill:#ff6b6b,color:#fff

ResNet solved this problem and enabled training networks with 152 layers—and even 1000+ layers in experiments.

8.2 The Degradation Problem

Not Overfitting—Something Deeper

xychart-beta
    title "The Degradation Problem"
    x-axis "Iterations" [0, 20, 40, 60, 80, 100]
    y-axis "Error %" 0 --> 40
    line "20-layer train" [35, 18, 12, 10, 9, 8]
    line "20-layer test" [38, 22, 16, 14, 13, 12]
    line "56-layer train" [38, 25, 18, 15, 14, 13]
    line "56-layer test" [40, 28, 22, 19, 18, 17]

Key observation: The 56-layer network has higher training error than the 20-layer network. This rules out overfitting!

The Identity Mapping Argument

Theoretically, a deeper network should never be worse:

If a shallow network achieves some accuracy, a deeper network could just learn identity mappings for the extra layers and match it.

graph LR
    subgraph "Theoretical Construction"
        S["20-layer network<br/>(good performance)"]
        I["+ 36 identity layers<br/>(pass input through)"]
        D["= 56-layer network<br/>(should match performance)"]
    end
    
    S --> I --> D
    
    P["But in practice,<br/>networks can't learn this!"]
    
    D --> P
    
    style P fill:#ff6b6b,color:#fff

The problem: Stacked nonlinear layers have difficulty learning identity mappings.

8.3 The Residual Learning Framework

The Key Insight

Instead of learning H(x) directly, learn the residual F(x) = H(x) - x:

\[H(x) = F(x) + x\]

graph LR
    subgraph "Traditional Block"
        X1["x"] --> L1["Conv-ReLU-Conv"]
        L1 --> H1["H(x)"]
    end
    
    subgraph "Residual Block"
        X2["x"] --> L2["Conv-ReLU-Conv"]
        X2 --> ADD["⊕"]
        L2 -->|"F(x)"| ADD
        ADD --> H2["H(x) = F(x) + x"]
    end
    
    K["The skip connection<br/>makes identity easy:<br/>just set F(x) = 0"]
    
    ADD --> K
    
    style K fill:#ffe66d,color:#000

Why Residuals Are Easier to Learn

graph TB
    subgraph "Learning Identity"
        T["Traditional: Learn H(x) = x<br/>Need weights to compute identity"]
        R["Residual: Learn F(x) = 0<br/>Just push weights toward zero"]
    end
    
    R --> E["Much easier optimization!<br/>Weights naturally initialize near zero"]
    
    style E fill:#4ecdc4,color:#fff

If the optimal function is close to identity, the residual F(x) is close to zero—which is easy to learn.

8.4 The Residual Block Architecture

Basic Building Block

graph TB
    subgraph "Basic Residual Block"
        X["x"]
        C1["3×3 Conv, 64"]
        BN1["Batch Norm"]
        R1["ReLU"]
        C2["3×3 Conv, 64"]
        BN2["Batch Norm"]
        ADD["⊕"]
        R2["ReLU"]
        OUT["output"]
    end
    
    X --> C1 --> BN1 --> R1 --> C2 --> BN2 --> ADD
    X -->|"identity shortcut"| ADD
    ADD --> R2 --> OUT

Bottleneck Block (For Deeper Networks)

For ResNet-50 and beyond, use a bottleneck design:

graph TB
    subgraph "Bottleneck Block"
        X["x (256 channels)"]
        C1["1×1 Conv, 64<br/>(reduce)"]
        BN1["BN + ReLU"]
        C2["3×3 Conv, 64<br/>(process)"]
        BN2["BN + ReLU"]
        C3["1×1 Conv, 256<br/>(expand)"]
        BN3["Batch Norm"]
        ADD["⊕"]
        R["ReLU"]
        OUT["output (256 channels)"]
    end
    
    X --> C1 --> BN1 --> C2 --> BN2 --> C3 --> BN3 --> ADD
    X -->|"identity"| ADD
    ADD --> R --> OUT
    
    E["1×1 convs reduce/expand channels<br/>3×3 conv works in low dimension<br/>→ Saves computation"]
    
    style E fill:#ffe66d,color:#000

Handling Dimension Changes

When spatial dimensions or channels change:

graph TB
    subgraph "Projection Shortcut"
        X["x (64 ch, 56×56)"]
        CONV["Conv layers<br/>(→ 128 ch, 28×28)"]
        PROJ["1×1 Conv, stride 2<br/>(projection)"]
        ADD["⊕"]
        OUT["output (128 ch, 28×28)"]
    end
    
    X --> CONV --> ADD
    X --> PROJ --> ADD
    ADD --> OUT

8.5 The Full ResNet Architecture

ResNet Variants

Model	Layers	Parameters	Top-5 Error
ResNet-18	18	11.7M	10.92%
ResNet-34	34	21.8M	9.46%
ResNet-50	50	25.6M	7.48%
ResNet-101	101	44.5M	6.58%
ResNet-152	152	60.2M	6.16%

ResNet-34 Architecture

graph TB
    subgraph "ResNet-34 Structure"
        I["Input 224×224×3"]
        C0["7×7 Conv, 64, stride 2"]
        P0["3×3 MaxPool, stride 2"]
        
        subgraph "Stage 1 (56×56)"
            S1["3 Basic Blocks<br/>64 filters"]
        end
        
        subgraph "Stage 2 (28×28)"
            S2["4 Basic Blocks<br/>128 filters"]
        end
        
        subgraph "Stage 3 (14×14)"
            S3["6 Basic Blocks<br/>256 filters"]
        end
        
        subgraph "Stage 4 (7×7)"
            S4["3 Basic Blocks<br/>512 filters"]
        end
        
        GAP["Global Avg Pool"]
        FC["FC 1000"]
        SM["Softmax"]
    end
    
    I --> C0 --> P0 --> S1 --> S2 --> S3 --> S4 --> GAP --> FC --> SM

The Numbers: ResNet-50

graph LR
    subgraph "ResNet-50 Layer Count"
        C1["conv1: 1 layer"]
        B1["Stage 1: 3 blocks × 3 = 9"]
        B2["Stage 2: 4 blocks × 3 = 12"]
        B3["Stage 3: 6 blocks × 3 = 18"]
        B4["Stage 4: 3 blocks × 3 = 9"]
        FC["FC: 1 layer"]
    end
    
    T["Total: 1 + 9 + 12 + 18 + 9 + 1 = 50"]
    
    C1 --> T
    B1 --> T
    B2 --> T
    B3 --> T
    B4 --> T
    FC --> T

8.6 Why Skip Connections Work

Gradient Flow

The key benefit is improved gradient flow during backpropagation:

graph TB
    subgraph "Without Skip Connections"
        G1["Gradient must flow through<br/>every layer's weights"]
        G2["Vanishing gradients in deep nets"]
        G3["Early layers barely updated"]
    end
    
    subgraph "With Skip Connections"
        S1["Gradient has direct path<br/>through identity shortcuts"]
        S2["Gradients can 'skip' problematic layers"]
        S3["All layers receive strong gradients"]
    end
    
    G1 --> G2 --> G3
    S1 --> S2 --> S3
    
    style S3 fill:#4ecdc4,color:#fff
    style G3 fill:#ff6b6b,color:#fff

Mathematical View

For a residual block: \(y = F(x) + x\)

The gradient: \(\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot \frac{\partial y}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot \left(\frac{\partial F}{\partial x} + 1\right)\)

The +1 ensures gradients always flow, even if ∂F/∂x is small!

graph LR
    subgraph "Gradient Highway"
        L["Loss"]
        Y["y = F(x) + x"]
        X["x"]
        
        L -->|"∂L/∂y"| Y
        Y -->|"∂L/∂y × (∂F/∂x + 1)"| X
    end
    
    P["The '+1' prevents<br/>gradient vanishing!"]
    
    Y --> P
    
    style P fill:#ffe66d,color:#000

8.7 Ensemble Interpretation

ResNets as Implicit Ensembles

A remarkable insight: ResNets can be viewed as an ensemble of shallow networks.

graph TB
    subgraph "Unrolled View"
        X["x"]
        B1["Block 1: F₁(x)"]
        B2["Block 2: F₂(·)"]
        B3["Block 3: F₃(·)"]
        OUT["Output"]
    end
    
    X --> B1
    X --> OUT
    B1 --> B2
    B1 --> OUT
    B2 --> B3
    B2 --> OUT
    B3 --> OUT
    
    E["Each possible path through<br/>skip connections = one 'sub-network'<br/>ResNet-20 has 2^20 paths!"]
    
    style E fill:#ffe66d,color:#000

With n blocks, there are 2^n paths. The network is implicitly averaging over exponentially many sub-networks!

8.8 Experimental Results

ImageNet Performance

xychart-beta
    title "ImageNet Top-5 Error (2012-2015)"
    x-axis ["AlexNet", "VGG-19", "GoogLeNet", "ResNet-152"]
    y-axis "Error %" 0 --> 20
    bar [15.3, 7.3, 6.7, 3.6]

The Depth Experiment

xychart-beta
    title "Effect of Depth with ResNet"
    x-axis "Layers" [20, 32, 44, 56, 110]
    y-axis "Error %" 0 --> 12
    line "Plain Network" [9.5, 10.5, 11.2, 12.0, 13.5]
    line "ResNet" [8.5, 7.5, 6.8, 6.2, 5.5]

Without skip connections: deeper = worse With skip connections: deeper = better!

Going Extremely Deep

The paper trained a 1202-layer ResNet on CIFAR-10:

It worked! (no optimization issues)
But performance was slightly worse than 110-layer (overfitting)

8.9 Design Choices and Ablations

Identity vs. Projection Shortcuts

graph TB
    subgraph "Shortcut Options"
        A["Option A: Zero-padding<br/>(for dimension change)"]
        B["Option B: Projection shortcuts<br/>(only for dimension change)"]
        C["Option C: All projection shortcuts"]
    end
    
    R["Results: B is best<br/>C wastes parameters<br/>A slightly worse than B"]
    
    A --> R
    B --> R
    C --> R

Pre-activation vs. Post-activation

The original ResNet uses “post-activation”:

Conv → BN → ReLU → Conv → BN → Add → ReLU

Chapter 9 will explore “pre-activation” which is even better:

BN → ReLU → Conv → BN → ReLU → Conv → Add

8.10 Beyond Image Classification

ResNet for Other Tasks

graph TB
    subgraph "ResNet Applications"
        CLS["Image Classification<br/>(original task)"]
        DET["Object Detection<br/>(Faster R-CNN backbone)"]
        SEG["Semantic Segmentation<br/>(FCN, DeepLab)"]
        POSE["Pose Estimation<br/>(feature extractor)"]
        GEN["Image Generation<br/>(ResNet blocks in GANs)"]
    end
    
    R["ResNet became the<br/>default backbone for<br/>almost everything"]
    
    CLS --> R
    DET --> R
    SEG --> R
    POSE --> R
    GEN --> R
    
    style R fill:#ffe66d,color:#000

Transfer Learning with ResNet

Pre-trained ResNets are the foundation of transfer learning in vision:

# Common pattern (pseudocode)
model = resnet50(pretrained=True)

# Replace final layer for new task
model.fc = nn.Linear(2048, num_classes)

# Fine-tune
train(model, your_dataset)

8.11 The Broader Impact

What ResNet Changed

timeline
    title Impact of ResNet
    2015 : ResNet paper
         : 152 layers trained successfully
         : Won ImageNet, COCO, etc.
    2016 : CVPR Best Paper
         : Became standard backbone
    2017 : ResNeXt, DenseNet
         : Variations on the theme
    2018 : ResNet in everything
         : Detection, segmentation, GANs
    2020s : Still widely used
          : Foundation for Vision Transformers

The Skip Connection Legacy

Skip connections appeared everywhere after ResNet:

Architecture	Skip Connection Variant
DenseNet	Connect to ALL previous layers
U-Net	Skip connections across encoder-decoder
Highway Networks	Learned gating
Transformers	Residual connections after attention

8.12 Connection to Earlier Chapters

graph TB
    CH8["Chapter 8<br/>ResNet"]
    
    CH8 --> CH6["Chapter 6: AlexNet<br/><i>ResNet continues the<br/>depth revolution</i>"]
    CH8 --> CH7["Chapter 7: CS231n<br/><i>Understanding why<br/>gradients vanish</i>"]
    CH8 --> CH3["Chapter 3: Simple NNs<br/><i>Regularization through<br/>architecture</i>"]
    CH8 --> CH9["Chapter 9: Identity Mappings<br/><i>Improving ResNet further</i>"]
    CH8 --> CH16["Chapter 16: Transformers<br/><i>Also use residual connections!</i>"]
    
    style CH8 fill:#ff6b6b,color:#fff

8.13 Implementation Details

Key Training Settings

Hyperparameter	Value
Optimizer	SGD with momentum 0.9
Learning rate	0.1, divided by 10 at epochs 30, 60
Weight decay	0.0001
Batch size	256
Epochs	90
Data augmentation	Random crop, horizontal flip

Weight Initialization

He initialization for ReLU networks:

\[W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)\]

This accounts for ReLU zeroing half the activations.

8.14 Key Equations Summary

Residual Learning

\[y = F(x, \{W_i\}) + x\]

Gradient Through Residual Block

\[\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \left(1 + \frac{\partial F}{\partial x}\right)\]

Bottleneck Computation

\[F(x) = W_3 \cdot \sigma(W_2 \cdot \sigma(W_1 \cdot x))\]

Where W₁ is 1×1 (reduce), W₂ is 3×3, W₃ is 1×1 (expand)

He Initialization

\[\text{Var}(W) = \frac{2}{n_{in}}\]

8.15 Chapter Summary

graph TB
    subgraph "Key Takeaways"
        T1["Skip connections solve<br/>the degradation problem"]
        T2["Deeper networks can now<br/>be trained effectively"]
        T3["Gradients flow through<br/>identity shortcuts"]
        T4["ResNets work as implicit<br/>ensembles of paths"]
        T5["Became the default<br/>backbone for vision"]
    end
    
    T1 --> C["ResNet's simple insight—<br/>learn residuals, not direct mappings—<br/>enabled training networks 10× deeper<br/>and transformed computer vision"]
    T2 --> C
    T3 --> C
    T4 --> C
    T5 --> C
    
    style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px

In One Sentence

ResNet introduced skip connections that let networks learn residual functions, solving the degradation problem and enabling training of 100+ layer networks that achieved superhuman performance on ImageNet.

Exercises

Conceptual: Explain in your own words why learning F(x) = 0 is easier than learning H(x) = x for a stack of nonlinear layers.
Calculation: In a bottleneck block with 256 input channels, the middle 3×3 conv has 64 channels. How many parameters does this block have? Compare to a basic block with 256 channels.
Implementation: Implement a ResNet-18 from scratch and train it on CIFAR-10. Plot training curves for with and without skip connections.
Analysis: ResNet-1202 works but is worse than ResNet-110 on CIFAR. Why might this be? What does this tell us about the limits of depth?

References & Further Reading

Resource	Link
Original Paper (He et al., 2015)	arXiv:1512.03385
Identity Mappings Paper (He et al., 2016)	arXiv:1603.05027
ResNeXt Paper	arXiv:1611.05431
DenseNet Paper	arXiv:1608.06993
PyTorch ResNet Implementation	torchvision
Residual Networks as Ensembles	arXiv:1605.06431
He Initialization Paper	arXiv:1502.01852

Next Chapter: Chapter 9: Identity Mappings in Deep Residual Networks — We explore how to improve ResNet further by rethinking the order of operations within residual blocks.

← Back to Part II

Table of Contents