Chapter 9: Identity Mappings in Deep Residual Networks
“When the identity shortcut is truly identity, information flows freely.”
Based on: “Identity Mappings in Deep Residual Networks” (Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, 2016)
| 📄 Original Paper: arXiv:1603.05027 | ECCV 2016 |
9.1 Improving on a Breakthrough
Just months after ResNet revolutionized deep learning, the same team asked a crucial question:
Is the original residual unit design optimal?
The answer was no. By carefully analyzing information flow in residual networks, they discovered a superior design that further improved training and generalization.
graph LR
subgraph "Evolution"
R1["ResNet v1<br/>(Original, 2015)"]
R2["ResNet v2<br/>(This paper, 2016)"]
end
R1 -->|"Pre-activation<br/>design"| R2
B["Better gradients<br/>Better generalization<br/>Easier optimization"]
R2 --> B
style B fill:#ffe66d,color:#000
9.2 Analyzing Information Flow
The Ideal: Pure Identity Shortcuts
The key insight: for optimal gradient flow, the shortcut should be a pure identity mapping—no modifications.
graph TB
subgraph "Original ResNet (v1)"
X1["x"]
F1["F(x)"]
ADD1["⊕"]
R1["ReLU"]
OUT1["output"]
X1 --> F1 --> ADD1
X1 --> ADD1
ADD1 --> R1 --> OUT1
end
subgraph "Problem"
P["ReLU after addition<br/>modifies the skip path!"]
end
ADD1 --> P
style P fill:#ff6b6b,color:#fff
The ReLU after addition means the shortcut path is not a pure identity—information gets modified.
Mathematical Analysis
For a series of residual units, if shortcuts are identity:
\[x_L = x_l + \sum_{i=l}^{L-1} F(x_i, W_i)\]The gradient becomes:
\[\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} \left(1 + \frac{\partial}{\partial x_l}\sum_{i=l}^{L-1} F(x_i, W_i)\right)\]graph LR
subgraph "Gradient Flow"
G["∂L/∂x_l = ∂L/∂x_L × (1 + ...)"]
end
I["The '1' ensures gradients<br/>propagate directly from any<br/>layer to any other layer!"]
G --> I
style I fill:#ffe66d,color:#000
If the shortcut is NOT identity, this beautiful property breaks down.
9.3 The Pre-activation Design
Moving BN and ReLU Before Convolutions
The solution: rearrange operations so the shortcut is truly identity.
graph TB
subgraph "Original (Post-activation)"
direction TB
X1["x"]
C1a["Conv"]
B1a["BN"]
R1a["ReLU"]
C1b["Conv"]
B1b["BN"]
ADD1["⊕"]
R1c["ReLU"]
O1["output"]
X1 --> C1a --> B1a --> R1a --> C1b --> B1b --> ADD1
X1 --> ADD1
ADD1 --> R1c --> O1
end
subgraph "Pre-activation (This Paper)"
direction TB
X2["x"]
B2a["BN"]
R2a["ReLU"]
C2a["Conv"]
B2b["BN"]
R2b["ReLU"]
C2b["Conv"]
ADD2["⊕"]
O2["output"]
X2 --> B2a --> R2a --> C2a --> B2b --> R2b --> C2b --> ADD2
X2 --> ADD2
ADD2 --> O2
end
K["Now the shortcut is<br/>PURE IDENTITY!"]
ADD2 --> K
style K fill:#4ecdc4,color:#fff
The Key Difference
| Aspect | Post-activation | Pre-activation |
|---|---|---|
| Shortcut | Modified by ReLU | Pure identity |
| BN location | After conv | Before conv |
| ReLU location | After addition | Before conv |
| Gradient flow | Slightly impeded | Completely free |
9.4 Why Pre-activation Works Better
Gradient Highway
With pre-activation, gradients flow through an uninterrupted highway:
graph TB
subgraph "Gradient Propagation"
L["Loss"]
XL["x_L (final)"]
XK["x_k (middle)"]
X0["x_0 (input)"]
end
L --> XL
XL -->|"direct path"| XK
XK -->|"direct path"| X0
H["No ReLU or BN<br/>blocks the highway"]
XL --> H
XK --> H
style H fill:#ffe66d,color:#000
Regularization Effect of BN
Placing BN before convolution has a subtle benefit:
graph LR
subgraph "BN as Regularizer"
I["Input x"]
B["BN normalizes"]
C["Conv sees normalized input"]
end
I --> B --> C
E["Weights don't need to<br/>adapt to input scale<br/>→ Better optimization"]
C --> E
9.5 Experimental Comparison
Comparing Unit Designs
The paper systematically tests different arrangements:
graph TB
subgraph "Tested Variants"
A["(a) Original<br/>post-activation"]
B["(b) BN after addition"]
C["(c) ReLU before addition"]
D["(d) ReLU-only pre-act"]
E["(e) Full pre-activation"]
end
R["Results on CIFAR-10<br/>ResNet-110:<br/>(a) 6.61%<br/>(b) 8.17%<br/>(c) 7.84%<br/>(d) 6.71%<br/>(e) 6.37% ✓"]
A --> R
B --> R
C --> R
D --> R
E --> R
style E fill:#4ecdc4,color:#fff
Deeper Networks Benefit More
xychart-beta
title "Pre-activation Advantage vs Depth"
x-axis "Layers" [110, 164, 1001]
y-axis "Error % Reduction" 0 --> 2
bar [0.24, 0.50, 1.03]
The deeper the network, the more pre-activation helps!
Results on CIFAR-10/100
| Model | Original | Pre-activation | Improvement |
|---|---|---|---|
| ResNet-110 | 6.61% | 6.37% | 0.24% |
| ResNet-164 | 5.93% | 5.46% | 0.47% |
| ResNet-1001 | 7.61% | 4.92% | 2.69% |
The 1001-layer pre-activation network achieves 4.92% error—remarkable!
9.6 Shortcut Connection Analysis
What Happens with Non-Identity Shortcuts?
The paper analyzes various shortcut modifications:
graph TB
subgraph "Shortcut Variants"
I["(a) Identity<br/>h(x) = x"]
S["(b) Scaling<br/>h(x) = λx"]
G["(c) Gating<br/>h(x) = g(x)⊙x"]
C["(d) 1×1 Conv<br/>h(x) = Wx"]
D["(e) Dropout<br/>h(x) = dropout(x)"]
end
R["Identity is best!<br/>Any modification hurts."]
I --> R
S --> R
G --> R
C --> R
D --> R
style I fill:#4ecdc4,color:#fff
Why Non-Identity Hurts
For a scaling shortcut h(x) = λx, the forward pass becomes:
\[x_L = \lambda^{L-l} x_l + \text{residuals}\]- If λ > 1: activations explode
- If λ < 1: activations vanish
Even learned scaling (gating) performs worse than simple identity!
9.7 The Information Flow Perspective
Clean Signal Propagation
graph LR
subgraph "Pre-activation View"
X["Signal x"]
ADD["Additive updates<br/>from residual functions"]
Y["Output"]
end
X -->|"flows unchanged"| Y
ADD -->|"adds refinements"| Y
I["The network learns<br/>'refinements' to an<br/>identity mapping"]
Y --> I
style I fill:#ffe66d,color:#000
Connection to Unrolled View
Remember from Chapter 8: ResNets can be viewed as ensembles. Pre-activation makes each path cleaner:
graph TB
subgraph "Each Path"
P1["Path 1: Identity only"]
P2["Path 2: Identity + F₁"]
P3["Path 3: Identity + F₂"]
P4["Path 4: Identity + F₁ + F₂"]
end
E["All paths share the same<br/>clean identity baseline"]
P1 --> E
P2 --> E
P3 --> E
P4 --> E
9.8 Implementation Details
Pre-activation Residual Block Code
# Pseudocode for pre-activation block
class PreActBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
self.bn1 = nn.BatchNorm2d(in_channels)
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride, 1)
self.bn2 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, 1)
# Shortcut for dimension change
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Conv2d(in_channels, out_channels, 1, stride)
else:
self.shortcut = nn.Identity()
def forward(self, x):
# Pre-activation
out = F.relu(self.bn1(x))
# Shortcut from PRE-activated input
shortcut = self.shortcut(out)
# Residual path
out = self.conv1(out)
out = self.conv2(F.relu(self.bn2(out)))
# Addition (pure identity shortcut)
return out + shortcut
Key Implementation Note
When dimensions change, apply the projection to the pre-activated input:
graph TB
subgraph "Dimension Change"
X["x"]
BN["BN-ReLU"]
PROJ["1×1 projection"]
CONV["Conv path"]
ADD["⊕"]
end
X --> BN
BN --> PROJ --> ADD
BN --> CONV --> ADD
N["Projection applied to<br/>pre-activated features"]
BN --> N
style N fill:#ffe66d,color:#000
9.9 Impact on Modern Architectures
Pre-activation Became Standard
timeline
title Adoption of Pre-activation
2016 : This paper
: Pre-activation ResNet
2017 : WideResNet
: Uses pre-activation
2018 : Many detection models
: Pre-act backbones
2019 : EfficientNet discussion
: Considered pre-act
2020s : Still relevant
: ResNet-RS uses it
Connection to Transformers
Interestingly, Transformers use a similar pattern:
graph TB
subgraph "Transformer Block"
X["x"]
LN["LayerNorm"]
ATT["Attention"]
ADD["⊕"]
end
X --> LN --> ATT --> ADD
X --> ADD
P["Pre-LN Transformer<br/>= Same principle as<br/>Pre-activation ResNet!"]
ADD --> P
style P fill:#ffe66d,color:#000
9.10 Deeper Analysis: Why 1001 Layers Work
Training Ultra-Deep Networks
The paper trains a 1001-layer ResNet on CIFAR-10:
graph TB
subgraph "ResNet-1001"
S["Structure:<br/>3 stages × 333 blocks"]
P["Parameters: ~10M"]
T["Training: Converges smoothly"]
R["Result: 4.92% error!"]
end
W["Without pre-activation:<br/>7.61% error<br/>Optimization struggles"]
S --> R
R --> C["Pre-activation enables<br/>training networks with<br/>1000+ layers"]
W --> C
style C fill:#4ecdc4,color:#fff
Gradient Analysis
For ResNet-1001 with pre-activation:
graph LR
subgraph "Gradient Magnitude"
E["Early layers"]
M["Middle layers"]
L["Late layers"]
end
V["All layers receive<br/>gradients of similar<br/>magnitude!"]
E --> V
M --> V
L --> V
style V fill:#ffe66d,color:#000
Without pre-activation, early layer gradients are much smaller.
9.11 Comparison Summary
The Full Picture
graph TB
subgraph "Original vs Pre-activation"
O["Original ResNet<br/>• ReLU after addition<br/>• Shortcut slightly modified<br/>• Works well for 150 layers"]
P["Pre-activation ResNet<br/>• BN-ReLU before conv<br/>• Pure identity shortcut<br/>• Works for 1000+ layers"]
end
O --> C["Choose based on depth<br/>and optimization needs"]
P --> C
When to Use Which
| Scenario | Recommendation |
|---|---|
| Standard vision (50-152 layers) | Either works |
| Very deep (200+ layers) | Pre-activation preferred |
| Training instability | Try pre-activation |
| Following recent papers | Check what they use |
9.12 Connection to Other Chapters
graph TB
CH9["Chapter 9<br/>Identity Mappings"]
CH9 --> CH8["Chapter 8: ResNet<br/><i>Original residual learning</i>"]
CH9 --> CH7["Chapter 7: CS231n<br/><i>BN and optimization</i>"]
CH9 --> CH16["Chapter 16: Transformers<br/><i>Pre-LN uses same principle!</i>"]
CH9 --> CH3["Chapter 3: Simple NNs<br/><i>Information flow matters</i>"]
style CH9 fill:#ff6b6b,color:#fff
9.13 Key Equations Summary
Identity Shortcut Forward Pass
\[x_{l+1} = x_l + F(x_l, W_l)\]Gradient with Identity Shortcut
\[\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} + \frac{\partial \mathcal{L}}{\partial x_L} \cdot \frac{\partial}{\partial x_l}\sum_{i=l}^{L-1} F_i\]Direct Signal Propagation
\[x_L = x_0 + \sum_{i=0}^{L-1} F(x_i, W_i)\]The output is input plus sum of residuals—no multiplicative factors!
9.14 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["Identity shortcuts should<br/>be PURE identity"]
T2["Pre-activation: move<br/>BN-ReLU before conv"]
T3["Enables training<br/>1000+ layer networks"]
T4["Better gradient flow<br/>to all layers"]
T5["Same principle used<br/>in Transformers (Pre-LN)"]
end
T1 --> C["For residual networks,<br/>keeping the shortcut as pure<br/>identity is crucial for deep<br/>networks. Pre-activation<br/>achieves this elegantly."]
T2 --> C
T3 --> C
T4 --> C
T5 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
In One Sentence
By moving batch normalization and ReLU before the convolutions, pre-activation ResNets achieve pure identity shortcuts that enable cleaner gradient flow and successful training of networks with over 1000 layers.
Exercises
-
Conceptual: Draw the computational graph for both post-activation and pre-activation residual blocks. Trace the gradient flow and identify where it gets “impeded” in the original design.
-
Mathematical: For a scaling shortcut h(x) = 0.9x stacked 100 times, what fraction of the original signal remains? What does this mean for gradient flow?
-
Implementation: Modify a ResNet-50 implementation to use pre-activation blocks. Compare training curves on CIFAR-10.
-
Analysis: Why do you think the improvement from pre-activation is larger for deeper networks? Connect this to the gradient flow analysis.
References & Further Reading
| Resource | Link |
|---|---|
| Original Paper (He et al., 2016) | arXiv:1603.05027 |
| ResNet v1 Paper | arXiv:1512.03385 |
| Wide Residual Networks | arXiv:1605.07146 |
| ResNet-RS (Revisiting ResNets) | arXiv:2103.07579 |
| Pre-LN Transformer Analysis | arXiv:2002.04745 |
| PyTorch Pre-act ResNet | GitHub |
Next Chapter: Chapter 10: Dilated Convolutions for Multi-Scale Context — We explore how dilated (atrous) convolutions enable exponentially increasing receptive fields without losing resolution, crucial for dense prediction tasks.