Chapter 7: CS231n - Convolutional Neural Networks for Visual Recognition
“The course that taught a generation how deep learning actually works.”
Based on: CS231n: Convolutional Neural Networks for Visual Recognition (Stanford University)
| 📄 Course Materials: CS231n Website | Course Notes | Video Lectures |
7.1 Why a Course in Ilya’s List?
While most entries in the 30u30 are research papers, CS231n is a Stanford course. Its inclusion tells us something important: understanding deep learning requires not just reading papers, but building comprehensive mental models.
CS231n, originally taught by Fei-Fei Li and later by Andrej Karpathy (who worked with Ilya at OpenAI), became the definitive resource for understanding CNNs from the ground up.
graph TB
subgraph "What CS231n Covers"
F["Foundations<br/>Image classification basics"]
B["Backprop<br/>How gradients flow"]
C["CNNs<br/>Architecture deep dive"]
T["Training<br/>Optimization tricks"]
A["Architectures<br/>AlexNet to ResNet"]
end
F --> B --> C --> T --> A
R["Complete understanding<br/>of visual recognition"]
A --> R
style R fill:#ffe66d,color:#000
Figure: CS231n course coverage. The course builds from foundations (image classification basics) through backpropagation, CNNs, training techniques, to modern architectures, providing complete understanding of visual recognition.
7.2 Image Classification: The Core Problem
The Challenge
Given an image, assign it to one of several categories.
graph LR
subgraph "Input"
I["Image<br/>224×224×3<br/>= 150,528 numbers"]
end
subgraph "Output"
O["Category<br/>'cat' or 'dog' or ..."]
end
I -->|"???"| O
C["The 'semantic gap':<br/>Pixels → Meaning"]
style C fill:#ff6b6b,color:#fff
Figure: The image classification challenge. An input image (224×224×3 = 150,528 numbers) must be mapped to a category label, bridging the semantic gap between pixels and meaning.
Why It’s Hard
graph TB
subgraph "Challenges"
V["Viewpoint variation"]
S["Scale variation"]
D["Deformation"]
O["Occlusion"]
I["Illumination"]
B["Background clutter"]
C["Intra-class variation"]
end
ALL["Same object can look<br/>COMPLETELY different"]
V --> ALL
S --> ALL
D --> ALL
O --> ALL
I --> ALL
B --> ALL
C --> ALL
Figure: Challenges in image classification. Viewpoint, scale, deformation, occlusion, illumination, background clutter, and intra-class variation mean the same object can look completely different, making classification difficult.
The Data-Driven Approach
Instead of writing rules, learn from examples:
- Collect a dataset of images with labels
- Train a classifier using machine learning
- Evaluate on new images
7.3 From Pixels to Features
The Naive Approach: Compare Pixels
Nearest Neighbor: Find the training image most similar to the test image.
graph LR
subgraph "L1 Distance"
T["Test image"]
TR["Training images"]
D["d(I₁,I₂) = Σ|I₁ - I₂|"]
end
T --> D
TR --> D
D --> N["Nearest neighbor<br/>label"]
Figure: L1 distance (Manhattan distance) measures pixel-wise differences between images. While simple, it’s sensitive to small shifts and doesn’t capture semantic similarity, making it poor for image classification.
Problems:
- Slow at test time (compare to ALL training images)
- Pixel distance ≠ semantic similarity
- Curse of dimensionality
The Solution: Learn Features
graph LR
subgraph "Feature Learning"
I["Raw pixels<br/>150,528 dims"]
F["Learned features<br/>~4,096 dims"]
C["Classifier<br/>1,000 classes"]
end
I -->|"CNN"| F
F -->|"FC"| C
K["Key insight:<br/>Learn WHAT to look for"]
F --> K
style K fill:#ffe66d,color:#000
Figure: Feature learning approach. Instead of comparing raw pixels, learn features that capture semantic information. Neural networks automatically learn hierarchical features from low-level (edges) to high-level (object parts).
7.4 Linear Classifiers: The Building Block
The Score Function
A linear classifier computes:
\[f(x, W) = Wx + b\]Where:
- x is the image (flattened to a vector)
- W is the weight matrix
- b is the bias vector
graph LR
subgraph "Linear Classifier"
X["x<br/>[3072×1]<br/>(32×32×3 image)"]
W["W<br/>[10×3072]"]
B["b<br/>[10×1]"]
S["Scores<br/>[10×1]"]
end
X --> M["Wx + b"]
W --> M
B --> M
M --> S
Figure: Linear classifier architecture. Input features x are multiplied by weight matrix W and added to bias b, producing class scores. This is the simplest learnable classifier, serving as a building block for neural networks.
Geometric Interpretation
Each row of W is a template for a class:
graph TB
subgraph "What W Learns"
R1["Row 1 = 'car' template"]
R2["Row 2 = 'dog' template"]
R3["Row 3 = 'cat' template"]
end
I["Input image"]
I -->|"dot product"| R1
I -->|"dot product"| R2
I -->|"dot product"| R3
S["Higher score = better match"]
R1 --> S
R2 --> S
R3 --> S
Figure: What the weight matrix W learns. Each row of W acts as a template for one class. The classifier computes how well the input matches each template via dot product, with higher scores indicating better matches.
Limitations of Linear Classifiers
graph TB
subgraph "Linear = One Template Per Class"
L["Cannot handle:<br/>• Multiple modes (car from front vs side)<br/>• Complex decision boundaries<br/>• Compositional structure"]
end
S["Solution: Stack multiple layers<br/>→ Neural Networks"]
L --> S
style S fill:#4ecdc4,color:#fff
Figure: Linear classifier limitation. A linear classifier can only learn one template per class, making it unable to handle multiple modes or complex decision boundaries. This motivates the need for non-linear neural networks.
7.5 Loss Functions: Measuring “Badness”
The Goal
Quantify how wrong our predictions are so we can improve.
Softmax Loss (Cross-Entropy)
The most common loss for classification:
\[L_i = -\log\left(\frac{e^{s_{y_i}}}{\sum_j e^{s_j}}\right)\]graph TB
subgraph "Softmax Pipeline"
S["Raw scores<br/>[3.2, 5.1, -1.7]"]
E["Exponentiate<br/>[24.5, 164.0, 0.18]"]
N["Normalize<br/>[0.13, 0.87, 0.001]"]
L["-log(correct class prob)<br/>If correct=1: -log(0.87)=0.14"]
end
S --> E --> N --> L
Figure: Softmax pipeline. Raw class scores are passed through softmax to convert them into probabilities (summing to 1). This provides a probabilistic interpretation of the classifier’s predictions.
Intuition
- High score for correct class → Low loss
- Low score for correct class → High loss
- Forces probabilities to be calibrated
Regularization
Add penalty to prevent overfitting:
\[L = \frac{1}{N}\sum_i L_i + \lambda R(W)\]Common choices:
- L2: $R(W) = \sum W^2$ (prefer smaller weights)
-
L1: $R(W) = \sum W $ (prefer sparse weights)
7.6 Optimization: Finding Good Weights
The Core Problem
\[W^* = \arg\min_W L(W)\]Find weights that minimize loss.
Gradient Descent
graph TB
subgraph "Gradient Descent Loop"
I["Initialize W randomly"]
C["Compute loss L(W)"]
G["Compute gradient ∇L"]
U["Update: W ← W - η∇L"]
CH["Check convergence"]
end
I --> C --> G --> U --> CH
CH -->|"Not done"| C
CH -->|"Done"| F["Final W*"]
Figure: Gradient descent training loop. Forward pass computes predictions and loss, backward pass computes gradients, then weights are updated. This iterative process minimizes the loss function.
Learning Rate: The Critical Hyperparameter
xychart-beta
title "Effect of Learning Rate"
x-axis "Steps" [0, 10, 20, 30, 40, 50]
y-axis "Loss" 0 --> 10
line "Too small (slow)" [10, 9.5, 9.0, 8.6, 8.2, 7.9]
line "Just right" [10, 6, 3, 1.5, 0.8, 0.5]
line "Too large (unstable)" [10, 8, 12, 5, 15, 20]
Figure: Effect of learning rate on training. Too low: slow convergence. Just right: smooth convergence to minimum. Too high: overshooting and divergence. Learning rate is the most important hyperparameter.
Stochastic Gradient Descent (SGD)
Instead of computing gradient on ALL data:
- Sample a mini-batch (e.g., 32 or 64 images)
- Compute gradient on mini-batch
- Update weights
graph LR
subgraph "SGD"
D["Full Dataset<br/>50,000 images"]
B["Mini-batch<br/>64 images"]
G["Gradient estimate"]
end
D -->|"sample"| B
B --> G
A["Advantages:<br/>• Much faster per update<br/>• Noise helps escape local minima<br/>• Works for huge datasets"]
G --> A
Figure: Stochastic Gradient Descent (SGD). Instead of computing gradients on the full dataset (expensive), compute gradients on small random batches. This is faster, uses less memory, and often generalizes better.
7.7 Backpropagation: The Magic Behind Learning
The Chain Rule
For a composition of functions:
\[\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}\]graph LR
subgraph "Computational Graph"
X["x"] --> M["* (multiply)"]
W["w"] --> M
M --> A["+ (add)"]
B["b"] --> A
A --> L["Loss"]
end
subgraph "Forward Pass"
F["Compute outputs<br/>left to right"]
end
subgraph "Backward Pass"
BK["Compute gradients<br/>right to left"]
end
Figure: Computational graph representation. Operations are nodes, data flows as edges. Backpropagation computes gradients by traversing the graph backward, applying the chain rule at each node.
Local Gradients
Each operation has simple local gradients:
| Operation | Forward | Local Gradient |
|---|---|---|
| Add: x + y | z = x + y | ∂z/∂x = 1, ∂z/∂y = 1 |
| Multiply: x × y | z = x × y | ∂z/∂x = y, ∂z/∂y = x |
| Max: max(x, y) | z = max(x,y) | ∂z/∂x = 1 if x>y else 0 |
| ReLU: max(0, x) | z = max(0,x) | ∂z/∂x = 1 if x>0 else 0 |
Backprop in Action
graph LR
subgraph "Example: f = (x + y) × z"
X["x=−2"] --> ADD["q = x+y<br/>= 1"]
Y["y=3"] --> ADD
ADD --> MUL["f = q×z<br/>= −4"]
Z["z=−4"] --> MUL
end
subgraph "Backward"
MUL -->|"∂f/∂f = 1"| BM["∂f/∂q = z = −4<br/>∂f/∂z = q = 1"]
BM -->|"∂f/∂q = −4"| BA["∂f/∂x = 1×(−4) = −4<br/>∂f/∂y = 1×(−4) = −4"]
end
Figure: Example computational graph for f = (x + y) × z. Forward pass computes the function value, backward pass computes gradients using the chain rule, propagating gradients from output to inputs.
7.8 Neural Networks: Stacking Layers
From Linear to Non-linear
A 2-layer neural network:
\[f = W_2 \cdot \max(0, W_1 x + b_1) + b_2\]graph LR
subgraph "2-Layer Network"
X["Input x"]
H["Hidden layer<br/>h = ReLU(W₁x + b₁)"]
O["Output<br/>s = W₂h + b₂"]
end
X --> H --> O
N["Non-linearity is ESSENTIAL<br/>Without it: W₂W₁x = Wx<br/>(still linear!)"]
H --> N
style N fill:#ffe66d,color:#000
Figure: Two-layer neural network. Input x passes through a hidden layer with ReLU activation, then to output. The non-linearity (ReLU) is essential—without it, the network would still be linear (W₂W₁x = Wx).
Universal Approximation
With enough hidden units, a 2-layer network can approximate any continuous function.
But: “Can approximate” ≠ “Easy to learn”
7.9 Convolutional Neural Networks
The Key Ideas
graph TB
subgraph "CNN Principles"
L["Local connectivity<br/>Each neuron sees small region"]
S["Spatial hierarchy<br/>Build up from simple to complex"]
W["Weight sharing<br/>Same filter across image"]
T["Translation equivariance<br/>Shift input → shift output"]
end
L --> I["Inspired by visual cortex"]
S --> I
W --> I
T --> I
Figure: CNN principles. Local connectivity (each neuron sees a small region), spatial hierarchy (build from simple to complex), weight sharing (same filter across image), and translation equivariance (shift input shifts output). These are inspired by the visual cortex.
The Convolution Operation
graph TB
subgraph "Convolution"
I["Input volume<br/>H × W × C"]
F["Filter<br/>k × k × C"]
O["Output<br/>(H-k+1) × (W-k+1) × 1"]
end
I --> C["Slide filter,<br/>compute dot products"]
F --> C
C --> O
M["Multiple filters<br/>→ Multiple output channels"]
O --> M
Figure: The convolution operation. An input volume (H×W×C) is convolved with a filter (k×k×C) by sliding the filter and computing dot products. Multiple filters produce multiple output channels.
Convolution Arithmetic
For input size N, filter size F, stride S, padding P:
\[\text{Output size} = \frac{N - F + 2P}{S} + 1\]graph LR
subgraph "Example"
I["Input: 32×32<br/>Filter: 5×5<br/>Stride: 1<br/>Padding: 2"]
O["Output: (32-5+4)/1+1<br/>= 32×32"]
end
I --> O
N["'Same' padding<br/>preserves spatial size"]
O --> N
Figure: Convolution arithmetic example. With input 32×32, filter 5×5, stride 1, and padding 2, the output is 32×32 (same size). This “same” padding preserves spatial dimensions.
Pooling
Reduce spatial dimensions:
graph TB
subgraph "Max Pooling (2×2, stride 2)"
I["4×4 input"]
P["Take max in each 2×2 region"]
O["2×2 output"]
end
I --> P --> O
E["• Reduces computation<br/>• Provides invariance<br/>• Increases receptive field"]
O --> E
Figure: Max pooling operation. A 4×4 input is divided into 2×2 regions, and the maximum value in each region is taken, producing a 2×2 output. This reduces computation, provides invariance, and increases the receptive field.
7.10 CNN Architectures in Detail
Common Pattern
graph LR
subgraph "Typical CNN"
I["Input"]
C1["CONV-ReLU"]
C2["CONV-ReLU"]
P1["POOL"]
C3["CONV-ReLU"]
C4["CONV-ReLU"]
P2["POOL"]
F1["FC-ReLU"]
F2["FC"]
S["Softmax"]
end
I --> C1 --> C2 --> P1 --> C3 --> C4 --> P2 --> F1 --> F2 --> S
Figure: Typical CNN architecture. Input passes through convolutional layers (CONV-ReLU), pooling layers (POOL), fully connected layers (FC-ReLU), and finally softmax for classification. This pattern extracts hierarchical features.
Layer-by-Layer Understanding
| Layer Type | Purpose | Parameters |
|---|---|---|
| CONV | Extract local features | Filter weights |
| ReLU | Add non-linearity | None |
| POOL | Downsample, add invariance | None |
| FC | Combine features for classification | Weight matrix |
| Softmax | Convert to probabilities | None |
Receptive Field
graph TB
subgraph "Receptive Field Growth"
L1["Layer 1: 3×3 filter<br/>RF = 3×3"]
L2["Layer 2: 3×3 filter<br/>RF = 5×5"]
L3["Layer 3: 3×3 filter<br/>RF = 7×7"]
end
L1 --> L2 --> L3
I["Deeper layers 'see'<br/>larger image regions"]
L3 --> I
style I fill:#ffe66d,color:#000
Figure: Receptive field growth through layers. Layer 1 with 3×3 filters has a 3×3 receptive field. Layer 2 has 5×5, Layer 3 has 7×7. Deeper layers “see” larger image regions, enabling hierarchical feature learning.
7.11 Training Neural Networks
Weight Initialization
graph TB
subgraph "Initialization Matters!"
Z["All zeros?<br/>→ All neurons identical<br/>→ No learning"]
L["Too large?<br/>→ Activations explode<br/>→ Gradients explode"]
S["Too small?<br/>→ Activations vanish<br/>→ Gradients vanish"]
end
G["Xavier/He initialization:<br/>Scale by 1/√n or 2/√n"]
Z --> G
L --> G
S --> G
style G fill:#4ecdc4,color:#fff
Figure: Weight initialization matters. All zeros makes all neurons identical (no learning). Too large causes activations and gradients to explode. Too small causes them to vanish. Xavier/He initialization scales by 1/√n or 2/√n to maintain proper activation variance.
Batch Normalization
Normalize activations within each mini-batch:
\[\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\]Then scale and shift:
\[y_i = \gamma \hat{x}_i + \beta\]graph LR
subgraph "Batch Norm Benefits"
B1["Faster training"]
B2["Higher learning rates"]
B3["Less sensitive to initialization"]
B4["Regularization effect"]
end
BN["Batch<br/>Normalization"]
BN --> B1
BN --> B2
BN --> B3
BN --> B4
Figure: Batch normalization benefits. It enables faster training, higher learning rates, less sensitivity to initialization, and has a regularization effect. It normalizes activations within each mini-batch.
Dropout (Revisited)
During training: Randomly zero neurons with probability p During testing: Scale activations by p (or use inverted dropout)
7.12 Hyperparameter Tuning
The Key Hyperparameters
graph TB
subgraph "Hyperparameters to Tune"
LR["Learning rate<br/>(most important!)"]
REG["Regularization strength"]
BS["Batch size"]
ARCH["Architecture choices<br/>(layers, filters, etc.)"]
end
subgraph "Strategy"
C["Coarse-to-fine search"]
L["Log-scale for LR, reg"]
V["Validate on held-out set"]
end
LR --> C
REG --> L
BS --> V
Figure: Key hyperparameters to tune. Learning rate is most important. Use coarse-to-fine search, log-scale for learning rate and regularization, and validate on a held-out set. Architecture choices (layers, filters) also matter significantly.
Learning Rate Schedules
xychart-beta
title "Learning Rate Schedules"
x-axis "Epoch" [0, 20, 40, 60, 80, 100]
y-axis "Learning Rate" 0 --> 0.1
line "Step decay" [0.1, 0.1, 0.01, 0.01, 0.001, 0.001]
line "Cosine annealing" [0.1, 0.08, 0.05, 0.02, 0.005, 0.001]
line "Warmup + decay" [0.01, 0.1, 0.08, 0.04, 0.01, 0.001]
Figure: Learning rate schedules. Step decay reduces LR at fixed intervals. Cosine annealing smoothly decreases following a cosine curve. Warmup + decay starts low, increases, then decreases. Different schedules work better for different problems.
7.13 Transfer Learning
The Power of Pre-trained Models
graph TB
subgraph "Transfer Learning"
P["Pre-trained CNN<br/>(on ImageNet)"]
R["Remove last layer"]
A["Add new classifier<br/>(for your task)"]
F["Fine-tune"]
end
P --> R --> A --> F
W["Works because early layers<br/>learn generic features:<br/>edges, textures, shapes"]
R --> W
style W fill:#ffe66d,color:#000
Figure: Transfer learning process. Start with a pre-trained CNN (on ImageNet), remove the last layer, add a new classifier for your task, and fine-tune. This works because early layers learn generic features (edges, textures, shapes) that transfer across tasks.
When to Use
| Your Data Size | Strategy |
|---|---|
| Very small (<1K) | Use CNN as fixed feature extractor |
| Small (1K-10K) | Fine-tune top layers |
| Medium (10K-100K) | Fine-tune more layers |
| Large (>100K) | Fine-tune all or train from scratch |
7.14 Visualizing and Understanding CNNs
What Do CNNs Learn?
graph TB
subgraph "Visualization Techniques"
V1["Visualize filters<br/>(layer 1: Gabor-like)"]
V2["Visualize activations<br/>(what fires for an image)"]
V3["Gradient-based saliency<br/>(what pixels matter)"]
V4["Occlusion experiments<br/>(block parts, see effect)"]
V5["Feature inversion<br/>(reconstruct from features)"]
end
I["Understanding what<br/>the network sees"]
V1 --> I
V2 --> I
V3 --> I
V4 --> I
V5 --> I
Figure: Visualization techniques for understanding CNNs. Visualize filters (layer 1 shows Gabor-like patterns), activations (what fires for an image), gradient-based saliency (what pixels matter), occlusion experiments (block parts to see effect), and feature inversion (reconstruct from features).
The Hierarchy of Features
| Layer | Typical Features |
|---|---|
| Conv1 | Edges, colors, simple textures |
| Conv2 | Corners, contours, textures |
| Conv3 | Parts (eyes, wheels, windows) |
| Conv4 | Object parts, more abstract |
| Conv5 | Whole objects, scenes |
| FC | Class-specific combinations |
7.15 Connections to Other Chapters
graph TB
CH7["Chapter 7<br/>CS231n"]
CH7 --> CH6["Chapter 6: AlexNet<br/><i>The breakthrough that<br/>CS231n explains</i>"]
CH7 --> CH8["Chapter 8: ResNet<br/><i>Going deeper with<br/>skip connections</i>"]
CH7 --> CH3["Chapter 3: Simple NNs<br/><i>Theoretical foundation<br/>for regularization</i>"]
CH7 --> CH16["Chapter 16: Transformers<br/><i>Attention mechanisms<br/>in vision</i>"]
style CH7 fill:#ff6b6b,color:#fff
Figure: CS231n connects to multiple chapters: AlexNet (the breakthrough it explains), ResNet (going deeper with skip connections), Keeping NNs Simple (theoretical foundation for regularization), and Transformers (attention mechanisms in vision).
7.16 Key Equations Summary
Linear Classifier
\(f(x, W, b) = Wx + b\)
Softmax Loss
\(L_i = -\log\left(\frac{e^{s_{y_i}}}{\sum_j e^{s_j}}\right)\)
SGD Update
\(W \leftarrow W - \eta \nabla_W L\)
Convolution Output Size
\(O = \frac{N - F + 2P}{S} + 1\)
Batch Normalization
\(\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y = \gamma\hat{x} + \beta\)
7.17 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["CNNs learn hierarchical<br/>features automatically"]
T2["Backprop + SGD enable<br/>training deep networks"]
T3["Architecture matters:<br/>depth, normalization, etc."]
T4["Transfer learning:<br/>reuse pre-trained features"]
T5["Visualization helps<br/>understand what CNNs learn"]
end
T1 --> C["CS231n provides the<br/>complete mental model<br/>for understanding CNNs—<br/>from pixels to predictions"]
T2 --> C
T3 --> C
T4 --> C
T5 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
Figure: Key takeaways from CS231n. CNNs learn hierarchical features automatically, backpropagation and SGD enable training deep networks, architecture matters (depth, normalization), and transfer learning leverages pre-trained models for new tasks.
In One Sentence
CS231n provides the complete foundation for understanding convolutional neural networks—from the mathematics of backpropagation to the practical techniques that make modern computer vision possible.
Exercises
-
Calculation: For a 224×224×3 input with a 7×7 conv layer (96 filters, stride 2, no padding), what is the output size? How many parameters?
-
Implementation: Implement forward and backward pass for a fully-connected layer from scratch in NumPy.
-
Visualization: Train a small CNN on CIFAR-10 and visualize the first-layer filters. What patterns do you see?
-
Transfer Learning: Use a pre-trained ResNet to classify a custom dataset. Compare fine-tuning vs. feature extraction.
References & Further Reading
| Resource | Link |
|---|---|
| CS231n Course Website | cs231n.stanford.edu |
| Course Notes (GitHub) | cs231n.github.io |
| Video Lectures (2017) | YouTube Playlist |
| Assignment Solutions | GitHub |
| PyTorch Tutorial | pytorch.org/tutorials |
| Visualizing CNNs (Zeiler & Fergus) | arXiv:1311.2901 |
| Batch Normalization Paper | arXiv:1502.03167 |
Next Chapter: Chapter 8: Deep Residual Learning (ResNet) — We explore the breakthrough that enabled training networks with 100+ layers through skip connections, fundamentally changing how we think about network depth.