Chapter 13: Recurrent Neural Network Regularization

“We apply dropout to the non-recurrent connections of LSTM units.”

Based on: “Recurrent Neural Network Regularization” (Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals, 2014)

📄 Original Paper: arXiv:1409.2329

13.1 The Regularization Challenge for RNNs

We know from Chapter 6 (AlexNet) that dropout is crucial for preventing overfitting in deep networks. But applying dropout to RNNs is tricky.

graph TB
    subgraph "The Problem"
        D["Standard dropout on RNN"]
        B["Breaks temporal dependencies"]
        W["Worse performance!"]
    end
    
    D --> B --> W
    
    style W fill:#ff6b6b,color:#fff

This 2014 paper by Ilya Sutskever and colleagues solved the problem—and became the standard approach for regularizing RNNs.

13.2 Why Standard Dropout Fails in RNNs

The Temporal Dependency Problem

In RNNs, the hidden state carries information across time:

graph LR
    subgraph "Information Flow"
        H0["h₀"] --> H1["h₁"]
        H1 --> H2["h₂"]
        H2 --> H3["h₃"]
    end
    
    P["If we dropout h₁ randomly,<br/>h₂ loses connection to h₀!"]
    
    H1 --> P
    
    style P fill:#ff6b6b,color:#fff

What Happens with Standard Dropout

graph TB
    subgraph "Standard Dropout (WRONG)"
        X1["x₁"] --> D1["Dropout"]
        H0 --> D1
        D1 --> H1["h₁"]
        
        X2["x₂"] --> D2["Dropout"]
        H1 --> D2
        D2 --> H2["h₂"]
    end
    
    P["Different dropout masks at each step<br/>→ Information can't flow consistently<br/>→ Network can't learn long dependencies"]
    
    D2 --> P
    
    style P fill:#ff6b6b,color:#fff

The network needs consistent information flow to learn temporal patterns.

13.3 The Solution: Dropout on Non-Recurrent Connections

Key Insight

Apply dropout only to the non-recurrent connections (input → hidden), not to the recurrent connections (hidden → hidden).

graph TB
    subgraph "Correct Dropout Application"
        X["x_t"]
        H_PREV["h_{t-1}"]
        
        DROP["Dropout<br/>(only on x_t)"]
        NO_DROP["No Dropout<br/>(on h_{t-1})"]
        
        LSTM["LSTM Cell"]
        H_NEW["h_t"]
    end
    
    X --> DROP --> LSTM
    H_PREV --> NO_DROP --> LSTM
    LSTM --> H_NEW
    
    K["Recurrent path stays intact!<br/>Temporal dependencies preserved."]
    
    LSTM --> K
    
    style K fill:#4ecdc4,color:#fff

For LSTM Specifically

graph TB
    subgraph "LSTM with Proper Dropout"
        X["x_t"]
        H["h_{t-1}"]
        C["C_{t-1}"]
        
        DROP["Dropout on x_t"]
        
        FG["Forget Gate<br/>f_t = σ(W_f·[h,dropout(x)]+b)"]
        IG["Input Gate<br/>i_t = σ(W_i·[h,dropout(x)]+b)"]
        CAND["Candidate<br/>C̃_t = tanh(W_C·[h,dropout(x)]+b)"]
        OG["Output Gate<br/>o_t = σ(W_o·[h,dropout(x)]+b)"]
        
        UPDATE["Update C_t, h_t"]
    end
    
    X --> DROP
    H --> FG
    H --> IG
    H --> CAND
    H --> OG
    
    DROP --> FG
    DROP --> IG
    DROP --> CAND
    DROP --> OG
    
    FG --> UPDATE
    IG --> UPDATE
    CAND --> UPDATE
    OG --> UPDATE

Key point: The hidden state h_{t-1} is never dropped out—only the input x_t.

13.4 Dropout Between Layers

Stacked RNNs

For multi-layer RNNs, apply dropout between layers (not within):

graph TB
    subgraph "Multi-Layer RSTM with Dropout"
        X["x_t"]
        L1["LSTM Layer 1"]
        DROP["Dropout<br/>(between layers)"]
        L2["LSTM Layer 2"]
        Y["y_t"]
    end
    
    X --> L1 --> DROP --> L2 --> Y
    
    N["Dropout applied to h₁<br/>before feeding to layer 2<br/>But NOT on recurrent connections"]
    
    DROP --> N
    
    style N fill:#ffe66d,color:#000

The Pattern

Connection Type	Dropout?
Input → Hidden	✅ Yes
Hidden → Hidden (recurrent)	❌ No
Hidden → Hidden (between layers)	✅ Yes
Hidden → Output	✅ Yes

13.5 Experimental Setup

Language Modeling Task

The paper evaluates on:

Penn Treebank: Small dataset (~1M words)
Large corpus: ~90M words

Task: Predict next word given previous words.

graph LR
    subgraph "Language Modeling"
        W1["'The'"] --> W2["'cat'"]
        W2 --> W3["'sat'"]
        W3 --> W4["'on'"]
        W4 --> W5["'the'"]
    end
    
    T["Predict next word<br/>given all previous"]
    
    W1 --> T
    W2 --> T
    W3 --> T
    W4 --> T

Architecture

2-layer LSTM
650 hidden units per layer
Dropout rate: 0.5 (on non-recurrent connections)
Embedding size: 650

13.6 Results: Penn Treebank

Without Dropout

Perplexity: ~120 (baseline)

With Proper Dropout

Perplexity: ~78 (35% improvement!)

xychart-beta
    title "Penn Treebank Perplexity"
    x-axis ["Baseline", "With Dropout"]
    y-axis "Perplexity" 0 --> 140
    bar [120, 78]

Comparison with Other Methods

Method	Perplexity
Baseline LSTM	120
With dropout (this paper)	78
Previous best	~82

State-of-the-art at the time!

13.7 Results: Large Corpus

Scaling to Big Data

On a 90M word corpus:

graph TB
    subgraph "Large Corpus Results"
        NOD["No dropout<br/>Perplexity: 68"]
        DROP["With dropout<br/>Perplexity: 48"]
    end
    
    I["30% improvement<br/>even on large dataset!"]
    
    DROP --> I
    
    style I fill:#4ecdc4,color:#fff

Key Finding

Dropout helps even when you have lots of data—it’s not just for small datasets!

13.8 Why This Works

Information Theory Perspective

graph TB
    subgraph "What Dropout Does"
        R["Regularizes input processing"]
        P["Preserves temporal structure"]
        G["Prevents co-adaptation<br/>of input features"]
    end
    
    subgraph "What It Doesn't Do"
        N["Doesn't break<br/>recurrent connections"]
        M["Doesn't interfere with<br/>memory mechanisms"]
    end
    
    R --> S["Better generalization"]
    P --> S
    G --> S
    N --> S
    M --> S
    
    style S fill:#ffe66d,color:#000

The Recurrent Path Stays Clean

The hidden state → hidden state connection remains deterministic (no dropout), allowing:

Consistent gradient flow
Long-term memory to work
Temporal patterns to be learned

13.9 Implementation Details

PyTorch Code

import torch.nn as nn

class RegularizedLSTM(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        
        # Dropout on input embeddings
        self.input_dropout = nn.Dropout(0.5)
        
        # LSTM with dropout between layers
        self.lstm = nn.LSTM(
            embed_size, 
            hidden_size, 
            num_layers,
            dropout=0.5  # Between layers only!
        )
        
        # Dropout before output
        self.output_dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, x, hidden):
        # Embed and dropout input
        x = self.embedding(x)
        x = self.input_dropout(x)
        
        # LSTM (dropout applied between layers internally)
        out, hidden = self.lstm(x, hidden)
        
        # Dropout before output
        out = self.output_dropout(out)
        out = self.fc(out)
        
        return out, hidden

Key Points

Input dropout: Apply to embeddings/input
LSTM dropout: PyTorch’s dropout parameter handles between-layer dropout
Output dropout: Apply before final linear layer
No recurrent dropout: PyTorch doesn’t apply dropout to recurrent connections by default

13.10 Variational Dropout (Advanced)

Later work introduced variational dropout: use the same dropout mask across all timesteps.

graph TB
    subgraph "Standard Dropout"
        M1["Mask t=1"]
        M2["Mask t=2"]
        M3["Mask t=3"]
    end
    
    subgraph "Variational Dropout"
        M["Same mask<br/>for all timesteps"]
    end
    
    V["More consistent<br/>Better for some tasks"]
    
    M --> V

This is closer to the original dropout philosophy but adapted for sequences.

13.11 Connection to Other Regularization Techniques

Weight Tying

The paper also uses weight tying: same weights for input embeddings and output projection.

graph LR
    subgraph "Weight Tying"
        E["Embedding Matrix<br/>V × d"]
        O["Output Matrix<br/>V × d"]
    end
    
    T["Share weights:<br/>E = O^T"]
    
    E --> T
    O --> T
    
    B["Benefits:<br/>• Fewer parameters<br/>• Better generalization"]
    
    T --> B

Other Techniques

Technique	Where Applied	Effect
Dropout	Non-recurrent connections	Prevents overfitting
Weight tying	Embedding = Output	Parameter efficiency
Gradient clipping	All gradients	Prevents explosion
Early stopping	Training loop	Prevents overfitting

13.12 Modern Best Practices

Current Recommendations

graph TB
    subgraph "RNN Regularization (2020s)"
        D1["Input dropout: 0.1-0.3"]
        D2["Between-layer dropout: 0.2-0.5"]
        D3["Output dropout: 0.1-0.3"]
        G["Gradient clipping: 1-5"]
        W["Weight tying: Often helpful"]
    end
    
    B["Best practices"]
    
    D1 --> B
    D2 --> B
    D3 --> B
    G --> B
    W --> B

When Using Transformers

Note: Transformers (Chapter 16) use dropout differently:

Dropout on attention weights
Dropout on feedforward layers
No recurrent connections to worry about!

13.13 Connection to Other Chapters

graph TB
    CH13["Chapter 13<br/>RNN Regularization"]
    
    CH13 --> CH6["Chapter 6: AlexNet<br/><i>Dropout for CNNs</i>"]
    CH13 --> CH12["Chapter 12: LSTMs<br/><i>What we're regularizing</i>"]
    CH13 --> CH3["Chapter 3: Simple NNs<br/><i>MDL view of regularization</i>"]
    CH13 --> CH16["Chapter 16: Transformers<br/><i>Different dropout pattern</i>"]
    
    style CH13 fill:#ff6b6b,color:#fff

13.14 Key Equations Summary

LSTM with Input Dropout

\(x'_t = \text{dropout}(x_t)\) \(f_t = \sigma(W_f \cdot [h_{t-1}, x'_t] + b_f)\) \(i_t = \sigma(W_i \cdot [h_{t-1}, x'_t] + b_i)\) \(\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x'_t] + b_C)\) \(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\) \(o_t = \sigma(W_o \cdot [h_{t-1}, x'_t] + b_o)\) \(h_t = o_t \odot \tanh(C_t)\)

Note: h_{t-1} is never dropped out!

Perplexity

\[\text{Perplexity} = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \log P(w_i | w_1, ..., w_{i-1})\right)\]

Lower perplexity = better model.

13.15 Chapter Summary

graph TB
    subgraph "Key Takeaways"
        T1["Standard dropout breaks<br/>temporal dependencies in RNNs"]
        T2["Apply dropout ONLY to<br/>non-recurrent connections"]
        T3["Never dropout the<br/>hidden state → hidden state path"]
        T4["Dropout between layers<br/>is safe and effective"]
        T5["35% improvement on<br/>language modeling tasks"]
    end
    
    T1 --> C["Proper RNN regularization requires<br/>careful application of dropout—<br/>only on non-recurrent connections—<br/>to preserve temporal structure<br/>while preventing overfitting."]
    T2 --> C
    T3 --> C
    T4 --> C
    T5 --> C
    
    style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px

In One Sentence

This paper showed that dropout should be applied only to non-recurrent connections in RNNs, preserving temporal dependencies while achieving 35% improvement in language modeling perplexity.

Exercises

Conceptual: Explain why dropping out the hidden state in an RNN breaks temporal dependencies, but dropping out the input doesn’t.
Implementation: Implement a language model with and without proper dropout. Compare perplexity on a validation set.
Analysis: The paper shows dropout helps even on large datasets. Why might this be? (Hint: think about what dropout does beyond just preventing overfitting.)
Comparison: Compare the dropout strategy in this paper with the dropout used in Transformers (Chapter 16). What are the differences and why?

References & Further Reading

Resource	Link
Original Paper (Zaremba et al., 2014)	arXiv:1409.2329
Variational Dropout for RNNs	arXiv:1512.05287
Recurrent Dropout	arXiv:1603.05118
PyTorch LSTM Dropout	Documentation
Language Modeling Tutorial	PyTorch Tutorials

Next Chapter: Chapter 14: Relational Recurrent Neural Networks — We explore how self-attention mechanisms can be integrated into recurrent networks, bridging toward the Transformer architecture.

← Back to Part III

Table of Contents