Chapter 14: Relational Recurrent Neural Networks
“We introduce a memory module that uses self-attention to allow memories to interact.”
Based on: “Relational Recurrent Neural Networks” (Adam Santoro, Ryan Faulkner, David Raposo, et al., 2018)
| 📄 Original Paper: arXiv:1806.01822 | NeurIPS 2018 |
14.1 Bridging RNNs and Attention
By 2018, attention mechanisms (Chapter 15) and Transformers (Chapter 16) were showing remarkable success. But could we combine the best of both worlds?
This paper introduces Relational Recurrent Neural Networks (RRNNs): RNNs enhanced with self-attention mechanisms in their memory.
graph TB
subgraph "The Combination"
R["RNN<br/>(Sequential processing)"]
A["Self-Attention<br/>(Relational reasoning)"]
RRNN["Relational RNN<br/>(Best of both)"]
end
R --> RRNN
A --> RRNN
B["Enables:<br/>• Sequential processing<br/>• Relational reasoning<br/>• Long-range dependencies"]
RRNN --> B
style RRNN fill:#ffe66d,color:#000
14.2 The Motivation: Relational Reasoning
What Is Relational Reasoning?
Understanding relationships between entities:
graph TB
subgraph "Relational Reasoning Tasks"
Q1["'The red ball is to the<br/>left of the blue cube'"]
Q2["'Who is older: Alice or Bob?'"]
Q3["'If A > B and B > C,<br/>then A > C'"]
end
R["Requires comparing<br/>and relating entities"]
Q1 --> R
Q2 --> R
Q3 --> R
style R fill:#ffe66d,color:#000
Why Standard RNNs Struggle
Standard RNNs process sequentially—they can’t easily compare distant elements:
graph LR
subgraph "Standard RNN"
X1["x₁"] --> H1["h₁"]
X2["x₂"] --> H2["h₂"]
X3["x₃"] --> H3["h₃"]
X4["x₄"] --> H4["h₄"]
end
P["To compare x₁ and x₄,<br/>information must flow<br/>through h₁→h₂→h₃→h₄<br/>(gradients vanish!)"]
H1 --> P
style P fill:#ff6b6b,color:#fff
14.3 The Relational Memory Core
Architecture Overview
The key innovation: a Relational Memory Core (RMC) that uses self-attention:
graph TB
subgraph "Relational Memory Core"
X["Input x_t"]
M_PREV["Memory M_{t-1}<br/>(N memory slots)"]
ATT["Self-Attention<br/>over memory slots"]
UPDATE["Memory Update"]
M_NEW["Memory M_t"]
H["Hidden State h_t"]
end
X --> ATT
M_PREV --> ATT
ATT --> UPDATE
X --> UPDATE
UPDATE --> M_NEW
UPDATE --> H
K["Memory slots can attend<br/>to each other = relational reasoning!"]
ATT --> K
style K fill:#4ecdc4,color:#fff
Memory as a Set of Slots
Instead of a single hidden state, maintain N memory slots:
\[M_t = [m_t^1, m_t^2, ..., m_t^N]\]Each slot can represent different aspects or entities.
14.4 Self-Attention in Memory
How Memory Slots Interact
graph TB
subgraph "Self-Attention Mechanism"
M["Memory M = [m¹, m², m³]"]
Q["Queries Q = M·W_Q"]
K["Keys K = M·W_K"]
V["Values V = M·W_V"]
ATT["Attention(Q, K, V)"]
M_OUT["Updated Memory"]
end
M --> Q
M --> K
M --> V
Q --> ATT
K --> ATT
V --> ATT
ATT --> M_OUT
E["Each memory slot can<br/>attend to all others!"]
ATT --> E
style E fill:#ffe66d,color:#000
The Attention Operation
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]This allows each memory slot to:
- Query other slots
- Compare with other slots
- Aggregate information from relevant slots
14.5 The Complete RRNN Architecture
Forward Pass
graph TB
subgraph "RRNN Step"
X["x_t (input)"]
M_PREV["M_{t-1} (previous memory)"]
subgraph "Relational Memory Core"
CONCAT["Concat [x_t, M_{t-1}]"]
ATT["Multi-Head Self-Attention"]
FF["Feedforward"]
NORM1["Layer Norm"]
NORM2["Layer Norm"]
end
M_NEW["M_t (new memory)"]
H["h_t (hidden state)"]
end
X --> CONCAT
M_PREV --> CONCAT
CONCAT --> ATT --> NORM1 --> FF --> NORM2 --> M_NEW
M_NEW --> H
K["Transformer-like structure<br/>within RNN framework"]
ATT --> K
style K fill:#ffe66d,color:#000
Mathematical Formulation
-
Concatenate input with memory: \(X_t = [x_t; M_{t-1}]\)
-
Apply self-attention: \(M'_t = \text{MultiHeadAttention}(X_t, X_t, X_t)\)
-
Feedforward and normalize: \(M_t = \text{LayerNorm}(M'_t + \text{FF}(M'_t))\)
-
Extract hidden state: \(h_t = \text{ReadHead}(M_t)\)
14.6 Multi-Head Attention
Why Multiple Heads?
graph TB
subgraph "Single Head"
M["Memory"]
ATT1["One attention pattern"]
OUT1["One relationship type"]
end
subgraph "Multi-Head"
M2["Memory"]
ATT2["Head 1: Spatial relations"]
ATT3["Head 2: Temporal relations"]
ATT4["Head 3: Semantic relations"]
OUT2["Multiple relationship types"]
end
M --> ATT1 --> OUT1
M2 --> ATT2 --> OUT2
M2 --> ATT3 --> OUT2
M2 --> ATT4 --> OUT2
K["Different heads capture<br/>different types of relationships"]
ATT2 --> K
style K fill:#ffe66d,color:#000
Multi-Head Formulation
\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\]Where each head: \(\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\)
14.7 Applications and Results
bAbI Tasks
The paper evaluates on bAbI—a suite of reasoning tasks:
graph TB
subgraph "bAbI Tasks"
T1["Question Answering<br/>'Where is the apple?'"]
T2["Counting<br/>'How many objects?'"]
T3["Lists/Sets<br/>'What is first?'"]
T4["Positional Reasoning<br/>'Who is left of X?'"]
end
R["RRNN outperforms<br/>standard RNNs and<br/>even some attention models"]
T1 --> R
T2 --> R
T3 --> R
T4 --> R
style R fill:#4ecdc4,color:#fff
Results Summary
| Model | bAbI Accuracy |
|---|---|
| LSTM | ~60% |
| Attention-based | ~75% |
| RRNN | ~85% |
Language Modeling
RRNN also improves language modeling:
- Better perplexity than LSTM
- Captures long-range dependencies
- Learns relational patterns in text
14.8 Comparison with Other Architectures
RRNN vs Standard RNN
graph TB
subgraph "Standard RNN"
S1["Single hidden state"]
S2["Sequential processing"]
S3["Limited relational reasoning"]
end
subgraph "RRNN"
R1["Multiple memory slots"]
R2["Sequential + relational"]
R3["Explicit relationship modeling"]
end
S1 --> C["RRNN adds relational<br/>reasoning capability"]
R1 --> C
style C fill:#ffe66d,color:#000
RRNN vs Transformer
| Aspect | RRNN | Transformer |
|---|---|---|
| Processing | Sequential | Parallel |
| Memory | Recurrent slots | All positions |
| Relational reasoning | ✅ Yes | ✅ Yes |
| Long sequences | Good | Excellent |
| Training speed | Slower | Faster |
14.9 The Read Head
Extracting Hidden State
The read head extracts information from memory:
graph TB
subgraph "Read Head"
M["Memory M_t<br/>[m¹, m², ..., m^N]"]
W["Learnable weights<br/>or attention"]
H["h_t"]
end
M --> W --> H
O["Options:<br/>• Weighted sum<br/>• Attention-based<br/>• Concatenation"]
W --> O
style O fill:#ffe66d,color:#000
Simple Read Head
\[h_t = \frac{1}{N}\sum_{i=1}^{N} m_t^i\]Or with learned attention: \(h_t = \sum_{i=1}^{N} \alpha_i m_t^i\)
14.10 Training RRNNs
Challenges
graph TB
subgraph "Training Challenges"
C1["More parameters<br/>(attention matrices)"]
C2["Slower than standard RNN<br/>(O(N²) attention)"]
C3["Memory initialization<br/>(how to start?)"]
end
S["Solutions:<br/>• Careful initialization<br/>• Gradient clipping<br/>• Learning rate scheduling"]
C1 --> S
C2 --> S
C3 --> S
Initialization
Memory slots typically initialized to small random values or zeros. The network learns to use them effectively.
14.11 Connection to Neural Turing Machines
Similar Concepts
RRNNs share ideas with Neural Turing Machines (Chapter 20):
graph TB
subgraph "Shared Ideas"
M["External memory<br/>(multiple slots)"]
A["Attention-based<br/>memory access"]
R["Relational operations"]
end
NTM["Neural Turing Machine<br/>(Chapter 20)"]
RRNN["Relational RNN<br/>(This chapter)"]
M --> NTM
M --> RRNN
A --> NTM
A --> RRNN
R --> RRNN
D["RRNN: Simpler, more focused<br/>on relational reasoning"]
RRNN --> D
14.12 Modern Perspective
Legacy and Impact
timeline
title RRNN in Context
2017 : Attention Is All You Need
: Transformers introduced
2018 : Relational RNNs
: Combining RNN + Attention
2019 : Transformer dominance
: Most tasks use Transformers
2020s : RNNs still used
: For specific sequential tasks
: RRNN ideas in some models
Where RRNNs Fit Today
- Research: Interesting hybrid approach
- Production: Less common than pure Transformers
- Insight: Shows how to add relational reasoning to RNNs
- Bridge: Connects RNN and Transformer ideas
14.13 Connection to Other Chapters
graph TB
CH14["Chapter 14<br/>Relational RNNs"]
CH14 --> CH12["Chapter 12: LSTMs<br/><i>Standard RNN baseline</i>"]
CH14 --> CH15["Chapter 15: Attention<br/><i>Attention mechanism used here</i>"]
CH14 --> CH16["Chapter 16: Transformers<br/><i>Similar self-attention</i>"]
CH14 --> CH20["Chapter 20: Neural Turing Machines<br/><i>External memory concept</i>"]
CH14 --> CH22["Chapter 22: Relational Reasoning<br/><i>Similar reasoning tasks</i>"]
style CH14 fill:#ff6b6b,color:#fff
14.14 Key Equations Summary
Memory Update
\[M'_t = \text{MultiHeadAttention}([x_t; M_{t-1}], [x_t; M_{t-1}], [x_t; M_{t-1}])\] \[M_t = \text{LayerNorm}(M'_t + \text{FF}(M'_t))\]Self-Attention
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]Multi-Head
\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\]Hidden State
\[h_t = \text{ReadHead}(M_t)\]14.15 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["RRNNs combine RNN sequential<br/>processing with self-attention"]
T2["Relational Memory Core uses<br/>multi-head self-attention"]
T3["Memory slots can attend to<br/>each other for reasoning"]
T4["Better than standard RNNs<br/>on relational tasks"]
T5["Bridge between RNNs and<br/>Transformers"]
end
T1 --> C["Relational RNNs demonstrate how<br/>self-attention can enhance recurrent<br/>networks, enabling explicit relational<br/>reasoning while maintaining sequential<br/>processing capabilities."]
T2 --> C
T3 --> C
T4 --> C
T5 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
In One Sentence
Relational Recurrent Neural Networks enhance standard RNNs with a self-attention-based memory core, enabling explicit relational reasoning between memory slots while maintaining sequential processing.
🎉 Part III Complete!
You’ve finished the Sequence Models and Recurrent Networks section. You now understand:
- How RNNs generate text and code (Chapter 11)
- How LSTMs solve vanishing gradients (Chapter 12)
- How to properly regularize RNNs (Chapter 13)
- How attention can enhance RNNs (Chapter 14)
Next up: Part IV - Attention and Transformers, where we explore the attention mechanism that revolutionized sequence modeling!
Exercises
-
Conceptual: Explain how self-attention in RRNNs enables relational reasoning that standard RNNs struggle with.
-
Comparison: Compare the computational complexity of RRNN vs standard RNN vs Transformer for a sequence of length T.
-
Implementation: Implement a simple RRNN with 4 memory slots and single-head attention. Test on a simple relational reasoning task.
-
Analysis: Why might RRNNs be less popular than Transformers today? What are the trade-offs?
References & Further Reading
| Resource | Link |
|---|---|
| Original Paper (Santoro et al., 2018) | arXiv:1806.01822 |
| Attention Is All You Need | arXiv:1706.03762 |
| Neural Turing Machines | arXiv:1410.5401 |
| bAbI Dataset | GitHub |
| Relational Networks Paper | arXiv:1706.01427 |
Next Chapter: Chapter 15: Neural Machine Translation with Attention — We begin Part IV by exploring how attention mechanisms were first successfully applied to sequence-to-sequence models, solving the bottleneck problem in neural machine translation.