Chapter 20: Neural Turing Machines

“We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes.”

Based on: “Neural Turing Machines” (Alex Graves, Greg Wayne, Ivo Danihelka, 2014)

📄 Original Paper: arXiv:1410.5401

Google DeepMind

20.1 The Memory Problem

Standard neural networks have fixed-size internal memory (hidden states). But many tasks require:

Long-term storage: Remember information for many steps
Selective recall: Retrieve specific memories
Variable capacity: Handle different amounts of information

graph TB
    subgraph "Standard Neural Network"
        H["Hidden state h<br/>(fixed size)"]
        L["Limited capacity<br/>Information decays"]
    end
    
    subgraph "Neural Turing Machine"
        M["External Memory M<br/>(large, persistent)"]
        C["Controller can read/write<br/>selectively"]
    end
    
    H --> L
    M --> C
    
    K["NTM: External memory<br/>that can be read from<br/>and written to"]
    
    C --> K
    
    style K fill:#ffe66d,color:#000

20.2 What Is a Neural Turing Machine?

The Analogy

Just as a Turing Machine has:

A finite control (program)
An infinite tape (memory)
A read/write head

A Neural Turing Machine has:

A neural network controller
An external memory matrix
Attention-based read/write heads

graph TB
    subgraph "Turing Machine Analogy"
        TM["Turing Machine<br/>Finite control + Tape"]
        NTM["Neural Turing Machine<br/>Neural controller + Memory"]
    end
    
    TM -->|"inspired"| NTM
    
    K["Key difference:<br/>NTM is differentiable<br/>→ Can be trained with backprop!"]
    
    NTM --> K
    
    style K fill:#4ecdc4,color:#fff

20.3 The NTM Architecture

High-Level Overview

graph TB
    subgraph "Neural Turing Machine"
        INPUT["Input x_t"]
        CONTROLLER["Controller<br/>(Neural Network)"]
        MEMORY["Memory M<br/>[N × M matrix]"]
        
        READ["Read Head<br/>(attention)"]
        WRITE["Write Head<br/>(attention)"]
        
        OUTPUT["Output y_t"]
    end
    
    INPUT --> CONTROLLER
    CONTROLLER --> READ
    CONTROLLER --> WRITE
    READ --> MEMORY
    WRITE --> MEMORY
    MEMORY --> READ
    CONTROLLER --> OUTPUT
    READ --> OUTPUT

Components

Controller: LSTM or feedforward network
Memory: N × M matrix (N locations, M features each)
Read Head: Attention mechanism to read
Write Head: Attention mechanism to write

20.4 Memory Addressing

Content-Based Addressing

Find memory locations similar to a key:

graph TB
    subgraph "Content-Based Addressing"
        K["Key k_t<br/>(from controller)"]
        M["Memory M<br/>[N locations]"]
        SIM["Similarity<br/>K(M[i], k_t)"]
        W_C["Content weights w^c<br/>(softmax over similarities)"]
    end
    
    K --> SIM
    M --> SIM
    SIM --> W_C
    
    K2["Similar locations<br/>get high weights"]
    
    W_C --> K2
    
    style K2 fill:#ffe66d,color:#000

Location-Based Addressing

Shift attention to adjacent locations:

graph LR
    subgraph "Location-Based Addressing"
        W_PREV["Previous weights<br/>[0.1, 0.7, 0.1, 0.1]"]
        SHIFT["Shift operation<br/>s = [0.1, 0.8, 0.1]"]
        W_SHIFT["Shifted weights<br/>[0.1, 0.1, 0.7, 0.1]"]
    end
    
    W_PREV --> SHIFT --> W_SHIFT
    
    K["Allows moving attention<br/>to neighboring locations"]
    
    W_SHIFT --> K
    
    style K fill:#ffe66d,color:#000

Combined Addressing

graph TB
    subgraph "Combined Addressing"
        W_C["Content weights"]
        W_L["Location weights"]
        INTERP["Interpolation<br/>g_t"]
        CONV["Convolution<br/>(shift)"]
        SHARP["Sharpening<br/>γ_t"]
        W["Final weights"]
    end
    
    W_C --> INTERP
    W_L --> INTERP
    INTERP --> CONV --> SHARP --> W
    
    K["Combines content similarity<br/>with spatial locality"]
    
    W --> K
    
    style K fill:#4ecdc4,color:#fff

20.5 Reading from Memory

The Read Operation

graph TB
    subgraph "Read Operation"
        M["Memory M<br/>[N × M]"]
        W["Attention weights w_t<br/>[N × 1]"]
        R["Read vector r_t<br/>[M × 1]"]
    end
    
    M -->|"weighted sum"| R
    W --> R
    
    F["r_t = Σ w_t[i] × M[i]<br/>= w_t^T M"]
    
    R --> F
    
    style F fill:#ffe66d,color:#000

Mathematical Formulation

\[r_t = \sum_{i=1}^{N} w_t(i) M_t(i)\]

Where $w_t$ is the attention distribution over memory locations.

20.6 Writing to Memory

The Write Operation

Two steps: erase then add.

graph TB
    subgraph "Write Operation"
        M_OLD["M_{t-1}[i]"]
        E["Erase vector e_t"]
        ERASE["M_t'[i] = M_{t-1}[i] ⊙ (1 - w_t[i]e_t)"]
        A["Add vector a_t"]
        ADD["M_t[i] = M_t'[i] + w_t[i]a_t"]
        M_NEW["M_t[i]"]
    end
    
    M_OLD --> ERASE
    E --> ERASE
    ERASE --> ADD
    A --> ADD
    ADD --> M_NEW
    
    K["Selective erasure + addition<br/>at attended locations"]
    
    ADD --> K
    
    style K fill:#ffe66d,color:#000

The Equations

Erase: $M_t'(i) = M_{t-1}(i) \odot [1 - w_t(i) e_t]$

Add: $M_t(i) = M_t'(i) + w_t(i) a_t$

Where:

$e_t$ = erase vector (what to remove)
$a_t$ = add vector (what to add)
$w_t$ = write weights (where to write)

20.7 The Controller

LSTM Controller

The controller can be an LSTM:

graph TB
    subgraph "LSTM Controller"
        X["x_t (input)"]
        R_PREV["r_{t-1} (previous read)"]
        H_PREV["h_{t-1} (previous hidden)"]
        
        LSTM["LSTM Cell"]
        
        H["h_t"]
        K["k_t (key)"]
        BETA["β_t (key strength)"]
        G["g_t (interpolation gate)"]
        S["s_t (shift vector)"]
        GAMMA["γ_t (sharpening)"]
        E["e_t (erase)"]
        A["a_t (add)"]
    end
    
    X --> LSTM
    R_PREV --> LSTM
    H_PREV --> LSTM
    LSTM --> H
    H --> K
    H --> BETA
    H --> G
    H --> S
    H --> GAMMA
    H --> E
    H --> A
    
    K --> READ["Read Head"]
    BETA --> READ
    G --> READ
    S --> READ
    GAMMA --> READ
    
    E --> WRITE["Write Head"]
    A --> WRITE

20.8 Learning Algorithms from Examples

Copy Task

Task: Copy a sequence to memory, then output it.

graph LR
    subgraph "Copy Task"
        INPUT["Input: [A, B, C, D, E]"]
        WRITE["Write to memory"]
        READ["Read from memory"]
        OUTPUT["Output: [A, B, C, D, E]"]
    end
    
    INPUT --> WRITE --> READ --> OUTPUT
    
    K["NTM learns to:<br/>1. Store sequence in memory<br/>2. Retrieve it later"]
    
    READ --> K
    
    style K fill:#4ecdc4,color:#fff

Repeat Copy Task

Task: Copy a sequence, then repeat it N times.

graph TB
    subgraph "Repeat Copy"
        INPUT["Input: [A, B, C] + N=3"]
        STORE["Store in memory"]
        REPEAT["Repeat N times"]
        OUTPUT["Output: [A,B,C, A,B,C, A,B,C]"]
    end
    
    INPUT --> STORE --> REPEAT --> OUTPUT
    
    K["Learns to count<br/>and repeat!"]
    
    REPEAT --> K

Associative Recall

Task: Given key-value pairs, retrieve value for a key.

graph TB
    subgraph "Associative Recall"
        STORE["Store: (A→1, B→2, C→3)"]
        QUERY["Query: 'A'"]
        RETRIEVE["Retrieve: '1'"]
    end
    
    STORE --> QUERY --> RETRIEVE
    
    K["Learns associative<br/>memory lookup"]
    
    RETRIEVE --> K

Priority Sort

Task: Sort a sequence by priority.

The NTM learns a sorting algorithm!

20.9 Why NTMs Are Powerful

Capabilities

graph TB
    subgraph "NTM Capabilities"
        C1["Long-term memory<br/>(persistent storage)"]
        C2["Selective access<br/>(content-based)"]
        C3["Algorithmic learning<br/>(from examples)"]
        C4["Variable capacity<br/>(memory size)"]
    end
    
    P["More powerful than<br/>standard RNNs"]
    
    C1 --> P
    C2 --> P
    C3 --> P
    C4 --> P
    
    style P fill:#4ecdc4,color:#fff

Comparison with RNNs

Aspect	RNN	NTM
Memory	Hidden state (fixed)	External memory (large)
Persistence	Decays over time	Persistent until overwritten
Capacity	Limited by hidden size	Limited by memory size
Access	Sequential	Content-based + location-based
Algorithms	Implicit	Can learn explicit algorithms

20.10 Training NTMs

Challenges

graph TB
    subgraph "Training Challenges"
        C1["Memory initialization<br/>(how to start?)"]
        C2["Addressing stability<br/>(weights can be noisy)"]
        C3["Gradient flow<br/>(through memory operations)"]
        C4["Curriculum learning<br/>(start simple, increase difficulty)"]
    end
    
    S["Solutions:<br/>• Careful initialization<br/>• Gradient clipping<br/>• Scheduled sampling"]
    
    C1 --> S
    C2 --> S
    C3 --> S
    C4 --> S

Curriculum Learning

Start with simple tasks, gradually increase complexity:

Copy (short sequences)
Repeat Copy (with small N)
Associative Recall (few pairs)
Priority Sort (short sequences)

20.11 Connection to Other Architectures

Similar Ideas

graph TB
    subgraph "Memory-Augmented Networks"
        NTM["Neural Turing Machine<br/>(This chapter)"]
        DNC["Differentiable Neural Computer<br/>(Graves, 2016)"]
        MEMN2N["Memory Networks<br/>(Weston, 2014)"]
        RELRNN["Relational RNN<br/>(Chapter 14)"]
    end
    
    NTM --> DNC
    NTM --> MEMN2N
    NTM --> RELRNN
    
    K["All use external memory<br/>with attention-based access"]
    
    NTM --> K
    
    style K fill:#ffe66d,color:#000

Evolution

NTM (2014): Basic external memory
DNC (2016): Improved addressing, temporal links
Memory Networks: Question answering with memory
Relational RNNs: Self-attention in memory

20.12 Modern Perspective

Legacy and Impact

timeline
    title NTM Impact
    2014 : NTM paper
         : External differentiable memory
    2016 : DNC
         : Improved NTM
    2017 : Transformers
         : Attention everywhere
    2020s : Modern LLMs
          : Implicit memory in parameters
          : Retrieval-augmented generation

Where NTMs Fit Today

Research: Still interesting for algorithmic tasks
Production: Less common (Transformers dominate)
Insight: Shows how to add memory to neural nets
RAG: Modern retrieval-augmented generation uses similar ideas

20.13 Implementation Details

Memory Matrix

class Memory:
    def __init__(self, N, M):
        # N locations, M features each
        self.memory = torch.zeros(N, M)
    
    def read(self, weights):
        # weights: [N] attention distribution
        return torch.matmul(weights, self.memory)  # [M]
    
    def write(self, weights, erase, add):
        # erase: [M], add: [M]
        self.memory = self.memory * (1 - weights.unsqueeze(1) * erase)
        self.memory = self.memory + weights.unsqueeze(1) * add

Addressing Mechanism

def content_addressing(key, memory, beta):
    # key: [M], memory: [N, M], beta: scalar
    similarities = torch.matmul(memory, key)  # [N]
    return F.softmax(beta * similarities, dim=0)

def location_addressing(prev_weights, shift, gamma):
    # shift: [3] (left, center, right)
    # Convolve with shift
    shifted = F.conv1d(prev_weights.unsqueeze(0).unsqueeze(0), 
                      shift.unsqueeze(0).unsqueeze(0), padding=1)
    # Sharpen
    return F.softmax(gamma * shifted.squeeze(), dim=0)

20.14 Connection to Other Chapters

graph TB
    CH20["Chapter 20<br/>Neural Turing Machines"]
    
    CH20 --> CH14["Chapter 14: Relational RNNs<br/><i>External memory concept</i>"]
    CH20 --> CH15["Chapter 15: NMT Attention<br/><i>Attention mechanism</i>"]
    CH20 --> CH18["Chapter 18: Pointer Networks<br/><i>Pointing to positions</i>"]
    CH20 --> CH16["Chapter 16: Transformers<br/><i>Self-attention evolution</i>"]
    
    style CH20 fill:#ff6b6b,color:#fff

20.15 Key Equations Summary

Content-Based Addressing

\[w_t^c(i) = \frac{\exp(\beta_t K(k_t, M_t(i)))}{\sum_j \exp(\beta_t K(k_t, M_t(j)))}\]

Where $K$ is cosine similarity.

Location-Based Addressing

\[w_t^g = g_t w_t^c + (1 - g_t) w_{t-1}\] \[w_t' = \sum_{j=0}^{N-1} w_t^g(j) s_t(i-j)\] \[w_t(i) = \frac{w_t'(i)^{\gamma_t}}{\sum_j w_t'(j)^{\gamma_t}}\]

Read Operation

\[r_t = \sum_{i=1}^{N} w_t^r(i) M_t(i)\]

Write Operation

$M_t'(i) = M_{t-1}(i) \odot [1 - w_t^w(i) e_t]$ $M_t(i) = M_t'(i) + w_t^w(i) a_t$

20.16 Chapter Summary

graph TB
    subgraph "Key Takeaways"
        T1["NTMs add external<br/>differentiable memory"]
        T2["Content-based addressing<br/>finds similar memories"]
        T3["Location-based addressing<br/>shifts attention spatially"]
        T4["Can learn algorithms<br/>from examples"]
        T5["More powerful than<br/>standard RNNs"]
    end
    
    T1 --> C["Neural Turing Machines extend<br/>neural networks with external,<br/>differentiable memory that can be<br/>read from and written to using<br/>attention-based addressing, enabling<br/>learning of algorithmic behaviors<br/>from examples."]
    T2 --> C
    T3 --> C
    T4 --> C
    T5 --> C
    
    style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px

In One Sentence

Neural Turing Machines couple neural network controllers with external, differentiable memory matrices that can be accessed through content-based and location-based attention, enabling networks to learn algorithmic behaviors like copying, sorting, and associative recall from examples.

Exercises

Conceptual: Explain the difference between content-based and location-based addressing. When would you use each?
Implementation: Implement a simple NTM with a feedforward controller for the copy task. Start with sequences of length 3.
Analysis: Compare the memory capacity of an NTM with N=128, M=20 vs an LSTM with hidden size 256. When does each have advantages?
Extension: How would you modify an NTM to handle variable-length memory (adding/removing memory locations dynamically)?

References & Further Reading

Resource	Link
Original Paper (Graves et al., 2014)	arXiv:1410.5401
Differentiable Neural Computer	arXiv:1606.04474
Memory Networks	arXiv:1410.3916
End-to-End Memory Networks	arXiv:1503.08895
NTM PyTorch Implementation	GitHub
DNC Paper	Nature

Next Chapter: Chapter 21: Neural Message Passing for Quantum Chemistry — We explore how message passing provides a unified framework for graph neural networks, with applications to molecular property prediction.

← Back to Part V

Table of Contents