Chapter 18: Pointer Networks

“We introduce a new architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence.”

Based on: “Pointer Networks” (Oriol Vinyals, Meire Fortunato, Navdeep Jaitly, 2015)

📄 Original Paper: arXiv:1506.03134

NeurIPS 2015

18.1 The Variable-Length Output Problem

Standard sequence-to-sequence models have a fixed output vocabulary. But many problems require outputs that are positions or elements from the input:

graph TB
    subgraph "Standard Seq2Seq"
        I1["Input: [A, B, C, D]"]
        O1["Output: 'word1', 'word2'<br/>(from fixed vocabulary)"]
    end
    
    subgraph "Pointer Network"
        I2["Input: [A, B, C, D]"]
        O2["Output: 2, 0, 3<br/>(pointers to input positions)"]
    end
    
    I1 --> O1
    I2 --> O2
    
    K["Outputs are INDICES<br/>into the input sequence!"]
    
    O2 --> K
    
    style K fill:#ffe66d,color:#000

Example Problems

Problem	Input	Output
Convex Hull	Points	Indices of hull vertices
Traveling Salesman	Cities	Order of visit (indices)
Sorting	Numbers	Sorted order (indices)
Finding Maximum	Array	Index of max element

18.2 Limitations of Standard Seq2Seq

The Vocabulary Problem

graph TB
    subgraph "Standard Approach"
        I["Input: 10 points<br/>in 2D space"]
        V["Vocabulary: All possible<br/>coordinate pairs?<br/>→ INFINITE!"]
        O["Output: Convex hull"]
    end
    
    I --> V --> O
    
    P["Cannot enumerate all<br/>possible outputs!"]
    
    V --> P
    
    style P fill:#ff6b6b,color:#fff

Why This Fails

Variable input size: Vocabulary would need to include all possible inputs
Combinatorial explosion: For n inputs, there are n! possible orderings
Generalization: Can’t generalize to new input sizes

18.3 The Pointer Network Solution

Core Idea

Instead of predicting words from a vocabulary, point to positions in the input:

graph TB
    subgraph "Pointer Network"
        X["Input sequence<br/>x₁, x₂, ..., x_n"]
        E["Encoder<br/>(RNN/LSTM)"]
        H["Hidden states<br/>h₁, h₂, ..., h_n"]
        D["Decoder<br/>(RNN/LSTM)"]
        S["Decoder state s_t"]
        P["Pointer mechanism<br/>P(i | s_t, H)"]
        O["Output: index i"]
    end
    
    X --> E --> H
    H --> D --> S
    S --> P
    H --> P
    P --> O
    
    K["P(i) = probability of<br/>pointing to position i"]
    
    P --> K
    
    style K fill:#ffe66d,color:#000

The Pointer Mechanism

At each decoding step, compute attention over input positions:

\[u_j^i = v^T \tanh(W_1 e_j + W_2 d_i)\] \[P(C_i | C_1, ..., C_{i-1}, P) = \text{softmax}(u^i)\]

Where:

$e_j$ = encoder hidden state for input $j$
$d_i$ = decoder hidden state at step $i$
$P(C_i)$ = probability distribution over input positions

18.4 Architecture Details

Encoder

Standard RNN/LSTM encoder:

graph LR
    subgraph "Encoder"
        X1["x₁"] --> H1["h₁"]
        X2["x₂"] --> H2["h₂"]
        X3["x₃"] --> H3["h₃"]
        X4["x₄"] --> H4["h₄"]
    end
    
    H["H = {h₁, h₂, h₃, h₄}<br/>(all encoder states)"]
    
    H1 --> H
    H2 --> H
    H3 --> H
    H4 --> H

Decoder with Pointer

graph TB
    subgraph "Decoder Step"
        S_PREV["s_{i-1}<br/>(previous decoder state)"]
        C_PREV["C_{i-1}<br/>(previous pointer)"]
        
        RNN["Decoder RNN"]
        S_NEW["s_i"]
        
        ATT["Attention over H<br/>(encoder states)"]
        P["Pointer distribution<br/>P(i) over input positions"]
        C_NEW["C_i<br/>(selected index)"]
    end
    
    S_PREV --> RNN
    C_PREV --> RNN
    RNN --> S_NEW
    S_NEW --> ATT
    H_ALL["H (all encoder states)"] --> ATT
    ATT --> P --> C_NEW
    
    K["C_i = argmax P(i)"]
    
    P --> K
    
    style K fill:#ffe66d,color:#000

18.5 Comparison: Attention vs Pointer

Standard Attention (Bahdanau)

graph TB
    subgraph "Standard Attention"
        Q["Query (decoder state)"]
        K["Keys (encoder states)"]
        V["Values (encoder states)"]
        ATT["Attention weights"]
        C["Context vector<br/>(weighted sum of values)"]
    end
    
    Q --> ATT
    K --> ATT
    V --> ATT
    ATT --> C
    
    O["Output: word from<br/>fixed vocabulary"]
    
    C --> O

Pointer Network

graph TB
    subgraph "Pointer Network"
        Q["Query (decoder state)"]
        K["Keys (encoder states)"]
        ATT["Attention weights"]
        P["Pointer distribution<br/>(over input positions)"]
    end
    
    Q --> ATT
    K --> ATT
    ATT --> P
    
    O["Output: INDEX<br/>(position in input)"]
    
    P --> O
    
    style O fill:#4ecdc4,color:#fff

Key difference: Pointer networks use attention weights directly as output probabilities, not to create a context vector.

18.6 Application: Convex Hull

Problem Setup

Given n points, find the convex hull (smallest polygon containing all points).

graph TB
    subgraph "Convex Hull Problem"
        P1["Input: Points<br/>P₁, P₂, ..., Pₙ"]
        ENC["Encoder<br/>(processes points)"]
        DEC["Decoder<br/>(with pointer)"]
        HULL["Output: Indices<br/>of hull vertices<br/>e.g., [2, 5, 8, 3]"]
    end
    
    P1 --> ENC --> DEC --> HULL
    
    K["Output length varies<br/>based on input!"]
    
    HULL --> K
    
    style K fill:#ffe66d,color:#000

Training

Input: Sequence of 2D points
Target: Sequence of indices forming convex hull
Loss: Cross-entropy over pointer distribution

18.7 Application: Traveling Salesman Problem

TSP as Sequence Learning

Given n cities, find shortest tour visiting each once.

graph LR
    subgraph "TSP"
        C["Cities:<br/>C₁, C₂, C₃, C₄"]
        P["Pointer Network"]
        T["Tour:<br/>[2, 4, 1, 3]"]
    end
    
    C --> P --> T
    
    K["Output is permutation<br/>of input indices"]
    
    T --> K
    
    style K fill:#ffe66d,color:#000

Results

Pointer Networks achieve near-optimal solutions for TSP with up to 50 cities, without explicit optimization algorithms!

18.8 Application: Delaunay Triangulation

Problem

Given points, find Delaunay triangulation (triangles with empty circumcircles).

graph TB
    subgraph "Delaunay Triangulation"
        P["Points"]
        ENC["Encoder"]
        DEC["Decoder"]
        T["Triangles<br/>(as index triplets)"]
    end
    
    P --> ENC --> DEC --> T
    
    K["Output: sequences of<br/>3 indices per triangle"]
    
    T --> K

Pointer Networks can learn to generate valid triangulations!

18.9 Why Pointer Networks Work

Advantages

graph TB
    subgraph "Advantages"
        A1["Variable output length<br/>(depends on input)"]
        A2["No fixed vocabulary<br/>(works with any input)"]
        A3["Generalizes to new sizes<br/>(trained on n, works on m)"]
        A4["Learns combinatorial structure<br/>(permutations, selections)"]
    end
    
    S["Solves problems that<br/>standard seq2seq cannot"]
    
    A1 --> S
    A2 --> S
    A3 --> S
    A4 --> S
    
    style S fill:#4ecdc4,color:#fff

Comparison

Aspect	Standard Seq2Seq	Pointer Network
Output vocabulary	Fixed, finite	Variable, input-dependent
Output length	Fixed or learned	Variable, input-dependent
Generalization	To new words	To new input sizes
Combinatorial problems	Difficult	Natural fit

18.10 Training Details

Loss Function

Standard cross-entropy over pointer distribution:

\[\mathcal{L} = -\sum_{i=1}^{m} \log P(C_i^* | C_1^*, ..., C_{i-1}^*, P)\]

Where $C_i^*$ is the target pointer at step $i$.

Inference

Greedy decoding: $C_i = \arg\max_j P(C_i = j | C_1, ..., C_{i-1}, P)$

Or beam search for better solutions.

18.11 Connection to Attention

Pointer = Attention as Output

graph TB
    subgraph "Evolution"
        ATT["Attention (Bahdanau)<br/>Weights → Context vector"]
        PTR["Pointer Networks<br/>Weights → Output indices"]
        TRANS["Transformers<br/>Self-attention everywhere"]
    end
    
    ATT -->|"reuse weights"| PTR
    PTR -->|"influences"| TRANS
    
    K["Pointer shows attention<br/>can be the output itself"]
    
    PTR --> K
    
    style K fill:#ffe66d,color:#000

Modern Perspective

Modern models often combine:

Attention for context
Pointer mechanisms for selection
Copy mechanisms (similar idea)

18.12 Copy Mechanisms

Copy mechanisms (used in summarization, etc.) are similar:

graph TB
    subgraph "Copy Mechanism"
        GEN["Generate from vocabulary"]
        COPY["Copy from input"]
        GATE["Gate: choose<br/>generate or copy"]
        OUT["Output"]
    end
    
    GEN --> GATE
    COPY --> GATE
    GATE --> OUT
    
    K["Hybrid: can generate<br/>OR point to input"]
    
    GATE --> K
    
    style K fill:#4ecdc4,color:#fff

Pointer Networks are a pure version: only pointing, no generation.

18.13 Limitations

What Pointer Networks Can’t Do

graph TB
    subgraph "Limitations"
        L1["Only outputs indices<br/>(cannot generate new content)"]
        L2["Input-dependent vocabulary<br/>(cannot handle unseen elements)"]
        L3["Combinatorial complexity<br/>(n! permutations for n elements)"]
    end
    
    S["For some tasks, need<br/>hybrid approaches"]
    
    L1 --> S
    L2 --> S
    L3 --> S

When to Use Alternatives

Need new content: Use standard seq2seq or copy mechanisms
Very large inputs: Attention over all positions is expensive
Structured outputs: May need specialized architectures

18.14 Modern Applications

Where Pointer Networks Appear

graph TB
    subgraph "Applications"
        A1["Code generation<br/>(pointing to variables)"]
        A2["Question answering<br/>(pointing to spans)"]
        A3["Entity linking<br/>(pointing to mentions)"]
        A4["Combinatorial optimization<br/>(TSP, scheduling)"]
    end
    
    M["Modern models often<br/>combine generation + pointing"]
    
    A1 --> M
    A2 --> M
    A3 --> M
    A4 --> M

In Transformers

Modern LLMs use similar mechanisms:

Span selection: Point to start/end positions
Copy attention: Attend to input positions
Retrieval: Point to relevant documents

18.15 Connection to Other Chapters

graph TB
    CH18["Chapter 18<br/>Pointer Networks"]
    
    CH18 --> CH15["Chapter 15: NMT Attention<br/><i>Attention mechanism foundation</i>"]
    CH18 --> CH19["Chapter 19: Seq2Seq for Sets<br/><i>Order-invariant processing</i>"]
    CH18 --> CH20["Chapter 20: Neural Turing Machines<br/><i>External memory access</i>"]
    CH18 --> CH16["Chapter 16: Transformers<br/><i>Self-attention evolution</i>"]
    
    style CH18 fill:#ff6b6b,color:#fff

18.16 Key Equations Summary

Pointer Scores

\[u_j^i = v^T \tanh(W_1 e_j + W_2 d_i)\]

Pointer Distribution

\[P(C_i | C_1, ..., C_{i-1}, P) = \text{softmax}(u^i)\]

Loss Function

\[\mathcal{L} = -\sum_{i=1}^{m} \log P(C_i^* | C_1^*, ..., C_{i-1}^*, P)\]

Inference

\[C_i = \arg\max_j P(C_i = j | C_1, ..., C_{i-1}, P)\]

18.17 Chapter Summary

graph TB
    subgraph "Key Takeaways"
        T1["Pointer Networks output<br/>indices into input sequence"]
        T2["No fixed vocabulary—<br/>works with variable inputs"]
        T3["Attention weights become<br/>output probabilities"]
        T4["Natural for combinatorial<br/>optimization problems"]
        T5["Generalizes to new<br/>input sizes"]
    end
    
    T1 --> C["Pointer Networks solve the<br/>variable-length output problem by<br/>using attention to point to input<br/>positions, enabling solutions to<br/>combinatorial problems like TSP<br/>and convex hull that standard<br/>seq2seq models cannot handle."]
    T2 --> C
    T3 --> C
    T4 --> C
    T5 --> C
    
    style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px

In One Sentence

Pointer Networks use attention mechanisms to output indices pointing to positions in the input sequence, enabling variable-length outputs and solving combinatorial optimization problems that require selecting or ordering input elements.

Exercises

Conceptual: Explain why a standard seq2seq model with a fixed vocabulary cannot solve the convex hull problem, but a Pointer Network can.
Implementation: Implement a simple Pointer Network for finding the maximum element in a sequence. Train it on sequences of varying lengths.
Analysis: Compare the computational complexity of Pointer Networks vs standard seq2seq for an input of length n and output of length m.
Extension: How would you modify a Pointer Network to handle the case where you want to output both indices AND generate new tokens? (Hint: look up “copy mechanisms”)

References & Further Reading

Resource	Link
Original Paper (Vinyals et al., 2015)	arXiv:1506.03134
Copy Mechanism Paper	arXiv:1603.06393
Pointer-Generator Networks	arXiv:1704.04368
Neural Combinatorial Optimization	arXiv:1611.09940
PyTorch Pointer Network	GitHub Examples

Next Chapter: Chapter 19: Order Matters: Sequence to Sequence for Sets — We explore how to handle unordered input sets while maintaining the ability to produce ordered outputs, addressing a fundamental challenge in set-to-sequence learning.

← Back to Part V

Table of Contents

Chapter 18: Pointer Networks

18.1 The Variable-Length Output Problem

Example Problems

18.2 Limitations of Standard Seq2Seq

The Vocabulary Problem

Why This Fails

18.3 The Pointer Network Solution

Core Idea

The Pointer Mechanism

18.4 Architecture Details

Encoder

Decoder with Pointer

18.5 Comparison: Attention vs Pointer

Standard Attention (Bahdanau)

Pointer Network

18.6 Application: Convex Hull

Problem Setup

Training

18.7 Application: Traveling Salesman Problem

TSP as Sequence Learning

Results

18.8 Application: Delaunay Triangulation

Problem

18.9 Why Pointer Networks Work

Advantages

Comparison

18.10 Training Details

Loss Function

Inference

18.11 Connection to Attention

Pointer = Attention as Output

Modern Perspective

18.12 Copy Mechanisms

Related Idea

18.13 Limitations

What Pointer Networks Can’t Do

When to Use Alternatives

18.14 Modern Applications

Where Pointer Networks Appear

In Transformers

18.15 Connection to Other Chapters

18.16 Key Equations Summary

Pointer Scores

Pointer Distribution

Loss Function

Inference

18.17 Chapter Summary

In One Sentence

Exercises

References & Further Reading