Chapter 18: Pointer Networks

“We introduce a new architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence.”

Based on: “Pointer Networks” (Oriol Vinyals, Meire Fortunato, Navdeep Jaitly, 2015)

📄 Original Paper: arXiv:1506.03134 NeurIPS 2015

18.1 The Variable-Length Output Problem

Standard sequence-to-sequence models have a fixed output vocabulary. But many problems require outputs that are positions or elements from the input:

graph TB
    subgraph "Standard Seq2Seq"
        I1["Input: [A, B, C, D]"]
        O1["Output: 'word1', 'word2'<br/>(from fixed vocabulary)"]
    end
    
    subgraph "Pointer Network"
        I2["Input: [A, B, C, D]"]
        O2["Output: 2, 0, 3<br/>(pointers to input positions)"]
    end
    
    I1 --> O1
    I2 --> O2
    
    K["Outputs are INDICES<br/>into the input sequence!"]
    
    O2 --> K
    
    style K fill:#ffe66d,color:#000

Example Problems

Problem Input Output
Convex Hull Points Indices of hull vertices
Traveling Salesman Cities Order of visit (indices)
Sorting Numbers Sorted order (indices)
Finding Maximum Array Index of max element

18.2 Limitations of Standard Seq2Seq

The Vocabulary Problem

graph TB
    subgraph "Standard Approach"
        I["Input: 10 points<br/>in 2D space"]
        V["Vocabulary: All possible<br/>coordinate pairs?<br/>→ INFINITE!"]
        O["Output: Convex hull"]
    end
    
    I --> V --> O
    
    P["Cannot enumerate all<br/>possible outputs!"]
    
    V --> P
    
    style P fill:#ff6b6b,color:#fff

Why This Fails

  • Variable input size: Vocabulary would need to include all possible inputs
  • Combinatorial explosion: For n inputs, there are n! possible orderings
  • Generalization: Can’t generalize to new input sizes

18.3 The Pointer Network Solution

Core Idea

Instead of predicting words from a vocabulary, point to positions in the input:

graph TB
    subgraph "Pointer Network"
        X["Input sequence<br/>x₁, x₂, ..., x_n"]
        E["Encoder<br/>(RNN/LSTM)"]
        H["Hidden states<br/>h₁, h₂, ..., h_n"]
        D["Decoder<br/>(RNN/LSTM)"]
        S["Decoder state s_t"]
        P["Pointer mechanism<br/>P(i | s_t, H)"]
        O["Output: index i"]
    end
    
    X --> E --> H
    H --> D --> S
    S --> P
    H --> P
    P --> O
    
    K["P(i) = probability of<br/>pointing to position i"]
    
    P --> K
    
    style K fill:#ffe66d,color:#000

The Pointer Mechanism

At each decoding step, compute attention over input positions:

\[u_j^i = v^T \tanh(W_1 e_j + W_2 d_i)\] \[P(C_i | C_1, ..., C_{i-1}, P) = \text{softmax}(u^i)\]

Where:

  • $e_j$ = encoder hidden state for input $j$
  • $d_i$ = decoder hidden state at step $i$
  • $P(C_i)$ = probability distribution over input positions

18.4 Architecture Details

Encoder

Standard RNN/LSTM encoder:

graph LR
    subgraph "Encoder"
        X1["x₁"] --> H1["h₁"]
        X2["x₂"] --> H2["h₂"]
        X3["x₃"] --> H3["h₃"]
        X4["x₄"] --> H4["h₄"]
    end
    
    H["H = {h₁, h₂, h₃, h₄}<br/>(all encoder states)"]
    
    H1 --> H
    H2 --> H
    H3 --> H
    H4 --> H

Decoder with Pointer

graph TB
    subgraph "Decoder Step"
        S_PREV["s_{i-1}<br/>(previous decoder state)"]
        C_PREV["C_{i-1}<br/>(previous pointer)"]
        
        RNN["Decoder RNN"]
        S_NEW["s_i"]
        
        ATT["Attention over H<br/>(encoder states)"]
        P["Pointer distribution<br/>P(i) over input positions"]
        C_NEW["C_i<br/>(selected index)"]
    end
    
    S_PREV --> RNN
    C_PREV --> RNN
    RNN --> S_NEW
    S_NEW --> ATT
    H_ALL["H (all encoder states)"] --> ATT
    ATT --> P --> C_NEW
    
    K["C_i = argmax P(i)"]
    
    P --> K
    
    style K fill:#ffe66d,color:#000

18.5 Comparison: Attention vs Pointer

Standard Attention (Bahdanau)

graph TB
    subgraph "Standard Attention"
        Q["Query (decoder state)"]
        K["Keys (encoder states)"]
        V["Values (encoder states)"]
        ATT["Attention weights"]
        C["Context vector<br/>(weighted sum of values)"]
    end
    
    Q --> ATT
    K --> ATT
    V --> ATT
    ATT --> C
    
    O["Output: word from<br/>fixed vocabulary"]
    
    C --> O

Pointer Network

graph TB
    subgraph "Pointer Network"
        Q["Query (decoder state)"]
        K["Keys (encoder states)"]
        ATT["Attention weights"]
        P["Pointer distribution<br/>(over input positions)"]
    end
    
    Q --> ATT
    K --> ATT
    ATT --> P
    
    O["Output: INDEX<br/>(position in input)"]
    
    P --> O
    
    style O fill:#4ecdc4,color:#fff

Key difference: Pointer networks use attention weights directly as output probabilities, not to create a context vector.


18.6 Application: Convex Hull

Problem Setup

Given n points, find the convex hull (smallest polygon containing all points).

graph TB
    subgraph "Convex Hull Problem"
        P1["Input: Points<br/>P₁, P₂, ..., Pₙ"]
        ENC["Encoder<br/>(processes points)"]
        DEC["Decoder<br/>(with pointer)"]
        HULL["Output: Indices<br/>of hull vertices<br/>e.g., [2, 5, 8, 3]"]
    end
    
    P1 --> ENC --> DEC --> HULL
    
    K["Output length varies<br/>based on input!"]
    
    HULL --> K
    
    style K fill:#ffe66d,color:#000

Training

  • Input: Sequence of 2D points
  • Target: Sequence of indices forming convex hull
  • Loss: Cross-entropy over pointer distribution

18.7 Application: Traveling Salesman Problem

TSP as Sequence Learning

Given n cities, find shortest tour visiting each once.

graph LR
    subgraph "TSP"
        C["Cities:<br/>C₁, C₂, C₃, C₄"]
        P["Pointer Network"]
        T["Tour:<br/>[2, 4, 1, 3]"]
    end
    
    C --> P --> T
    
    K["Output is permutation<br/>of input indices"]
    
    T --> K
    
    style K fill:#ffe66d,color:#000

Results

Pointer Networks achieve near-optimal solutions for TSP with up to 50 cities, without explicit optimization algorithms!


18.8 Application: Delaunay Triangulation

Problem

Given points, find Delaunay triangulation (triangles with empty circumcircles).

graph TB
    subgraph "Delaunay Triangulation"
        P["Points"]
        ENC["Encoder"]
        DEC["Decoder"]
        T["Triangles<br/>(as index triplets)"]
    end
    
    P --> ENC --> DEC --> T
    
    K["Output: sequences of<br/>3 indices per triangle"]
    
    T --> K

Pointer Networks can learn to generate valid triangulations!


18.9 Why Pointer Networks Work

Advantages

graph TB
    subgraph "Advantages"
        A1["Variable output length<br/>(depends on input)"]
        A2["No fixed vocabulary<br/>(works with any input)"]
        A3["Generalizes to new sizes<br/>(trained on n, works on m)"]
        A4["Learns combinatorial structure<br/>(permutations, selections)"]
    end
    
    S["Solves problems that<br/>standard seq2seq cannot"]
    
    A1 --> S
    A2 --> S
    A3 --> S
    A4 --> S
    
    style S fill:#4ecdc4,color:#fff

Comparison

Aspect Standard Seq2Seq Pointer Network
Output vocabulary Fixed, finite Variable, input-dependent
Output length Fixed or learned Variable, input-dependent
Generalization To new words To new input sizes
Combinatorial problems Difficult Natural fit

18.10 Training Details

Loss Function

Standard cross-entropy over pointer distribution:

\[\mathcal{L} = -\sum_{i=1}^{m} \log P(C_i^* | C_1^*, ..., C_{i-1}^*, P)\]

Where $C_i^*$ is the target pointer at step $i$.

Inference

Greedy decoding: \(C_i = \arg\max_j P(C_i = j | C_1, ..., C_{i-1}, P)\)

Or beam search for better solutions.


18.11 Connection to Attention

Pointer = Attention as Output

graph TB
    subgraph "Evolution"
        ATT["Attention (Bahdanau)<br/>Weights → Context vector"]
        PTR["Pointer Networks<br/>Weights → Output indices"]
        TRANS["Transformers<br/>Self-attention everywhere"]
    end
    
    ATT -->|"reuse weights"| PTR
    PTR -->|"influences"| TRANS
    
    K["Pointer shows attention<br/>can be the output itself"]
    
    PTR --> K
    
    style K fill:#ffe66d,color:#000

Modern Perspective

Modern models often combine:

  • Attention for context
  • Pointer mechanisms for selection
  • Copy mechanisms (similar idea)

18.12 Copy Mechanisms

Copy mechanisms (used in summarization, etc.) are similar:

graph TB
    subgraph "Copy Mechanism"
        GEN["Generate from vocabulary"]
        COPY["Copy from input"]
        GATE["Gate: choose<br/>generate or copy"]
        OUT["Output"]
    end
    
    GEN --> GATE
    COPY --> GATE
    GATE --> OUT
    
    K["Hybrid: can generate<br/>OR point to input"]
    
    GATE --> K
    
    style K fill:#4ecdc4,color:#fff

Pointer Networks are a pure version: only pointing, no generation.


18.13 Limitations

What Pointer Networks Can’t Do

graph TB
    subgraph "Limitations"
        L1["Only outputs indices<br/>(cannot generate new content)"]
        L2["Input-dependent vocabulary<br/>(cannot handle unseen elements)"]
        L3["Combinatorial complexity<br/>(n! permutations for n elements)"]
    end
    
    S["For some tasks, need<br/>hybrid approaches"]
    
    L1 --> S
    L2 --> S
    L3 --> S

When to Use Alternatives

  • Need new content: Use standard seq2seq or copy mechanisms
  • Very large inputs: Attention over all positions is expensive
  • Structured outputs: May need specialized architectures

18.14 Modern Applications

Where Pointer Networks Appear

graph TB
    subgraph "Applications"
        A1["Code generation<br/>(pointing to variables)"]
        A2["Question answering<br/>(pointing to spans)"]
        A3["Entity linking<br/>(pointing to mentions)"]
        A4["Combinatorial optimization<br/>(TSP, scheduling)"]
    end
    
    M["Modern models often<br/>combine generation + pointing"]
    
    A1 --> M
    A2 --> M
    A3 --> M
    A4 --> M

In Transformers

Modern LLMs use similar mechanisms:

  • Span selection: Point to start/end positions
  • Copy attention: Attend to input positions
  • Retrieval: Point to relevant documents

18.15 Connection to Other Chapters

graph TB
    CH18["Chapter 18<br/>Pointer Networks"]
    
    CH18 --> CH15["Chapter 15: NMT Attention<br/><i>Attention mechanism foundation</i>"]
    CH18 --> CH19["Chapter 19: Seq2Seq for Sets<br/><i>Order-invariant processing</i>"]
    CH18 --> CH20["Chapter 20: Neural Turing Machines<br/><i>External memory access</i>"]
    CH18 --> CH16["Chapter 16: Transformers<br/><i>Self-attention evolution</i>"]
    
    style CH18 fill:#ff6b6b,color:#fff

18.16 Key Equations Summary

Pointer Scores

\[u_j^i = v^T \tanh(W_1 e_j + W_2 d_i)\]

Pointer Distribution

\[P(C_i | C_1, ..., C_{i-1}, P) = \text{softmax}(u^i)\]

Loss Function

\[\mathcal{L} = -\sum_{i=1}^{m} \log P(C_i^* | C_1^*, ..., C_{i-1}^*, P)\]

Inference

\[C_i = \arg\max_j P(C_i = j | C_1, ..., C_{i-1}, P)\]

18.17 Chapter Summary

graph TB
    subgraph "Key Takeaways"
        T1["Pointer Networks output<br/>indices into input sequence"]
        T2["No fixed vocabulary—<br/>works with variable inputs"]
        T3["Attention weights become<br/>output probabilities"]
        T4["Natural for combinatorial<br/>optimization problems"]
        T5["Generalizes to new<br/>input sizes"]
    end
    
    T1 --> C["Pointer Networks solve the<br/>variable-length output problem by<br/>using attention to point to input<br/>positions, enabling solutions to<br/>combinatorial problems like TSP<br/>and convex hull that standard<br/>seq2seq models cannot handle."]
    T2 --> C
    T3 --> C
    T4 --> C
    T5 --> C
    
    style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px

In One Sentence

Pointer Networks use attention mechanisms to output indices pointing to positions in the input sequence, enabling variable-length outputs and solving combinatorial optimization problems that require selecting or ordering input elements.


Exercises

  1. Conceptual: Explain why a standard seq2seq model with a fixed vocabulary cannot solve the convex hull problem, but a Pointer Network can.

  2. Implementation: Implement a simple Pointer Network for finding the maximum element in a sequence. Train it on sequences of varying lengths.

  3. Analysis: Compare the computational complexity of Pointer Networks vs standard seq2seq for an input of length n and output of length m.

  4. Extension: How would you modify a Pointer Network to handle the case where you want to output both indices AND generate new tokens? (Hint: look up “copy mechanisms”)


References & Further Reading

Resource Link
Original Paper (Vinyals et al., 2015) arXiv:1506.03134
Copy Mechanism Paper arXiv:1603.06393
Pointer-Generator Networks arXiv:1704.04368
Neural Combinatorial Optimization arXiv:1611.09940
PyTorch Pointer Network GitHub Examples

Next Chapter: Chapter 19: Order Matters: Sequence to Sequence for Sets — We explore how to handle unordered input sets while maintaining the ability to produce ordered outputs, addressing a fundamental challenge in set-to-sequence learning.



Back to top

Educational content based on public research papers. All original papers are cited with links to their sources.