Chapter 18: Pointer Networks
“We introduce a new architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence.”
Based on: “Pointer Networks” (Oriol Vinyals, Meire Fortunato, Navdeep Jaitly, 2015)
| 📄 Original Paper: arXiv:1506.03134 | NeurIPS 2015 |
18.1 The Variable-Length Output Problem
Standard sequence-to-sequence models have a fixed output vocabulary. But many problems require outputs that are positions or elements from the input:
graph TB
subgraph "Standard Seq2Seq"
I1["Input: [A, B, C, D]"]
O1["Output: 'word1', 'word2'<br/>(from fixed vocabulary)"]
end
subgraph "Pointer Network"
I2["Input: [A, B, C, D]"]
O2["Output: 2, 0, 3<br/>(pointers to input positions)"]
end
I1 --> O1
I2 --> O2
K["Outputs are INDICES<br/>into the input sequence!"]
O2 --> K
style K fill:#ffe66d,color:#000
Example Problems
| Problem | Input | Output |
|---|---|---|
| Convex Hull | Points | Indices of hull vertices |
| Traveling Salesman | Cities | Order of visit (indices) |
| Sorting | Numbers | Sorted order (indices) |
| Finding Maximum | Array | Index of max element |
18.2 Limitations of Standard Seq2Seq
The Vocabulary Problem
graph TB
subgraph "Standard Approach"
I["Input: 10 points<br/>in 2D space"]
V["Vocabulary: All possible<br/>coordinate pairs?<br/>→ INFINITE!"]
O["Output: Convex hull"]
end
I --> V --> O
P["Cannot enumerate all<br/>possible outputs!"]
V --> P
style P fill:#ff6b6b,color:#fff
Why This Fails
- Variable input size: Vocabulary would need to include all possible inputs
- Combinatorial explosion: For n inputs, there are n! possible orderings
- Generalization: Can’t generalize to new input sizes
18.3 The Pointer Network Solution
Core Idea
Instead of predicting words from a vocabulary, point to positions in the input:
graph TB
subgraph "Pointer Network"
X["Input sequence<br/>x₁, x₂, ..., x_n"]
E["Encoder<br/>(RNN/LSTM)"]
H["Hidden states<br/>h₁, h₂, ..., h_n"]
D["Decoder<br/>(RNN/LSTM)"]
S["Decoder state s_t"]
P["Pointer mechanism<br/>P(i | s_t, H)"]
O["Output: index i"]
end
X --> E --> H
H --> D --> S
S --> P
H --> P
P --> O
K["P(i) = probability of<br/>pointing to position i"]
P --> K
style K fill:#ffe66d,color:#000
The Pointer Mechanism
At each decoding step, compute attention over input positions:
\[u_j^i = v^T \tanh(W_1 e_j + W_2 d_i)\] \[P(C_i | C_1, ..., C_{i-1}, P) = \text{softmax}(u^i)\]Where:
- $e_j$ = encoder hidden state for input $j$
- $d_i$ = decoder hidden state at step $i$
- $P(C_i)$ = probability distribution over input positions
18.4 Architecture Details
Encoder
Standard RNN/LSTM encoder:
graph LR
subgraph "Encoder"
X1["x₁"] --> H1["h₁"]
X2["x₂"] --> H2["h₂"]
X3["x₃"] --> H3["h₃"]
X4["x₄"] --> H4["h₄"]
end
H["H = {h₁, h₂, h₃, h₄}<br/>(all encoder states)"]
H1 --> H
H2 --> H
H3 --> H
H4 --> H
Decoder with Pointer
graph TB
subgraph "Decoder Step"
S_PREV["s_{i-1}<br/>(previous decoder state)"]
C_PREV["C_{i-1}<br/>(previous pointer)"]
RNN["Decoder RNN"]
S_NEW["s_i"]
ATT["Attention over H<br/>(encoder states)"]
P["Pointer distribution<br/>P(i) over input positions"]
C_NEW["C_i<br/>(selected index)"]
end
S_PREV --> RNN
C_PREV --> RNN
RNN --> S_NEW
S_NEW --> ATT
H_ALL["H (all encoder states)"] --> ATT
ATT --> P --> C_NEW
K["C_i = argmax P(i)"]
P --> K
style K fill:#ffe66d,color:#000
18.5 Comparison: Attention vs Pointer
Standard Attention (Bahdanau)
graph TB
subgraph "Standard Attention"
Q["Query (decoder state)"]
K["Keys (encoder states)"]
V["Values (encoder states)"]
ATT["Attention weights"]
C["Context vector<br/>(weighted sum of values)"]
end
Q --> ATT
K --> ATT
V --> ATT
ATT --> C
O["Output: word from<br/>fixed vocabulary"]
C --> O
Pointer Network
graph TB
subgraph "Pointer Network"
Q["Query (decoder state)"]
K["Keys (encoder states)"]
ATT["Attention weights"]
P["Pointer distribution<br/>(over input positions)"]
end
Q --> ATT
K --> ATT
ATT --> P
O["Output: INDEX<br/>(position in input)"]
P --> O
style O fill:#4ecdc4,color:#fff
Key difference: Pointer networks use attention weights directly as output probabilities, not to create a context vector.
18.6 Application: Convex Hull
Problem Setup
Given n points, find the convex hull (smallest polygon containing all points).
graph TB
subgraph "Convex Hull Problem"
P1["Input: Points<br/>P₁, P₂, ..., Pₙ"]
ENC["Encoder<br/>(processes points)"]
DEC["Decoder<br/>(with pointer)"]
HULL["Output: Indices<br/>of hull vertices<br/>e.g., [2, 5, 8, 3]"]
end
P1 --> ENC --> DEC --> HULL
K["Output length varies<br/>based on input!"]
HULL --> K
style K fill:#ffe66d,color:#000
Training
- Input: Sequence of 2D points
- Target: Sequence of indices forming convex hull
- Loss: Cross-entropy over pointer distribution
18.7 Application: Traveling Salesman Problem
TSP as Sequence Learning
Given n cities, find shortest tour visiting each once.
graph LR
subgraph "TSP"
C["Cities:<br/>C₁, C₂, C₃, C₄"]
P["Pointer Network"]
T["Tour:<br/>[2, 4, 1, 3]"]
end
C --> P --> T
K["Output is permutation<br/>of input indices"]
T --> K
style K fill:#ffe66d,color:#000
Results
Pointer Networks achieve near-optimal solutions for TSP with up to 50 cities, without explicit optimization algorithms!
18.8 Application: Delaunay Triangulation
Problem
Given points, find Delaunay triangulation (triangles with empty circumcircles).
graph TB
subgraph "Delaunay Triangulation"
P["Points"]
ENC["Encoder"]
DEC["Decoder"]
T["Triangles<br/>(as index triplets)"]
end
P --> ENC --> DEC --> T
K["Output: sequences of<br/>3 indices per triangle"]
T --> K
Pointer Networks can learn to generate valid triangulations!
18.9 Why Pointer Networks Work
Advantages
graph TB
subgraph "Advantages"
A1["Variable output length<br/>(depends on input)"]
A2["No fixed vocabulary<br/>(works with any input)"]
A3["Generalizes to new sizes<br/>(trained on n, works on m)"]
A4["Learns combinatorial structure<br/>(permutations, selections)"]
end
S["Solves problems that<br/>standard seq2seq cannot"]
A1 --> S
A2 --> S
A3 --> S
A4 --> S
style S fill:#4ecdc4,color:#fff
Comparison
| Aspect | Standard Seq2Seq | Pointer Network |
|---|---|---|
| Output vocabulary | Fixed, finite | Variable, input-dependent |
| Output length | Fixed or learned | Variable, input-dependent |
| Generalization | To new words | To new input sizes |
| Combinatorial problems | Difficult | Natural fit |
18.10 Training Details
Loss Function
Standard cross-entropy over pointer distribution:
\[\mathcal{L} = -\sum_{i=1}^{m} \log P(C_i^* | C_1^*, ..., C_{i-1}^*, P)\]Where $C_i^*$ is the target pointer at step $i$.
Inference
Greedy decoding: \(C_i = \arg\max_j P(C_i = j | C_1, ..., C_{i-1}, P)\)
Or beam search for better solutions.
18.11 Connection to Attention
Pointer = Attention as Output
graph TB
subgraph "Evolution"
ATT["Attention (Bahdanau)<br/>Weights → Context vector"]
PTR["Pointer Networks<br/>Weights → Output indices"]
TRANS["Transformers<br/>Self-attention everywhere"]
end
ATT -->|"reuse weights"| PTR
PTR -->|"influences"| TRANS
K["Pointer shows attention<br/>can be the output itself"]
PTR --> K
style K fill:#ffe66d,color:#000
Modern Perspective
Modern models often combine:
- Attention for context
- Pointer mechanisms for selection
- Copy mechanisms (similar idea)
18.12 Copy Mechanisms
Related Idea
Copy mechanisms (used in summarization, etc.) are similar:
graph TB
subgraph "Copy Mechanism"
GEN["Generate from vocabulary"]
COPY["Copy from input"]
GATE["Gate: choose<br/>generate or copy"]
OUT["Output"]
end
GEN --> GATE
COPY --> GATE
GATE --> OUT
K["Hybrid: can generate<br/>OR point to input"]
GATE --> K
style K fill:#4ecdc4,color:#fff
Pointer Networks are a pure version: only pointing, no generation.
18.13 Limitations
What Pointer Networks Can’t Do
graph TB
subgraph "Limitations"
L1["Only outputs indices<br/>(cannot generate new content)"]
L2["Input-dependent vocabulary<br/>(cannot handle unseen elements)"]
L3["Combinatorial complexity<br/>(n! permutations for n elements)"]
end
S["For some tasks, need<br/>hybrid approaches"]
L1 --> S
L2 --> S
L3 --> S
When to Use Alternatives
- Need new content: Use standard seq2seq or copy mechanisms
- Very large inputs: Attention over all positions is expensive
- Structured outputs: May need specialized architectures
18.14 Modern Applications
Where Pointer Networks Appear
graph TB
subgraph "Applications"
A1["Code generation<br/>(pointing to variables)"]
A2["Question answering<br/>(pointing to spans)"]
A3["Entity linking<br/>(pointing to mentions)"]
A4["Combinatorial optimization<br/>(TSP, scheduling)"]
end
M["Modern models often<br/>combine generation + pointing"]
A1 --> M
A2 --> M
A3 --> M
A4 --> M
In Transformers
Modern LLMs use similar mechanisms:
- Span selection: Point to start/end positions
- Copy attention: Attend to input positions
- Retrieval: Point to relevant documents
18.15 Connection to Other Chapters
graph TB
CH18["Chapter 18<br/>Pointer Networks"]
CH18 --> CH15["Chapter 15: NMT Attention<br/><i>Attention mechanism foundation</i>"]
CH18 --> CH19["Chapter 19: Seq2Seq for Sets<br/><i>Order-invariant processing</i>"]
CH18 --> CH20["Chapter 20: Neural Turing Machines<br/><i>External memory access</i>"]
CH18 --> CH16["Chapter 16: Transformers<br/><i>Self-attention evolution</i>"]
style CH18 fill:#ff6b6b,color:#fff
18.16 Key Equations Summary
Pointer Scores
\[u_j^i = v^T \tanh(W_1 e_j + W_2 d_i)\]Pointer Distribution
\[P(C_i | C_1, ..., C_{i-1}, P) = \text{softmax}(u^i)\]Loss Function
\[\mathcal{L} = -\sum_{i=1}^{m} \log P(C_i^* | C_1^*, ..., C_{i-1}^*, P)\]Inference
\[C_i = \arg\max_j P(C_i = j | C_1, ..., C_{i-1}, P)\]18.17 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["Pointer Networks output<br/>indices into input sequence"]
T2["No fixed vocabulary—<br/>works with variable inputs"]
T3["Attention weights become<br/>output probabilities"]
T4["Natural for combinatorial<br/>optimization problems"]
T5["Generalizes to new<br/>input sizes"]
end
T1 --> C["Pointer Networks solve the<br/>variable-length output problem by<br/>using attention to point to input<br/>positions, enabling solutions to<br/>combinatorial problems like TSP<br/>and convex hull that standard<br/>seq2seq models cannot handle."]
T2 --> C
T3 --> C
T4 --> C
T5 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
In One Sentence
Pointer Networks use attention mechanisms to output indices pointing to positions in the input sequence, enabling variable-length outputs and solving combinatorial optimization problems that require selecting or ordering input elements.
Exercises
-
Conceptual: Explain why a standard seq2seq model with a fixed vocabulary cannot solve the convex hull problem, but a Pointer Network can.
-
Implementation: Implement a simple Pointer Network for finding the maximum element in a sequence. Train it on sequences of varying lengths.
-
Analysis: Compare the computational complexity of Pointer Networks vs standard seq2seq for an input of length n and output of length m.
-
Extension: How would you modify a Pointer Network to handle the case where you want to output both indices AND generate new tokens? (Hint: look up “copy mechanisms”)
References & Further Reading
| Resource | Link |
|---|---|
| Original Paper (Vinyals et al., 2015) | arXiv:1506.03134 |
| Copy Mechanism Paper | arXiv:1603.06393 |
| Pointer-Generator Networks | arXiv:1704.04368 |
| Neural Combinatorial Optimization | arXiv:1611.09940 |
| PyTorch Pointer Network | GitHub Examples |
Next Chapter: Chapter 19: Order Matters: Sequence to Sequence for Sets — We explore how to handle unordered input sets while maintaining the ability to produce ordered outputs, addressing a fundamental challenge in set-to-sequence learning.