Chapter 19: Order Matters: Sequence to Sequence for Sets
“We present a simple architecture that can handle sets as input, while producing sequences as output.”
Based on: “Order Matters: Sequence to Sequence for Sets” (Oriol Vinyals, Samy Bengio, Manjunath Kudlur, 2015)
| 📄 Original Paper: arXiv:1511.06391 | ICLR 2016 |
19.1 The Set-to-Sequence Problem
Many real-world problems involve:
- Input: A set (unordered collection)
- Output: A sequence (ordered list)
graph LR
subgraph "The Challenge"
S["Set Input<br/>{A, B, C}<br/>(order doesn't matter)"]
SEQ["Sequence Output<br/>[A, C, B]<br/>(order matters!)"]
end
Q["How to process unordered input<br/>to produce ordered output?"]
S --> Q
Q --> SEQ
style Q fill:#ffe66d,color:#000
Example Problems
| Problem | Input (Set) | Output (Sequence) |
|---|---|---|
| Sorting | {3, 1, 4, 2} | [1, 2, 3, 4] |
| Set Cover | {items} | [subset order] |
| Permutation | {elements} | [permutation] |
| Set Operations | {A, B, C} | [A ∪ B, B ∩ C, …] |
19.2 Why Standard Seq2Seq Fails
The Order Sensitivity Problem
Standard RNNs are order-sensitive:
graph TB
subgraph "Standard RNN"
O1["Order 1: [A, B, C]"]
O2["Order 2: [B, A, C]"]
O3["Order 3: [C, A, B]"]
RNN["RNN processes<br/>sequentially"]
H1["h₁ ≠ h₂ ≠ h₃<br/>(different hidden states)"]
end
O1 --> RNN
O2 --> RNN
O3 --> RNN
RNN --> H1
P["Same set, different orders<br/>→ Different representations!"]
H1 --> P
style P fill:#ff6b6b,color:#fff
The Problem
For a set {A, B, C}, there are 3! = 6 possible orderings. Standard RNNs treat each differently, even though the set is the same.
19.3 The Read-Process-Write Architecture
High-Level Design
graph TB
subgraph "Read-Process-Write"
READ["READ<br/>Encode set elements<br/>(order-invariant)"]
PROCESS["PROCESS<br/>Reason about set<br/>(maintains invariance)"]
WRITE["WRITE<br/>Generate sequence<br/>(order matters)"]
end
READ --> PROCESS --> WRITE
K["Key: READ and PROCESS<br/>are order-invariant<br/>WRITE produces order"]
PROCESS --> K
style K fill:#4ecdc4,color:#fff
The Three Stages
- READ: Encode each set element independently
- PROCESS: Aggregate and reason about the set
- WRITE: Generate output sequence (order matters)
19.4 The Read Stage
Order-Invariant Encoding
Process each element independently:
graph TB
subgraph "Read Stage"
S["Set: {A, B, C}"]
E1["Encoder(A)"]
E2["Encoder(B)"]
E3["Encoder(C)"]
H["H = {h_A, h_B, h_C}<br/>(order-invariant set)"]
end
S --> E1
S --> E2
S --> E3
E1 --> H
E2 --> H
E3 --> H
K["Each element encoded<br/>independently → order doesn't matter"]
H --> K
style K fill:#ffe66d,color:#000
Implementation
def read_stage(input_set):
# Process each element independently
encodings = []
for element in input_set:
encoding = encoder(element) # Same encoder for all
encodings.append(encoding)
return set(encodings) # Order doesn't matter
19.5 The Process Stage
Set Aggregation
Aggregate the encoded elements in an order-invariant way:
graph TB
subgraph "Process Stage"
H["H = {h_A, h_B, h_C}"]
subgraph "Options"
SUM["Sum: Σ h_i"]
MEAN["Mean: (1/n)Σ h_i"]
MAX["Max: max(h_i)"]
ATT["Attention: Σ α_i h_i"]
end
R["Representation R<br/>(order-invariant)"]
end
H --> SUM --> R
H --> MEAN --> R
H --> MAX --> R
H --> ATT --> R
K["All operations are<br/>symmetric (order-invariant)"]
R --> K
style K fill:#ffe66d,color:#000
Attention-Based Processing
Use attention to create a context:
\[c = \sum_{i=1}^{n} \alpha_i h_i\]Where $\alpha_i$ can depend on:
- The current decoder state
- The set elements themselves
- A learned query
19.6 The Write Stage
Generating Ordered Output
The write stage uses a decoder that produces sequences:
graph TB
subgraph "Write Stage"
R["Set Representation R<br/>(from Process)"]
D["Decoder RNN"]
Y1["y₁"]
Y2["y₂"]
Y3["y₃"]
end
R --> D --> Y1 --> D --> Y2 --> D --> Y3
K["Decoder maintains state<br/>→ Order matters in output"]
Y3 --> K
style K fill:#4ecdc4,color:#fff
The Decoder
Standard RNN decoder that:
- Takes set representation as initial context
- Can attend to set elements during generation
- Produces ordered sequence
19.7 Complete Architecture
Full Pipeline
graph TB
subgraph "Set2Seq Architecture"
INPUT["Input Set<br/>{x₁, x₂, ..., x_n}"]
READ["READ<br/>h_i = Encoder(x_i)"]
H["H = {h₁, h₂, ..., h_n}"]
PROCESS["PROCESS<br/>R = Aggregate(H)"]
WRITE["WRITE<br/>Decoder(R) → [y₁, y₂, ..., y_m]"]
OUTPUT["Output Sequence"]
end
INPUT --> READ --> H --> PROCESS --> R --> WRITE --> OUTPUT
K["Order-invariant input<br/>→ Order-dependent output"]
OUTPUT --> K
style K fill:#ffe66d,color:#000
Mathematical Formulation
Read: \(h_i = \text{Encoder}(x_i), \quad i = 1, ..., n\)
Process: \(R = \text{Aggregate}(\{h_1, ..., h_n\})\)
Write: \(y_t = \text{Decoder}(y_{<t}, R, H)\)
19.8 Application: Sorting
Learning to Sort
graph TB
subgraph "Sorting Task"
INPUT["Input: {3, 1, 4, 2}"]
READ["Read: Encode each number"]
PROCESS["Process: Understand set"]
WRITE["Write: Generate sorted order"]
OUTPUT["Output: [1, 2, 3, 4]"]
end
INPUT --> READ --> PROCESS --> WRITE --> OUTPUT
K["Learns sorting algorithm<br/>from examples!"]
OUTPUT --> K
style K fill:#4ecdc4,color:#fff
Results
The model learns to sort numbers without explicit sorting algorithm—just from examples!
19.9 Application: Set Operations
Learning Set Operations
graph TB
subgraph "Set Operations"
INPUT["Input: {A, B, C}"]
PROCESS["Process: Understand set"]
WRITE["Write: Generate operations"]
OUTPUT["Output: [A∪B, B∩C, ...]"]
end
INPUT --> PROCESS --> WRITE --> OUTPUT
K["Can learn to perform<br/>set operations"]
OUTPUT --> K
19.10 Why Order Matters in Output
The Write Order Effect
Even with order-invariant input processing, the order of writing matters:
graph TB
subgraph "Write Order"
R["Same representation R"]
W1["Write order 1<br/>→ [A, B, C]"]
W2["Write order 2<br/>→ [B, A, C]"]
W3["Write order 3<br/>→ [C, A, B]"]
end
R --> W1
R --> W2
R --> W3
K["Decoder state evolves<br/>→ Different outputs possible"]
W1 --> K
style K fill:#ffe66d,color:#000
Autoregressive Generation
The decoder is autoregressive: each output depends on previous outputs, creating order.
19.11 Comparison with Standard Seq2Seq
Key Differences
graph TB
subgraph "Standard Seq2Seq"
S1["Sequential input<br/>[x₁, x₂, x₃]"]
E1["Order-sensitive encoder"]
D1["Decoder"]
O1["Output sequence"]
end
subgraph "Set2Seq"
S2["Set input<br/>{x₁, x₂, x₃}"]
E2["Order-invariant encoder"]
P2["Order-invariant processor"]
D2["Order-sensitive decoder"]
O2["Output sequence"]
end
S1 --> E1 --> D1 --> O1
S2 --> E2 --> P2 --> D2 --> O2
K["Set2Seq: Input order doesn't matter<br/>Output order does matter"]
D2 --> K
style K fill:#4ecdc4,color:#fff
When to Use Each
| Scenario | Architecture |
|---|---|
| Input has natural order | Standard Seq2Seq |
| Input is a set | Set2Seq |
| Both input and output are sets | Set-to-set models |
| Need order-invariant processing | Set2Seq |
19.12 Attention in Set2Seq
Attending to Set Elements
During writing, the decoder can attend to set elements:
graph TB
subgraph "Attention During Write"
H["H = {h₁, h₂, h₃}<br/>(set encodings)"]
S["s_t<br/>(decoder state)"]
ATT["Attention<br/>α_i = f(s_t, h_i)"]
C["c_t = Σ α_i h_i"]
Y["y_t"]
end
H --> ATT
S --> ATT
ATT --> C
S --> Y
C --> Y
K["Decoder can focus on<br/>different set elements<br/>at different steps"]
ATT --> K
style K fill:#ffe66d,color:#000
This allows the decoder to “look back” at the set while generating.
19.13 Training and Inference
Training
- Input: Set (can be presented in any order)
- Target: Sequence (specific order)
- Loss: Standard sequence loss (cross-entropy)
Inference
graph TB
subgraph "Inference"
S["Input set<br/>(any order)"]
READ["Read stage"]
PROCESS["Process stage"]
WRITE["Write stage<br/>(greedy or beam search)"]
O["Output sequence"]
end
S --> READ --> PROCESS --> WRITE --> O
K["Output order determined by<br/>decoder, not input order"]
WRITE --> K
style K fill:#ffe66d,color:#000
19.14 Connection to Other Chapters
graph TB
CH19["Chapter 19<br/>Seq2Seq for Sets"]
CH19 --> CH18["Chapter 18: Pointer Networks<br/><i>Also handles variable outputs</i>"]
CH19 --> CH15["Chapter 15: NMT Attention<br/><i>Attention mechanism</i>"]
CH19 --> CH21["Chapter 21: Message Passing<br/><i>Set processing in graphs</i>"]
CH19 --> CH22["Chapter 22: Relational Reasoning<br/><i>Pairwise set operations</i>"]
style CH19 fill:#ff6b6b,color:#fff
19.15 Key Equations Summary
Read Stage
\[h_i = \text{Encoder}(x_i), \quad \forall x_i \in S\]Process Stage
\[R = \text{Aggregate}(\{h_1, ..., h_n\})\]Common aggregations:
- Sum: $R = \sum_i h_i$
- Mean: $R = \frac{1}{n}\sum_i h_i$
- Attention: $R = \sum_i \alpha_i h_i$
Write Stage
\[y_t = \text{Decoder}(y_{<t}, R, \{h_1, ..., h_n\})\]Attention During Write
\(\alpha_{it} = \text{softmax}(f(s_t, h_i))\) \(c_t = \sum_i \alpha_{it} h_i\)
19.16 Chapter Summary
graph TB
subgraph "Key Takeaways"
T1["Read-Process-Write architecture<br/>for set-to-sequence"]
T2["Read and Process stages<br/>are order-invariant"]
T3["Write stage produces<br/>ordered output"]
T4["Attention allows decoder<br/>to focus on set elements"]
T5["Learns to perform operations<br/>like sorting from examples"]
end
T1 --> C["Set2Seq architectures solve the<br/>fundamental challenge of processing<br/>unordered inputs to produce ordered<br/>outputs through order-invariant<br/>encoding and aggregation, followed<br/>by an order-sensitive decoder."]
T2 --> C
T3 --> C
T4 --> C
T5 --> C
style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px
In One Sentence
Set2Seq architectures use a Read-Process-Write design where order-invariant encoding and aggregation of set elements enables an order-sensitive decoder to generate sequences, allowing models to learn operations like sorting from examples.
Exercises
-
Conceptual: Explain why standard RNNs are order-sensitive for inputs, and how Set2Seq solves this problem.
-
Implementation: Implement a simple Set2Seq model for learning to sort numbers. Compare performance when input is presented in different orders.
-
Analysis: Compare the computational complexity of Set2Seq vs standard Seq2Seq. When does each have advantages?
-
Extension: How would you modify Set2Seq to handle both input and output as sets (set-to-set mapping)?
References & Further Reading
| Resource | Link |
|---|---|
| Original Paper (Vinyals et al., 2015) | arXiv:1511.06391 |
| Deep Sets Paper | arXiv:1703.06114 |
| Set Transformer Paper | arXiv:1810.00825 |
| Permutation Invariant Networks | arXiv:1703.06114 |
| Neural Sort Paper | arXiv:1803.08840 |
Next Chapter: Chapter 20: Neural Turing Machines — We explore networks with external, differentiable memory that can be read from and written to, enabling learning of algorithms from examples.