Chapter 15: Neural Machine Translation by Jointly Learning to Align and Translate

“We introduce an attention mechanism that allows the model to automatically search for parts of the source sentence that are relevant to predicting a target word.”

Based on: “Neural Machine Translation by Jointly Learning to Align and Translate” (Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, 2014)

📄 Original Paper: arXiv:1409.3215

ICLR 2015

15.1 The Bottleneck Problem

Before attention, neural machine translation used a simple encoder-decoder architecture:

graph LR
    subgraph "Encoder-Decoder (Pre-Attention)"
        E["Encoder RNN<br/>'I am happy'"]
        B["Bottleneck<br/>Single vector"]
        D["Decoder RNN<br/>'Je suis heureux'"]
    end
    
    E --> B --> D
    
    P["Problem: All information<br/>compressed into one vector!"]
    
    B --> P
    
    style P fill:#ff6b6b,color:#fff

This bottleneck limited performance, especially for long sentences.

15.2 The Encoder-Decoder Architecture

How It Worked (Without Attention)

graph TB
    subgraph "Encoder"
        X1["x₁ 'I'"] --> H1["h₁"]
        X2["x₂ 'am'"] --> H2["h₂"]
        X3["x₃ 'happy'"] --> H3["h₃"]
    end
    
    H1 --> C["Context vector c<br/>= h₃ (last hidden state)"]
    H2 --> C
    H3 --> C
    
    subgraph "Decoder"
        C --> S1["s₁"]
        S1 --> Y1["'Je'"]
        S1 --> S2["s₂"]
        S2 --> Y2["'suis'"]
        S2 --> S3["s₃"]
        S3 --> Y3["'heureux'"]
    end
    
    K["All source information<br/>must fit in c!"]
    
    C --> K
    
    style K fill:#ff6b6b,color:#fff

The Limitation

For a long sentence:

Encoder must compress all information into one vector
Decoder must reconstruct all information from one vector
Information loss is inevitable

15.3 The Attention Solution

Key Insight

Instead of a single context vector, use a different context vector for each decoding step:

graph TB
    subgraph "With Attention"
        E1["h₁ 'I'"]
        E2["h₂ 'am'"]
        E3["h₃ 'happy'"]
        
        ATT["Attention Mechanism"]
        
        C1["c₁ (for 'Je')"]
        C2["c₂ (for 'suis')"]
        C3["c₃ (for 'heureux')"]
    end
    
    E1 --> ATT
    E2 --> ATT
    E3 --> ATT
    
    ATT --> C1
    ATT --> C2
    ATT --> C3
    
    K["Each target word gets<br/>its own context vector!"]
    
    ATT --> K
    
    style K fill:#4ecdc4,color:#fff

The Attention Mechanism

At each decoding step, compute a weighted sum of all encoder hidden states:

\[c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j\]

Where $\alpha_{ij}$ is the attention weight—how much to focus on source word $j$ when generating target word $i$.

15.4 Computing Attention Weights

The Alignment Model

Attention weights are computed using an alignment model:

\[e_{ij} = a(s_{i-1}, h_j)\]

Where:

$s_{i-1}$ is the previous decoder hidden state
$h_j$ is the $j$-th encoder hidden state
$a$ is an alignment function (learned neural network)

graph LR
    subgraph "Alignment Model"
        S["s_{i-1}<br/>(decoder state)"]
        H["h_j<br/>(encoder state)"]
        A["Alignment function a(·,·)"]
        E["e_{ij}<br/>(alignment score)"]
    end
    
    S --> A
    H --> A
    A --> E
    
    K["Measures how well<br/>s_{i-1} and h_j match"]
    
    E --> K
    
    style K fill:#ffe66d,color:#000

Softmax Normalization

Convert alignment scores to probabilities:

\[\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}\]

graph TB
    subgraph "Attention Weight Computation"
        E1["e_{i1} = 0.5"]
        E2["e_{i2} = 2.0"]
        E3["e_{i3} = 1.0"]
        
        SOFT["Softmax"]
        
        A1["α_{i1} = 0.12"]
        A2["α_{i2} = 0.66"]
        A3["α_{i3} = 0.22"]
    end
    
    E1 --> SOFT
    E2 --> SOFT
    E3 --> SOFT
    
    SOFT --> A1
    SOFT --> A2
    SOFT --> A3
    
    K["Weights sum to 1<br/>Focus on h₂ most"]
    
    A2 --> K
    
    style K fill:#ffe66d,color:#000

15.5 The Complete Architecture

Encoder: Bidirectional RNN

graph TB
    subgraph "Bidirectional Encoder"
        X1["x₁"] --> F1["→ h₁"]
        X2["x₂"] --> F2["→ h₂"]
        X3["x₃"] --> F3["→ h₃"]
        
        X1 --> B1["← h₁"]
        X2 --> B2["← h₂"]
        X3 --> B3["← h₃"]
        
        F1 --> H1["h₁ = [→h₁; ←h₁]"]
        B1 --> H1
        F2 --> H2["h₂ = [→h₂; ←h₂]"]
        B2 --> H2
        F3 --> H3["h₃ = [→h₃; ←h₃]"]
        B3 --> H3
    end
    
    K["Each h_j contains<br/>context from both directions"]
    
    H2 --> K
    
    style K fill:#4ecdc4,color:#fff

Why bidirectional? Each encoder hidden state should contain information about the entire sentence, not just what came before.

Decoder with Attention

graph TB
    subgraph "Decoder Step i"
        S_PREV["s_{i-1}<br/>(previous decoder state)"]
        Y_PREV["y_{i-1}<br/>(previous output)"]
        H_ALL["{h₁, h₂, ..., h_T}<br/>(all encoder states)"]
        
        ATT["Attention Mechanism<br/>c_i = Σ α_{ij} h_j"]
        
        CONCAT["[s_{i-1}, c_i]"]
        RNN["Decoder RNN"]
        OUT["y_i<br/>(output word)"]
    end
    
    S_PREV --> ATT
    H_ALL --> ATT
    ATT --> CONCAT
    S_PREV --> CONCAT
    Y_PREV --> RNN
    CONCAT --> RNN
    RNN --> OUT
    
    K["Context c_i guides<br/>word generation"]
    
    ATT --> K
    
    style K fill:#ffe66d,color:#000

15.6 Visualizing Attention

The Alignment Matrix

Attention weights form an alignment matrix:

graph TB
    subgraph "Alignment Matrix"
        direction LR
        E1["'I'"]
        E2["'am'"]
        E3["'happy'"]
        
        D1["'Je'"] -->|"0.8"| E1
        D1 -->|"0.1"| E2
        D1 -->|"0.1"| E3
        
        D2["'suis'"] -->|"0.1"| E1
        D2 -->|"0.7"| E2
        D2 -->|"0.2"| E3
        
        D3["'heureux'"] -->|"0.05"| E1
        D3 -->|"0.1"| E2
        D3 -->|"0.85"| E3
    end
    
    K["Visualization shows<br/>which source words<br/>each target word attends to"]
    
    D3 --> K
    
    style K fill:#ffe66d,color:#000

Example Visualization

Source:  "I am happy"
Target:  "Je suis heureux"

Attention weights:
        I    am   happy
Je     0.8  0.1   0.1
suis   0.1  0.7   0.2
heureux 0.05 0.1  0.85

The model learns to align “Je” with “I”, “suis” with “am”, and “heureux” with “happy”!

15.7 Why Attention Works

Benefits

graph TB
    subgraph "Benefits of Attention"
        B1["No bottleneck<br/>All encoder states accessible"]
        B2["Automatic alignment<br/>Learns word correspondences"]
        B3["Handles long sentences<br/>No information compression"]
        B4["Interpretable<br/>Attention weights show alignment"]
    end
    
    R["Better translation quality"]
    
    B1 --> R
    B2 --> R
    B3 --> R
    B4 --> R
    
    style R fill:#4ecdc4,color:#fff

Comparison

Aspect	Without Attention	With Attention
Context	Single vector c	Different c_i per step
Long sentences	Poor (bottleneck)	Good (no compression)
Alignment	Implicit	Explicit (learned)
Interpretability	Black box	Visualizable

15.8 Experimental Results

WMT’14 English-French

xychart-beta
    title "BLEU Scores on WMT'14 En-Fr"
    x-axis ["Baseline RNN", "RNNsearch-30", "RNNsearch-50"]
    y-axis "BLEU Score" 0 --> 35
    bar [28.5, 31.5, 34.6]

RNNsearch = RNN with attention (this paper’s model)

Key Findings

Long sentences: Attention model significantly outperforms baseline
Alignment quality: Attention weights correlate with word alignments
No length limit: Performance doesn’t degrade with sentence length

15.9 The Alignment Function

Implementation Options

The alignment function $a(s_{i-1}, h_j)$ can be:

Option 1: Concatenation + MLP $a(s_{i-1}, h_j) = v^T \tanh(W[s_{i-1}; h_j])$

Option 2: Dot Product $a(s_{i-1}, h_j) = s_{i-1}^T h_j$

Option 3: General $a(s_{i-1}, h_j) = s_{i-1}^T W h_j$

graph TB
    subgraph "Alignment Function"
        S["s_{i-1}"]
        H["h_j"]
        A["a(s, h)"]
        E["e_{ij}"]
    end
    
    S --> A
    H --> A
    A --> E
    
    K["Learned to measure<br/>relevance/compatibility"]
    
    A --> K
    
    style K fill:#ffe66d,color:#000

15.10 Connection to Modern Attention

The Foundation

This paper laid the foundation for:

graph TB
    subgraph "Evolution"
        BAH["Bahdanau Attention<br/>(This paper, 2014)"]
        LUONG["Luong Attention<br/>(2015)"]
        TRANS["Transformer Attention<br/>(2017)"]
    end
    
    BAH --> LUONG --> TRANS
    
    K["All use the same core idea:<br/>weighted combination of<br/>source representations"]
    
    TRANS --> K
    
    style K fill:#ffe66d,color:#000

Differences

Aspect	Bahdanau (This)	Luong	Transformer
Query	Previous decoder state	Current decoder state	Learned query
Keys	Encoder states	Encoder states	Self-attention
Computation	Additive	Multiplicative	Scaled dot-product

15.11 Implementation Details

PyTorch Pseudocode

class AttentionDecoder(nn.Module):
    def __init__(self, hidden_size, vocab_size):
        self.hidden_size = hidden_size
        self.attention = nn.Linear(hidden_size * 2, hidden_size)
        self.decoder_rnn = nn.GRU(hidden_size * 2, hidden_size)
        self.output = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, encoder_outputs, decoder_hidden, prev_output):
        # encoder_outputs: [seq_len, batch, hidden*2]
        # decoder_hidden: [1, batch, hidden]
        
        # Compute attention scores
        scores = []
        for enc_out in encoder_outputs:
            # Concatenate decoder hidden and encoder output
            combined = torch.cat([decoder_hidden, enc_out], dim=-1)
            score = self.attention(combined)
            scores.append(score)
        
        # Softmax to get attention weights
        attention_weights = F.softmax(torch.stack(scores), dim=0)
        
        # Weighted sum of encoder outputs
        context = torch.sum(attention_weights * encoder_outputs, dim=0)
        
        # Concatenate context with previous output
        decoder_input = torch.cat([prev_output, context], dim=-1)
        
        # Decoder RNN
        decoder_output, decoder_hidden = self.decoder_rnn(
            decoder_input, decoder_hidden
        )
        
        # Output
        output = self.output(decoder_output)
        return output, decoder_hidden, attention_weights

15.12 Connection to Other Chapters

graph TB
    CH15["Chapter 15<br/>NMT with Attention"]
    
    CH15 --> CH12["Chapter 12: LSTMs<br/><i>Encoder-decoder uses LSTMs</i>"]
    CH15 --> CH16["Chapter 16: Transformers<br/><i>Self-attention evolution</i>"]
    CH15 --> CH14["Chapter 14: Relational RNNs<br/><i>Attention in RNNs</i>"]
    CH15 --> CH24["Chapter 24: Deep Speech 2<br/><i>Attention for speech</i>"]
    
    style CH15 fill:#ff6b6b,color:#fff

15.13 Key Equations Summary

Encoder (Bidirectional)

$\overrightarrow{h_j} = \text{RNN}(\overrightarrow{h_{j-1}}, x_j)$ $\overleftarrow{h_j} = \text{RNN}(\overleftarrow{h_{j+1}}, x_j)$ $h_j = [\overrightarrow{h_j}; \overleftarrow{h_j}]$

Attention Weights

$e_{ij} = a(s_{i-1}, h_j)$ $\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}$

Context Vector

\[c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j\]

Decoder

$s_i = \text{RNN}(s_{i-1}, [y_{i-1}; c_i])$ $P(y_i | y_{<i}, x) = \text{softmax}(W_o s_i)$

15.14 Chapter Summary

graph TB
    subgraph "Key Takeaways"
        T1["Attention solves the<br/>bottleneck problem"]
        T2["Different context vector<br/>for each decoding step"]
        T3["Automatic alignment<br/>between source and target"]
        T4["Bidirectional encoder<br/>captures full context"]
        T5["Foundation for<br/>modern attention mechanisms"]
    end
    
    T1 --> C["Bahdanau attention introduced<br/>the idea of dynamically focusing<br/>on relevant parts of the input,<br/>eliminating the bottleneck in<br/>sequence-to-sequence models and<br/>enabling better translation quality."]
    T2 --> C
    T3 --> C
    T4 --> C
    T5 --> C
    
    style C fill:#ffe66d,color:#000,stroke:#000,stroke-width:2px

In One Sentence

This paper introduced attention mechanisms to neural machine translation, allowing the decoder to dynamically focus on different parts of the source sentence for each target word, eliminating the information bottleneck and dramatically improving translation quality.

Exercises

Conceptual: Explain why a single context vector creates a bottleneck, and how attention solves this problem.
Visualization: Draw the attention alignment matrix for translating “The cat sat on the mat” to French. What patterns do you expect?
Implementation: Implement a simple attention mechanism for a character-level seq2seq model. Visualize the attention weights.
Analysis: Compare Bahdanau attention (additive) with dot-product attention. What are the trade-offs?

References & Further Reading

Resource	Link
Original Paper (Bahdanau et al., 2014)	arXiv:1409.3215
Luong Attention Paper	arXiv:1508.04025
Effective Approaches to Attention	arXiv:1508.04025
Neural Machine Translation Tutorial	PyTorch
Attention Visualization Tool	GitHub

Next Chapter: Chapter 16: Attention Is All You Need (Transformers) — We explore the paper that eliminated recurrence entirely, using only attention mechanisms to create the Transformer architecture that powers modern LLMs.

← Back to Part IV

Table of Contents