Sequence to Sequence Learning with Neural Networks

Ilya Sutskever, Oriol Vinyals, Quoc V. Le

2014 · NeurIPS

Sequence to Sequence Learning with Neural Networks

Problem

Framing

Fixed-width networks did not handle variable-length input-output mappings with non-monotonic alignment. The paper closes this gap with a deep encoder-decoder LSTM trained end-to-end, plus source-sentence reversal that shortens effective dependencies. On WMT'14 En→Fr, direct decoding reaches 34.81 BLEU and reranking reaches 36.5 BLEU.

Currently Used Methods

Foundational

@hochreiterLSTM1997 — gated recurrent memory for long-range sequence dependencies.
- Limitation in context: no encoder-decoder transduction for variable-length translation.
@choGRU2014 — neural encoder-decoder for phrase scoring in SMT.
- Limitation in context: used for rescoring, not strong direct large-scale translation.
@bahdanauAttention2014 — neural translation with differentiable alignment.
- Limitation in context: lower reported BLEU than the reversed deep LSTM.
Connectionist Sequence Classification — neural sequence mapping under monotonic alignment assumptions.
- Limitation in context: cannot express general non-monotonic translation alignments.

Proposed Method

Architecture

The model uses separate encoder and decoder LSTMs. Both are 4-layer networks with 1000 cells per layer and 1000-dimensional embeddings, so the sentence representation is 8000-dimensional. The encoder reads the source in reverse order; the decoder predicts target tokens left-to-right until $<\mathrm{EOS}>$ .

Encoder-decoder schematic: source tokens are read in reverse into recurrent states, then target tokens are generated left-to-right until the end token.

Loss / Objective

Training maximizes conditional log-likelihood of the target sequence given the source.

p(y_1, \ldots, y_{T'} \mid x_1, \ldots, x_T) = \prod_{t=1}^{T'} p(y_t \mid v, y_1, \ldots, y_{t-1})

\max \; \frac{1}{|\mathcal{S}|} \sum_{(T,S) \in \mathcal{S}} \log p(T \mid S)

Sampling Rule / Algorithm

Decoding uses left-to-right beam search over target prefixes.

\hat{T} = \arg\max_T \, p(T \mid S)

Training Procedure

4 LSTM layers.
1000 cells per layer.
1000-dimensional word embeddings.
Source vocabulary 160,000; target vocabulary 80,000.
Parameter initialization: uniform in $[-0.08, 0.08]$ .
SGD without momentum.
Learning rate 0.7.
After 5 epochs, halve learning rate every half epoch.
Total training: 7.5 epochs.
Batch size: 128.
Gradient clipping: if $\|g\|_2 > 5$ , rescale to norm 5.
Length-bucketed minibatches.
8 GPUs.
Training time: about 10 days.

Evaluation

Datasets

WMT'14 English→French.
12M selected sentence pairs.
304M English words.
348M French words.
Source vocabulary 160k; target vocabulary 80k.
OOV words mapped to UNK.

Metrics

Cased BLEU on ntst14.
Test perplexity for source-reversal comparison.

Headline results

WMT'14 En→Fr direct, 5 reversed LSTMs, beam 12: BLEU 34.81.
WMT'14 En→Fr direct, 5 reversed LSTMs, beam 2: BLEU 34.50.
WMT'14 En→Fr direct, SMT baseline: BLEU 33.30.
WMT'14 En→Fr reranking 1000-best, 5 reversed LSTMs: BLEU 36.5.
Source reversal: perplexity 5.8 → 4.7, BLEU 25.9 → 30.6.

Results page with two BLEU tables for direct decoding and SMT reranking, plus a PCA plot of learned sentence representations.

Ablations

Source order reversal: sharply improves optimization and translation quality.
Depth: each added LSTM layer cuts perplexity by nearly 10%.
Beam size: beam 2 captures most gains over larger beams.
Ensemble size: 5 models outperform single-model decoding and reranking.

Method Strengths and Weaknesses

Strengths

Direct neural decoding beats the phrase-based baseline: 34.81 vs 33.30 BLEU.
Source reversal yields a large gain: 25.9 → 30.6 BLEU.
Long-sentence performance degrades only slightly past 35 words.
Learned representations capture word order and active-passive invariance.

Weaknesses

Fixed-vector encoding creates a source-information bottleneck.
Small target vocabulary causes UNK errors on rare words.
Best reranking score still trails the top WMT'14 system: 36.5 vs 37.0.
Training is expensive: 384M parameters and roughly 10 days.

Suggestions from the authors

Extend the small-vocabulary setup to handle rare and unseen words better.
Improve direct translation beyond the current unoptimized architecture.
Train multilingual systems with separate encoder and decoder LSTMs.
Explain why source reversal improves optimization and memory use.

Sequence to Sequence Learning with Neural Networks

Sequence to Sequence Learning with Neural Networks

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers