Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
Problem
Framing
Fixed-width networks did not handle variable-length input-output mappings with non-monotonic alignment. The paper closes this gap with a deep encoder-decoder LSTM trained end-to-end, plus source-sentence reversal that shortens effective dependencies. On WMT'14 En→Fr, direct decoding reaches 34.81 BLEU and reranking reaches 36.5 BLEU.
Currently Used Methods
Foundational
- @hochreiterLSTM1997 — gated recurrent memory for long-range sequence dependencies.
- Limitation in context: no encoder-decoder transduction for variable-length translation.
- @choGRU2014 — neural encoder-decoder for phrase scoring in SMT.
- Limitation in context: used for rescoring, not strong direct large-scale translation.
- @bahdanauAttention2014 — neural translation with differentiable alignment.
- Limitation in context: lower reported BLEU than the reversed deep LSTM.
- Connectionist Sequence Classification — neural sequence mapping under monotonic alignment assumptions.
- Limitation in context: cannot express general non-monotonic translation alignments.
Proposed Method
Architecture
The model uses separate encoder and decoder LSTMs. Both are 4-layer networks with 1000 cells per layer and 1000-dimensional embeddings, so the sentence representation is 8000-dimensional. The encoder reads the source in reverse order; the decoder predicts target tokens left-to-right until .

Loss / Objective
Training maximizes conditional log-likelihood of the target sequence given the source.
Sampling Rule / Algorithm
Decoding uses left-to-right beam search over target prefixes.
Training Procedure
- 4 LSTM layers.
- 1000 cells per layer.
- 1000-dimensional word embeddings.
- Source vocabulary 160,000; target vocabulary 80,000.
- Parameter initialization: uniform in .
- SGD without momentum.
- Learning rate 0.7.
- After 5 epochs, halve learning rate every half epoch.
- Total training: 7.5 epochs.
- Batch size: 128.
- Gradient clipping: if , rescale to norm 5.
- Length-bucketed minibatches.
- 8 GPUs.
- Training time: about 10 days.
Evaluation
Datasets
- WMT'14 English→French.
- 12M selected sentence pairs.
- 304M English words.
- 348M French words.
- Source vocabulary 160k; target vocabulary 80k.
- OOV words mapped to UNK.
Metrics
- Cased BLEU on ntst14.
- Test perplexity for source-reversal comparison.
Headline results
- WMT'14 En→Fr direct, 5 reversed LSTMs, beam 12: BLEU 34.81.
- WMT'14 En→Fr direct, 5 reversed LSTMs, beam 2: BLEU 34.50.
- WMT'14 En→Fr direct, SMT baseline: BLEU 33.30.
- WMT'14 En→Fr reranking 1000-best, 5 reversed LSTMs: BLEU 36.5.
- Source reversal: perplexity 5.8 → 4.7, BLEU 25.9 → 30.6.

Ablations
- Source order reversal: sharply improves optimization and translation quality.
- Depth: each added LSTM layer cuts perplexity by nearly 10%.
- Beam size: beam 2 captures most gains over larger beams.
- Ensemble size: 5 models outperform single-model decoding and reranking.
Method Strengths and Weaknesses
Strengths
- Direct neural decoding beats the phrase-based baseline: 34.81 vs 33.30 BLEU.
- Source reversal yields a large gain: 25.9 → 30.6 BLEU.
- Long-sentence performance degrades only slightly past 35 words.
- Learned representations capture word order and active-passive invariance.
Weaknesses
- Fixed-vector encoding creates a source-information bottleneck.
- Small target vocabulary causes UNK errors on rare words.
- Best reranking score still trails the top WMT'14 system: 36.5 vs 37.0.
- Training is expensive: 384M parameters and roughly 10 days.
Suggestions from the authors
- Extend the small-vocabulary setup to handle rare and unseen words better.
- Improve direct translation beyond the current unoptimized architecture.
- Train multilingual systems with separate encoder and decoder LSTMs.
- Explain why source reversal improves optimization and memory use.
Links
Prior Papers
- @hochreiterLSTM1997 — provides the gated recurrent cell that makes long-range seq2seq optimization feasible.
Further Papers
- @bahdanauAttention2014 — replaces the fixed-vector bottleneck with learned alignment over encoder states.
- @choGRU2014 — overlaps in encoder-decoder translation and neural sequence transduction.
- @vaswaniAttentionAllNeed2017 — pushes sequence transduction beyond recurrence with attention-only architectures.