Sequence to Sequence Learning with Neural Networks

Ilya Sutskever, Oriol Vinyals, Quoc V. Le

2014 · NeurIPS

Sequence to Sequence Learning with Neural Networks

Problem

Framing

Fixed-width networks did not handle variable-length input-output mappings with non-monotonic alignment. The paper closes this gap with a deep encoder-decoder LSTM trained end-to-end, plus source-sentence reversal that shortens effective dependencies. On WMT'14 En→Fr, direct decoding reaches 34.81 BLEU and reranking reaches 36.5 BLEU.

Currently Used Methods

Foundational

Proposed Method

Architecture

The model uses separate encoder and decoder LSTMs. Both are 4-layer networks with 1000 cells per layer and 1000-dimensional embeddings, so the sentence representation is 8000-dimensional. The encoder reads the source in reverse order; the decoder predicts target tokens left-to-right until <EOS><\mathrm{EOS}>.

Encoder-decoder schematic: source tokens are read in reverse into recurrent states, then target tokens are generated left-to-right until the end token.

Loss / Objective

Training maximizes conditional log-likelihood of the target sequence given the source.

p(y1,,yTx1,,xT)=t=1Tp(ytv,y1,,yt1) p(y_1, \ldots, y_{T'} \mid x_1, \ldots, x_T) = \prod_{t=1}^{T'} p(y_t \mid v, y_1, \ldots, y_{t-1}) max  1S(T,S)Slogp(TS)\max \; \frac{1}{|\mathcal{S}|} \sum_{(T,S) \in \mathcal{S}} \log p(T \mid S)

Sampling Rule / Algorithm

Decoding uses left-to-right beam search over target prefixes.

T^=argmaxTp(TS)\hat{T} = \arg\max_T \, p(T \mid S)

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Results page with two BLEU tables for direct decoding and SMT reranking, plus a PCA plot of learned sentence representations.

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers