Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart van Merrienboer

2014 · EMNLP

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Problem

Framing

Phrase-based SMT used count-derived phrase scores and shallow neural features, but lacked a learned conditional model for variable-length phrase pairs. This paper closes that gap with an encoder–decoder RNN that maps source phrases to target phrases through a fixed context vector and improves Moses test BLEU from 33.30 to 33.87.

Currently Used Methods

Foundational

Proposed Method

Architecture

An encoder RNN reads x=(x1,,xT)\mathbf{x}=(x_1,\ldots,x_T) into a fixed vector c\mathbf{c}. A decoder RNN predicts target tokens autoregressively from previous outputs and c\mathbf{c}. The recurrent unit adds reset and update gates to control overwrite.

Verified architecture diagram: an encoder reads source tokens into a context vector c, and a decoder generates target tokens conditioned on that vector.

Loss / Objective

The model maximizes conditional log-likelihood over phrase pairs.

maxθ1Nn=1Nlogpθ(ynxn)\max_{\theta} \frac{1}{N} \sum_{n=1}^{N} \log p_{\theta}(\mathbf{y}_n \mid \mathbf{x}_n)

Sampling Rule / Algorithm

The decoder factorizes the target phrase left-to-right.

p(yx)=t=1Tp(ytyt1,,y1,c)p(\mathbf{y} \mid \mathbf{x}) = \prod_{t=1}^{T'} p\left(y_t \mid y_{t-1}, \ldots, y_1, \mathbf{c}\right) ht=f(ht1,yt1,c)\mathbf{h}_{\langle t \rangle} = f\left(\mathbf{h}_{\langle t-1 \rangle}, y_{t-1}, \mathbf{c}\right)

Gated Hidden Unit

The new recurrent unit interpolates between copying the old state and writing a candidate state.

rj=σ([Wrx]j+[Urht1]j),zj=σ([Wzx]j+[Uzht1]j)r_j = \sigma\left([\mathbf{W}_r \mathbf{x}]_j + [\mathbf{U}_r \mathbf{h}_{t-1}]_j\right), \qquad z_j = \sigma\left([\mathbf{W}_z \mathbf{x}]_j + [\mathbf{U}_z \mathbf{h}_{t-1}]_j\right) h~j=ϕ([Wx]j+[U(rht1)]j)\tilde{h}_j = \phi\left([\mathbf{W} \mathbf{x}]_j + [\mathbf{U}(\mathbf{r} \odot \mathbf{h}_{t-1})]_j\right) hjt=zjhjt1+(1zj)h~jh_j^{\langle t \rangle} = z_j h_j^{\langle t-1 \rangle} + (1-z_j) \tilde{h}_j

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Qualitative analysis

The model prefers linguistically regular targets over raw phrase-table frequency and can sample plausible continuations.

Table 4: Samples from RNN Encoder–Decoder for selected source phrases

SourceSamples from RNN Encoder–Decoder
at the end of the[à la fin de la] (\times11)
for the first time[pour la première fois] (\times24) [pour la première fois que] (\times2)
in the United States and[aux États-Unis et] (\times6) [dans les États-Unis et] (\times4)
, as well as[, ainsi que] [,] [ainsi que] [, ainsi qu’] [et UNK]
one of the most[l’ un des plus] (\times9) [l’ un des] (\times5) [l’ une des plus] (\times2)

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers