Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

2014 · ICLR

Neural Machine Translation by Jointly Learning to Align and Translate

Problem

Framing

Encoder–decoder NMT compressed the full source sentence into one fixed-length vector, and translation quality collapsed on long inputs. This paper replaces that bottleneck with soft alignment over encoder annotations at each decoding step. On WMT’14 English→French, the attentive model reaches 28.45 BLEU.

Currently Used Methods

Direct antecedents

Proposed Method

Architecture

A bidirectional encoder produces annotations hj=[hj;hj]h_j = [\overrightarrow{h}_j; \overleftarrow{h}_j]. The decoder updates sis_i from yi1y_{i-1} and a step-specific context cic_i, then predicts yiy_i. The architecture figure shows decoder states attending over all source annotations with weights αi,j\alpha_{i,j}.

Architecture diagram: decoder states s_{t-1}, s_t receive a context from attention weights over bidirectional source annotations h_1, \ldots, h_{T_x}.

Loss / Objective

Training maximizes the conditional log-likelihood of the target sentence given the source sentence.

p(yx)=i=1Typ(yiy1,,yi1,x)p(\mathbf{y} \mid \mathbf{x}) = \prod_{i=1}^{T_y} p\left(y_i \mid y_1, \ldots, y_{i-1}, \mathbf{x}\right)

Sampling Rule / Algorithm

At each decoding step, the model recomputes alignment weights and the context vector before predicting the next token.

ci=j=1Txαijhj,αij=exp(eij)k=1Txexp(eik),eij=vatanh(Wasi1+Uahj)c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j, \qquad \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}, \qquad e_{ij} = v_a^\top \tanh\left(W_a s_{i-1} + U_a h_j\right)

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 1: BLEU scores on the WMT’14 English→French test set.

ModelAllNo UNK
RNNencdec-3013.9324.19
RNNsearch-3021.5031.44
RNNencdec-5017.8226.71
RNNsearch-5028.4533.08
Moses30.6433.30

Ablations

Results plot: BLEU versus source-sentence length for RNNsearch and RNNencdec, with attentive models degrading much more slowly.

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers