Deep Contextualized Word Representations

Matthew E. Peters, Mark Neumann

2018 · NAACL

Deep Contextualized Word Representations

Problem

Framing

Static embeddings assign one vector per type and miss polysemy and context-sensitive syntax. ELMo closes this with a deep bidirectional LM whose internal layers are mixed per task into token-specific features. Across six benchmarks, it reports 6–20% relative error reductions.

Currently Used Methods

Foundational

Proposed Method

Architecture

ELMo uses a 2-layer biLM with separate forward and backward LSTMs. Each direction has 4096 hidden units with 512-dimensional projections; the token layer is a character CNN with 2048 filters, two highway layers, and a 512-dimensional projection. The second LSTM layer receives a residual connection.

Loss / Objective

The biLM maximizes coupled forward and backward LM likelihoods, then forms a task-specific scalar mix of all layers.

k=1N(logp(tkt1,,tk1;Θx,ΘLSTM,Θs)+logp(tktk+1,,tN;Θx,ΘLSTM,Θs))\sum_{k=1}^{N} \Big( \log p(t_k \mid t_1, \ldots, t_{k-1}; \Theta_x, \overrightarrow{\Theta}_{\mathrm{LSTM}}, \Theta_s) + \log p(t_k \mid t_{k+1}, \ldots, t_N; \Theta_x, \overleftarrow{\Theta}_{\mathrm{LSTM}}, \Theta_s) \Big) ELMoktask=γtaskj=0Lsjtaskhk,jLM\mathrm{ELMo}^{\mathrm{task}}_k = \gamma^{\mathrm{task}} \sum_{j=0}^{L} s^{\mathrm{task}}_j \, \mathbf{h}^{\mathrm{LM}}_{k,j}

Algorithm

The downstream model freezes the biLM, runs it once, and concatenates the mixed representation at the input or output.

[xk;ELMoktask],[hk;ELMoktask][\mathbf{x}_k; \mathrm{ELMo}^{\mathrm{task}}_k], \qquad [\mathbf{h}_k; \mathrm{ELMo}^{\mathrm{task}}_k]

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Results plots: baseline vs. ELMo curves for SNLI accuracy and SRL F1 as training data increases from 0.1% to 100%.

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers