Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts

2020 · JMLR

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Problem

Framing

Transfer learning for NLP had fragmented across incompatible objectives, architectures, and task formats, so comparisons were confounded. T5 closes this by casting every task as text-to-text and scaling one encoder-decoder recipe to strong results, including GLUE 89.789.7 and SuperGLUE 89.389.3.

Currently Used Methods

Foundational

Proposed Method

Architecture

T5 uses a standard encoder-decoder Transformer with shared input/output embeddings, relative position embeddings, ReLU feed-forward blocks, and dropout 0.10.1. The baseline uses 12 encoder layers, 12 decoder layers, dmodel=768d_{\mathrm{model}}=768, dff=3072d_{\mathrm{ff}}=3072, and 12 heads.

Verified text-to-text framework: prompted inputs for translation, acceptability, semantic similarity, and summarization all map through one T5 model to textual outputs.

Loss / Objective

All tasks use teacher-forced maximum likelihood over target text.

L(θ)=t=1ylogpθ(ytx,y<t)\mathcal{L}(\theta) = - \sum_{t=1}^{|\mathbf{y}|} \log p_{\theta}(y_t \mid \mathbf{x}, y_{<t})

Algorithm

Pre-training uses span corruption: corrupted spans in x\mathbf{x} are replaced by sentinel tokens, and the target concatenates the missing spans.

x~=C(x),y=S(x,C)\tilde{\mathbf{x}} = C(\mathbf{x}), \qquad \mathbf{y}^{\star} = S(\mathbf{x}, C) pθ(yx~)=t=1ypθ(ytx~,y<t)p_{\theta}(\mathbf{y}^{\star} \mid \tilde{\mathbf{x}}) = \prod_{t=1}^{|\mathbf{y}^{\star}|} p_{\theta}(y_t^{\star} \mid \tilde{\mathbf{x}}, y_{<t}^{\star})

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers