Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts

2020 · JMLR

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Problem

Framing

Transfer learning for NLP had fragmented across incompatible objectives, architectures, and task formats, so comparisons were confounded. T5 closes this by casting every task as text-to-text and scaling one encoder-decoder recipe to strong results, including GLUE $89.7$ and SuperGLUE $89.3$ .

Currently Used Methods

Foundational

@vaswaniAttentionAllNeed2017 — Transformer sequence transduction with self-attention.
- Limitation in context: no unified transfer-learning objective or task format.
@radfordGPT2018 — decoder-only language-model pre-training for generative transfer.
- Limitation in context: awkward for classification, regression, and span extraction.
@devlinBERT2018 — masked denoising pre-training with bidirectional context.
- Limitation in context: objective is not naturally text generation.
@petersELMo2018 — contextual representations from language-model pre-training.
- Limitation in context: feature extraction underuses end-to-end fine-tuning.
XLNet: Generalized Autoregressive Pretraining for Language Understanding — permutation language modeling for bidirectional context.
- Limitation in context: objective changes alone do not unify task formatting.

Proposed Method

Architecture

T5 uses a standard encoder-decoder Transformer with shared input/output embeddings, relative position embeddings, ReLU feed-forward blocks, and dropout $0.1$ . The baseline uses 12 encoder layers, 12 decoder layers, $d_{\mathrm{model}}=768$ , $d_{\mathrm{ff}}=3072$ , and 12 heads.

Verified text-to-text framework: prompted inputs for translation, acceptability, semantic similarity, and summarization all map through one T5 model to textual outputs.

Loss / Objective

All tasks use teacher-forced maximum likelihood over target text.

\mathcal{L}(\theta) = - \sum_{t=1}^{|\mathbf{y}|} \log p_{\theta}(y_t \mid \mathbf{x}, y_{<t})

Algorithm

Pre-training uses span corruption: corrupted spans in $\mathbf{x}$ are replaced by sentinel tokens, and the target concatenates the missing spans.

\tilde{\mathbf{x}} = C(\mathbf{x}), \qquad \mathbf{y}^{\star} = S(\mathbf{x}, C)

p_{\theta}(\mathbf{y}^{\star} \mid \tilde{\mathbf{x}}) = \prod_{t=1}^{|\mathbf{y}^{\star}|} p_{\theta}(y_t^{\star} \mid \tilde{\mathbf{x}}, y_{<t}^{\star})

Training Procedure

Baseline: 12 encoder layers, 12 decoder layers, $d_{\mathrm{model}}=768$ , $d_{\mathrm{ff}}=3072$ , 12 heads.
Baseline size: $\approx 220$ M parameters.
Dropout: $0.1$ .
Vocabulary: 32k SentencePiece tokens.
Optimizer: AdaFactor.
Schedule: inverse square root, $10^4$ warmup steps.
Baseline pre-training: length 512, batch $2^{16}$ tokens, $2^{19}$ steps.
Final pre-training: $10^6$ steps, batch $2^{11}$ sequences of length 512.
Final denoising: corrupt 15% of tokens, mean span length 3.
Model sizes: 60M, 220M, 770M, 2.8B, 11B.
Decoding for CNN/DM and WMT: beam width 4, length penalty $\alpha = 0.6$ .

Evaluation

Datasets

Pre-training: C4.
Benchmarks: GLUE, SuperGLUE, CNN/Daily Mail, SQuAD, WMT En-De, WMT En-Fr, WMT En-Ro.

Metrics

GLUE: benchmark average and task scores.
SuperGLUE: benchmark average.
CNN/Daily Mail: ROUGE-2.
SQuAD: F1.
WMT: BLEU.

Headline results

GLUE test: $89.7$ .
SuperGLUE test: $89.3$ .
CNN/Daily Mail test: ROUGE-2 $21.55$ .
SQuAD test: F1 $90.54$ .
WMT En-De / En-Fr / En-Ro test: BLEU $32.3 / 44.6 / 41.0$ .

Ablations

Objective: span corruption slightly beats i.i.d. denoising and uses shorter targets.
Data quality: C4 beats less filtered Common Crawl variants on average.
Data repetition: repeating small unlabeled corpora hurts transfer.
Scale: bigger models improve more than only longer training or larger batches.

Method Strengths and Weaknesses

Strengths

One interface covers translation, QA, summarization, classification, and regression.
Encoder-decoder transfer beats strong encoder-only and decoder-only alternatives in the study.
Ablations isolate objective, data, architecture, and scale effects.
T5-11B reaches $89.7$ GLUE and $89.3$ SuperGLUE.

Weaknesses

Best results depend on 11B parameters and about $10^{12}$ pre-training tokens.
Study centers on English tasks despite web-scale pre-training.
Unified text outputs can add decoding overhead for simple classification.
Contribution is recipe synthesis and scaling, not a new backbone.

Suggestions from the authors

Develop pre-training objectives more efficient than span denoising.
Extend the framework to multilingual transfer.
Reduce fine-tuning and inference cost for large models.
Use domain-specific unlabeled data without harmful repetition.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers