LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen

2021 · ICLR

LoRA: Low-Rank Adaptation of Large Language Models

Problem

Framing

Full fine-tuning of 100B-scale language models duplicates nearly all weights per task, making storage and optimization expensive. LoRA closes this gap by freezing the backbone and learning low-rank updates inside Transformer projections, matching or beating full tuning with up to $10{,}000\times$ fewer trainable parameters.

Currently Used Methods

Direct antecedents

@brownGPT3_2020 — 175B autoregressive LM that makes adaptation operationally costly.
- Limitation in context: standard tuning still duplicates the full checkpoint.
@devlinBERT2018 — pretrain-then-fine-tune paradigm for downstream NLP.
- Limitation in context: full updates remain memory-heavy at large scale.
@vaswaniAttentionAllNeed2017 — Transformer with dense attention projection matrices.
- Limitation in context: offers no parameter-efficient adaptation rule.
"Prefix-Tuning: Optimizing Continuous Prompts for Generation" — tunes continuous prompts instead of backbone weights.
- Limitation in context: uses sequence length and degrades when prompt budgets grow.
"Parameter-Efficient Transfer Learning for NLP" — inserts adapter bottlenecks between Transformer blocks.
- Limitation in context: added layers keep inference slower than merged weight updates.

Proposed Method

Architecture

LoRA freezes each pretrained matrix $W_0 \in \mathbb{R}^{d \times k}$ and learns a low-rank update $\Delta W = BA$ with $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , and $r \ll \min(d,k)$ . The paper applies LoRA mainly to self-attention projections, especially $W_q$ and $W_v$ , and merges $BA$ into $W_0$ at inference.

Results page showing GPT-3 175B adaptation comparisons: a table over WikiSQL, MNLI-m, and SAMSum, plus scaling plots of validation accuracy versus trainable parameters.

Loss / Objective

The downstream objective stays the standard conditional language-model objective under the LoRA reparameterization.

\max_{\Theta} \sum_{(x,y) \in \mathcal{Z}} \sum_{t=1}^{|y|} \log p_{\Phi_0 + \Delta \Phi(\Theta)}(y_t \mid x, y_{<t})

Algorithm

Each adapted layer adds a scaled low-rank residual to the frozen linear map.

h = W_0 x + \Delta W x = W_0 x + BAx

Training Procedure

Initialize $A$ with Gaussian noise.
Initialize $B$ to zero.
Scale updates by $\alpha / r$ .
Freeze pretrained weights $W_0$ .
GPT-3: AdamW, batch size $128$ , $2$ epochs, warmup tokens $250{,}000$ .
GPT-2: AdamW, linear schedule, $5$ epochs.
GLUE: AdamW, linear schedule, warmup ratio $0.06$ or $0.1$ .

Evaluation

Datasets

GLUE with RoBERTa base/large and DeBERTa XXL.
E2E, WebNLG, DART with GPT-2.
WikiSQL, MultiNLI-m, SAMSum with GPT-3 175B.

Metrics

Accuracy for GLUE, WikiSQL, MultiNLI.
ROUGE-1/2/L for SAMSum.
BLEU, NIST, METEOR, ROUGE-L, CIDEr for E2E.
METEOR and TER for WebNLG and DART.
Forward latency for GPT-2.

Headline results

GLUE, RoBERTa base: average $87.2$ with $0.3$ M trainable parameters.
GPT-3 175B, WikiSQL: $73.4\%$ with $4.7$ M LoRA parameters; $74.0\%$ with $37.7$ M.
GPT-3 175B, MultiNLI-m: $91.7\%$ with LoRA vs. $89.5\%$ full fine-tuning.
GPT-3 175B, SAMSum: $53.8/29.8/45.9$ ROUGE-1/2/L vs. $52.0/28.0/44.5$ full fine-tuning.
GPT-2 medium latency: LoRA matches fine-tuning latency; adapters add overhead.

Table 4: Performance of different adaptation methods on GPT-3 175B

Model&Method	# Trainable Parameters	WikiSQL Acc. (%)	MNLI-m Acc.(%)	SAMSum R1/R2/RL
GPT-3 (FT)	175,255.8M	73.8	89.5	52.0/28.0/44.5
GPT-3 (BitFit)	14.2M	71.3	91.0	51.3/27.4/43.5
GPT-3 (PreEmbed)	3.2M	63.1	88.6	48.3/24.2/40.5
GPT-3 (PreLayer)	20.2M	70.1	89.5	50.8/27.3/43.5
GPT-3 (Adapter $^H$ )	7.1M	71.9	89.8	53.0/28.9/44.8
GPT-3 (Adapter $^H$ )	40.1M	73.2	91.5	53.2/29.0/45.1
GPT-3 (LoRA)	4.7M	73.4	91.7	53.8/29.8/45.9
GPT-3 (LoRA)	37.7M	74.0	91.6	53.4/29.2/45.1

Ablations

Weight choice: adapting $W_q$ and $W_v$ beats only $W_q$ or only $W_k$ at fixed budget.
Rank $r$ : even $r=1$ stays competitive on WikiSQL and MultiNLI.
Parameter scaling: LoRA stays strong as trainable parameters increase; prefix methods deteriorate.
Correlation study: $\Delta W$ amplifies useful directions already present but underemphasized in $W$ .

Method Strengths and Weaknesses

Strengths

Matches or beats full fine-tuning on GPT-3 175B tasks.
Cuts trainable parameters by up to $10{,}000\times$ .
Merges into base weights, so inference latency stays unchanged.
Very small ranks already work on multiple tasks.

Weaknesses

Weight-matrix selection is mostly heuristic.
Main evidence centers on attention projections, not all dense layers.
Small-rank success is tested on modest task shifts.
Benefits depend on useful directions already existing in pretrained weights.

Suggestions from the authors

Combine LoRA with other efficient adaptation methods.
Explain how pretrained features transform during LoRA adaptation.
Find principled rules for selecting target weight matrices.
Test whether pretrained weights themselves are rank-deficient.

LoRA: Low-Rank Adaptation of Large Language Models

LoRA: Low-Rank Adaptation of Large Language Models

Problem

Framing

Currently Used Methods

Direct antecedents

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers