LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen

2021 · ICLR

LoRA: Low-Rank Adaptation of Large Language Models

Problem

Framing

Full fine-tuning of 100B-scale language models duplicates nearly all weights per task, making storage and optimization expensive. LoRA closes this gap by freezing the backbone and learning low-rank updates inside Transformer projections, matching or beating full tuning with up to 10,000×10{,}000\times fewer trainable parameters.

Currently Used Methods

Direct antecedents

Proposed Method

Architecture

LoRA freezes each pretrained matrix W0Rd×kW_0 \in \mathbb{R}^{d \times k} and learns a low-rank update ΔW=BA\Delta W = BA with BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}, and rmin(d,k)r \ll \min(d,k). The paper applies LoRA mainly to self-attention projections, especially WqW_q and WvW_v, and merges BABA into W0W_0 at inference.

Results page showing GPT-3 175B adaptation comparisons: a table over WikiSQL, MNLI-m, and SAMSum, plus scaling plots of validation accuracy versus trainable parameters.

Loss / Objective

The downstream objective stays the standard conditional language-model objective under the LoRA reparameterization.

maxΘ(x,y)Zt=1ylogpΦ0+ΔΦ(Θ)(ytx,y<t)\max_{\Theta} \sum_{(x,y) \in \mathcal{Z}} \sum_{t=1}^{|y|} \log p_{\Phi_0 + \Delta \Phi(\Theta)}(y_t \mid x, y_{<t})

Algorithm

Each adapted layer adds a scaled low-rank residual to the frozen linear map.

h=W0x+ΔWx=W0x+BAxh = W_0 x + \Delta W x = W_0 x + BAx

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 4: Performance of different adaptation methods on GPT-3 175B

Model&Method# Trainable ParametersWikiSQL Acc. (%)MNLI-m Acc.(%)SAMSum R1/R2/RL
GPT-3 (FT)175,255.8M73.889.552.0/28.0/44.5
GPT-3 (BitFit)14.2M71.391.051.3/27.4/43.5
GPT-3 (PreEmbed)3.2M63.188.648.3/24.2/40.5
GPT-3 (PreLayer)20.2M70.189.550.8/27.3/43.5
GPT-3 (AdapterH^H)7.1M71.989.853.0/28.9/44.8
GPT-3 (AdapterH^H)40.1M73.291.553.2/29.0/45.1
GPT-3 (LoRA)4.7M73.491.753.8/29.8/45.9
GPT-3 (LoRA)37.7M74.091.653.4/29.2/45.1

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers