QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni

2023 · NeurIPS

QLoRA: Efficient Finetuning of Quantized LLMs

Problem

Framing

Finetuning 33B–65B LLMs is memory-bound: 16-bit tuning of LLaMA-65B needs more than 780 GB, and inference quantizers fail under backpropagation. QLoRA closes this gap by freezing a 4-bit base model, training LoRA adapters through dequantized weights, and fitting 65B finetuning under 48 GB without losing reported 16-bit task performance.

Currently Used Methods

Foundational

@huLoRA2021 — low-rank adapters for parameter-efficient finetuning.
- Limitation in context: base weights stay higher precision, so memory still scales poorly.
LLM.int8() — 8-bit quantization that preserves LLM inference quality.
- Limitation in context: designed for inference; training through quantized weights breaks down.
SmoothQuant — activation-weight rescaling for low-bit inference.
- Limitation in context: does not provide stable 4-bit finetuning.
Full 16-bit finetuning — updates all pretrained parameters directly.
- Limitation in context: 65B-scale memory exceeds single-GPU budgets.
Query/key-only LoRA placement — sparse adapter insertion in attention blocks.
- Limitation in context: leaves measurable accuracy gaps versus full finetuning.

Proposed Method

Architecture

QLoRA stores the pretrained transformer in 4-bit NF4, dequantizes weights to BF16 for compute, and trains only LoRA adapters. It adds double-quantized scaling constants and paged optimizers, and places adapters in all layers to recover full-finetuning accuracy.

Verified figure: memory-layout comparison of full finetuning, LoRA, and QLoRA; QLoRA uses a 4-bit transformer, LoRA adapters, and CPU paging for optimizer-state spikes.

Loss / Objective

The trainable map is the LoRA-augmented frozen linear layer.

\mathbf{Y} = \mathbf{X}\mathbf{W} + s\,\mathbf{X}\mathbf{L}_1\mathbf{L}_2

Quantization Rule

NF4 chooses levels from normal-distribution quantiles, then computes with doubly dequantized frozen weights.

q_i = \frac{1}{2}\left(Q_{\mathcal{N}}\!\left(\frac{i}{2^k+1}\right) + Q_{\mathcal{N}}\!\left(\frac{i+1}{2^k+1}\right)\right)

\mathbf{Y}^{\mathrm{BF16}} = \mathbf{X}^{\mathrm{BF16}}\,\operatorname{doubleDequant}\!\left(c_1^{\mathrm{FP32}}, c_2^{k\text{-bit}}, \mathbf{W}^{\mathrm{NF4}}\right) + \mathbf{X}^{\mathrm{BF16}}\mathbf{L}_1^{\mathrm{BF16}}\mathbf{L}_2^{\mathrm{BF16}}

Training Procedure

Base weights: frozen 4-bit NF4.
Quantization constants: FP8 under double quantization.
Weight block size: 64.
Quantization-constant block size: 256.
LoRA dropout search: $\{0, 0.05, 0.1\}$ .
LoRA rank search: $\{8, 16, 32, 64, 128, 256\}$ .
LoRA layer search: key+query, all attention, all FFN, all layers, attention+FFN outputs.
Guanaco 7B example: batch size 16, learning rate $2 \times 10^{-4}$ .

Evaluation

Datasets

GLUE with RoBERTa-large.
Super-NaturalInstructions with T5-80M to T5-11B.
5-shot MMLU after LLaMA finetuning on FLAN v2 and Alpaca.
Vicuna prompts for chatbot tournaments.
OASST1 validation prompts for chatbot evaluation.

Metrics

Accuracy on GLUE.
RougeL on Super-NaturalInstructions.
5-shot accuracy on MMLU.
Elo from GPT-4 and human pairwise judgments.
GPU memory footprint in GB.

Headline results

LLaMA-65B finetuning: memory drops from $>780$ GB to $<48$ GB.
Vicuna benchmark: Guanaco 65B reaches $99.3\%$ of ChatGPT.
Vicuna benchmark: Guanaco 33B reaches $97.8\%$ of ChatGPT.
GPT-4 Elo tournament: Guanaco 65B scores $1022 \pm 1$ ; ChatGPT scores $966 \pm 1$ .
MMLU datatype study: NF4 + DQ matches BF16; FP4 trails by about 1 point.

Table 1: Mean zero-shot accuracy over tasks and model scales for 4-bit LLaMA variants

Data type
Float
NF4
NF4 + DQ

Ablations

Data type: NF4 outperforms FP4 and tracks BF16 across MMLU settings.
Double quantization: saves about $0.37$ bits per parameter with no reported loss.
Adapter placement: all-layer LoRA is needed to match 16-bit finetuning.
Dataset choice: OASST1 9k beats a 450k FLAN v2 subset for chatbot quality.

Method Strengths and Weaknesses

Strengths

Finetunes 65B models on one 48 GB GPU.
Matches reported 16-bit performance on GLUE, T5, and MMLU studies.
NF4 improves 4-bit accuracy over FP4 baselines.
All-layer adapters remove major LoRA accuracy gaps.

Weaknesses

No direct 33B or 65B comparison against full 16-bit finetuning.
Chatbot rankings rely on noisy GPT-4 and human preference judgments.
GPT-4 and human rankings disagree at both system and example levels.
Guanaco inherits bias and failure modes from the base model.

Suggestions from the authors

Test whether QLoRA matches full 16-bit finetuning at 33B and 65B.
Study 3-bit bases and alternative PEFT methods beyond LoRA.
Improve benchmark reliability for chatbot evaluation.
Analyze failure cases and inherited biases more systematically.

QLoRA: Efficient Finetuning of Quantized LLMs

QLoRA: Efficient Finetuning of Quantized LLMs

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Quantization Rule

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers