FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu

2022 · NeurIPS

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Problem

Framing

Standard exact attention is IO-bound: it materializes the $N \times N$ score matrix in HBM, so memory traffic dominates long-context runtime and memory. FlashAttention closes this gap with SRAM-resident tiling and online softmax, preserving exact attention while reducing HBM accesses to $\Theta(N^2 d^2 M^{-1})$ instead of $\Theta(Nd + N^2)$ .

Currently Used Methods

Foundational

@vaswaniAttentionAllNeed2017 — exact scaled dot-product attention over dense $QK^\top$ $Q K^{⊤}$ .
- Limitation in context: writes the full score matrix to HBM, causing quadratic memory traffic.

Proposed Method

Architecture

FlashAttention preserves exact self-attention and only changes execution order. It tiles $Q$ , $K$ , and $V$ into SRAM blocks, fuses score, masking, softmax, dropout, and value accumulation, and writes only output tiles to HBM.

Verified architecture figure: GPU memory hierarchy, FlashAttention tiling over Q/K/V blocks in SRAM, and a small GPT-2 attention speedup comparison against PyTorch.

Loss / Objective

The operator is unchanged from standard exact attention:

\mathbf{O} = \operatorname{softmax}(\mathbf{Q}\mathbf{K}^\top)\mathbf{V}.

Algorithm

Its key step is an online softmax merge across key-value tiles, so prior score blocks are never materialized:

\mathbf{m}_i^{\mathrm{new}} = \max\big(\mathbf{m}_i, \tilde{\mathbf{m}}_{ij}\big), \qquad \mathbf{l}_i^{\mathrm{new}} = e^{\mathbf{m}_i-\mathbf{m}_i^{\mathrm{new}}}\mathbf{l}_i + e^{\tilde{\mathbf{m}}_{ij}-\mathbf{m}_i^{\mathrm{new}}}\tilde{\mathbf{l}}_{ij},

\mathbf{O}_i^{\mathrm{new}} = \operatorname{diag}(\mathbf{l}_i^{\mathrm{new}})^{-1}\left(\operatorname{diag}(\mathbf{l}_i)e^{\mathbf{m}_i-\mathbf{m}_i^{\mathrm{new}}}\mathbf{O}_i + e^{\tilde{\mathbf{m}}_{ij}-\mathbf{m}_i^{\mathrm{new}}}\tilde{\mathbf{P}}_{ij}\mathbf{V}_j\right).

Training Procedure

SRAM budget in analysis: $M$ bytes.
Tile sizes: $B_c = \left\lceil \frac{M}{4d} \right\rceil$ , $B_r = \min\left(\left\lceil \frac{M}{4d} \right\rceil, d\right)$ .
BERT-large: batch size $448$ , LAMB, learning rate $3.75 \times 10^{-3}$ , at most $7100$ steps.
GPT-2: effective batch size $512$ , AdamW, learning rate $6 \times 10^{-4}$ small, $1.5 \times 10^{-4}$ medium.

Evaluation

Datasets

Wikipedia for BERT-large pretraining.
OpenWebText for GPT-2 small and medium.
Long Range Arena: ListOps, Text, Retrieval, Image, Pathfinder.
MIMIC-III long-document classification.
ECtHR long-document classification.
Path-X and Path-256 long-context reasoning.

Metrics

Training time.
Speedup over baseline implementations.
OpenWebText perplexity.
Accuracy on LRA and Path tasks.
Micro- $F_1$ on long-document classification.
Attention runtime and memory footprint.

Headline results

BERT-large: $17.4 \pm 1.4$ min vs $20.0 \pm 1.5$ , $15\%$ faster.
GPT-2 small: ppl $18.2$ , $2.7$ days, $3.5\times$ over HuggingFace.
GPT-2 medium: ppl $14.3$ , $6.9$ days, $3.0\times$ over HuggingFace.
GPT-2 small, $4\mathrm{K}$ context: ppl $17.5$ , still $1.3\times$ over Megatron $1\mathrm{K}$ .
Path-X / Path-256: $61.4\%$ / $63.1\%$ accuracy.

Verified results plot: bar chart of FlashAttention speedup on T4 across sequence lengths, with largest gains when masking and dropout are fused.

Table 3: Long-Range Arena accuracy and speedup

Models	ListOps	Text	Retrieval	Image	Pathfinder	Avg	Speedup
Transformer	36.0	63.6	81.6	42.3	72.7	59.3	-
FlashAttention	37.6	63.9	81.4	43.5	72.7	59.8	2.4×
Block-sparse FlashAttention	37.0	63.0	81.3	43.6	73.3	59.6	2.8×
Linformer [84]	35.6	55.9	77.7	37.8	67.6	54.9	2.5×
Linear Attention [50]	38.8	63.2	80.7	42.6	72.5	59.6	2.3×
Performer [12]	36.8	63.6	82.2	42.1	69.9	58.9	1.8×
Local Attention [80]	36.1	60.2	76.7	40.6	66.6	56.0	1.7×
Reformer [51]	36.5	63.8	78.5	39.6	69.4	57.6	1.3×
Smyrf [19]	36.1	64.1	79.0	39.6	70.5	57.9	1.7×

Ablations

Sequence length sweep: exact FlashAttention stays faster than standard attention through at least $2\mathrm{K}$ .
Kernel fusion: masking and dropout increase the measured speedup the most.
Longer GPT-2 context: $1\mathrm{K} \rightarrow 4\mathrm{K}$ improves perplexity from $18.2$ to $17.5$ .
Long-document context: MIMIC-III rises from $52.8$ at $512$ to $57.1$ at $16\mathrm{K}$ .

Method Strengths and Weaknesses

Strengths

Exact attention semantics, not an approximation.
Additional memory is $O(N)$ beyond inputs and outputs.
Delivers $3.5\times$ GPT-2 small training speedup at matched perplexity.
Enables non-random Path-X and Path-256 Transformer results.

Weaknesses

Dense compute remains quadratic in sequence length.
Requires custom CUDA kernels per attention variant.
Gains depend on GPU SRAM size and memory hierarchy.
Block-sparse extension needs a chosen sparsity pattern.

Suggestions from the authors

Compile high-level attention code into IO-aware CUDA kernels.
Extend IO-aware optimization beyond attention.
Analyze optimal attention execution across multiple GPUs.
Generalize block-sparse kernels to broader structured patterns.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers