BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li

2023 · ICML

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Problem

Framing

End-to-end vision-language pre-training scales poorly once both the vision tower and LLM are large. BLIP-2 closes this with a lightweight Q-Former that queries frozen image features and emits LLM-compatible tokens. It reports 65.0 zero-shot VQAv2 with 188M trainable parameters.

Currently Used Methods

Direct antecedents

Proposed Method

Architecture

BLIP-2 has three parts: a frozen image encoder, a Q-Former, and a frozen LLM. Q-Former uses 32 learned queries, hidden width 768, BERTbase initialization, and cross-attention inserted every other block. A linear projection maps query outputs into the chosen LLM embedding space.

Verified architecture diagram: a frozen image encoder feeds Q-Former learned queries through alternating self-attention and cross-attention blocks, with three stage-1 objectives and different query-text masks.

Loss / Objective

Stage 1 learns query features with contrastive, matching, and grounded generation losses. Stage 2 trains only language modeling over projected query outputs.

Lstage1=LITC+LITM+LITG\mathcal{L}_{\mathrm{stage1}} = \mathcal{L}_{\mathrm{ITC}} + \mathcal{L}_{\mathrm{ITM}} + \mathcal{L}_{\mathrm{ITG}} LLM=t=1Tlogpθ ⁣(yty<t,z)\mathcal{L}_{\mathrm{LM}} = - \sum_{t=1}^{T} \log p_{\theta}\!\left(y_t \mid y_{<t}, \mathbf{z}\right)

Algorithm

The connector first queries frozen vision features, then projects them into the frozen LLM token space.

z=Qϕ ⁣(q,Eimg(x)),h0=Wz,pθ(yx)=t=1Tpθ ⁣(yty<t,h0)\mathbf{z} = Q_{\phi}\!\left(\mathbf{q}, E_{\mathrm{img}}(\mathbf{x})\right), \qquad \mathbf{h}_0 = W\mathbf{z}, \qquad p_{\theta}(\mathbf{y}\mid \mathbf{x}) = \prod_{t=1}^{T} p_{\theta}\!\left(y_t \mid y_{<t}, \mathbf{h}_0\right)

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers

No vault papers identified as further work yet.