Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue

2022 · NeurIPS

Flamingo: a Visual Language Model for Few-Shot Learning

Problem

Framing

Few-shot vision-language systems either score image-text compatibility or need task-specific fine-tuning. Flamingo closes this gap by inserting lightweight visual conditioning into a frozen LM, so one prompted model handles interleaved images, videos, and text. It reports best zero/few-shot results on all 16 evaluated benchmarks.

Currently Used Methods

Foundational

Proposed Method

Architecture

Flamingo freezes a vision encoder and a causal LM, then adds a Perceiver Resampler plus interleaved GATED XATTN-DENSE blocks. Each image or video is compressed to a fixed latent set, and each text token attends only to the most recent visual input.

Architecture diagram: two frozen vision encoders feed Perceiver Resamplers, whose latents condition inserted GATED XATTN-DENSE blocks inside a frozen language model over interleaved image-text prompts.

Loss / Objective

Training is autoregressive next-token prediction over interleaved visual-text sequences.

p(yx)==1Lp(yy<,x)p(\mathbf{y}\mid \mathbf{x}) = \prod_{\ell=1}^{L} p\left(y_{\ell} \mid y_{<\ell}, x_{\leq \ell}\right)

Algorithm

Visual features enter the frozen LM through gated cross-attention and gated feed-forward residuals.

y=y+tanh(α)XAttn(y,x)\mathbf{y}' = \mathbf{y} + \tanh(\boldsymbol{\alpha}) \odot \operatorname{XAttn}(\mathbf{y}, \mathbf{x}) y=y+tanh(β)FFW(y)\mathbf{y}'' = \mathbf{y}' + \tanh(\boldsymbol{\beta}) \odot \operatorname{FFW}(\mathbf{y}')

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Results plot: left, bar chart of Flamingo-80B 32-shot performance versus prior zero/few-shot state of the art across 15 listed benchmarks; right, aggregated performance rising with model size and shot count.

Table 3: Flamingo-3B ablations on five DEV benchmarks with 4 shots

Ablated settingFlamingo-3B original valueChanged valueParam.count \downarrowStep time \downarrowCOCO CIDEr \uparrowOKVQA top1 \uparrowVQAv2 top1 \uparrowMSVDQA top1 \uparrowVATEX CIDEr \uparrowOverall score \uparrow
Flamingo-3B model3.2B1.74s86.542.155.836.353.470.7
(i) Training dataAll dataw/o Video-Text pairs3.2B1.42s84.243.053.934.546.067.3
w/o Image-Text pairs3.2B0.95s66.339.251.632.041.660.9
Image-Text pairs \rightarrow LAION3.2B1.74s79.541.453.533.947.666.4
w/o M3W3.2B1.02s54.136.552.731.423.553.4
(ii) OptimisationAccumulationRound Robin3.2B1.68s76.139.852.133.240.862.9
(iii) Tanh gating373.2B1.74s78.440.552.935.947.566.5
(iv) Cross-attention architectureGATED XATTN - DENSEVANILLA XATTN2.4B1.16s80.641.553.432.950.766.9
GRAFTING3.3B1.74s79.236.150.832.247.863.1
(v) Cross-attention frequencyEverySingle in middle2.0B0.87s71.538.150.229.142.359.8
Every 4th2.3B1.02s82.342.755.134.650.868.8
Every 2nd2.6B1.24s83.741.055.834.549.768.2
(vi) ResamplerPerceiverMLP3.2B1.85s78.642.254.735.244.766.6
Transformer3.2B1.81s83.241.755.631.548.366.7
(vii) Vision encoderNFNet-F6CLIP ViT-L/143.1B1.58s76.541.653.433.244.564.9
NFNet-F02.9B1.45s73.840.552.831.142.962.7
(viii) Freezing LM37 (random init)3.2B2.42s74.831.545.626.950.157.8
7 (pretrained)3.2B2.42s81.233.747.431.053.962.7

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers