Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov

2021 · ICML

Zero-Shot Text-to-Image Generation

Problem

Framing

Text-to-image models still depended on dataset-specific architectures, auxiliary losses, or extra supervision. This paper replaces that stack with a single autoregressive transformer over caption and image tokens, trained at web scale. At 12B parameters on 250M pairs, zero-shot MS-COCO samples beat DF-GAN in 90% of human pairwise votes.

Currently Used Methods

Foundational

@radfordCLIP2021 — contrastive language-image pretraining for aligned cross-modal embeddings.
- Limitation in context: alignment and reranking do not synthesize pixels.
@vaswaniAttentionAllNeed2017 — scalable self-attention for long autoregressive sequence modeling.
- Limitation in context: no discrete image tokenizer or joint text-image prior.
Neural Discrete Representation Learning — discrete latent codes compress images into symbol sequences.
- Limitation in context: compression alone does not give text-conditioned generation.
Generating Images from Captions with Attention — caption-conditioned recurrent image synthesis.
- Limitation in context: fidelity and scale lag far behind web-scale generation.
DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis — strong caption-conditioned GAN baseline on MS-COCO.
- Limitation in context: remains domain-specific and not a unified zero-shot generator.

Proposed Method

Architecture

The model has two stages. A dVAE compresses each $256 \times 256$ image into a $32 \times 32=1024$ token grid with vocabulary size 8192. A 12B decoder-only sparse transformer autoregressively models up to 256 BPE text tokens followed by image tokens, with broadcast row and column embeddings for image positions.

Sample grid: four verified behaviors on one page—concept composition, anthropomorphized animals, rendered text in neon signs, and simple image-to-image translation with a cat.

Loss / Objective

Training uses a two-stage variational bound over images $\mathbf{x}$ , captions $\mathbf{y}$ , and image latents $\mathbf{z}$ .

\ln p_{\theta,\psi}(\mathbf{x}, \mathbf{y}) \geq \mathbb{E}_{\mathbf{z} \sim q_{\phi}(\mathbf{z} \mid \mathbf{x})} \left[ \ln p_{\theta}(\mathbf{x} \mid \mathbf{y}, \mathbf{z}) - \beta \, D_{\mathrm{KL}}\!\left(q_{\phi}(\mathbf{y}, \mathbf{z} \mid \mathbf{x}), \, p_{\psi}(\mathbf{y}, \mathbf{z})\right) \right]

Algorithm

Sampling factorizes the joint prior into caption generation followed by conditional image-token generation.

p_{\psi}(\mathbf{y}, \mathbf{z}) = p_{\psi}(\mathbf{y})\, p_{\psi}(\mathbf{z} \mid \mathbf{y})

Training Procedure

Image resolution: $256 \times 256$ .
dVAE latent grid: $32 \times 32 = 1024$ tokens.
Image codebook size: 8192.
Max caption length: 256 BPE tokens.
Transformer: 64 layers, 62 heads, $d_{\mathrm{model}}=3968$ .
Optimizer: AdamW, $\beta_1=0.9$ , $\beta_2=0.999$ , $\epsilon=10^{-8}$ .
Weight decay: $10^{-4}$ .
EMA decay: 0.999.
dVAE KL weight: 0 to 6.6 over 5k updates.
dVAE learning rate: $10^{-4}$ to $1.25 \times 10^{-6}$ over 1.2M updates.
Caption tokenization: 10% BPE dropout.

Evaluation

Datasets

250M internet image-text pairs for training.
MS-COCO captions for zero-shot evaluation.
CUB captions for zero-shot evaluation.

Metrics

Human pairwise preference votes.
FID.
Inception Score.
Sensitivity to CLIP reranking sample count.
Overlap-filtered FID on CUB.

Headline results

MS-COCO zero-shot: 90% human preference over DF-GAN in best-of-five voting.
MS-COCO: FID and IS reported as functions of Gaussian blur radius.
CUB zero-shot: FID and IS reported as functions of blur radius.
MS-COCO reranking: best-of-512 CLIP selection used for qualitative comparisons.
CUB overlap audit: validation set contains about 21% training-image overlap.

Ablations

Reranking sample count: larger candidate pools improve MS-COCO scores.
Blur radius: reported FID and IS shift materially on both benchmarks.
Validation overlap: removing overlapping CUB images changes FID.
Temperature reduction: human study excludes it for a cleaner comparison.

Reconstruction comparison: top row original images, bottom row dVAE reconstructions; global structure survives while text and fine detail blur or disappear.

Method Strengths and Weaknesses

Strengths

One joint autoregressive prior unifies caption conditioning and image generation.
Scale alone reaches strong zero-shot quality: 12B parameters, 250M pairs.
Human raters prefer MS-COCO samples over DF-GAN 90% of the time.
The model shows emergent text rendering and simple image-to-image translation.

Weaknesses

dVAE compression visibly removes fine detail before transformer modeling.
Best showcased samples rely on CLIP reranking over 512 candidates.
Quantitative image quality depends strongly on evaluation blur settings.
CUB validation overlap muddies clean zero-shot FID interpretation.

Suggestions from the authors

Scale model size, dataset size, and compute further.
Improve mixed-precision stability for billion-parameter generative training.
Reduce train-validation overlap in evaluation datasets.
Improve text rendering and image-to-image translation fidelity.

Zero-Shot Text-to-Image Generation

Zero-Shot Text-to-Image Generation

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers