Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov

2021 · ICML

Zero-Shot Text-to-Image Generation

Problem

Framing

Text-to-image models still depended on dataset-specific architectures, auxiliary losses, or extra supervision. This paper replaces that stack with a single autoregressive transformer over caption and image tokens, trained at web scale. At 12B parameters on 250M pairs, zero-shot MS-COCO samples beat DF-GAN in 90% of human pairwise votes.

Currently Used Methods

Foundational

Proposed Method

Architecture

The model has two stages. A dVAE compresses each 256×256256 \times 256 image into a 32×32=102432 \times 32=1024 token grid with vocabulary size 8192. A 12B decoder-only sparse transformer autoregressively models up to 256 BPE text tokens followed by image tokens, with broadcast row and column embeddings for image positions.

Sample grid: four verified behaviors on one page—concept composition, anthropomorphized animals, rendered text in neon signs, and simple image-to-image translation with a cat.

Loss / Objective

Training uses a two-stage variational bound over images x\mathbf{x}, captions y\mathbf{y}, and image latents z\mathbf{z}.

lnpθ,ψ(x,y)Ezqϕ(zx)[lnpθ(xy,z)βDKL ⁣(qϕ(y,zx),pψ(y,z))]\ln p_{\theta,\psi}(\mathbf{x}, \mathbf{y}) \geq \mathbb{E}_{\mathbf{z} \sim q_{\phi}(\mathbf{z} \mid \mathbf{x})} \left[ \ln p_{\theta}(\mathbf{x} \mid \mathbf{y}, \mathbf{z}) - \beta \, D_{\mathrm{KL}}\!\left(q_{\phi}(\mathbf{y}, \mathbf{z} \mid \mathbf{x}), \, p_{\psi}(\mathbf{y}, \mathbf{z})\right) \right]

Algorithm

Sampling factorizes the joint prior into caption generation followed by conditional image-token generation.

pψ(y,z)=pψ(y)pψ(zy)p_{\psi}(\mathbf{y}, \mathbf{z}) = p_{\psi}(\mathbf{y})\, p_{\psi}(\mathbf{z} \mid \mathbf{y})

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Reconstruction comparison: top row original images, bottom row dVAE reconstructions; global structure survives while text and fine detail blur or disappear.

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers