Zero-Shot Text-to-Image Generation
Zero-Shot Text-to-Image Generation
Problem
Framing
Text-to-image models still depended on dataset-specific architectures, auxiliary losses, or extra supervision. This paper replaces that stack with a single autoregressive transformer over caption and image tokens, trained at web scale. At 12B parameters on 250M pairs, zero-shot MS-COCO samples beat DF-GAN in 90% of human pairwise votes.
Currently Used Methods
Foundational
- @radfordCLIP2021 — contrastive language-image pretraining for aligned cross-modal embeddings.
- Limitation in context: alignment and reranking do not synthesize pixels.
- @vaswaniAttentionAllNeed2017 — scalable self-attention for long autoregressive sequence modeling.
- Limitation in context: no discrete image tokenizer or joint text-image prior.
- Neural Discrete Representation Learning — discrete latent codes compress images into symbol sequences.
- Limitation in context: compression alone does not give text-conditioned generation.
- Generating Images from Captions with Attention — caption-conditioned recurrent image synthesis.
- Limitation in context: fidelity and scale lag far behind web-scale generation.
- DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis — strong caption-conditioned GAN baseline on MS-COCO.
- Limitation in context: remains domain-specific and not a unified zero-shot generator.
Proposed Method
Architecture
The model has two stages. A dVAE compresses each image into a token grid with vocabulary size 8192. A 12B decoder-only sparse transformer autoregressively models up to 256 BPE text tokens followed by image tokens, with broadcast row and column embeddings for image positions.

Loss / Objective
Training uses a two-stage variational bound over images , captions , and image latents .
Algorithm
Sampling factorizes the joint prior into caption generation followed by conditional image-token generation.
Training Procedure
- Image resolution: .
- dVAE latent grid: tokens.
- Image codebook size: 8192.
- Max caption length: 256 BPE tokens.
- Transformer: 64 layers, 62 heads, .
- Optimizer: AdamW, , , .
- Weight decay: .
- EMA decay: 0.999.
- dVAE KL weight: 0 to 6.6 over 5k updates.
- dVAE learning rate: to over 1.2M updates.
- Caption tokenization: 10% BPE dropout.
Evaluation
Datasets
- 250M internet image-text pairs for training.
- MS-COCO captions for zero-shot evaluation.
- CUB captions for zero-shot evaluation.
Metrics
- Human pairwise preference votes.
- FID.
- Inception Score.
- Sensitivity to CLIP reranking sample count.
- Overlap-filtered FID on CUB.
Headline results
- MS-COCO zero-shot: 90% human preference over DF-GAN in best-of-five voting.
- MS-COCO: FID and IS reported as functions of Gaussian blur radius.
- CUB zero-shot: FID and IS reported as functions of blur radius.
- MS-COCO reranking: best-of-512 CLIP selection used for qualitative comparisons.
- CUB overlap audit: validation set contains about 21% training-image overlap.
Ablations
- Reranking sample count: larger candidate pools improve MS-COCO scores.
- Blur radius: reported FID and IS shift materially on both benchmarks.
- Validation overlap: removing overlapping CUB images changes FID.
- Temperature reduction: human study excludes it for a cleaner comparison.

Method Strengths and Weaknesses
Strengths
- One joint autoregressive prior unifies caption conditioning and image generation.
- Scale alone reaches strong zero-shot quality: 12B parameters, 250M pairs.
- Human raters prefer MS-COCO samples over DF-GAN 90% of the time.
- The model shows emergent text rendering and simple image-to-image translation.
Weaknesses
- dVAE compression visibly removes fine detail before transformer modeling.
- Best showcased samples rely on CLIP reranking over 512 candidates.
- Quantitative image quality depends strongly on evaluation blur settings.
- CUB validation overlap muddies clean zero-shot FID interpretation.
Suggestions from the authors
- Scale model size, dataset size, and compute further.
- Improve mixed-precision stability for billion-parameter generative training.
- Reduce train-validation overlap in evaluation datasets.
- Improve text rendering and image-to-image translation fidelity.
Links
Prior Papers
- @radfordCLIP2021 — supplies the contrastive image-text model used for reranking generated samples.
- @vaswaniAttentionAllNeed2017 — provides the transformer backbone scaled here to joint text-image autoregression.
Further Papers
- @rombachLatentDiffusion2022 — replaces expensive pixel-space autoregression with latent diffusion for text-to-image synthesis.
- @alayracFlamingo2022 — extends large-scale multimodal modeling with stronger few-shot visual-language conditioning.
- @liBLIP2_2023 — advances modular vision-language pretraining on top of large multimodal representation pipelines.