Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu

2022 · NeurIPS

Large Language Models are Zero-Shot Reasoners

Problem

Framing

Zero-shot prompting in large LLMs still fails on multi-step reasoning unless the prompt includes task-specific exemplars. The paper shows that one trigger phrase, $\"\text{Let's think step by step}\"$ , elicits useful reasoning traces without exemplars, raising MultiArith accuracy from 17.7% to 78.7% and GSM8K from 10.4% to 40.7%.

Currently Used Methods

Foundational

@brownGPT3_2020 — few-shot in-context prompting with task examples.
- Limitation in context: zero-shot reasoning remains weak on multi-step tasks.
@weiCoT2022 — few-shot chain-of-thought exemplars elicit intermediate reasoning.
- Limitation in context: requires hand-written, task-specific reasoning demonstrations.
@ouyangInstructGPT2022 — instruction-tuned GPT-3 strengthens zero-shot instruction following.
- Limitation in context: instruction tuning alone does not unlock stable reasoning traces.
@kaplanScalingLaws2020 — larger models improve predictably on many language tasks.
- Limitation in context: reasoning gains stay flat without explicit stepwise prompting.

Proposed Method

Architecture

The method keeps the LLM frozen and changes only inference prompts. It uses two stages: first generate a rationale with the trigger phrase, then extract the final answer with a short answer cue.

Prompting comparison: four panels contrast few-shot, few-shot-CoT, zero-shot, and Zero-shot-CoT on the same arithmetic question; the added trigger phrase induces a correct step-by-step rationale.

Loss / Objective

The paper does not optimize model parameters. Inference uses the pretrained conditional distribution directly.

\mathbf{z} \sim p_{\theta}(\cdot \mid \mathbf{x}_0), \qquad \hat{y} \sim p_{\theta}(\cdot \mid [\mathbf{x}_0; \mathbf{z}; \mathbf{a}])

Algorithm

Zero-shot-CoT forms a reasoning prompt, then an answer-extraction prompt.

\mathbf{x}_0 = [\text{Q: } \mathbf{x} \text{ A: } t], \qquad t = \"\text{Let's think step by step}\"

\mathbf{x}_{\mathrm{ans}} = [\mathbf{x}_0; \mathbf{z}; \mathbf{a}]

Training Procedure

No finetuning.
No gradient updates.
Main model: text-davinci-002.
Additional models: InstructGPT, GPT-3, PaLM.
Two inference passes per example.
Answer-extraction cue varies with answer format.

Evaluation

Datasets

Arithmetic: SingleEq, AddSub, MultiArith, GSM8K, AQUA-RAT, SVAMP.
Commonsense: CommonSenseQA, StrategyQA.
Symbolic: Last Letter, Coin Flip.
Logical: Date Understanding, Tracking Shuffled Objects.

Metrics

Accuracy on every benchmark.
Model-scale accuracy curves for MultiArith and GSM8K.

Headline results

MultiArith, text-davinci-002: 78.7%.
GSM8K, text-davinci-002: 40.7%.
MultiArith, zero-shot baseline: 17.7%.
GSM8K, zero-shot baseline: 10.4%.
PaLM 540B: GSM8K 12.5% $\rightarrow$ 43.0%, MultiArith 25.5% $\rightarrow$ 69.5%.

Ablations

Prompt template: instructive templates stay strong; misleading templates can collapse accuracy.
Example mismatch: Few-shot-CoT degrades when exemplar task and answer format mismatch.
Model scale: gains appear mainly in larger models.
Commonsense tasks: little or no improvement versus zero-shot.

Results table

Table 5: Few-shot-CoT robustness to cross-task exemplars

Method	AQUA-RAT	MultiArith
Zero-shot	22.4	17.7
Few-shot-CoT $\dagger$	31.9	27.0
Zero-shot-CoT	33.5	78.7
Few-shot-CoT	39.0	88.2

$\dagger$ CommonsenseQA samples are used as exemplars.

Method Strengths and Weaknesses

Strengths

No training cost; gains come entirely from prompt changes.
One trigger transfers across arithmetic, symbolic, and logical tasks.
Large arithmetic gains: MultiArith improves by 61.0 points.
Similar gains appear on a second model family, PaLM 540B.

Weaknesses

Still trails few-shot CoT with curated exemplars on key arithmetic tasks.
Needs two decoding passes per question.
Gains are weak on CommonSenseQA and StrategyQA.
Rationales can be fluent yet logically wrong or overgenerate options.

Suggestions from the authors

Analyze why a single trigger phrase elicits multi-step reasoning.
Study broader zero-shot reasoning abilities hidden in pretrained LLMs.
Improve robustness across tasks and answer formats.
Explore higher-level, multi-task cognitive capabilities beyond benchmark prompting.

Large Language Models are Zero-Shot Reasoners

Large Language Models are Zero-Shot Reasoners

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Results table

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers