Large Language Models are Zero-Shot Reasoners
Large Language Models are Zero-Shot Reasoners
Problem
Framing
Zero-shot prompting in large LLMs still fails on multi-step reasoning unless the prompt includes task-specific exemplars. The paper shows that one trigger phrase, \"\text{Let's think step by step}\", elicits useful reasoning traces without exemplars, raising MultiArith accuracy from 17.7% to 78.7% and GSM8K from 10.4% to 40.7%.
Currently Used Methods
Foundational
- @brownGPT3_2020 — few-shot in-context prompting with task examples.
- Limitation in context: zero-shot reasoning remains weak on multi-step tasks.
- @weiCoT2022 — few-shot chain-of-thought exemplars elicit intermediate reasoning.
- Limitation in context: requires hand-written, task-specific reasoning demonstrations.
- @ouyangInstructGPT2022 — instruction-tuned GPT-3 strengthens zero-shot instruction following.
- Limitation in context: instruction tuning alone does not unlock stable reasoning traces.
- @kaplanScalingLaws2020 — larger models improve predictably on many language tasks.
- Limitation in context: reasoning gains stay flat without explicit stepwise prompting.
Proposed Method
Architecture
The method keeps the LLM frozen and changes only inference prompts. It uses two stages: first generate a rationale with the trigger phrase, then extract the final answer with a short answer cue.

Loss / Objective
The paper does not optimize model parameters. Inference uses the pretrained conditional distribution directly.
Algorithm
Zero-shot-CoT forms a reasoning prompt, then an answer-extraction prompt.
\mathbf{x}_0 = [\text{Q: } \mathbf{x} \text{ A: } t], \qquad t = \"\text{Let's think step by step}\"Training Procedure
- No finetuning.
- No gradient updates.
- Main model: text-davinci-002.
- Additional models: InstructGPT, GPT-3, PaLM.
- Two inference passes per example.
- Answer-extraction cue varies with answer format.
Evaluation
Datasets
- Arithmetic: SingleEq, AddSub, MultiArith, GSM8K, AQUA-RAT, SVAMP.
- Commonsense: CommonSenseQA, StrategyQA.
- Symbolic: Last Letter, Coin Flip.
- Logical: Date Understanding, Tracking Shuffled Objects.
Metrics
- Accuracy on every benchmark.
- Model-scale accuracy curves for MultiArith and GSM8K.
Headline results
- MultiArith, text-davinci-002: 78.7%.
- GSM8K, text-davinci-002: 40.7%.
- MultiArith, zero-shot baseline: 17.7%.
- GSM8K, zero-shot baseline: 10.4%.
- PaLM 540B: GSM8K 12.5% 43.0%, MultiArith 25.5% 69.5%.
Ablations
- Prompt template: instructive templates stay strong; misleading templates can collapse accuracy.
- Example mismatch: Few-shot-CoT degrades when exemplar task and answer format mismatch.
- Model scale: gains appear mainly in larger models.
- Commonsense tasks: little or no improvement versus zero-shot.
Results table
Table 5: Few-shot-CoT robustness to cross-task exemplars
| Method | AQUA-RAT | MultiArith |
|---|---|---|
| Zero-shot | 22.4 | 17.7 |
| Few-shot-CoT | 31.9 | 27.0 |
| Zero-shot-CoT | 33.5 | 78.7 |
| Few-shot-CoT | 39.0 | 88.2 |
CommonsenseQA samples are used as exemplars.
Method Strengths and Weaknesses
Strengths
- No training cost; gains come entirely from prompt changes.
- One trigger transfers across arithmetic, symbolic, and logical tasks.
- Large arithmetic gains: MultiArith improves by 61.0 points.
- Similar gains appear on a second model family, PaLM 540B.
Weaknesses
- Still trails few-shot CoT with curated exemplars on key arithmetic tasks.
- Needs two decoding passes per question.
- Gains are weak on CommonSenseQA and StrategyQA.
- Rationales can be fluent yet logically wrong or overgenerate options.
Suggestions from the authors
- Analyze why a single trigger phrase elicits multi-step reasoning.
- Study broader zero-shot reasoning abilities hidden in pretrained LLMs.
- Improve robustness across tasks and answer formats.
- Explore higher-level, multi-task cognitive capabilities beyond benchmark prompting.
Links
Prior Papers
- @brownGPT3_2020 — establishes the few-shot and zero-shot prompting baseline that this paper strengthens for reasoning.
- @weiCoT2022 — introduces chain-of-thought exemplars, the direct method that Zero-shot-CoT removes.
- @ouyangInstructGPT2022 — provides the instruction-tuned text-davinci models used for the main experiments.
Further Papers
No vault papers identified as further work yet.