Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu

2022 · NeurIPS

Large Language Models are Zero-Shot Reasoners

Problem

Framing

Zero-shot prompting in large LLMs still fails on multi-step reasoning unless the prompt includes task-specific exemplars. The paper shows that one trigger phrase, \"\text{Let's think step by step}\", elicits useful reasoning traces without exemplars, raising MultiArith accuracy from 17.7% to 78.7% and GSM8K from 10.4% to 40.7%.

Currently Used Methods

Foundational

Proposed Method

Architecture

The method keeps the LLM frozen and changes only inference prompts. It uses two stages: first generate a rationale with the trigger phrase, then extract the final answer with a short answer cue.

Prompting comparison: four panels contrast few-shot, few-shot-CoT, zero-shot, and Zero-shot-CoT on the same arithmetic question; the added trigger phrase induces a correct step-by-step rationale.

Loss / Objective

The paper does not optimize model parameters. Inference uses the pretrained conditional distribution directly.

zpθ(x0),y^pθ([x0;z;a])\mathbf{z} \sim p_{\theta}(\cdot \mid \mathbf{x}_0), \qquad \hat{y} \sim p_{\theta}(\cdot \mid [\mathbf{x}_0; \mathbf{z}; \mathbf{a}])

Algorithm

Zero-shot-CoT forms a reasoning prompt, then an answer-extraction prompt.

\mathbf{x}_0 = [\text{Q: } \mathbf{x} \text{ A: } t], \qquad t = \"\text{Let's think step by step}\" xans=[x0;z;a]\mathbf{x}_{\mathrm{ans}} = [\mathbf{x}_0; \mathbf{z}; \mathbf{a}]

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Results table

Table 5: Few-shot-CoT robustness to cross-task exemplars

MethodAQUA-RATMultiArith
Zero-shot22.417.7
Few-shot-CoT \dagger31.927.0
Zero-shot-CoT33.578.7
Few-shot-CoT39.088.2

\dagger CommonsenseQA samples are used as exemplars.

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers

No vault papers identified as further work yet.