Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang

2022 · NeurIPS

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Problem

Framing

Standard few-shot prompting underuses large language models on multi-step reasoning. The paper closes this gap by inserting natural-language intermediate steps into exemplars. On GSM8K, PaLM 540B rises from 18% to 57% accuracy.

Currently Used Methods

Foundational

@brownGPT3_2020 — few-shot prompting with input–output exemplars.
- Limitation in context: weak on multi-step reasoning, even at large scale.
Training from scratch for natural-language rationales — models learn to emit intermediate reasoning traces.
- Limitation in context: needs rationale supervision, not prompt-only adaptation.
Finetuned GPT-3 with verifier — supervised math reasoning plus learned checking.
- Limitation in context: requires labeled solutions and extra task-specific machinery.
Neuro-symbolic reasoning methods — use formal intermediate representations for reasoning.
- Limitation in context: less flexible than plain-language prompting on off-the-shelf LMs.

Proposed Method

Architecture

The method changes the prompt, not the model. Each exemplar is a triple $\langle \text{input}, \text{chain of thought}, \text{output} \rangle$ , and decoding stays autoregressive. The page figure shows side-by-side prompt formats: direct answer prediction versus rationale-then-answer prompting.

Prompt-format figure: standard prompting answers directly, while chain-of-thought prompting inserts highlighted intermediate reasoning before the final answer.

Loss / Objective

The paper keeps the pretrained next-token objective over the prompted sequence.

p_{\theta}(\mathbf{y}, \mathbf{r} \mid \mathbf{x}, \mathcal{E}) = \prod_{t=1}^{T} p_{\theta}\!\left(s_t \mid s_{<t}, \mathbf{x}, \mathcal{E}\right)

Sampling Rule

Inference generates a rationale followed by an answer under the CoT prompt.

(\hat{\mathbf{r}}, \hat{\mathbf{y}}) = \arg\max_{\mathbf{r},\mathbf{y}}\; p_{\theta}(\mathbf{y}, \mathbf{r} \mid \mathbf{x}, \mathcal{E}_{\mathrm{CoT}})

Training Procedure

No parameter updates.
Eight handwritten CoT exemplars for most tasks.
Four exemplars for AQuA.
Greedy decoding for main results.
LaMDA results averaged over 5 exemplar-order seeds.

Evaluation

Datasets

Arithmetic: GSM8K, SVAMP, ASDiv, AQuA, MAWPS.
Commonsense: CSQA, StrategyQA, Date Understanding, Sports Understanding, SayCan.
Symbolic: Last Letter Concatenation, Coin Flip.

Metrics

Accuracy (%).
Solve rate (%) in summary plots.

Headline results

GSM8K, PaLM 540B: 57% CoT vs 18% standard.
GSM8K, summary bar chart: 57% CoT vs 55% prior best vs 33% finetuned GPT-3.
Arithmetic, PaLM 540B: new state of the art on GSM8K, SVAMP, and MAWPS.
Commonsense and symbolic tasks: gains appear only at sufficient model scale.
Symbolic OOD length generalization: CoT improves extrapolation.

Results bar chart: GSM8K solve rate for finetuned GPT-3, prior best, PaLM 540B standard prompting, and PaLM 540B chain-of-thought prompting; the CoT bar is highest at 57%.

Ablations

Model scale: CoT helps only after models become large enough.
Equation-only prompting: beats standard prompting, but trails full natural-language CoT.
Annotator variation: different rationale writers change scores, but CoT still wins.
Exemplar wording and order: prompt engineering matters, yet CoT keeps a large margin.

Method Strengths and Weaknesses

Strengths

No finetuning; only exemplar format changes.
GSM8K gain is large: 18% to 57% on PaLM 540B.
One prompt pattern transfers across arithmetic, commonsense, and symbolic tasks.
Generated rationales expose intermediate steps for debugging.

Weaknesses

Gains are strongly scale-dependent.
Performance depends on handcrafted rationale exemplars.
Prompt wording and exemplar order still affect results.
Rationales can sound plausible without guaranteeing faithful reasoning.

Suggestions from the authors

Study why CoT gains emerge only at large model scale.
Test whether generated chains are faithful, not just answer-correlated.
Improve robustness to exemplar choice, wording, and annotator style.
Extend CoT prompting to broader reasoning tasks.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers