Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang

2022 · NeurIPS

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Problem

Framing

Standard few-shot prompting underuses large language models on multi-step reasoning. The paper closes this gap by inserting natural-language intermediate steps into exemplars. On GSM8K, PaLM 540B rises from 18% to 57% accuracy.

Currently Used Methods

Foundational

Proposed Method

Architecture

The method changes the prompt, not the model. Each exemplar is a triple input,chain of thought,output\langle \text{input}, \text{chain of thought}, \text{output} \rangle, and decoding stays autoregressive. The page figure shows side-by-side prompt formats: direct answer prediction versus rationale-then-answer prompting.

Prompt-format figure: standard prompting answers directly, while chain-of-thought prompting inserts highlighted intermediate reasoning before the final answer.

Loss / Objective

The paper keeps the pretrained next-token objective over the prompted sequence.

pθ(y,rx,E)=t=1Tpθ ⁣(sts<t,x,E)p_{\theta}(\mathbf{y}, \mathbf{r} \mid \mathbf{x}, \mathcal{E}) = \prod_{t=1}^{T} p_{\theta}\!\left(s_t \mid s_{<t}, \mathbf{x}, \mathcal{E}\right)

Sampling Rule

Inference generates a rationale followed by an answer under the CoT prompt.

(r^,y^)=argmaxr,y  pθ(y,rx,ECoT)(\hat{\mathbf{r}}, \hat{\mathbf{y}}) = \arg\max_{\mathbf{r},\mathbf{y}}\; p_{\theta}(\mathbf{y}, \mathbf{r} \mid \mathbf{x}, \mathcal{E}_{\mathrm{CoT}})

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Results bar chart: GSM8K solve rate for finetuned GPT-3, prior best, PaLM 540B standard prompting, and PaLM 540B chain-of-thought prompting; the CoT bar is highest at 57%.

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers