Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba

2014 · ICLR

Adam: A Method for Stochastic Optimization

Problem

Framing

Stochastic optimizers split between sparse-gradient adaptivity and robustness to non-stationary objectives. Adam closes that gap with bias-corrected first and second moments, giving low-memory parameter-wise steps that work across convex and deep non-convex training.

Currently Used Methods

Foundational

Proposed Method

Architecture

Adam is an optimizer, not a network architecture. It stores two state tensors per parameter: first moment mt\mathbf{m}_t and second moment vt\mathbf{v}_t, then applies a bias-corrected normalized step.

Loss / Objective

The method minimizes a stochastic objective through moment-tracked first-order updates.

gt=θft(θt1),mt=β1mt1+(1β1)gt,vt=β2vt1+(1β2)gt2\mathbf{g}_t = \nabla_{\boldsymbol{\theta}} f_t(\boldsymbol{\theta}_{t-1}), \qquad \mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\mathbf{g}_t, \qquad \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)\mathbf{g}_t^2

Algorithm

Bias correction removes the zero-initialization shrinkage before the parameter update.

m^t=mt1β1t,v^t=vt1β2t,θt=θt1αm^tv^t+ϵ\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \qquad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t}, \qquad \boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \alpha \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t}+\epsilon}

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Results plots: MNIST and IMDB logistic-regression training curves, where Adam reaches low cost quickly and avoids AdaGrad's slow decay and SGDNesterov's poor IMDB behavior.

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

No prior vault papers identified yet.

Further Papers