Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter

2019 · ICLR

Decoupled Weight Decay Regularization

Problem

Framing

Adam used coupled L2L_2 regularization, which is not equivalent to weight decay under adaptive preconditioning. The paper fixes this by decoupling parameter shrinkage from the gradient step, yielding AdamW and SGDW with better tuning behavior and about 15% relative test-error gains on image classification.

Currently Used Methods

Foundational

Proposed Method

Architecture

The contribution is optimizer-side. AdamW keeps Adam's moment estimates and applies decay as a separate parameter shrinkage term. Experiments use Shake-Shake ResNets, mainly 26 2x64d and 26 2x96d.

Results heatmaps comparing Adam, SGDW, and AdamW across learning-rate and decay settings on CIFAR-10 after 100 epochs; the right-column methods show broader low-error basins.

Loss / Objective

The key change is to separate loss gradients from decay.

boldsymbolthetat+1=(1lambda)boldsymbolthetatalpha,nablaft(boldsymbolthetat)\\boldsymbol{\\theta}_{t+1} = (1-\\lambda) \\boldsymbol{\\theta}_t - \\alpha \\, \\nabla f_t(\\boldsymbol{\\theta}_t)

Algorithm

AdamW preserves Adam's adaptive step and decouples shrinkage from the gradient term.

\\begin{aligned} \\mathbf{m}_t &= \\beta_1 \\mathbf{m}_{t-1} + (1-\\beta_1) \\, \\nabla f_t(\\boldsymbol{\\theta}_{t-1}), \\\\ \\mathbf{v}_t &= \\beta_2 \\mathbf{v}_{t-1} + (1-\\beta_2) \\, \\nabla f_t(\\boldsymbol{\\theta}_{t-1})^{\\odot 2}, \\\\ \\hat{\\mathbf{m}}_t &= \\mathbf{m}_t / (1-\\beta_1^t), \\qquad \\hat{\\mathbf{v}}_t = \\mathbf{v}_t / (1-\\beta_2^t), \\\\ \\boldsymbol{\\theta}_t &= \\boldsymbol{\\theta}_{t-1} - \\alpha_t \\left( \\frac{\\hat{\\mathbf{m}}_t}{\\sqrt{\\hat{\\mathbf{v}}_t}+\\epsilon} + \\lambda \\, \\boldsymbol{\\theta}_{t-1} \\right) \\end{aligned}

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Six heatmaps of CIFAR-10 test error after 100 epochs: top row Adam, bottom row AdamW, each under fixed, step-drop, and cosine schedules. AdamW has a broader low-error region, especially with cosine annealing.

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers

No vault papers identified as further work yet.