Gradient Optimizer

Below is a clearly structured and logically rigorous article covering:

SGD (Stochastic Gradient Descent)
Mini-batch SGD
Adam Optimizer
AdamW Optimizer
The relationship and differences among the three
Computational intuition + Mathematical formulas + Practical experience

Comprehensive Understanding of Deep Learning Optimizers: SGD, Mini-batch SGD, Adam, AdamW

In deep learning training, the optimizer is the core component that controls how model parameters are updated.
Different optimizers can greatly affect the model's learning speed, stability, and final performance.

This article will delve into:

What is SGD?
Why is Mini-batch the standard in deep learning?
How does Adam combine Momentum + RMSProp?
Why is AdamW currently the most mainstream optimizer for large models?

The gradient in SGD and mini-batch is the average of this batch.

1. SGD: Stochastic Gradient Descent#

1.1 Basic Idea#

Standard Gradient Descent (Full-batch Gradient Descent):

\theta = \theta - \eta \cdot \nabla_\theta L(\text{all training data})

The drawbacks are very obvious:

Each update requires traversing the entire dataset (too slow)
Not feasible for large datasets

Thus, we introduce Stochastic Gradient Descent (SGD).

The idea of SGD:

Update parameters using the gradient of only one sample at a time, rather than all data.

Update method:

\theta = \theta - \eta \cdot \nabla_\theta L(x_i)

1.2 Advantages of SGD#

Extremely fast update speed (no need to traverse the entire dataset)
High noise, which may help the model escape local optima

1.3 Disadvantages of SGD#

Excessive noise, severe gradient jitter
Only looking at one sample at a time, making the optimization trajectory very unstable
Not suitable for deep networks and large models

Therefore, what is more commonly used now is——Mini-batch SGD.

2. Mini-batch SGD: The Default Method in Modern Deep Learning#

In actual training, we do not use batch=1 (SGD), nor do we use full batch, but rather:

Use a small batch of samples (batch_size = 32/64/128...) to estimate the gradient each time.

Mini-batch gradient:

g = \frac{1}{B} \sum_{i=1}^{B} \nabla_\theta L(x_i)

Parameter update:

\theta = \theta - \eta g

2.1 Why is Mini-batch Optimal?#

Mini-batch is a perfect compromise:

Method	Noise	Speed	Stability	Practical Use
SGD (batch=1)	Too much noise	Fast	Poor	Rarely used
Full-batch GD	No noise	Slow	Good	Not feasible
Mini-batch SGD	Moderate noise	Fast	Stable	Most commonly used

Advantages of Mini-batch:

Fully utilizes GPU parallel capabilities
Has reasonable gradient noise, which helps generalization
Training is stable, and update speed is fast

Thus, in deep learning, when mentioning "SGD," it usually refers to:

Mini-batch SGD, rather than the original SGD with batch=1.

3. Adam: Adaptive Moment Estimation Optimizer#

Adam can be seen as a combination of the following two:

Momentum — Smooths the gradient direction
RMSProp — Adaptive learning rate for dimensions

3.1 Core Idea of Adam#

Adam maintains two values for each parameter:

First Moment (Momentum): Exponential average of gradients

m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t

Key Points:

$g_t$ : Gradient of the current batch
$m_{t-1}$ : "Compressed memory" of all previous batch gradients
Why is the value of $\beta$ around 0.9, relatively large?

Why should β₁ be set so large (e.g., 0.9)?

1️⃣ Because the mini-batch gradient itself is "very noisy."

You now know:

Each $g_t$ is just the gradient of one mini-batch
The gradient direction can fluctuate between different batches

If we overly trust the current batch:

Parameters will "jump left and right"
Convergence will be slow or even unstable

So we need to:

"Listen less to the current batch and more to the historical trend."

This is the fundamental reason for setting β₁ to be large.

2️⃣ What does β₁ = 0.9 actually mean?

m_t = 0.9 m_{t-1} + 0.1 g_t

Explanation:

The gradient of the current batch accounts for only 10%
The historical trend accounts for 90%

But note:
This 90% is not "the last batch," but rather the comprehensive result of all historical batches.

3️⃣ What is the "memory length"?

Exponential averaging has an intuitive "effective memory length":

\text{memory length} \approx \frac{1}{1-\beta_1}

So:

β₁ = 0.9 → remembers about 10 batches
β₁ = 0.99 → remembers about 100 batches
β₁ = 0.5 → remembers about 2 batches

This also explains:

The larger β₁ is, the longer the memory, and the smoother the updates;
The smaller β₁ is, the faster the response, but noisier.

Second Moment: Exponential average of squared gradients

v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2

Then perform bias correction:

\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1-\beta_2^t}

Final update:

\theta = \theta - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}

3.2 Intuitive Understanding of Adam#

$\hat{m}_t$: Smooths the gradient, making the optimization direction more stable
$\hat{v}_t$: Measures the magnitude of the gradient; large gradients → small step size, small gradients → large step size
Each dimension of parameters has an independent learning rate: Adaptive learning rate

3.3 Advantages of Adam#

Fast convergence speed (especially in the early stages of training)
Insensitive to learning rate
Adapts to gradient scale, very suitable for deep networks
It is the mainstay for training large models like NLP and Transformers

3.4 Disadvantages of Adam#

Sometimes generalization performance is not as good as SGD + Momentum
The original Adam's L2 regularization is not strictly equivalent to weight decay

This disadvantage leads to AdamW.

4. AdamW: The Standard Optimizer for Large Models like Transformers#

AdamW = Adam + Correct Weight Decay

The original Adam's L2 regularization and adaptive learning rate are coupled together, leading to poor performance.

AdamW applies weight decay directly to the parameters:

\theta = \theta - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon} + \lambda\theta \right)

Thus:

Weight decay only affects the parameter size, not the gradient estimate
More theoretically sound and stable

AdamW is the default optimizer for the following models:#

BERT
GPT-2 / GPT-3 / GPT-4
PaLM / LLaMA
Megatron-LM
Stable Diffusion

Almost all modern large models are trained using AdamW.

5. Summary Comparison of SGD vs Mini-batch SGD vs Adam vs AdamW#

Optimizer	Adaptive	Smooth Direction	Correct Weight Decay Implementation	Convergence Speed	Generalization Ability	Application Scenarios
SGD	No	No	No	Slow	Strong (image tasks)	CNN, small models
SGD+Momentum	No	Yes	No	Moderate	Very strong	Most traditional tasks
Mini-batch SGD	No	No	No	Moderate	Strong	All deep learning
Adam	Yes	Yes	No	Fast	Average	NLP, large models
AdamW	Yes	Yes	Yes	Fast and Stable	Strong	Standard for Transformers and large models

6. Final Summary (One-sentence Memory Version)#

SGD: Updates parameters using one sample, but has too much noise
Mini-batch SGD: Updates using the average gradient of a batch of samples, the training standard
Adam: Momentum + adaptive learning rate, fast and stable training
AdamW: Fixes the weight decay issue of Adam, the default optimizer for Transformers/GPT