Below is a clearly structured and logically rigorous article covering:
-
SGD (Stochastic Gradient Descent)
-
Mini-batch SGD
-
Adam Optimizer
-
AdamW Optimizer
-
The relationship and differences among the three
-
Computational intuition + Mathematical formulas + Practical experience
Comprehensive Understanding of Deep Learning Optimizers: SGD, Mini-batch SGD, Adam, AdamW
In deep learning training, the optimizer is the core component that controls how model parameters are updated.
Different optimizers can greatly affect the model's learning speed, stability, and final performance.
This article will delve into:
-
What is SGD?
-
Why is Mini-batch the standard in deep learning?
-
How does Adam combine Momentum + RMSProp?
-
Why is AdamW currently the most mainstream optimizer for large models?
The gradient in SGD and mini-batch is the average of this batch.
1. SGD: Stochastic Gradient Descent#
1.1 Basic Idea#
Standard Gradient Descent (Full-batch Gradient Descent):
The drawbacks are very obvious:
-
Each update requires traversing the entire dataset (too slow)
-
Not feasible for large datasets
Thus, we introduce Stochastic Gradient Descent (SGD).
The idea of SGD:
Update parameters using the gradient of only one sample at a time, rather than all data.
Update method:
1.2 Advantages of SGD#
-
Extremely fast update speed (no need to traverse the entire dataset)
-
High noise, which may help the model escape local optima
1.3 Disadvantages of SGD#
-
Excessive noise, severe gradient jitter
-
Only looking at one sample at a time, making the optimization trajectory very unstable
-
Not suitable for deep networks and large models
Therefore, what is more commonly used now is——Mini-batch SGD.
2. Mini-batch SGD: The Default Method in Modern Deep Learning#
In actual training, we do not use batch=1 (SGD), nor do we use full batch, but rather:
Use a small batch of samples (batch_size = 32/64/128...) to estimate the gradient each time.
Mini-batch gradient:
Parameter update:
2.1 Why is Mini-batch Optimal?#
Mini-batch is a perfect compromise:
| Method | Noise | Speed | Stability | Practical Use |
|---|---|---|---|---|
| SGD (batch=1) | Too much noise | Fast | Poor | Rarely used |
| Full-batch GD | No noise | Slow | Good | Not feasible |
| Mini-batch SGD | Moderate noise | Fast | Stable | Most commonly used |
Advantages of Mini-batch:
-
Fully utilizes GPU parallel capabilities
-
Has reasonable gradient noise, which helps generalization
-
Training is stable, and update speed is fast
Thus, in deep learning, when mentioning "SGD," it usually refers to:
Mini-batch SGD, rather than the original SGD with batch=1.
3. Adam: Adaptive Moment Estimation Optimizer#
Adam can be seen as a combination of the following two:
-
Momentum — Smooths the gradient direction
-
RMSProp — Adaptive learning rate for dimensions
3.1 Core Idea of Adam#
Adam maintains two values for each parameter:
- First Moment (Momentum): Exponential average of gradients
Key Points:
-
: Gradient of the current batch
-
: "Compressed memory" of all previous batch gradients
Why is the value of around 0.9, relatively large?
Why should β₁ be set so large (e.g., 0.9)?
1️⃣ Because the mini-batch gradient itself is "very noisy."
You now know:
-
Each is just the gradient of one mini-batch
-
The gradient direction can fluctuate between different batches
If we overly trust the current batch:
-
Parameters will "jump left and right"
-
Convergence will be slow or even unstable
So we need to:
"Listen less to the current batch and more to the historical trend."
This is the fundamental reason for setting β₁ to be large.
2️⃣ What does β₁ = 0.9 actually mean?
Explanation:
-
The gradient of the current batch accounts for only 10%
-
The historical trend accounts for 90%
But note:
This 90% is not "the last batch," but rather the comprehensive result of all historical batches.
3️⃣ What is the "memory length"?
Exponential averaging has an intuitive "effective memory length":
So:
-
β₁ = 0.9 → remembers about 10 batches
-
β₁ = 0.99 → remembers about 100 batches
-
β₁ = 0.5 → remembers about 2 batches
This also explains:
The larger β₁ is, the longer the memory, and the smoother the updates;
The smaller β₁ is, the faster the response, but noisier.
- Second Moment: Exponential average of squared gradients
Then perform bias correction:
Final update:
3.2 Intuitive Understanding of Adam#
-
$\hat{m}_t$: Smooths the gradient, making the optimization direction more stable
-
$\hat{v}_t$: Measures the magnitude of the gradient; large gradients → small step size, small gradients → large step size
-
Each dimension of parameters has an independent learning rate: Adaptive learning rate
3.3 Advantages of Adam#
-
Fast convergence speed (especially in the early stages of training)
-
Insensitive to learning rate
-
Adapts to gradient scale, very suitable for deep networks
-
It is the mainstay for training large models like NLP and Transformers
3.4 Disadvantages of Adam#
-
Sometimes generalization performance is not as good as SGD + Momentum
-
The original Adam's L2 regularization is not strictly equivalent to weight decay
This disadvantage leads to AdamW.
4. AdamW: The Standard Optimizer for Large Models like Transformers#
AdamW = Adam + Correct Weight Decay
The original Adam's L2 regularization and adaptive learning rate are coupled together, leading to poor performance.
AdamW applies weight decay directly to the parameters:
Thus:
-
Weight decay only affects the parameter size, not the gradient estimate
-
More theoretically sound and stable
AdamW is the default optimizer for the following models:#
-
BERT
-
GPT-2 / GPT-3 / GPT-4
-
PaLM / LLaMA
-
Megatron-LM
-
Stable Diffusion
Almost all modern large models are trained using AdamW.
5. Summary Comparison of SGD vs Mini-batch SGD vs Adam vs AdamW#
| Optimizer | Adaptive | Smooth Direction | Correct Weight Decay Implementation | Convergence Speed | Generalization Ability | Application Scenarios |
|---|---|---|---|---|---|---|
| SGD | No | No | No | Slow | Strong (image tasks) | CNN, small models |
| SGD+Momentum | No | Yes | No | Moderate | Very strong | Most traditional tasks |
| Mini-batch SGD | No | No | No | Moderate | Strong | All deep learning |
| Adam | Yes | Yes | No | Fast | Average | NLP, large models |
| AdamW | Yes | Yes | Yes | Fast and Stable | Strong | Standard for Transformers and large models |
6. Final Summary (One-sentence Memory Version)#
-
SGD: Updates parameters using one sample, but has too much noise
-
Mini-batch SGD: Updates using the average gradient of a batch of samples, the training standard
-
Adam: Momentum + adaptive learning rate, fast and stable training
-
AdamW: Fixes the weight decay issue of Adam, the default optimizer for Transformers/GPT