banner
Fight4354

Fight4354

AI,Chem,Science,Study,Share,Hobby,LLM,Life,Sport

Pytorch Learning Record

Recently, I've been watching the hands-on deep learning course by Li Mu, and here I will record some important knowledge points to prevent knowledge from floating around and to consolidate it.

1. PyTorch's backward must be a scalar#

import torch
x = torch.arange(4.0, requires_grad=True)
y = 2 * torch.dot(x, x)
y.backward()
# By default, PyTorch accumulates gradients, so previous values need to be cleared
# Calling 'backward' on a non-scalar requires passing a 'gradient' parameter, which specifies the differential function
x.grad.zero_()  # Clear the previous gradient of x
y = x * x  # Here y is not a scalar; it's a vector
print(y)
# Equivalent to y.backward(torch.ones(len(x)))
y.sum().backward()  # y.sum() converts the vector to a scalar for differentiation
x.grad

🔹 Key Point: PyTorch's backward must be a scalar

In PyTorch, the design of .backward() is based on the chain rule, primarily targeting the gradient of a scalar (single output value) with respect to the input.

If you call y.backward(), here y must be a scalar because:

xy=yx\nabla_x y = \frac{\partial y}{\partial x}
  • xy\nabla_x y is called the gradient of y with respect to x

  • yx\frac{\partial y}{\partial x} is the partial derivative of y with respect to x

But what it specifically is depends on the dimensions of $y$ and $x$.


🔸 Case 1: y is a scalar, x is a vector

Assume:

  • xRnx \in \mathbb{R}^n

  • yRy \in \mathbb{R}

Then:

xy=[yx1yx2yxn]Rn\nabla_x y = \begin{bmatrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \vdots \\ \frac{\partial y}{\partial x_n} \end{bmatrix} \in \mathbb{R}^n

That is:

A vector of length nn representing the partial derivatives of y with respect to each component of x.

This is the most commonly used case in machine learning, such as the gradient of loss with respect to weights.


🔸 Case 2: y is a vector, x is a vector

Assume:

  • xRnx \in \mathbb{R}^n

  • yRmy \in \mathbb{R}^m

Then:

yx=JRm×n\frac{\partial y}{\partial x} = J \in \mathbb{R}^{m \times n}

This matrix JJ is called the Jacobian matrix, and its (i,j)(i, j) element is:

Jij=yixjJ_{ij} = \frac{\partial y_i}{\partial x_j}

2. Jacobian Matrix#

The Jacobian matrix is:

The matrix of partial derivatives of a vector function with respect to an input vector.

Assume:

  • Input: xRnx \in \mathbb{R}^n

  • Output: y=f(x)Rmy = f(x) \in \mathbb{R}^m

Jacobian:

J(x)=yx=[y1x1y1xnymx1ymxn]Rm×nJ(x) = \frac{\partial y}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \\ \end{bmatrix} \in \mathbb{R}^{m \times n}

Intuitively:

  • The ii-th row: the partial derivatives of yiy_i with respect to all xjx_j.

  • Describes the linear influence of small changes in input variables on each output variable.

3. Hessian Matrix#

The Hessian matrix is:

The matrix of second-order partial derivatives of a scalar function with respect to an input vector.

Assume:

  • Input: xRnx \in \mathbb{R}^n

  • Output: y=f(x)Ry = f(x) \in \mathbb{R}

Hessian:

H(x)=2yx2=[2yx122yx1x22yx2x12yx22]Rn×nH(x) = \frac{\partial^2 y}{\partial x^2} = \begin{bmatrix} \frac{\partial^2 y}{\partial x_1^2} & \frac{\partial^2 y}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 y}{\partial x_2 \partial x_1} & \frac{\partial^2 y}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \\ \end{bmatrix} \in \mathbb{R}^{n \times n}

However, when yy is a vector, the gradient is actually the Jacobian matrix, not a single gradient vector:

Jij=yixjJ_{ij} = \frac{\partial y_i}{\partial x_j}

4. The difference between dtype and type in PyTorch#

Let's start with the conclusion:

dtype is the specific numerical type of elements within a tensor, while type is the Python type of the tensor object itself (including device information).

They focus on completely different aspects.


📍 1️⃣ What is dtype?

  • Refers to the storage type of each element in the tensor.

  • Examples:

    • torch.float32 → single-precision floating point

    • torch.int64 → 64-bit integer

    • torch.bool → boolean type

See the code:

x = torch.tensor([1, 2, 3], dtype=torch.float32)
print(x.dtype)  # Output: torch.float32

It only tells you "what format this tensor's elements are stored in."


📍 2️⃣ What is type?

  • Refers to the full name of the tensor object in PyTorch, including data type and device.

  • Examples:

    • torch.FloatTensor → float32 CPU tensor

    • torch.cuda.FloatTensor → float32 GPU tensor

    • torch.IntTensor → int32 CPU tensor

See the code:

x = torch.tensor([1, 2, 3])
print(x.type())  # Output: torch.IntTensor (or your system's default type)

It tells you "the complete type name of this tensor object in the PyTorch system."


📍 ⚠ Key Differences

Comparison Itemdtypetype
FocusElement's numerical typeComplete PyTorch type of the tensor object (including device information)
Exampletorch.float32, torch.int64torch.FloatTensor, torch.cuda.FloatTensor
Main UseRelated to precision, storage, and computationDistinguishing different tensor categories, for debugging/checking
Change Method.to(dtype), .float().type() (changes the entire type object)

5. Soft Label vs Hard Label#

Typically, when training classification models, the labels used are:
hard label
For example, for a 3-class task, the true labels:

Class A → [1, 0, 0]
Class B → [0, 1, 0]
Class C → [0, 0, 1]

These labels are completely one-hot encoded, recognizing only right or wrong.


soft label is:

Each class corresponds to a probability, rather than a rigid 0 or 1.

For example:

For a certain image, soft label → [0.7, 0.2, 0.1]

This indicates:

  • There is a 70% probability it is class A

  • 20% probability it is class B

  • 10% probability it is class C

In other words, the labels also "acknowledge fuzziness," rather than being all or nothing.


6. Softmax Regression vs Logistic Regression#

** 📦 Common Points**

Both are essentially classification models
✅ Both use a linear function + activation (sigmoid or softmax)
✅ Both use cross-entropy as the loss function

But they are used for different tasks.


🌟 Main Differences


ItemLogistic RegressionSoftmax Regression
Task TypeBinary classificationMulti-class classification
Output Layer Activation Functionsigmoid (single output value 0~1)softmax (vector output, probability for each class)
Output Dimension1DC-dimensional (C = number of classes)
Target Label0 or 1one-hot encoding, e.g., [0,0,1,0]
Decision MethodOutput > 0.5 classified as positiveThe class with the highest probability is the predicted result


📊 Logistic Regression Details

  • Assume you have:

    • Input features xx

    • Weights ww and bias bb

  • Model calculation:
    y^=σ(wTx+b)\hat{y} = \sigma(w^T x + b)
    where sigma\\sigma is the sigmoid function.

Output haty\\hat{y} is a probability value (0 to 1), representing the probability of the positive class.

Loss function:
Binary Cross Entropy=[ylog(y^)+(1y)log(1y^)]\text{Binary Cross Entropy} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]


📊 Softmax Regression Details

  • Assume you have:

    • Input features xx

    • Weight matrix WW (shape: num_features × num_classes)

    • Bias bb

  • Model calculation:
    y^i=ezijezj\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}
    where z=WTx+bz = W^T x + b

Output $\hat{y}$ is a vector of length C, with each element being the probability of the corresponding class.

Loss function:
Cross Entropy=iyilog(y^i)\text{Cross Entropy} = -\sum_i y_i \log(\hat{y}_i)


** 🧠 Why can't logistic regression be directly used for multi-class?**

Because sigmoid only outputs one value, while multi-class requires outputting probabilities for multiple classes, and these probabilities must satisfy:

The sum equals 1, mutually exclusive.

This is the purpose of the softmax design.


In Summary

Logistic Regression ≈ 2-class special case of softmax
Softmax Regression = Multi-class generalization of logistic regression


7. Likelihood and Probability#

🌟 What is the likelihood function?

In simple terms:

The likelihood function is the probability of observing the data given the model parameters.

You can think of it as:
The model assumes a certain parameter → Under this parameter, how likely is it to generate the batch of data we currently have.


📊 Difference from Probability?

Many people easily confuse:
✅ Probability: Given parameters, calculate the likelihood of an event.
✅ Likelihood: Given data, see which parameters are more likely to generate this data.

Although the mathematical formulas look the same, their uses are reversed.



🏗️ Mathematical Expression

Assume:

  • Data: x1,x2,...,xnx_1, x_2, ..., x_n

  • Parameter: θ\theta

Probability:

P(xθ)P(x | \theta)

Likelihood Function:

L(θx)=P(xθ)L(\theta | x) = P(x | \theta)

The difference lies in:

  • Probability: θ\theta is fixed, looking at xx.

  • Likelihood: xx is fixed, looking at θ\theta.



🌟** A Simple Example**

Assume you have a coin, flipped 10 times, resulting in 7 heads.
We want to estimate the probability of heads pp.


Model Assumption
Number of coin flips: 10
Probability of heads: pp
Event: 7 heads


Likelihood Function

L(p)=Cp7(1p)3L(p) = C \cdot p^7 (1-p)^3

Where:

  • CC is a combinatorial number (fixed value, does not affect maximization).

  • The core is:

    Given pp, the probability of generating this data (7 heads and 3 tails).


🌟 Maximum Likelihood Estimation (MLE)

Typically, we use:

To find the parameter that maximizes the likelihood function.

In this example:

  • Maximize p7(1p)3p^7 (1-p)^3

  • Optimal p=7/10=0.7p^* = 7 / 10 = 0.7

This is the maximum likelihood estimation.



🔥 In Machine Learning

Many training processes in machine learning are actually:

Using maximum likelihood to fit parameters.

For example:

  • Regression models → Maximum likelihood of Gaussian distribution

  • Classification models → Maximum likelihood under softmax

  • Neural networks → Cross-entropy loss, which is actually derived from maximum likelihood



In Summary

Likelihood function = The probability of observing the current data given parameters (viewed as a function of parameters).

Maximizing likelihood means finding the parameters that most likely generate the data.

8. Perceptron#

Perceptron is a very basic binary classification linear model, which can be seen as the earliest form of neural networks (single-layer, no hidden layers).

Its main characteristics and points are:

Basic Form:
A perceptron uses a weight vector w and a bias b to perform a linear combination on the input feature vector x, then passes it through a sign function to determine whether the output is +1 or -1.

Formula:
  f(x) = sign(w·x + b)

Goal:
Find a set of w, b that allows all samples to be separated by a hyperplane (i.e., linearly separable).

Training Algorithm:

  • Initialize weights and biases (usually to zero or small values).

  • For each misclassified sample, update the weights:
      w ← w + η * y * x
      b ← b + η * y
    Here η is the learning rate, y is the true label (+1 or -1).

  • Continue iterating until all samples are correctly classified (or reach the maximum number of iterations).

Limitations (hard limitations):

  • Can only handle linearly separable problems. It is completely powerless against non-linear data (like the XOR problem).

  • No probability output, just hard classification.

  • Easily affected by noise and outliers.

Significance and Historical Position:
Although deep learning has long surpassed the perceptron, it is the starting point of neural network development. The book "Perceptrons" written by Minsky and Papert in 1969 pointed out that it could not solve the XOR problem, which directly led to the arrival of the AI winter. It wasn't until the emergence of multi-layer networks (MLP) and backpropagation algorithms that this limitation was broken.

9. Multi-layer Perceptron#

Binary classification problem formula
image

image

The difference between the binary classification problem formula and the multi-class problem is that the shape of w2w_2 is mkm*k (where k is the number of classes).

Each hidden layer requires an activation function (non-linear function); without an activation function, it is equivalent to a large linear function.
The output layer may not require an activation function.

The difference between multi-class problems in multi-layer perceptrons and softmax regression is that there are additional hidden layers; everything else remains the same.

Code

import torch
from torch import nn

# Build the model, the hidden layer contains 256 hidden units and uses the ReLU activation function
net = nn.Sequential(nn.Flatten(), nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10))

Code Explanation
nn.Sequential() is used to stack a series of sub-modules (layers, activation functions, etc.) in order, automatically organizing the forward propagation. In other words, it is an ordered container that wraps the desired network modules in sequence.
nn.Flatten() flattens the input multi-dimensional tensor into a one-dimensional vector.
For MNIST, the input image shape is [batch_size, 1, 28, 28] (grayscale), and after flattening, it becomes [batch_size, 784], making it convenient to connect to the fully connected layer.

10. Activation Functions#

Sigmoid Function#

Sigmoid Activation Function is defined as:

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

Output Range:
(0, 1) — maps any real number to between 0 and 1.

Graph Features:

output

  • S-shaped curve (hence called sigmoid, sigmoid = S-shaped).

  • Symmetrical around (0, 0.5).

  • When xx is very large or very small, the gradient approaches 0 — gradient vanishing.

Advantages:

  • Can be interpreted as a probability (especially suitable for the last layer of binary classification).

  • Smooth, continuous, and differentiable.

Disadvantages (fatal):

  • Gradient Vanishing: When xx is very large or very small, the derivative approaches 0, making it nearly impossible to update weights during backpropagation.

  • Output not centered at 0: This can slow down convergence during gradient updates because positive and negative gradients are asymmetric.

  • Easily Saturated: Inputs that are too large or too small will be flattened.

Current Mainstream Practice:

  • Except for the last layer of binary classification, hidden layers almost never use sigmoid, but rather use ReLU or more advanced variants.

  • If it is binary classification, use sigmoid for the last layer and binary cross-entropy for the loss function.

Tanh Function#

Tanh (Hyperbolic Tangent) Activation Function is defined as:

tanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

Output Range:
(-1, 1) — compresses any real number to between -1 and 1.

Graph Features:

output

  • S-shaped curve (similar to sigmoid, but symmetrical).

  • Symmetrical around (0, 0) (this is better than sigmoid, as the output mean is closer to 0, aiding optimization convergence).

  • When xx is very large or very small, gradient vanishing can also occur.

Advantages:

  • Compared to sigmoid, the output is centered at 0, which is more friendly for gradient updates.

  • Smooth, continuous, and differentiable.

Disadvantages (core issue):

  • Like sigmoid, it is prone to saturation → gradient vanishing.

Current Mainstream Practice:

  • In some models that require symmetric output (like RNNs), tanh is still useful.

  • However, for hidden layers of deep neural networks, the modern mainstream is still to use ReLU and its improved versions.

ReLU Function#

ReLU (Rectified Linear Unit) Activation Function is defined as:

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

Output Range:
[0,+)[0, +\infty)

Graph Features:

output

  • When x<0x < 0, output is 0.

  • When x0x \geq 0, output is xx.
    Simple, direct, a piecewise linear function.

Advantages:

  • Simple computation, fast convergence. Compared to the Sigmoid and Tanh functions, both require exponential calculations, which are computationally expensive.

  • Not prone to gradient vanishing (because the gradient in the positive interval is always 1).

  • Sparse activation (many neurons output 0, which helps simplify the model).

Disadvantages:

  • Dying ReLU Problem: If a neuron gets stuck in the negative interval during training (outputting 0 continuously), it may no longer update (because the gradient in the negative interval is 0).

  • Asymmetrical with respect to input values (only retains positive values).

Modern Improved Versions:

  • Leaky ReLU: Gives a small slope in the negative interval.

  • Parametric ReLU (PReLU): Makes the slope in the negative interval learnable.

  • ELU, GELU, Swish: Further improvements.

11. Model Complexity#

image

12. Regularization#

In machine learning, regularization methods are a set of techniques used to prevent model overfitting and improve model generalization ability. It introduces some "penalties" or "constraints" during the model learning process to limit the model's complexity, ensuring that the model does not fit every detail (especially noise) in the training data too "perfectly," thus being able to better adapt to new, unseen data.
Used only during training, not needed during prediction.

What is regularization?

In one sentence: Regularization actively limits the model's degrees of freedom by adding "constraints/penalties" to the loss function, forcing it to learn simple and well-generalized patterns rather than memorizing the noise in the training set.

Mathematically, it is often written as

minθ  L(θ;X,y)  +  λΩ(θ)\min_{\theta}\; \mathcal{L}(\theta;\,X,y)\;+\;\lambda\,\Omega(\theta)
  • L\mathcal{L}: Original empirical loss (cross-entropy, MSE, etc.)

  • Ω\Omega: Regularization term (the larger it is, the more "complex" the model)

  • λ\lambda: Regularization strength; λ=0\lambda=0 degenerates to no regularization, and λ\lambda\to\infty compresses the model to the simplest form

Core Functions

GoalExplanation
Suppress OverfittingReduce variance, improve robustness to unseen samples
Numerical StabilityAvoid weight explosion or matrix singularity
InterpretabilitySparsity or structural constraints make features/sub-networks easier to read
Prevent MulticollinearityStill provide a unique solution under high-dimensional multicollinear features

L1 Regularization (Lasso Regression)#

✅** 1. What is L1 Regularization?**

One-sentence definition:

L1 regularization adds the sum of the absolute values of all weights as a penalty term to the loss function.

Its goal is:

To make some unimportant weights automatically become 0, thus simplifying the model.


2. Mathematical Form (briefly)

The ordinary loss function is:

Loss=Model Error (e.g., cross-entropy)\text{Loss} = \text{Model Error (e.g., cross-entropy)}

With L1 regularization, it becomes:

Losstotal=Original Error+λwi\text{Loss}_{\text{total}} = \text{Original Error} + \lambda \sum |w_i|

Where:

  • wiw_i is each weight parameter

  • λ\lambda is a hyperparameter controlling the penalty strength (the larger, the "harsher")


3. The Key Feature of L1: Making Some Weights Become 0 (Sparsity)

This is the key difference between L1 and L2:

So if you want the model to automatically select important features and discard irrelevant ones, L1 regularization is the ideal choice.


4. An Example (imagine)

You are working on a house price prediction task with 100 input features, but only 5 are useful.

If you use L1 regularization, after training, the model may retain only the weights of these 5 features, while the other 95 become 0.
This is equivalent to the model automatically performing feature selection.


In Summary

L1 Regularization = Penalizing the absolute values of weights → Encouraging some parameters to become 0 → Automatically selecting important features.

L2 Regularization (Weight Decay) (Ridge Regression)#

One-sentence Definition

L2 Regularization adds a penalty term of the square of model parameters to the loss function,
used to suppress large weights and avoid overfitting.


Mathematical Form

Assuming the original loss function of the model is:

Loss=L(y^,y)\text{Loss} = L(\hat{y}, y)

With L2 regularization, it becomes:

Losstotal=L(y^,y)+λwi2\text{Loss}_{\text{total}} = L(\hat{y}, y) + \lambda \sum w_i^2

Where:

  • λ\lambda: Regularization strength (hyperparameter) generally taken as 102,103,10410^{-2}, 10^{-3}, 10^{-4}

  • wi2\sum w_i^2: The sum of the squares of all weights (which is the L2 norm)

image

image


Practical Effect (Intuition)

Without L2:#

  • The model may wildly amplify a certain weight, leading to excellent fitting of the training data but poor generalization.

With L2 Regularization:#

  • The model becomes "more conservative," not easily enlarging parameters

  • The model is more stable against fluctuations in input (stronger generalization ability)


How to Use in PyTorch?

L2 Regularization = Weight Decay, so in PyTorch, you just need to add it in the optimizer:

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, weight_decay=1e-4)

This weight_decay parameter is actually λ\lambda! The default is L2 regularization.


How to Use in sklearn?

from sklearn.linear_model import Ridge  # Ridge is linear regression with L2
model = Ridge(alpha=1.0)

In Summary

L2 Regularization = Penalizing the square of weights, compressing but retaining all parameters, enhancing model generalization ability.

Dropout#

One-sentence Definition

Dropout is a regularization method that randomly masks neurons,
making neural networks more "robust" during training and preventing overfitting.


Motivation Behind It

Deep neural networks are prone to overfitting, especially when:

  • The number of layers is deep

  • The training data is small

  • There are many parameters, and the model is complex

The reason is:

The model learns to "rely on certain combinations of neurons" to memorize the training data → Generalization ability deteriorates.

Dropout breaks this dependency
During training, it randomly masks (sets to 0) some neurons, forcing the network not to rely solely on a small group of neurons working together, but rather to have redundancy.


How It Works?

During Training:

For a certain hidden layer output vector h=(h1,,hn)\mathbf h = (h_1,\dots,h_n)
During the training phase, apply:

h~=mph,miBernoulli(p)\tilde{\mathbf h}= \frac{\mathbf m}{p}\odot \mathbf h,\qquad m_i \sim \text{Bernoulli}(p)

— The retention probability $p$ is set by you; typically between 0.5 and 0.9.

  • m\mathbf m: Random mask that decides whether to "retain" the ii-th unit.

  • \odot: Element-wise multiplication.

  • 1/p1/p: The inverted dropout commonly used rescaling factor ensures E[h~i]=hi\mathbb E[\tilde h_i] = h_i, maintaining consistent activation scales during training/inference.

So from a mathematical perspective, it is a form of multiplicative binary noise injection, rather than physically deleting neurons.

Random Mask = A tensor filled with 0/1 values that determines which units "turn off" temporarily and which "work normally" during this forward/backward propagation round.
In Dropout (or other noise regularization), it is the minimal "switch matrix" that injects multiplicative noise into the network.

Element values

mijBernoulli(p){1(retain)0(drop)m_{ij} \sim \text{Bernoulli}(p)\quad \bigl\{ \begin{array}{l} 1\quad(\text{retain})\\ 0\quad(\text{drop}) \end{array}

During Testing:

No Drop, all activated, but outputs are not scaled.


Usage Location:
Between fully connected layers (Dense Layer / Linear Layer) and their subsequent activation functions (like ReLU, Sigmoid, Tanh, etc.).

  • It is more common to place it after the activation function: Dense -> Activation -> Dropout
  • This is because the purpose of Dropout is to randomly "turn off" the output of neurons. The output of the activation function represents the final output signal of the neuron, applying Dropout directly simulates the random deactivation of neurons.
  • In rare cases, it may also be placed before the activation function: Dense -> Dropout -> Activation
    Although this is less mainstream, some research and practice have shown that it can sometimes yield results. The logic is to apply Dropout to the linear combination results before the activation function.

Typically applied to one or more hidden layers (Hidden Layers).

  • For deeper networks, Dropout can be used after multiple hidden layers.
    Whether to use it on all hidden layers and what dropout rate (dropout rate/probability) to use is usually a hyperparameter that needs to be adjusted through experimentation.

It is generally not recommended to use Dropout in the output layer (Output Layer).

  • The output layer is responsible for producing the final prediction results. Using Dropout in the output layer may interfere with the final prediction output of the model, especially for classification tasks, as it may randomly drop certain class prediction signals, which is usually undesirable.

PyTorch Implementation:

import torch.nn as nn

net = nn.Sequential(
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(p=0.5),   # Dropout layer: masks 50% of neurons during training
    nn.Linear(128, 10)
)

Advantages of Dropout:

AdvantagesDescription
Suppress OverfittingForces the network not to rely on certain features
Enhance Generalization AbilityEach training session acts like training different "sub-networks"
Easy to UseJust one line of code to add

⚠️ Cautions:

  • Dropout only takes effect during the training phase, must be turned off during the testing phase.

  • If the model is already small or the data volume is sufficient, Dropout may sometimes harm performance (underfitting).


In Summary:

Dropout is a strategy of "randomly dropping points during training, fully activating during testing," enhancing model robustness and preventing overfitting.

Early Stopping#

One-sentence Definition

Early Stopping is a method of stopping training by monitoring validation set performance before the model starts to overfit.


🧠 Why is Early Stopping Needed?

When training a model, we often observe the following phenomenon:

EpochTraining Set AccuracyValidation Set Accuracy
160%58%
1095%88% ✅
3099% ✅70% ❌

The training set improves, but the validation set worsens. This indicates:

The model is "memorizing" the training data → Overfitting has begun.

Continuing training at this point is a waste of time and may even damage the model.


✅** The Core Idea of Early Stopping**

It's simple:

Monitor the performance of the validation set, and once it starts to decline, immediately stop training, retaining the best-performing model.

This way:

  • The model will not overfit

  • Training speed is faster

  • Generally does not require complex regularization terms

👉 Therefore, Early Stopping is a training-level regularization method, rather than a structural-level one.


In Summary

Early Stopping is the simplest yet extremely effective regularization strategy: Stop immediately when the validation set stops improving.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.