Pytorch Learning Record

Recently, I've been watching the hands-on deep learning course by Li Mu, and here I will record some important knowledge points to prevent knowledge from floating around and to consolidate it.

1. PyTorch's backward must be a scalar#

import torch
x = torch.arange(4.0, requires_grad=True)
y = 2 * torch.dot(x, x)
y.backward()
# By default, PyTorch accumulates gradients, so previous values need to be cleared
# Calling 'backward' on a non-scalar requires passing a 'gradient' parameter, which specifies the differential function
x.grad.zero_()  # Clear the previous gradient of x
y = x * x  # Here y is not a scalar; it's a vector
print(y)
# Equivalent to y.backward(torch.ones(len(x)))
y.sum().backward()  # y.sum() converts the vector to a scalar for differentiation
x.grad

🔹 Key Point: PyTorch's backward must be a scalar

In PyTorch, the design of .backward() is based on the chain rule, primarily targeting the gradient of a scalar (single output value) with respect to the input.

If you call y.backward(), here y must be a scalar because:

\nabla_x y = \frac{\partial y}{\partial x}

$\nabla_x y$ is called the gradient of y with respect to x
$\frac{\partial y}{\partial x}$ is the partial derivative of y with respect to x

But what it specifically is depends on the dimensions of $y$ and $x$.

🔸 Case 1: y is a scalar, x is a vector

Assume:

$x \in \mathbb{R}^n$
$y \in \mathbb{R}$

Then:

\nabla_x y = \begin{bmatrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \vdots \\ \frac{\partial y}{\partial x_n} \end{bmatrix} \in \mathbb{R}^n

That is:

A vector of length $n$ representing the partial derivatives of y with respect to each component of x.

This is the most commonly used case in machine learning, such as the gradient of loss with respect to weights.

🔸 Case 2: y is a vector, x is a vector

Assume:

$x \in \mathbb{R}^n$
$y \in \mathbb{R}^m$

Then:

\frac{\partial y}{\partial x} = J \in \mathbb{R}^{m \times n}

This matrix $J$ is called the Jacobian matrix, and its $(i, j)$ element is:

J_{ij} = \frac{\partial y_i}{\partial x_j}

2. Jacobian Matrix#

The Jacobian matrix is:

The matrix of partial derivatives of a vector function with respect to an input vector.

Assume:

Input: $x \in \mathbb{R}^n$
Output: $y = f(x) \in \mathbb{R}^m$

Jacobian:

J(x) = \frac{\partial y}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \\ \end{bmatrix} \in \mathbb{R}^{m \times n}

Intuitively:

The $i$ -th row: the partial derivatives of $y_i$ with respect to all $x_j$ .
Describes the linear influence of small changes in input variables on each output variable.

3. Hessian Matrix#

The Hessian matrix is:

The matrix of second-order partial derivatives of a scalar function with respect to an input vector.

Assume:

Input: $x \in \mathbb{R}^n$
Output: $y = f(x) \in \mathbb{R}$

Hessian:

H(x) = \frac{\partial^2 y}{\partial x^2} = \begin{bmatrix} \frac{\partial^2 y}{\partial x_1^2} & \frac{\partial^2 y}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 y}{\partial x_2 \partial x_1} & \frac{\partial^2 y}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \\ \end{bmatrix} \in \mathbb{R}^{n \times n}

However, when $y$ is a vector, the gradient is actually the Jacobian matrix, not a single gradient vector:

J_{ij} = \frac{\partial y_i}{\partial x_j}

4. The difference between dtype and type in PyTorch#

Let's start with the conclusion:

dtype is the specific numerical type of elements within a tensor, while type is the Python type of the tensor object itself (including device information).

They focus on completely different aspects.

📍 1️⃣ What is dtype?

Refers to the storage type of each element in the tensor.
Examples:
- torch.float32 → single-precision floating point
- torch.int64 → 64-bit integer
- torch.bool → boolean type

See the code:

x = torch.tensor([1, 2, 3], dtype=torch.float32)
print(x.dtype)  # Output: torch.float32

It only tells you "what format this tensor's elements are stored in."

📍 2️⃣ What is type?

Refers to the full name of the tensor object in PyTorch, including data type and device.
Examples:
- torch.FloatTensor → float32 CPU tensor
- torch.cuda.FloatTensor → float32 GPU tensor
- torch.IntTensor → int32 CPU tensor

See the code:

x = torch.tensor([1, 2, 3])
print(x.type())  # Output: torch.IntTensor (or your system's default type)

It tells you "the complete type name of this tensor object in the PyTorch system."

📍 ⚠ Key Differences

Comparison Item	dtype	type
Focus	Element's numerical type	Complete PyTorch type of the tensor object (including device information)
Example	torch.float32, torch.int64	torch.FloatTensor, torch.cuda.FloatTensor
Main Use	Related to precision, storage, and computation	Distinguishing different tensor categories, for debugging/checking
Change Method	`.to(dtype)`, `.float()`	`.type()` (changes the entire type object)

5. Soft Label vs Hard Label#

Typically, when training classification models, the labels used are:
✅ hard label
For example, for a 3-class task, the true labels:

Class A → [1, 0, 0]
Class B → [0, 1, 0]
Class C → [0, 0, 1]

These labels are completely one-hot encoded, recognizing only right or wrong.

soft label is:

Each class corresponds to a probability, rather than a rigid 0 or 1.

For example:

For a certain image, soft label → [0.7, 0.2, 0.1]

This indicates:

There is a 70% probability it is class A
20% probability it is class B
10% probability it is class C

In other words, the labels also "acknowledge fuzziness," rather than being all or nothing.

6. Softmax Regression vs Logistic Regression#

** 📦 Common Points**

✅ Both are essentially classification models
✅ Both use a linear function + activation (sigmoid or softmax)
✅ Both use cross-entropy as the loss function

But they are used for different tasks.

🌟 Main Differences

Item	Logistic Regression	Softmax Regression
Task Type	Binary classification	Multi-class classification
Output Layer Activation Function	sigmoid (single output value 0~1)	softmax (vector output, probability for each class)
Output Dimension	1D	C-dimensional (C = number of classes)
Target Label	0 or 1	one-hot encoding, e.g., [0,0,1,0]
Decision Method	Output > 0.5 classified as positive	The class with the highest probability is the predicted result

📊 Logistic Regression Details

Assume you have:
- Input features $x$
- Weights $w$ and bias $b$
Model calculation:
$\hat{y} = \sigma(w^T x + b)$
where $\\sigma$ is the sigmoid function.

Output $\\hat{y}$ is a probability value (0 to 1), representing the probability of the positive class.

Loss function:
$\text{Binary Cross Entropy} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]$

📊 Softmax Regression Details

Assume you have:
- Input features $x$
- Weight matrix $W$ (shape: num_features × num_classes)
- Bias $b$
Model calculation:
$\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$
where $z = W^T x + b$

Output $\hat{y}$ is a vector of length C, with each element being the probability of the corresponding class.

Loss function:
$\text{Cross Entropy} = -\sum_i y_i \log(\hat{y}_i)$

** 🧠 Why can't logistic regression be directly used for multi-class?**

Because sigmoid only outputs one value, while multi-class requires outputting probabilities for multiple classes, and these probabilities must satisfy:

The sum equals 1, mutually exclusive.

This is the purpose of the softmax design.

✅ In Summary

Logistic Regression ≈ 2-class special case of softmax
Softmax Regression = Multi-class generalization of logistic regression

7. Likelihood and Probability#

🌟 What is the likelihood function?

In simple terms:

The likelihood function is the probability of observing the data given the model parameters.

You can think of it as:
The model assumes a certain parameter → Under this parameter, how likely is it to generate the batch of data we currently have.

📊 Difference from Probability?

Many people easily confuse:
✅ Probability: Given parameters, calculate the likelihood of an event.
✅ Likelihood: Given data, see which parameters are more likely to generate this data.

Although the mathematical formulas look the same, their uses are reversed.

🏗️ Mathematical Expression

Assume:

Data: $x_1, x_2, ..., x_n$
Parameter: $\theta$

Probability:

P(x | \theta)

Likelihood Function:

L(\theta | x) = P(x | \theta)

The difference lies in:

Probability: $\theta$ is fixed, looking at $x$ .
Likelihood: $x$ is fixed, looking at $\theta$ .

🌟** A Simple Example**

Assume you have a coin, flipped 10 times, resulting in 7 heads.
We want to estimate the probability of heads $p$ .

✅ Model Assumption
Number of coin flips: 10
Probability of heads: $p$
Event: 7 heads

✅ Likelihood Function

L(p) = C \cdot p^7 (1-p)^3

Where:

$C$ is a combinatorial number (fixed value, does not affect maximization).
The core is:

Given $p$ , the probability of generating this data (7 heads and 3 tails).

🌟 Maximum Likelihood Estimation (MLE)

Typically, we use:

To find the parameter that maximizes the likelihood function.

In this example:

Maximize $p^7 (1-p)^3$
Optimal $p^* = 7 / 10 = 0.7$

This is the maximum likelihood estimation.

🔥 In Machine Learning

Many training processes in machine learning are actually:

Using maximum likelihood to fit parameters.

For example:

Regression models → Maximum likelihood of Gaussian distribution
Classification models → Maximum likelihood under softmax
Neural networks → Cross-entropy loss, which is actually derived from maximum likelihood

✅ In Summary

Likelihood function = The probability of observing the current data given parameters (viewed as a function of parameters).

Maximizing likelihood means finding the parameters that most likely generate the data.

8. Perceptron#

Perceptron is a very basic binary classification linear model, which can be seen as the earliest form of neural networks (single-layer, no hidden layers).

Its main characteristics and points are:

✅ Basic Form:
A perceptron uses a weight vector w and a bias b to perform a linear combination on the input feature vector x, then passes it through a sign function to determine whether the output is +1 or -1.

Formula:
f(x) = sign(w·x + b)

✅ Goal:
Find a set of w, b that allows all samples to be separated by a hyperplane (i.e., linearly separable).

✅ Training Algorithm:

Initialize weights and biases (usually to zero or small values).
For each misclassified sample, update the weights:
w ← w + η * y * x
b ← b + η * y
Here η is the learning rate, y is the true label (+1 or -1).
Continue iterating until all samples are correctly classified (or reach the maximum number of iterations).

✅ Limitations (hard limitations):

Can only handle linearly separable problems. It is completely powerless against non-linear data (like the XOR problem).
No probability output, just hard classification.
Easily affected by noise and outliers.

✅ Significance and Historical Position:
Although deep learning has long surpassed the perceptron, it is the starting point of neural network development. The book "Perceptrons" written by Minsky and Papert in 1969 pointed out that it could not solve the XOR problem, which directly led to the arrival of the AI winter. It wasn't until the emergence of multi-layer networks (MLP) and backpropagation algorithms that this limitation was broken.

9. Multi-layer Perceptron#

Binary classification problem formula

The difference between the binary classification problem formula and the multi-class problem is that the shape of $w_2$ is $m*k$ (where k is the number of classes).

Each hidden layer requires an activation function (non-linear function); without an activation function, it is equivalent to a large linear function.
The output layer may not require an activation function.

The difference between multi-class problems in multi-layer perceptrons and softmax regression is that there are additional hidden layers; everything else remains the same.

Code

import torch
from torch import nn

# Build the model, the hidden layer contains 256 hidden units and uses the ReLU activation function
net = nn.Sequential(nn.Flatten(), nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10))

Code Explanation
nn.Sequential() is used to stack a series of sub-modules (layers, activation functions, etc.) in order, automatically organizing the forward propagation. In other words, it is an ordered container that wraps the desired network modules in sequence.
nn.Flatten() flattens the input multi-dimensional tensor into a one-dimensional vector.
For MNIST, the input image shape is [batch_size, 1, 28, 28] (grayscale), and after flattening, it becomes [batch_size, 784], making it convenient to connect to the fully connected layer.

10. Activation Functions#

Sigmoid Function#

✅ Sigmoid Activation Function is defined as:

\sigma(x) = \frac{1}{1 + e^{-x}}

✅ Output Range:
(0, 1) — maps any real number to between 0 and 1.

✅ Graph Features:

output

S-shaped curve (hence called sigmoid, sigmoid = S-shaped).
Symmetrical around (0, 0.5).
When $x$ is very large or very small, the gradient approaches 0 — gradient vanishing.

✅ Advantages:

Can be interpreted as a probability (especially suitable for the last layer of binary classification).
Smooth, continuous, and differentiable.

✅ Disadvantages (fatal):

Gradient Vanishing: When $x$ is very large or very small, the derivative approaches 0, making it nearly impossible to update weights during backpropagation.
Output not centered at 0: This can slow down convergence during gradient updates because positive and negative gradients are asymmetric.
Easily Saturated: Inputs that are too large or too small will be flattened.

✅ Current Mainstream Practice:

Except for the last layer of binary classification, hidden layers almost never use sigmoid, but rather use ReLU or more advanced variants.
If it is binary classification, use sigmoid for the last layer and binary cross-entropy for the loss function.

Tanh Function#

✅ Tanh (Hyperbolic Tangent) Activation Function is defined as:

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

✅ Output Range:
(-1, 1) — compresses any real number to between -1 and 1.

✅ Graph Features:

output

S-shaped curve (similar to sigmoid, but symmetrical).
Symmetrical around (0, 0) (this is better than sigmoid, as the output mean is closer to 0, aiding optimization convergence).
When $x$ is very large or very small, gradient vanishing can also occur.

✅ Advantages:

Compared to sigmoid, the output is centered at 0, which is more friendly for gradient updates.
Smooth, continuous, and differentiable.

✅ Disadvantages (core issue):

Like sigmoid, it is prone to saturation → gradient vanishing.

✅ Current Mainstream Practice:

In some models that require symmetric output (like RNNs), tanh is still useful.
However, for hidden layers of deep neural networks, the modern mainstream is still to use ReLU and its improved versions.

ReLU Function#

✅ ReLU (Rectified Linear Unit) Activation Function is defined as:

\text{ReLU}(x) = \max(0, x)

✅ Output Range:
$[0, +\infty)$

✅ Graph Features:

output

When $x < 0$ , output is 0.
When $x \geq 0$ , output is $x$ .
Simple, direct, a piecewise linear function.

✅ Advantages:

Simple computation, fast convergence. Compared to the Sigmoid and Tanh functions, both require exponential calculations, which are computationally expensive.
Not prone to gradient vanishing (because the gradient in the positive interval is always 1).
Sparse activation (many neurons output 0, which helps simplify the model).

✅ Disadvantages:

Dying ReLU Problem: If a neuron gets stuck in the negative interval during training (outputting 0 continuously), it may no longer update (because the gradient in the negative interval is 0).
Asymmetrical with respect to input values (only retains positive values).

✅ Modern Improved Versions:

Leaky ReLU: Gives a small slope in the negative interval.
Parametric ReLU (PReLU): Makes the slope in the negative interval learnable.
ELU, GELU, Swish: Further improvements.

11. Model Complexity#

12. Regularization#

In machine learning, regularization methods are a set of techniques used to prevent model overfitting and improve model generalization ability. It introduces some "penalties" or "constraints" during the model learning process to limit the model's complexity, ensuring that the model does not fit every detail (especially noise) in the training data too "perfectly," thus being able to better adapt to new, unseen data.
Used only during training, not needed during prediction.

What is regularization?

In one sentence: Regularization actively limits the model's degrees of freedom by adding "constraints/penalties" to the loss function, forcing it to learn simple and well-generalized patterns rather than memorizing the noise in the training set.

Mathematically, it is often written as

\min_{\theta}\; \mathcal{L}(\theta;\,X,y)\;+\;\lambda\,\Omega(\theta)

$\mathcal{L}$ : Original empirical loss (cross-entropy, MSE, etc.)
$\Omega$ : Regularization term (the larger it is, the more "complex" the model)
$\lambda$ : Regularization strength; $\lambda=0$ degenerates to no regularization, and $\lambda\to\infty$ compresses the model to the simplest form

Core Functions

Goal	Explanation
Suppress Overfitting	Reduce variance, improve robustness to unseen samples
Numerical Stability	Avoid weight explosion or matrix singularity
Interpretability	Sparsity or structural constraints make features/sub-networks easier to read
Prevent Multicollinearity	Still provide a unique solution under high-dimensional multicollinear features

L1 Regularization (Lasso Regression)#

✅** 1. What is L1 Regularization?**

One-sentence definition:

L1 regularization adds the sum of the absolute values of all weights as a penalty term to the loss function.

Its goal is:

To make some unimportant weights automatically become 0, thus simplifying the model.

✅ 2. Mathematical Form (briefly)

The ordinary loss function is:

\text{Loss} = \text{Model Error (e.g., cross-entropy)}

With L1 regularization, it becomes:

\text{Loss}_{\text{total}} = \text{Original Error} + \lambda \sum |w_i|

Where:

$w_i$ is each weight parameter
$\lambda$ is a hyperparameter controlling the penalty strength (the larger, the "harsher")

✅ 3. The Key Feature of L1: Making Some Weights Become 0 (Sparsity)

This is the key difference between L1 and L2:

So if you want the model to automatically select important features and discard irrelevant ones, L1 regularization is the ideal choice.

✅ 4. An Example (imagine)

You are working on a house price prediction task with 100 input features, but only 5 are useful.

If you use L1 regularization, after training, the model may retain only the weights of these 5 features, while the other 95 become 0.
This is equivalent to the model automatically performing feature selection.

✅ In Summary

L1 Regularization = Penalizing the absolute values of weights → Encouraging some parameters to become 0 → Automatically selecting important features.

L2 Regularization (Weight Decay) (Ridge Regression)#

✅ One-sentence Definition

L2 Regularization adds a penalty term of the square of model parameters to the loss function,
used to suppress large weights and avoid overfitting.

✅ Mathematical Form

Assuming the original loss function of the model is:

\text{Loss} = L(\hat{y}, y)

With L2 regularization, it becomes:

\text{Loss}_{\text{total}} = L(\hat{y}, y) + \lambda \sum w_i^2

Where:

$\lambda$ : Regularization strength (hyperparameter) generally taken as $10^{-2}, 10^{-3}, 10^{-4}$
$\sum w_i^2$ : The sum of the squares of all weights (which is the L2 norm)

✅ Practical Effect (Intuition)

Without L2:#

The model may wildly amplify a certain weight, leading to excellent fitting of the training data but poor generalization.

With L2 Regularization:#

The model becomes "more conservative," not easily enlarging parameters
The model is more stable against fluctuations in input (stronger generalization ability)

✅ How to Use in PyTorch?

L2 Regularization = Weight Decay, so in PyTorch, you just need to add it in the optimizer:

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, weight_decay=1e-4)

This weight_decay parameter is actually $\lambda$ ! The default is L2 regularization.

✅ How to Use in sklearn?

from sklearn.linear_model import Ridge  # Ridge is linear regression with L2
model = Ridge(alpha=1.0)

✅ In Summary

L2 Regularization = Penalizing the square of weights, compressing but retaining all parameters, enhancing model generalization ability.

Dropout#

✅ One-sentence Definition

Dropout is a regularization method that randomly masks neurons,
making neural networks more "robust" during training and preventing overfitting.

✅ Motivation Behind It

Deep neural networks are prone to overfitting, especially when:

The number of layers is deep
The training data is small
There are many parameters, and the model is complex

The reason is:

The model learns to "rely on certain combinations of neurons" to memorize the training data → Generalization ability deteriorates.

Dropout breaks this dependency —
During training, it randomly masks (sets to 0) some neurons, forcing the network not to rely solely on a small group of neurons working together, but rather to have redundancy.

✅ How It Works?

During Training:

For a certain hidden layer output vector $\mathbf h = (h_1,\dots,h_n)$
During the training phase, apply:

\tilde{\mathbf h}= \frac{\mathbf m}{p}\odot \mathbf h,\qquad m_i \sim \text{Bernoulli}(p)

— The retention probability $p$ is set by you; typically between 0.5 and 0.9.

$\mathbf m$ : Random mask that decides whether to "retain" the $i$ -th unit.
$\odot$ : Element-wise multiplication.
$1/p$ : The inverted dropout commonly used rescaling factor ensures $\mathbb E[\tilde h_i] = h_i$ , maintaining consistent activation scales during training/inference.

So from a mathematical perspective, it is a form of multiplicative binary noise injection, rather than physically deleting neurons.

Random Mask = A tensor filled with 0/1 values that determines which units "turn off" temporarily and which "work normally" during this forward/backward propagation round.
In Dropout (or other noise regularization), it is the minimal "switch matrix" that injects multiplicative noise into the network.

Element values

m_{ij} \sim \text{Bernoulli}(p)\quad \bigl\{ \begin{array}{l} 1\quad(\text{retain})\\ 0\quad(\text{drop}) \end{array}

During Testing:

No Drop, all activated, but outputs are not scaled.

✅ Usage Location:
Between fully connected layers (Dense Layer / Linear Layer) and their subsequent activation functions (like ReLU, Sigmoid, Tanh, etc.).

It is more common to place it after the activation function: Dense -> Activation -> Dropout
This is because the purpose of Dropout is to randomly "turn off" the output of neurons. The output of the activation function represents the final output signal of the neuron, applying Dropout directly simulates the random deactivation of neurons.
In rare cases, it may also be placed before the activation function: Dense -> Dropout -> Activation
Although this is less mainstream, some research and practice have shown that it can sometimes yield results. The logic is to apply Dropout to the linear combination results before the activation function.

Typically applied to one or more hidden layers (Hidden Layers).

For deeper networks, Dropout can be used after multiple hidden layers.
Whether to use it on all hidden layers and what dropout rate (dropout rate/probability) to use is usually a hyperparameter that needs to be adjusted through experimentation.

It is generally not recommended to use Dropout in the output layer (Output Layer).

The output layer is responsible for producing the final prediction results. Using Dropout in the output layer may interfere with the final prediction output of the model, especially for classification tasks, as it may randomly drop certain class prediction signals, which is usually undesirable.

✅ PyTorch Implementation:

import torch.nn as nn

net = nn.Sequential(
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(p=0.5),   # Dropout layer: masks 50% of neurons during training
    nn.Linear(128, 10)
)

✅ Advantages of Dropout:

Advantages	Description
Suppress Overfitting	Forces the network not to rely on certain features
Enhance Generalization Ability	Each training session acts like training different "sub-networks"
Easy to Use	Just one line of code to add

⚠️ Cautions:

Dropout only takes effect during the training phase, must be turned off during the testing phase.
If the model is already small or the data volume is sufficient, Dropout may sometimes harm performance (underfitting).

✅ In Summary:

Dropout is a strategy of "randomly dropping points during training, fully activating during testing," enhancing model robustness and preventing overfitting.

Early Stopping#

✅ One-sentence Definition

Early Stopping is a method of stopping training by monitoring validation set performance before the model starts to overfit.

🧠 Why is Early Stopping Needed?

When training a model, we often observe the following phenomenon:

Epoch	Training Set Accuracy	Validation Set Accuracy
1	60%	58%
10	95%	88% ✅
30	99% ✅	70% ❌

The training set improves, but the validation set worsens. This indicates:

The model is "memorizing" the training data → Overfitting has begun.

Continuing training at this point is a waste of time and may even damage the model.

✅** The Core Idea of Early Stopping**

It's simple:

Monitor the performance of the validation set, and once it starts to decline, immediately stop training, retaining the best-performing model.

This way:

The model will not overfit
Training speed is faster
Generally does not require complex regularization terms

👉 Therefore, Early Stopping is a training-level regularization method, rather than a structural-level one.

✅ In Summary

Early Stopping is the simplest yet extremely effective regularization strategy: Stop immediately when the validation set stops improving.