Pytorch Learning Record

Recently, I have been watching Li Mu's hands-on deep learning course. Here, I will record some important knowledge points to prevent knowledge from floating around and to consolidate it.

1. The backward in PyTorch must be a scalar#

import torch
x = torch.arange(4.0, requires_grad=True)
y = 2 * torch.dot(x, x)
y.backward()
# By default, PyTorch accumulates gradients, so previous values need to be cleared
# Calling 'backward' on a non-scalar requires passing a 'gradient' argument that specifies the differential function
x.grad.zero_()  # Clear the previous gradient of x
y = x * x  # Here y is not a scalar; it is a vector
print(y)
# Equivalent to y.backward(torch.ones(len(x)))
y.sum().backward()  # y.sum() converts the vector to a scalar for differentiation
x.grad

🔹 Key Point: The backward in PyTorch must be a scalar

In PyTorch, the design of .backward() is based on the chain rule, primarily targeting the gradient of a scalar (single output value) with respect to the input.

If you call y.backward(), here y must be a scalar because:

\nabla_x y = \frac{\partial y}{\partial x}

$\nabla_x y$ is called the gradient of y with respect to x
$\frac{\partial y}{\partial x}$ is the partial derivative of y with respect to x

But what it specifically is depends on the dimensions of $y$ and $x$.

🔸 Case 1: y is a scalar, x is a vector

Assume:

$x \in \mathbb{R}^n$
$y \in \mathbb{R}$

Then:

\nabla_x y = \begin{bmatrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \vdots \\ \frac{\partial y}{\partial x_n} \end{bmatrix} \in \mathbb{R}^n

That is:

A vector of length $n$ representing the partial derivatives of y with respect to each component of x.

This is the most commonly used situation in machine learning, such as the gradient of loss with respect to weights.

🔸 Case 2: y is a vector, x is a vector

Assume:

$x \in \mathbb{R}^n$
$y \in \mathbb{R}^m$

Then:

\frac{\partial y}{\partial x} = J \in \mathbb{R}^{m \times n}

This matrix $J$ is called the Jacobian matrix, and its $(i, j)$ element is:

J_{ij} = \frac{\partial y_i}{\partial x_j}

2. Jacobian Matrix#

The Jacobian matrix is:

A matrix of partial derivatives of a vector function with respect to an input vector.

Assume:

Input: $x \in \mathbb{R}^n$
Output: $y = f(x) \in \mathbb{R}^m$

Jacobian:

J(x) = \frac{\partial y}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \\ \end{bmatrix} \in \mathbb{R}^{m \times n}

Intuitively:

The $i$ -th row: the partial derivatives of $y_i$ with respect to all $x_j$ .
Describes the linear influence of small changes in input variables on each output variable.

3. Hessian Matrix#

The Hessian matrix is:

A matrix of second-order partial derivatives of a scalar function with respect to an input vector.

Assume:

Input: $x \in \mathbb{R}^n$
Output: $y = f(x) \in \mathbb{R}$

Hessian:

H(x) = \frac{\partial^2 y}{\partial x^2} = \begin{bmatrix} \frac{\partial^2 y}{\partial x_1^2} & \frac{\partial^2 y}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 y}{\partial x_2 \partial x_1} & \frac{\partial^2 y}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \\ \end{bmatrix} \in \mathbb{R}^{n \times n}

However, when $y$ is a vector, the gradient is actually the Jacobian matrix, not a single gradient vector:

J_{ij} = \frac{\partial y_i}{\partial x_j}

4. The difference between dtype and type in PyTorch#

First, the conclusion:

dtype is the specific numerical type of elements in a tensor, while type is the Python type of the tensor object itself (including device information).

They focus on completely different aspects.

📍 1️⃣ What is dtype?

Refers to the storage type of each element in the tensor.
For example:
- torch.float32 → single-precision floating point
- torch.int64 → 64-bit integer
- torch.bool → boolean type

Look at the code:

x = torch.tensor([1, 2, 3], dtype=torch.float32)
print(x.dtype)  # Output: torch.float32

It only tells you "what format the elements of this tensor are stored in."

📍 2️⃣ What is type?

Refers to the full name of the tensor object in PyTorch, including data type and device.
For example:
- torch.FloatTensor → float32 CPU tensor
- torch.cuda.FloatTensor → float32 GPU tensor
- torch.IntTensor → int32 CPU tensor

Look at the code:

x = torch.tensor([1, 2, 3])
print(x.type())  # Output: torch.IntTensor (or your system's default type)

It tells you "the complete type name of this tensor object in the PyTorch system."

📍 ⚠ Key Differences

Comparison Item	dtype	type
Focus	Element's numerical type	Complete PyTorch type of tensor object (including device information)
Example	torch.float32, torch.int64	torch.FloatTensor, torch.cuda.FloatTensor
Main Use	Related to precision, storage, computation	Distinguishing different tensor categories, debugging/checking
Change Method	`.to(dtype)`, `.float()`	`.type()` (changes the entire type object)

5. Soft Label vs Hard Label#

Typically, when training classification models, the labels used are:
✅ hard label
For example, for a 3-class task, the true labels:

Class A → [1, 0, 0]
Class B → [0, 1, 0]
Class C → [0, 0, 1]

These labels are completely one-hot encoded, recognizing only right or wrong.

soft label is:

Each class corresponds to a probability, rather than a rigid 0 or 1.

For example:

For a certain image, soft label → [0.7, 0.2, 0.1]

This indicates:

There is a 70% probability it is class A
20% probability it is class B
10% probability it is class C

In other words, the labels also "acknowledge ambiguity," rather than being all or nothing.

6. Softmax Regression and Logistic Regression#

📦 Common Points

✅ Both are essentially classification models
✅ Both use linear functions + activation (sigmoid or softmax)
✅ Both use cross-entropy as the loss function

But they are used for different tasks.

🌟 Main Differences

Item	Logistic Regression	Softmax Regression
Task Type	Binary classification	Multi-class classification
Output Layer Activation Function	sigmoid (single value output 0~1)	softmax (vector output, probability for each class)
Output Dimension	1D	C-dimensional (C = number of classes)
Target Label	0 or 1	one-hot encoding, e.g., [0,0,1,0]
Decision Method	Output > 0.5 classified as positive class	The class with the highest probability is the predicted result

📊 Logistic Regression Details

Suppose you have:
- Input feature $x$
- Weights $w$ and bias $b$
Model calculation:
$\hat{y} = \sigma(w^T x + b)$
where $\sigma$ is the sigmoid function.

The output $\hat{y}$ is a probability value (0 to 1), representing the probability of the positive class.

Loss function:
$\text{Binary Cross Entropy} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]$

📊 Softmax Regression Details

Suppose you have:
- Input feature $x$
- Weight matrix $W$ (shape: num_features × num_classes)
- Bias $b$
Model calculation:
$\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$
where $z = W^T x + b$

The output $\hat{y}$ is a vector of length C, with each element being the probability of the corresponding class.

Loss function:
$\text{Cross Entropy} = -\sum_i y_i \log(\hat{y}_i)$

🧠 Why can't logistic regression be directly used for multi-class?

Because sigmoid only outputs one value, while multi-class requires outputting probabilities for multiple classes, and these probabilities must satisfy:

The sum equals 1, mutually exclusive.

This is the design purpose of softmax.

✅ In Summary

Logistic regression ≈ 2-class special case of softmax
Softmax regression = multi-class generalization of logistic regression

7. Likelihood and Probability#

🌟 What is the likelihood function?

In simple terms:

The likelihood function is the probability of observing the data given model parameters.

You can think of it as:
The model assumes a certain parameter → Under this parameter, how likely it is to generate the batch of data we currently have.

📊 Difference from Probability?

Many people easily confuse:
✅ Probability: Given parameters, calculate the likelihood of an event.
✅ Likelihood: Given data, see which parameters are more likely to generate this data.

Although the mathematical formulas look the same, their uses are reversed.

🏗️ Mathematical Expression

Assume:

Data: $x_1, x_2, ..., x_n$
Parameter: $\theta$

Probability:

P(x | \theta)

Likelihood Function:

L(\theta | x) = P(x | \theta)

The difference is:

Probability: $\theta$ is fixed, looking at $x$ .
Likelihood: $x$ is fixed, looking at $\theta$ .

🌟A Simple Example

Assume you have a coin, tossed 10 times, resulting in 7 heads.
We want to estimate the probability of heads $p$ .

✅ Model Assumption
Number of tosses: 10
Probability of heads: $p$
Event: 7 heads

✅ Likelihood Function

L(p) = C \cdot p^7 (1-p)^3

Where:

$C$ is the combination number (a fixed value that does not affect maximization).
The core is:

Given $p$ , the probability of generating this data (7 heads and 3 tails).

🌟 Maximum Likelihood Estimation (MLE)

Typically, we use:

To find the parameters that maximize the likelihood function.

In this example:

Maximize $p^7 (1-p)^3$
Optimal $p^* = 7 / 10 = 0.7$

This is the maximum likelihood estimation.

🔥 In Machine Learning

Many training processes in machine learning are actually:

Using maximum likelihood to fit parameters.

For example:

Regression models → Maximum likelihood of Gaussian distribution
Classification models → Maximum likelihood under softmax
Neural networks → Cross-entropy loss, which is actually derived from maximum likelihood

✅ In Summary

Likelihood function = The probability of observing the current data given parameters (viewed as a function of parameters).

Maximizing likelihood means finding the parameters that are most likely to generate the data.

8. Perceptron#

Perceptron is a very basic binary classification linear model, which can be seen as the earliest form of neural networks (single-layer, no hidden layers).

Its main characteristics and key points are:

✅ Basic Form:
A perceptron uses a weight vector w and a bias b to perform a linear combination on the input feature vector x, then passes it through a sign function to decide whether the output is +1 or -1.

Formula:
f(x) = sign(w·x + b)

✅ Goal:
Find a set of w, b that allows all samples to be separated by a hyperplane (i.e., linearly separable).

✅ Training Algorithm:

Initialize weights and bias (usually to zero or small values).
For each misclassified sample, update the weights:
w ← w + η * y * x
b ← b + η * y
Here η is the learning rate, and y is the true label (+1 or -1).
Continue iterating until all samples are correctly classified (or reach the maximum number of iterations).

✅ Limitations (hard limitations):

Can only handle linearly separable problems. It is completely powerless against non-linear data (like the XOR problem).
No probability output, only hard classification.
Easily affected by noise and outliers.

✅ Significance and Historical Status:
Although deep learning has long surpassed the perceptron, it is the starting point of neural network development. In 1969, Minsky and Papert pointed out in their book "Perceptrons" that it could not solve the XOR problem, which directly led to the arrival of the AI winter. It was not until the emergence of multi-layer networks (MLP) and backpropagation algorithms that this limitation was broken.

9. Multi-layer Perceptron#

Binary classification problem formula

The difference between the binary classification problem formula and the multi-class problem is that the shape of $w_2$ is $m*k$ (where k is the number of classes).

Each hidden layer requires an activation function (non-linear function); without an activation function, it is equivalent to a large linear function.
The output layer may not need an activation function.

The difference between the multi-class problem of multi-layer perceptron and softmax regression is that it adds hidden layers, while the rest remains the same.

Code

import torch
from torch import nn

# Build the model, the hidden layer contains 256 hidden units and uses the ReLU activation function
net = nn.Sequential(nn.Flatten(), nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10))

Code Explanation:
nn.Sequential() is used to stack a series of sub-modules (layers, activation functions, etc.) in order, automatically organizing forward propagation. In other words, it is an ordered container that wraps the network modules you want in sequence.
nn.Flatten() flattens the input multi-dimensional tensor into a one-dimensional vector.
For MNIST, the input image shape is [batch_size, 1, 28, 28] (grayscale), and after flattening, it becomes [batch_size, 784], making it convenient to connect to the fully connected layer.

10. Activation Functions#

Sigmoid Function#

✅ Sigmoid Activation Function is defined as:

\sigma(x) = \frac{1}{1 + e^{-x}}

✅ Output Range:
(0, 1) — maps any real number to between 0 and 1.

✅ Graph Features:

output

S-shaped curve (hence called sigmoid, sigmoid = S-shaped).
Symmetrical around (0, 0.5).
When $x$ is very large or very small, the gradient approaches 0 — gradient vanishing.

✅ Advantages:

Can be interpreted as a probability (especially suitable for the last layer of binary classification).
Smooth, continuous, and differentiable.

✅ Disadvantages (fatal):

Gradient Vanishing: When $x$ is very large or very small, the derivative approaches 0, making it nearly impossible to update weights during backpropagation.
Output not centered at 0: This can slow down convergence during gradient updates because the positive and negative gradients are asymmetrical.
Easily Saturated: Inputs that are too large or too small will be flattened.

✅ Current Mainstream Practice:

Except for the last layer of binary classification, hidden layers almost never use sigmoid, but rather use ReLU or more advanced variants.
If it is binary classification, use sigmoid for the last layer and binary cross-entropy for the loss function.

Tanh Function#

✅ Tanh (Hyperbolic Tangent) Activation Function is defined as:

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

✅ Output Range:
(-1, 1) — compresses any real number to between -1 and 1.

✅ Graph Features:

output

S-shaped curve (similar to sigmoid, but symmetrical around the vertical axis).
Symmetrical around (0, 0) (this is better than sigmoid, as the output mean is closer to 0, which helps with optimization convergence).
When $x$ is very large or very small, gradient vanishing can also occur.

✅ Advantages:

Compared to sigmoid, the output is centered at 0, which is more friendly for gradient updates.
Smooth, continuous, and differentiable.

✅ Disadvantages (core issue):

Like sigmoid, it is prone to saturation → gradient vanishing.

✅ Current Mainstream Practice:

In some models that require symmetric output (like RNN), tanh is still useful.
However, for hidden layers of deep neural networks, the modern mainstream still uses ReLU and its improved versions.

ReLU Function#

✅ ReLU (Rectified Linear Unit) Activation Function is defined as:

\text{ReLU}(x) = \max(0, x)

✅ Output Range:
$[0, +\infty)$

✅ Graph Features:

output

When $x < 0$ , output is 0.
When $x \geq 0$ , output is $x$ .
Simple, direct, a piecewise linear function.

✅ Advantages:

Simple computation, fast convergence. Compared to the Sigmoid and Tanh functions, both of which require exponential operations, exponential calculations are relatively expensive.
Less likely to experience gradient vanishing (because the gradient in the positive range is always 1).
Sparse activation (many neurons output 0, which helps simplify the model).

✅ Disadvantages:

Dying ReLU Problem: If a neuron gets stuck in the negative range during training (outputting 0 continuously), it may never update again (because the gradient in the negative range is 0).
Asymmetrical with respect to input values (only retains positive values).

✅ Modern Improved Versions:

Leaky ReLU: Gives a small slope in the negative range.
Parametric ReLU (PReLU): Makes the slope in the negative range learnable.
ELU, GELU, Swish: Further improvements.

11. Model Complexity#

12. Regularization#

In machine learning, regularization methods are a set of techniques used to prevent model overfitting and improve model generalization ability. It introduces some "penalties" or "constraints" during the model learning process to restrict the model's complexity, so that the model does not fit every detail (especially noise) in the training data too "perfectly," thus being able to better adapt to new, unseen data.
It is only used during training and not needed during prediction.

What is regularization?

In one sentence: Regularization actively limits the model's degrees of freedom by adding "constraints/penalties" to the loss function, forcing it to learn simple and well-generalized patterns rather than memorizing the noise in the training set.

Mathematically, it is often written as

\min_{\theta}\; \mathcal{L}(\theta;\,X,y)\;+\;\lambda\,\Omega(\theta)

$\mathcal{L}$ : Original empirical loss (cross-entropy, MSE, etc.)
$\Omega$ : Regularization term (the larger it is, the more "complex" the model)
$\lambda$ : Regularization strength; $\lambda=0$ degenerates to no regularization, $\lambda\to\infty$ compresses the model to the simplest form

Core Functions

Goal	Explanation
Suppress Overfitting	Reduce variance, improve robustness to unseen samples
Numerical Stability	Avoid weight explosion or matrix singularity
Interpretability	Sparsity or structural constraints make features/sub-networks easier to read
Prevent Multicollinearity	Still provide a unique solution under high-dimensional multicollinear features

L1 Regularization (Lasso Regression)#

✅** 1. What is L1 Regularization?**

One-sentence definition:

L1 regularization adds the sum of the absolute values of all weights as a penalty term to the loss function.

Its goal is:

To make some unimportant weights automatically become 0, thus making the model "simpler."

✅ 2. Mathematical Form (Briefly)

The ordinary loss function is:

\text{Loss} = \text{Model Error (e.g., cross-entropy)}

With L1 regularization, it becomes:

\text{Loss}_{\text{total}} = \text{Original Error} + \lambda \sum |w_i|

Where:

$w_i$ is each weight parameter
$\lambda$ is a hyperparameter controlling the penalty strength (the larger, the "harsher")

✅ 3. The Biggest Feature of L1: Making Some Weights Become 0 (Sparsity)

This is the key difference between L1 and L2:

So if you want the model to automatically select important features and discard junk features, L1 regularization is the ideal choice.

✅ 4. An Example (Imagine)

You are doing a house price prediction task with 100 input features, but only 5 are useful.

If you use L1 regularization, after training, the model might only retain the weights of these 5 features, while the other 95 become 0.
This is equivalent to the model automatically performing feature selection.

✅ In Summary

L1 Regularization = Penalizes the absolute values of weights → Encourages some parameters to become 0 → Automatically selects important features.

L2 Regularization (Weight Decay) (Ridge Regression)#

✅ One-sentence Definition

L2 Regularization adds a penalty term of the square of model parameters to the loss function,
used to suppress large weights and avoid overfitting.

✅ Mathematical Form

Assuming the original loss function of the model is:

\text{Loss} = L(\hat{y}, y)

With L2 regularization, it becomes:

\text{Loss}_{\text{total}} = L(\hat{y}, y) + \lambda \sum w_i^2

Where:

$\lambda$ : Regularization strength (hyperparameter) typically taken as $10^{-2}, 10^{-3}, 10^{-4}$
$\sum w_i^2$ : The sum of the squares of all weights (which is the L2 norm)

✅ Practical Effect (Intuition)

Without L2:

The model might wildly amplify a certain weight, leading to excellent fitting of the training data but poor generalization.

With L2 Regularization:

The model becomes "more conservative," not easily enlarging parameters
The model is more stable to fluctuations in input (stronger generalization ability)

✅ How to Use in PyTorch?

L2 Regularization = Weight Decay, so in PyTorch, you just need to add it in the optimizer:

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, weight_decay=1e-4)

This weight_decay parameter is actually $\lambda$ ! The default is L2 regularization.

✅ How to Use in sklearn?

from sklearn.linear_model import Ridge  # Ridge is linear regression with L2
model = Ridge(alpha=1.0)

✅ In Summary

L2 Regularization = Penalizes the square of weights, compressing but retaining all parameters, enhancing model generalization ability.

Dropout#

✅ One-sentence Definition

Dropout is a regularization method that randomly masks neurons,
making the neural network more "robust" during training and preventing overfitting.

✅ Motivation Behind It

Deep neural networks are prone to overfitting, especially when:

The number of layers is deep
The training data is small
There are many parameters, and the model is complex

The reason is:

The model learns to "rely on certain combinations of neurons" to memorize the training data → Generalization ability deteriorates.

Dropout breaks this dependency —
During training, it randomly masks (sets to 0) a portion of neurons, forcing the network not to rely solely on a small group of neurons but to have redundancy.

✅ How It Works?

During Training:

For a certain hidden layer output vector $\mathbf h = (h_1,\dots,h_n)$
During training, apply:

\tilde{\mathbf h}= \frac{\mathbf m}{p}\odot \mathbf h,\qquad m_i \sim \text{Bernoulli}(p)

— The retention probability $p$ is set by you; usually between 0.5–0.9.

$\mathbf m$ : Random mask that decides whether to "retain" the $i$ -th unit.
$\odot$ : Element-wise multiplication.
$1/p$ : Inverted dropout is a commonly used rescaling factor to ensure $\mathbb E[\tilde h_i] = h_i$ , maintaining consistent activation scales during training/inference.

So from a mathematical perspective, it is a form of multiplicative binary noise injection, not actually physically deleting neurons.

Random Mask = A tensor filled with 0/1 values that decides which units "turn off" temporarily and which "work normally" during this forward/backward propagation round.
In Dropout (or other noise regularization), it is the minimum "switch matrix" that injects multiplicative noise into the network.

Element values

m_{ij} \sim \text{Bernoulli}(p)\quad \bigl\{ \begin{array}{l} 1\quad(\text{retain})\\ 0\quad(\text{drop}) \end{array}

During Testing:

No Drop, all activated, but outputs are not scaled.

✅ Usage Location:
Between the fully connected layer (Dense Layer / Linear Layer) and its subsequent activation function (like ReLU, Sigmoid, Tanh, etc.) or after.

It is more common to place it after the activation function: Dense -> Activation -> Dropout
This is because the purpose of Dropout is to randomly "turn off" the outputs of neurons. The output of the activation function represents the final output signal of the neuron, and applying Dropout directly simulates the random deactivation of neurons.
In some cases, it may also be placed before the activation function: Dense -> Dropout -> Activation
Although not mainstream, some research and practice indicate that this can sometimes yield results. The logic is to apply Dropout to the linear combination results calculated by weights before passing through the activation function.

Typically applied to one or more hidden layers (Hidden Layers).

For deeper networks, Dropout can be used after multiple hidden layers.
Whether to use it on all hidden layers and what dropout rate (dropout rate/probability) to use is usually a hyperparameter that needs to be adjusted through experimentation.

Generally not recommended to use Dropout in the output layer (Output Layer).

The output layer is responsible for producing the final prediction results. Using Dropout in the output layer may interfere with the final prediction output of the model, especially for classification tasks, as it may randomly drop some class prediction signals, which is usually undesirable.

✅ PyTorch Implementation:

import torch.nn as nn

net = nn.Sequential(
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(p=0.5),   # Dropout layer: masks 50% of neurons during training
    nn.Linear(128, 10)
)

✅ Advantages of Dropout:

Advantages	Description
Suppresses Overfitting	Forces the network not to rely on certain features
Enhances Generalization Ability	Each training iteration is like training different "sub-networks"
Easy to Use	Can be added with just one line of code

⚠️ Cautions:

Dropout only works during training, and must be turned off during testing.
If the model is already small or the data volume is sufficient, Dropout may sometimes harm performance (underfitting).

✅ In Summary:

Dropout is a strategy of "randomly dropping points during training, fully activating during testing," enhancing model robustness and preventing overfitting.

Early Stopping#

✅ One-sentence Definition

Early Stopping is a method of stopping training by monitoring validation set performance before the model starts to overfit.

🧠 Why is Early Stopping Needed?

When training a model, we often see the following phenomenon:

Epoch	Training Accuracy	Validation Accuracy
1	60%	58%
10	95%	88% ✅
30	99% ✅	70% ❌

The training set keeps improving, but the validation set keeps declining. This indicates:

The model is "memorizing" the training data → Overfitting has begun.

Continuing training at this point is a waste of time and may even damage the model.

✅** The Core Idea of Early Stopping**

Very simple:

Monitor the performance of the validation set, and stop training immediately once it starts to decline, keeping the best-performing model.

This way:

The model will not overfit
Training speed will be faster
Generally does not require complex regularization terms

👉 Therefore, Early Stopping is a training-level regularization method, rather than a structural-level one.

✅ In Summary

Early Stopping is the simplest yet extremely effective regularization strategy: Stop immediately when the validation set stops improving.

13. Weight Initialization#

To avoid "signal explosion or disappearance," we hope

The mean of activations (forward) and gradients (backward) in each layer = 0, variance = constant.

Specifically, it treats the outputs $h_i^t$ and gradients $\frac{\partial \ell}{\partial h_i^t}$ of each layer as random variables:

Direction	Mean	Variance
Forward	$\mathbb E[h_i^t]=0$	$\mathrm{Var}[h_i^t]=a$
Backward	$\mathbb E\!\left[\dfrac{\partial \ell}{\partial h_i^t}\right]=0$	$\mathrm{Var}\!\left[\dfrac{\partial \ell}{\partial h_i^t}\right]=b$

Where $a,b$ are two constants you set (usually taken as 1 or 2), and the same for all layers $t$ and all channels $i$ .

Deriving the "variance conservation" condition for one layer (fully connected layer as an example)

Let

h_i^t=\sum_{j=1}^{n_{\text{in}}} W_{ij}^t\,x_j^{t-1},\qquad x_j^{t-1}~\text{(activation from the upper layer)}

Common Assumptions

$\mathbb E[x_j^{t-1}]=0,\; \mathrm{Var}[x_j^{t-1}]=a$
$\mathbb E[W_{ij}^t]=0,\; \mathrm{Var}[W_{ij}^t]=\sigma_w^2$
$x$ and $W$ are independent, and elements are approximately independent

Thus

\mathrm{Var}[h_i^t] = n_{\text{in}}\,\sigma_w^2\,a

To require the output variance to continue equaling $a$ ⇒

n_{\text{in}}\,\sigma_w^2\,a = a \;\Longrightarrow\; \sigma_w^2 = \frac{1}{n_{\text{in}}}.

The same reasoning applies backward (gradients propagate through $W^\top$ ), requiring

\sigma_w^2 = \frac{1}{n_{\text{out}}}.

Considering both ends → Compromise solution

\sigma_w^2 = \frac{2}{n_{\text{in}}+n_{\text{out}}}

This is the Glorot/Xavier Initialization. Random initialization

Xavier Initialization (Glorot Initialization)#

In one sentence: It is a strategy for setting the initial values of network weights, alleviating the gradient vanishing/explosion problem in deep networks by keeping forward activations and backward gradients at each layer close in variance. Proposed by Xavier Glorot and Yoshua Bengio (2010).

Specific Formulas

Sampling Distribution	Suggested Variance	Actual Sampling Range/Standard Deviation
Uniform U(−r, r)	$\sigma_w^2 = \dfrac{2}{n_\text{in}+n_\text{out}}$	$r = \sqrt{\dfrac{6}{n_\text{in}+n_\text{out}}}$
Normal 𝒩(0, σ²)	Same as above	$\sigma = \sqrt{\dfrac{2}{n_\text{in}+n_\text{out}}}$

n_in: The input dimension received by each neuron in this layer
n_out: The number of neurons in this layer

For Convolutional Layers:
$n_\text{in}= \text{kernel}_h \times \text{kernel}_w \times \text{in\_channels}$

Difference from He Initialization

Name	Suggested Variance	Applicable Activation
Xavier/Glorot	$\dfrac{2}{n_\text{in}+n_\text{out}}$	Sigmoid, tanh, soft-sign, and other symmetric activations
He/Kaiming	$\dfrac{2}{n_\text{in}}$	ReLU, Leaky-ReLU, GELU (single-sided activations lose half energy, requiring larger variance)

Xavier Uniform Initialization Code#

Single Linear Layer

import torch
import torch.nn as nn

# Create a Linear layer
linear = nn.Linear(20, 256)
# Initialize the linear layer
nn.init.xavier_uniform_(linear.weight)

# Initialize bias (usually to 0)
if linear.bias is not None:
   nn.init.zeros_(linear.bias)

Sequential Initialization

import torch
import torch.nn as nn

# Create a container
net = nn.Sequential(nn.Linear(20, 256),
                    nn.ReLU(),
                    nn.Linear(256, 256),
                    nn.ReLU(),
                    nn.Linear(256, 1)
                   )
# Define initialization function
def init_weight(m):
   if isinstance(m, nn.Linear):
        nn.init.xavier_uniform_(m.weight)
        if m.bias is not None:
            nn.init.zeros_(m.bias)

# Apply initialization
model.apply(init_weight)

Xavier Normal Initialization Code#

Single Linear Layer

import torch
import torch.nn as nn

# Create a Linear layer
linear = nn.Linear(20, 256)
# Initialize the linear layer
nn.init.xavier_normal_(linear.weight)

# Initialize bias (usually to 0)
if linear.bias is not None:
   nn.init.zeros_(linear.bias)

Sequential Initialization

import torch
import torch.nn as nn

# Create a container
net = nn.Sequential(nn.Linear(20, 256),
                    nn.ReLU(),
                    nn.Linear(256, 256),
                    nn.ReLU(),
                    nn.Linear(256, 1)
                   )
# Define initialization function
def init_weight(m):
   if isinstance(m, nn.Linear):
        nn.init.xavier_normal_(m.weight)
        if m.bias is not None:
            nn.init.zeros_(m.bias)

# Apply initialization
model.apply(init_weight)

14. nn.Module#

You can flexibly customize the computation process by creating subclasses with nn.Module as the parent class.

nn.Parameter#

nn.Parameter is a special Tensor that, when set as an attribute of nn.Module, is automatically registered as a trainable parameter of the model, meaning it will appear in model.parameters() and be updated by the optimizer during training.
A regular torch.Tensor defaults to requires_grad=False, and even if you manually set requires_grad=True, it will not be automatically registered as a parameter unless you wrap it with nn.Parameter.

Custom Layer#

Custom layer

class CustomLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.bias = nn.Parameter(torch.zeros(out_features))

    def forward(self, x):
        return x @ self.weight.T + self.bias
# @ is matrix multiplication

Custom Block#

class ResidualBlock(nn.Module):
    def __init__(self, in_features):
        super().__init__()
        self.fc1 = nn.Linear(in_features, in_features)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(in_features, in_features)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return x + out  # Residual connection

Parameter Management#

# First, focus on a multi-layer perceptron with a single hidden layer
import torch
from torch import nn

net = nn.Sequential(nn.Linear(2, 4), nn.ReLU(), nn.Linear(4, 1))
X = torch.rand(size=(4, 2))
net(X)
print(net)  # Print network structure
'''
Sequential(
  (0): Linear(in_features=2, out_features=4, bias=True)
  (1): ReLU()
  (2): Linear(in_features=4, out_features=1, bias=True)
)
'''
print(net(X))
'''
tensor([[-0.1564],
        [ 0.2563],
        [ 0.2011],
        [ 0.0006]], grad_fn=<AddmmBackward0>)
'''

print(net[2].state_dict()) # Access parameters, net[2] is the last output layer
'''
OrderedDict([('weight', tensor([[ 0.3754, -0.1346, -0.2410, -0.0513]])), ('bias', tensor([-0.1647]))])
'''
print(type(net[2].bias)) # Target parameter
# <class 'torch.nn.parameter.Parameter'>

print(net[2].bias)
# Parameter containing:
# tensor([-0.1647], requires_grad=True)

print(net[2].bias.data)
# tensor([-0.1647])

print(net[2].weight.grad == None) # No backward computation yet, so grad is None
# True

print(*[(name, param) for name, param in net[0].named_parameters()])  # Access all parameters at once 
'''
('weight', Parameter containing:
tensor([[-0.4437,  0.5371],
        [ 0.5344, -0.1997],
        [-0.3801, -0.6202],
        [-0.3033, -0.4238]], requires_grad=True)) ('bias', Parameter containing:
tensor([0.7005, 0.0617, 0.1107, 0.6609], requires_grad=True))
'''

print(*[(name, param.shape) for name, param in net.named_parameters()])  # 0 is the name of the first layer, 1 is ReLU, which has no parameters
# ('0.weight', torch.Size([4, 2])) ('0.bias', torch.Size([4])) ('2.weight', torch.Size([1, 4])) ('2.bias', torch.Size([1]))
print(net.state_dict()['2.bias'].data) # Access parameters by name
# tensor([-0.1647])

# Store training data
torch.save(net.state_dict(), 'train_weight')

Built-in Initialization#

net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1))

def init_normal(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, mean=0, std=0.01) # The underscore indicates that the value of m.weight is replaced   
        nn.init.zeros_(m.bias)
        
net.apply(init_normal) # Will recursively call until all layers are initialized
print(net[0].weight.data[0])
print(net[0].bias.data[0])

15. torch.device#

PyTorch defaults to CPU computation, and you must manually select GPU to use GPU computation.

import torch 

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Or
device = torch.cuda.device("cuda:0")

def evaluate_accuracy_gpu(net, data_iter, device=None):
    """Calculate the model's accuracy on the dataset using GPU"""
    if isinstance(net, torch.nn.Module):
# Ensure what is passed in is a PyTorch model
        net.eval()  
# 1. Set to evaluation mode, disabling dropout and batchnorm training behavior
        if not device:
            device = next(iter(net.parameters())).device  
# 2. Automatically detect the device where the model is located
# If no device is manually passed, automatically extract the current device from model parameters (e.g., CPU or GPU)

net.parameters() → Returns all trainable parameters in the model
• iter(...) → Creates a parameter iterator
• next(...) → Gets the first parameter, a built-in Python function that retrieves the next element from the iterator.
• .device → Gets the device where this parameter is located (which could be 'cpu' or 'cuda:0')

16. nn.utils#

The utilities module in PyTorch contains some auxiliary functions, utility functions, or classes to support training, data loading, model visualization, and other tasks.

Here are the most commonly used submodules of torch.utils:

torch.utils.data#

Used for data loading and processing:
• Dataset: You can inherit it to customize datasets.
• DataLoader: Loads data in batches, supporting multi-threading, shuffling, batching, etc.
• Subset: Selects a subset from an existing dataset.
• ConcatDataset: Concatenates multiple datasets.

data.DataLoader(mnist_train, 
                batch_size, 
                shuffle=True, 
                num_workers=get_dataloader_workers()
                )

data.DataLoader(mnist_test, 
                batch_size, 
                shuffle=False, 
                num_workers=get_dataloader_workers()
                )