Linear Algebra - Fight4354

The learning process over the past few days has gradually clarified my understanding of linear algebra. In this post, I will summarize some key new content.

When dealing with matrices and singular values, the following understanding should be established:
✅ A matrix is a spatial operator.
✅ Singular value decomposition helps you break down the essence of a matrix: rotation → stretching → rotation.
✅ The size ordering of singular values tells you: in which directions the matrix truly has strength, and which directions are ineffective.

1. Orthogonal Matrix#

The core definition of an Orthogonal Matrix

An $n \times n$ real matrix $Q$ is called an orthogonal matrix if it satisfies

$Q^{\mathsf T}\,Q \;=\; Q\,Q^{\mathsf T}\;=\;I_n$

Here, $Q^{\mathsf T}$ is the transpose, and $I_n$ is the $n$ -dimensional identity matrix.

The inverse is the transpose

Q^{-1}=Q^{\mathsf T}

This simplifies calculations and ensures numerical stability.
An orthogonal matrix is a real matrix that "preserves inner products"—it rotates or flips the coordinate system but never stretches or distorts it.

2. Angle Brackets#

Here

\langle q_i,\;q_j\rangle

is the symbol for the "inner product." In the most common case—real vector space $\mathbb R^n$ —it is equivalent to the dot product we are familiar with:

\langle q_i,\;q_j\rangle \;=\; q_i^{\mathsf T}\,q_j \;=\; \sum_{k=1}^n (q_i)_k\,(q_j)_k.

3. Matrix Position Exchange#

Eliminate the left $Q$

The $Q$ attached to the left of $S$
Multiply both sides by its inverse $Q^{-1}$ from the left:

Q^{-1}\,A \;=\; \cancel{Q^{-1}Q}\,S\,Q^{\mathsf T} \quad\Longrightarrow\quad Q^{-1}A \;=\; S\,Q^{\mathsf T}

Note:

You must multiply from the left consistently on both sides;

Do not attempt to multiply $Q^{-1}$ to the right (that would disrupt the order of multiplication).

Eliminate the right $Q^{\mathsf T}$

The $Q^{\mathsf T}$ attached to the right of $S$
Multiply both sides by its inverse $(Q^{\mathsf T})^{-1}$ from the right:

Q^{-1}A(Q^{\mathsf T})^{-1} \;=\; S\cancel{Q^{\mathsf T}(Q^{\mathsf T})^{-1}} \quad\Longrightarrow\quad S = Q^{-1}A(Q^{\mathsf T})^{-1}

If $Q$ is an orthogonal matrix, $Q^{-1}=Q^{\mathsf T}$ , then

S \;=\; Q^{\mathsf T}\,A\,Q.

Why can't the order be reversed?

Once you multiply incorrectly, the symbols will "insert" themselves in different positions:
$AQ^{-1} \neq Q^{-1}A$ .
The same operation must be performed symmetrically on both sides of the equation for the equality to hold.
This is essentially the same as the order of function composition or coordinate transformation: the order in which transformations are performed must be written in the corresponding positions of the product, and cannot be arbitrarily swapped.

4. Similar Diagonalizable Matrix#

A similar diagonalizable matrix (commonly referred to as a "diagonalizable matrix") means:

There exists an invertible matrix $P$ such that

$P^{-1}AP = D,$

where $D$ is a diagonal matrix.
At this point, we say that $A$ can be diagonalized through a similarity transformation, or simply that $A$ is diagonalizable.

The "mechanical process" of diagonalization

Find the eigenvalues: Solve $\det(A-\lambda I)=0$ .
Find the eigenvectors: For each $\lambda$ , solve $(A-\lambda I)x=0$ .
Assemble $P$ : Arrange the $n$ linearly independent eigenvectors as columns in matrix $P$ .
Obtain $D$ : Fill in the corresponding eigenvalues along the diagonal: $D=\operatorname{diag}(\lambda_1,\dots,\lambda_n)$ .
Thus, $A = P D P^{-1}$ .

5. Singular Value Decomposition#

A \;=\; Q\,S\,Q^{\mathsf T} \quad\Longleftrightarrow\quad S \;=\; Q^{\mathsf T} A Q

Symbol	Meaning
$A$	Given real symmetric matrix ( $A^{\mathsf T}=A$ )
$Q$	Orthogonal matrix: $Q^{\mathsf T}Q=I$ , column vectors are orthogonal and of unit length
$S$	Diagonal matrix: $S=\operatorname{diag}(\lambda_1,\dots,\lambda_n)$

The expression $A=Q S Q^{\mathsf T}$ is called orthogonal similarity diagonalization; geometrically, it means "rotate (or mirror) the coordinate system → A only retains independent stretching."

Why can "real symmetric matrices always be orthogonally diagonalized"?

Spectral Theorem:

For any real symmetric matrix $A$ , there exists an orthogonal matrix $Q$ such that $Q^{\mathsf T} A Q$ is diagonal, and the diagonal elements are the eigenvalues of $A$ .

Real eigenvalues: Symmetry ensures that all eigenvalues are real numbers.
Orthogonal eigenvectors: If $\lambda_i \ne \lambda_j$ , the corresponding eigenvectors must be orthogonal.
Repeated roots can also take orthogonal bases: The same eigenvalue may correspond to multiple vectors; in this case, perform Gram–Schmidt in the subspace they span.

Step-by-step textual analysis

Step	Explanation
1. Find all eigenvalues and eigenvectors of $A$	Calculate $\det(A-\lambda I)=0$ to obtain all $\lambda_i$ ; for each $\lambda_i$ , solve $(A-\lambda_i I)v=0$ to find the eigenvectors.
2. Arrange the eigenvalues in a certain order along the diagonal to obtain the diagonal matrix $S$	For example, arrange them in ascending order as $S=\operatorname{diag}(\lambda_1,\dots,\lambda_n)$ . The order does not matter, as long as it is consistent with the order of the column vectors later.
3. Eigenvectors corresponding to different eigenvalues are orthogonal; for repeated root eigenvectors, use Gram-Schmidt to orthogonalize and normalize	- If $\lambda_i\neq\lambda_j$ , the corresponding vectors are naturally orthogonal, no action needed.

If $\lambda$ is repeated (geometric multiplicity >1), first take a random set of linearly independent vectors, then perform Gram-Schmidt in that subspace to make them orthogonal and normalize (adjust length to 1). |
| 4. Arrange the modified eigenvectors horizontally according to the order of eigenvalues on the diagonal to obtain the orthogonal matrix $Q$ | Arrange the modified eigenvectors as columns in $Q=[q_1\ \cdots\ q_n]$ according to the order of the diagonal eigenvalues. At this point, $Q^{\mathsf T}Q=I$ , and $A = Q S Q^{\mathsf T}$ . |

A specific small example of $2 \times 2$

Let

A=\begin{bmatrix} 4 & 1\\ 1 & 4 \end{bmatrix}

① Find the eigenvalues

\det(A-\lambda I)=(4-\lambda)^2-1=0 \Longrightarrow \lambda_1=5,\; \lambda_2=3

② Find the eigenvectors

$\lambda_1=5$ : $(A-5I)v=0 \ \Rightarrow\ v_1=\begin{bmatrix}1\\1\end{bmatrix}$
$\lambda_2=3$ : $(A-3I)v=0 \ \Rightarrow\ v_2=\begin{bmatrix}1\\-1\end{bmatrix}$

③ Normalize

q_1=\frac{1}{\sqrt2}\begin{bmatrix}1\\1\end{bmatrix},\quad q_2=\frac{1}{\sqrt2}\begin{bmatrix}1\\-1\end{bmatrix}

④ Assemble and verify

Q=\frac1{\sqrt2}\begin{bmatrix} 1 & 1\\ 1 & -1 \end{bmatrix},\quad S=\begin{bmatrix}5&0\\0&3\end{bmatrix},\quad Q^{\mathsf T} A Q = S.

6. Determinant (det · )#

The determinant is an operation that maps a square matrix $A$ to a scalar $\det A$ .
This scalar encapsulates the most essential geometric and algebraic information of the matrix: volume scaling factor, invertibility, product of eigenvalues, etc.

Formula

Order	Formula
$1\times1$	$\det[a]=a$
$2\times2$	$\displaystyle\det\!\begin{bmatrix}a&b\\c&d\end{bmatrix}=ad-bc$
$3\times3$	"Sarrus' rule" or expand along the first row:
$\displaystyle \det\!\begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}=a(ei-fh)-b(di-fg)+c(dh-eg)$
Core properties (any definition must satisfy)

Property	Explanation
Multiplicativity	$\det(AB)=\det A\cdot\det B$
Invertibility Criterion	$\det A\neq0 \iff A$ is invertible
Linearity in Rows/Columns	Each row (or column) is linear in terms of its elements
Alternating	Swapping two rows (or columns) changes the sign of the determinant
Diagonal Product	For upper/lower triangular matrices: $\det A=\prod_{i}a_{ii}$
Product of Eigenvalues	$\det A=\lambda_1\lambda_2\cdots\lambda_n$ (including multiplicities)

3×3 Hand Calculation Example

Let

A=\begin{bmatrix} 2 & 1 & 3\\ 0 & 4 & -1\\ 5 & 2 & 0 \end{bmatrix}

Expand along the first row:

\begin{aligned} \det A &= 2\;\det\!\begin{bmatrix}4&-1\\2&0\end{bmatrix} \;-\;1\;\det\!\begin{bmatrix}0&-1\\5&0\end{bmatrix} \;+\;3\;\det\!\begin{bmatrix}0&4\\5&2\end{bmatrix} \\[4pt] &= 2\,(4\cdot0-(-1)\cdot2) -1\,(0\cdot0-(-1)\cdot5) +3\,(0\cdot2-4\cdot5) \\[4pt] &= 2\,(2) -1\,(5) +3\,(-20) \\[4pt] &= 4 - 5 - 60 = -61. \end{aligned}

In summary

"Taking the determinant" means: collapsing an $n\times n$ square matrix into a single number through a set of alternating, linear rules, and this number encodes key information about the matrix's volume scaling, direction, invertibility, and product of eigenvalues.

7. Rank of a Matrix#

What exactly is the "rank" of a matrix?

Equivalent Perspective	Intuitive Explanation
Linear Independence	The number of linearly independent vectors that can be selected from the rows (or columns) is the rank.
Dimensionality of Space	The dimension of the subspace spanned by the column vectors (column space) = the dimension of the subspace spanned by the row vectors (row space) = rank.
Full Rank Minor	The order of the largest non-zero determinant in the matrix = rank.
Singular Values	In SVD $A=U\Sigma V^{\!}$ , the number of non-zero singular values* = rank.

Linear Independence
Below are three comparative cases using a 3 × 3 small matrix to make the statement "rank = how many linearly independent column (or row) vectors can be selected" clear.

| Matrix $A$ | Column Vectors Written as $\bigl(v_1\,|\,v_2\,|\,v_3\bigr)$ | Linear Relationship | Rank |
|------------|-----------------------------------------------|-----------|--------|
| $\displaystyle\begin{bmatrix}1&2&3\\2&4&6\\3&6&9\end{bmatrix}$ | $v_1=\begin{bmatrix}1\\2\\3\end{bmatrix}$
$v_2=\begin{bmatrix}2\\4\\6\end{bmatrix}=2v_1$
$v_3=\begin{bmatrix}3\\6\\9\end{bmatrix}=3v_1$ | All three columns lie on the same line—only 1 independent vector | 1 |
| $\displaystyle\begin{bmatrix}1&0&1\\0&1&1\\1&1&2\end{bmatrix}$ | $v_1=\begin{bmatrix}1\\0\\1\end{bmatrix}$
$v_2=\begin{bmatrix}0\\1\\1\end{bmatrix}$
$v_3=v_1+v_2$ | $v_1, v_2$ are not collinear ⇒ 2-dimensional plane; $v_3$ lies in this plane | 2 |
| $\displaystyle\begin{bmatrix}1&0&1\\0&1&1\\1&1&0\end{bmatrix}$ | $v_1=\begin{bmatrix}1\\0\\1\end{bmatrix}$
$v_2=\begin{bmatrix}0\\1\\1\end{bmatrix}$
$v_3=\begin{bmatrix}1\\1\\0\end{bmatrix}$ | Any two columns cannot linearly express the third column ⇒ All three columns span the entire $\mathbb R^{3}$ | 3 |

How to determine "independence"?

Manual Calculation Combine the columns into a matrix and perform elimination → The number of non-zero rows is the rank.
Concept If there exist constants $c_1,c_2,c_3$ such that $c_1v_1+c_2v_2+c_3v_3=0$ and not all are 0, the vectors are dependent; otherwise, they are independent.
- Case 1: $2v_1-v_2=0$ → dependent
- Case 2: Only $v_3=v_1+v_2$ is dependent, while $v_1,v_2$ are independent
- Case 3: Any non-trivial combination ≠ 0 → All three vectors are independent

In summary: Rank = how much independent information (dimension) this matrix can truly "hold."

8. Low-Rank Approximation#

Why does truncating SVD (low-rank approximation) only require storing k (m+n)+k numbers?

When the original matrix

A\in\mathbb R^{m\times n}

is truncated to rank $k$ , it can be written as

A_k \;=\; U_k \,\Sigma_k \, V_k^{\mathsf T},

Block	Shape	Number of Scalars to Save	Explanation
$U_k$	$m\times k$	$m \times k$	Left singular vectors: only the first $k$ columns are taken
$V_k$	$n\times k$	$n \times k$	Right singular vectors: similarly
$\Sigma_k$	$k\times k$ diagonal	$k$	Only the $k$ singular values on the diagonal are retained

Adding the three blocks together gives

\underbrace{m k}_{U_k} \;+\; \underbrace{n k}_{V_k} \;+\; \underbrace{k}_{\Sigma_k} \;=\; k\,(m+n)+k.

$U_k$ and $V_k$ : each has $k$ columns, with each column storing a vector of length equal to the number of rows
$\Rightarrow mk + nk$ numbers.
$\Sigma_k$ : is a diagonal matrix, only requiring those $k$ diagonal elements—not $k^2$ .

Therefore, using rank- $k$ SVD approximation instead of the original $m\times n$ storage, the parameter count reduces from $mn$ to $k(m+n)+k$ .
If $k \ll \min(m,n)$ , the saved space becomes quite considerable.

Lowering rank = reducing information dimension, low-rank storage = reducing parameter count/memory simultaneously

9. Norms#

The "double vertical bars" $|,\cdot,|$ in linear algebra represent norms.

For a vector $v\in\mathbb R^m$ , the most commonly used is the Euclidean norm: $\|v\|=\sqrt{v^{\mathsf T}v}=\sqrt{\sum_{i=1}^{m}v_i^{2}},\qquad \|v\|^{2}=v^{\mathsf T}v.$
In the figure, $\|Xw-y\|^{2}$ represents the sum of the squares of each component of the vector $Xw-y$ .
For a matrix $A$ , if also written as $\|A\|$ , it commonly refers to the Frobenius norm: $\|A\|_F=\sqrt{\sum_{i,j}A_{ij}^{2}}$ . However, the figures here involve vectors.

In contrast, the single vertical bar $|\,\cdot\,|$ typically represents absolute value (scalar) or determinant $|A|$ . Thus, double vertical bars denote the "length" of vectors/matrices, while single vertical bars denote the magnitude of scalars or determinants—different objects and meanings.

Common Euclidean Distance for Vectors -- 2-Norm (L2 Norm)#

import torch

b = torch.tensor([3.0, 4.0])
print(b.norm())  # Outputs 5.0

.norm() is a method of PyTorch tensors (torch.Tensor).

Common Frobenius Norm for Matrices#

Matrices also have "length"—the commonly used is the Frobenius norm

Name	Notation	Formula (for $A\in\mathbb R^{m\times n}$)	Analogy with Vectors
Frobenius Norm	$\displaystyle\|A\|_F$	$\displaystyle\sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n}A_{ij}^{2}}$	Like vector 2-norm $\|v\|=\sqrt{\sum v_i^2}$

1. Why can it also be written as "matrix dot product"

The commonly used inner product in matrix space is

\langle A,B\rangle := \operatorname{tr}\!\bigl(A^{\mathsf T}B\bigr),

where $\operatorname{tr}(\cdot)$ is the trace operation (sum of diagonal elements).
Taking the inner product of $A$ with itself gives:

\|A\|_F^{2} \;=\;\langle A,A\rangle \;=\;\operatorname{tr}\!\bigl(A^{\mathsf T}A\bigr).

Thus:

$\boxed{\;\|A\|_F^{2}= \operatorname{tr}(A^{\mathsf T}A)\;}$

This is the matrix version of $\|v\|^{2}=v^{\mathsf T}v$ —just replacing vector dot product with "trace dot product."
The Frobenius norm is indeed equal to the square root of the sum of the squares of all singular values, that is:

\| A \|_F = \sqrt{\sum_i \sigma_i^2}

Here:

$| A |_F$ is the Frobenius norm of matrix $A$
$\sigma_i$ are the singular values of $A$

The Frobenius norm is indeed equal to the square root of the sum of the squares of all singular values
Expanded explanation:

The Frobenius norm is defined as:

\| A \|_F = \sqrt{ \sum_{i,j} |a_{ij}|^2 }

But singular value decomposition (SVD) tells us:

A = U \Sigma V^T

where $\Sigma$ is a diagonal matrix, with singular values $\sigma_1, \sigma_2, \dots$ on the main diagonal.

Since the Frobenius norm is invariant (unitary orthogonal transformations do not change the norm), we can directly compute:

\| A \|_F^2 = \sum_{i,j} |a_{ij}|^2 = \sum_i \sigma_i^2

Thus, ultimately:

\| A \|_F = \sqrt{ \sum_i \sigma_i^2 }

Beware of misconceptions

Note:
✅ Not the square root of a single singular value, nor the maximum singular value
✅ Is the sum of the squares of all singular values taken square root

The spectral norm looks at "the direction that stretches the most," while the Frobenius norm sums all energies.

Spectral Norm of a Matrix#

✅ Definition of Spectral Norm

The spectral norm of a matrix $A$ is defined as:

\| A \|_2 = \max_{\|x\|_2 = 1} \| A x \|_2

In simple terms, it is the maximum value to which matrix $A$ stretches a unit vector.
Singular values inherently represent the stretching transformations of the matrix.

It equals the maximum singular value of $A$ :

\| A \|_2 = \sigma_{\max}(A)

From another perspective: the spectral norm ≈ the maximum length to which a unit vector is stretched after being input into the matrix.
✅ Its relationship with the Frobenius norm

Frobenius Norm → Looks at overall energy (sum of squares of matrix elements)
Spectral Norm → Looks at the maximum stretching amount in a single direction

In other words:

Frobenius is like the total "volume" of the matrix
Spectral norm is like the "most extreme" stretching rate in a single direction

✅ Example: Why is it important?

Imagine a linear layer $W$ in a neural network:

If $\| W \|_2$ is very large, even small perturbations in the input will be amplified,
making the network prone to overfitting and sensitive to noise.
If $\| W \|_2$ is moderate, the output changes will be stable against input perturbations,
leading to better generalization.

Thus, modern methods (like spectral normalization)
will directly constrain the spectral norm of $W$ within a certain range during training.

⚠ Directly stating the drawbacks

The spectral norm is powerful, but:

It only focuses on a single maximum direction, ignoring other directions of stretching;
It is more complex to compute than the Frobenius norm (requires singular value decomposition, rather than simple element-wise square sums).

Summary Comparison

	Euclidean Norm (2-norm, ‖v‖)	Frobenius Norm (‖A‖_F)
Object	Vector $v\in\mathbb R^{n}$	Matrix $A\in\mathbb R^{m\times n}$
Definition	$\displaystyle\\|v\\|=\sqrt{\sum_{i=1}^{n}v_i^{2}}$	$\displaystyle\\|A\\|_F=\sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n}A_{ij}^{2}}$
Equivalent Expression	$\\|v\\|^{2}=v^{\mathsf T}v$	$\\|A\\|_F^{2}=\operatorname{tr}(A^{\mathsf T}A)=\sum_{k}\sigma_k^{2}$
Geometric Meaning	Length of the vector in $n$ -dimensional Euclidean space	Length when viewing matrix elements as a "long vector"
Unit/Scale	Has the same metric as the coordinate axes	Same for matrices; does not depend on the arrangement of rows and columns
Common Uses	Error measurement, regularization $L^2$ , distance	Weight decay, matrix approximation error, kernel methods
Relationship with Spectral Norm	$\\|v\\|=\sigma_{\max }(v)$ (only one singular value)	$\\|A\\|_F\ge \\|A\\|_2=\sigma_{\max }(A)$ ; equal if rank = 1

Same idea, different dimensions

The Euclidean norm is the square root of a vector's dot product with itself.
The Frobenius norm treats all matrix elements as a long vector and does the same; in matrix language, it can be expressed as

$\|A\|_F=\sqrt{\operatorname{tr}(A^{\mathsf T}A)}.$

This is "transpose → multiply → take trace."

When to use which?

Scenario	Recommended Norm	Reason
Prediction error, gradient descent	Euclidean (vector residual)	Residuals are naturally column vectors
Regularization of network weights (Dense / Conv)	Frobenius	Does not care about parameter shape, only overall magnitude
Comparing matrix approximation quality (SVD, PCA)	Frobenius	Easily corresponds to the sum of squares of singular values
Stability/Lipschitz bounds	Spectral Norm ( $\\|A\\|_2$ )	Concerned with stretching rates rather than total energy

Intuitive differences

Euclidean: Measures the length in a single direction;
Frobenius: Measures the total sum of each element's energy, thus for matrices, no particular column or row is special; all elements are treated equally.

One-sentence memory:

Euclidean Norm: The "ruler" for vectors.
Frobenius Norm: Measures the overall size of a matrix as if "flattened" with the same ruler.

10. Transpose of Matrix Multiplication#

In matrix algebra, there is a fixed "reversal order" rule for the transpose of the product of two (or more) matrices:

(AB)^{\mathsf T}=B^{\mathsf T}\,A^{\mathsf T}.

This means transpose each matrix first, then reverse the order of multiplication.
This property holds for any dimension-matching real (or complex) matrices and can be recursively extended:

(ABC)^{\mathsf T}=C^{\mathsf T}\,B^{\mathsf T}\,A^{\mathsf T},\qquad (\,A_1A_2\cdots A_k)^{\mathsf T}=A_k^{\mathsf T}\cdots A_2^{\mathsf T}A_1^{\mathsf T}.

xLog Editing Markdown Document Notes

Ensure all mathematical expressions are enclosed in $$ … $$
If there are single $n \times n$ types, change to n × n or $$n\\times n$$

Reference video:

Click here to watch the Bilibili video