Second-order Taylor expansion and CART splitting mechanism in XGBoost: A complete understanding from mathematics to intuition

This article systematically summarizes the core ideas of second-order Taylor expansion, gradient/Hessian matrix, CART regression tree, and split gain in XGBoost, focusing on three common confusions:

What exactly is the second-order Taylor expansion doing in XGBoost?
What do the linear algebra symbols (like transpose) mean in the multivariable Taylor formula?
What are the essential differences between XGBoost and traditional GBDT in "split decision"?

I. Basic Idea of Taylor Expansion#

1. One-Dimensional Taylor Expansion#

Taylor expansion is used to approximate a function with a polynomial near a certain point:

f(x) = f(x_0) + f'(x_0)(x-x_0) + \frac{f''(x_0)}{2!}(x-x_0)^2 + \frac{f^{(3)}(x_0)}{3!}(x-x_0)^3 + \cdots

First-order term: describes the direction of change of the function.
Second-order term: describes the curvature of the function.
Higher-order terms: improve accuracy but have high computational cost.

In machine learning optimization, typically only the second-order term is retained.

II. Second-Order Taylor Expansion of Multivariable Functions (Laying the Groundwork for XGBoost)#

For a multivariable function $f(x)$ , where $x \in \mathbb{R}^d$ , the second-order Taylor expansion near the point $x_0$ is:

f(x) \approx f(x_0) + \nabla f(x_0)^T (x - x_0) + \frac{1}{2}(x - x_0)^T H(x_0)(x - x_0)

Explanation of Each Term#

$\nabla f(x_0)$ : gradient vector (composed of first-order partial derivatives, column vector).
$H(x_0)$ : Hessian matrix (composed of second-order partial derivatives, describes curvature).
$T$ : transpose symbol, used to ensure correct dimensions for matrix multiplication.

Why is there a transpose $T$#

Both the gradient and displacement vectors are column vectors, direct multiplication is not valid;
Transposing forms an inner product (dot product), resulting in a scalar:

\nabla f(x_0)^T (x-x_0) = \sum_{k=1}^d \frac{\partial f}{\partial x_k}(x_0)(x_k-x_{0,k})

III. Simplification from "Multivariable → One-Dimensional" in XGBoost#

In XGBoost, the loss function is:

l(y_i, \hat{y}_i)

For each sample:

There is only one independent variable: predicted value $\hat{y}_i$ .
Therefore:
- Gradient $g_i$ : scalar.
- Hessian $h_i$ : scalar.

The second-order Taylor expansion naturally degenerates to:

l(y_i, \hat{y}_i + \Delta) \approx l(y_i, \hat{y}_i) + g_i \Delta + \frac{1}{2} h_i \Delta^2

This is the common "sample-wise one-dimensional second-order expansion" in XGBoost.

IV. The Role of CART Regression Trees in XGBoost#

1. What is CART#

CART (Classification And Regression Tree) is:

A strict binary tree.
Each leaf outputs a constant value.
Can be used for classification and regression.

XGBoost uses CART regression trees, even for classification tasks.

2. Model Form of XGBoost#

\hat{y}_i = \sum_{t=1}^{T} f_t(x_i)

Each $f_t$ : a CART regression tree.
Each sample ultimately falls into a certain leaf, receiving a numerical adjustment.

V. Core Difference of XGBoost: Explicit Complexity Regularization#

1. Problems with Traditional GBDT#

In traditional GBDT during splits:

Only checks if loss decreases.
Does not consider whether the tree has become "overly complex".
Complexity control relies on human rules (like max_depth).

2. Objective Function of XGBoost#

\mathcal{Obj} = \sum_i l(y_i, \hat{y}_i) + \Omega(f)

Where the regularization term is:

\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_j w_j^2

Meaning:

$\gamma$ : structural cost incurred for adding a leaf.
$\lambda$ : L2 regularization on leaf weights to prevent excessive output.

👉 Model complexity is directly written into the objective function.

VI. Intuition and Formula Meaning of Split Gain#

1. Essence of Split Decision#

XGBoost asks with each split:

Is the gain from the split − the cost of the split positive?

2. Meaning of Gain (Understandable without the Formula)#

Better fit for left and right child nodes → gain.
One more leaf → cost (γ).
If gain ≤ cost → do not split.

3. Why Traditional GBDT Cannot Express Gain#

Because it:

Lacks a mathematical definition of "split cost".
Can only make local judgments on "whether loss decreases".
Cannot unify into a gain − cost formula.

VII. A Key Intuitive Summary#

XGBoost is not more aggressive, but better at "calculating".

Traditional GBDT:
As long as the error decreases, keep splitting.
XGBoost:
Is the decrease in error worth this complexity?

VIII. Ultimate One-Sentence Summary#

The core innovation of XGBoost is not in the "tree", but in "writing second-order information and model complexity into the same optimization objective".
The second-order Taylor expansion provides computable gains,
the regularization term provides clear costs,
and the CART structure allows all of this to be efficiently implemented.