Training Parallel - Fight4354

Self-written

Reasons for Training Parallelism#

Single GPU computation has limits; by processing in parallel with multiple GPUs, training efficiency is improved.

Dimensions of Training Parallelism#

Data Parallelism (DP)#

For mp_size = 1, batch = 64, dp_size = 4
It is equivalent to dividing one batch into 4 parts, with each part processing 16 samples, and each part having the complete model parameters.
To update parameters, the average loss of the 4 parts is summed, and then parameter updates are performed (in practice, it is still the average loss of one batch * learning rate).
After completing the computation for one batch, model parameter synchronization is required.

Model Parallelism (MP)#

For mp_size = 4, batch = 64, dp_size = 1
It is equivalent to requiring each part to compute the complete batch of data, with each part having only a quarter of the model parameters, updating the weights of each part separately, and communication is also needed during forward and backward computations.
In MP, there is no requirement for global parameter consistency; the parameters of each rank are inherently different, so synchronization updates are not needed.

MP & DP#

In practical applications, MP and DP can be used together, but the above rules must still be followed.

GPT-written

Summary of Training Parallelism: DP / MP / DP+MP#

1. Why is training parallelism needed?#

The memory and computational power of a single GPU are limited, making it difficult to support large deep learning models.
By using multiple GPUs, one can:

Share the computational load
Expand the scale of trainable models
Increase training speed

Parallelism in deep learning training is generally divided into two categories:

Data Parallelism (DP)
Model Parallelism (MP)

They can also be combined into hybrid parallelism (DP + MP).

2. Data Parallelism (DP)#

Assuming:

dp_size = 4
mp_size = 1
batch = 64

Data parallelism divides one batch into multiple sub-batches:

GPU0: 16 samples
GPU1: 16 samples
GPU2: 16 samples
GPU3: 16 samples

Each GPU holds the complete model parameters.

Each GPU independently computes the gradients for its local samples:

g0 = ∇θ L0
g1 = ∇θ L1
g2 = ∇θ L2
g3 = ∇θ L3

Then, the gradients are synchronized through Allreduce(sum):

g_global = g0 + g1 + g2 + g3

Finally, it is divided by dp_size to obtain the average gradient:

g_avg = g_global / 4

Each GPU uses the same gradient to update parameters:

θ ← θ - lr * g_avg

Thus, the essence of DP is:

Data splitting
Gradient synchronization
Parameter consistency

3. Model Parallelism (MP)#

Assuming:

mp_size = 4
dp_size = 1
batch = 64

Model parallelism splits model parameters across different GPUs along a certain dimension.
For example, the weights of a fully connected layer:

W = [ W0 | W1 | W2 | W3 ]

Each GPU sees the complete batch of input X (64 × d_in).

Each GPU only performs its own matrix multiplication:

GPU0: Y0 = X @ W0
GPU1: Y1 = X @ W1
GPU2: Y2 = X @ W2
GPU3: Y3 = X @ W3

The complete output is concatenated:

Y = [Y0 | Y1 | Y2 | Y3]

Therefore, the forward pass must concatenate outputs through Allgather.

Backpropagation#

Each GPU computes the gradients it is responsible for:

dW_i = X^T @ dY_i

But the input gradient dX must be summed from contributions of all GPUs:

dX = dX0 + dX1 + dX2 + dX3

Thus, Allreduce(sum) is needed.

Core Idea of Model Parallelism#

MP does not independently train model chunks; instead:

Multiple GPUs collaboratively complete one forward and backward pass, requiring synchronization of activations and gradients.

Must use:

Allgather (concatenate activations)
Allreduce (sum input gradients)
ReduceScatter (optimization for Megatron-LM)

Characteristics of MP#

Parameter splitting
Complete data (each GPU sees the entire batch)
Each GPU only updates its own parameter shard
Parameters are inherently different, no need for consistency
But activations/gradients must be communicated and synchronized

4. Hybrid Parallelism (DP + MP)#

In real large model training, DP and MP are usually used simultaneously.

The GPU cluster structure can be viewed as a 2D grid:

Rows: DP groups (processing different data)
Columns: MP groups (model splitting)

Communication patterns:

DP dimension: gradient synchronization (Allreduce)
MP dimension: collaborative computation (Allgather / Allreduce / ReduceScatter)

DP+MP is the core of training large models like GPT-3, PaLM, Megatron-LM, etc.

5. Summary#

Data Parallelism (DP)#

Data splitting
Model replication
Gradient synchronization
Parameter consistency

Model Parallelism (MP)#

Model splitting
Complete data
Activation/gradient communication
Independent updates of parameter shards

DP + MP#

Two-dimensional parallelism
Foundation for large model training