Self-written
Reasons for Training Parallelism#
Single GPU computation has limits; by processing in parallel with multiple GPUs, training efficiency is improved.
Dimensions of Training Parallelism#
Data Parallelism (DP)#
For mp_size = 1, batch = 64, dp_size = 4
It is equivalent to dividing one batch into 4 parts, with each part processing 16 samples, and each part having the complete model parameters.
To update parameters, the average loss of the 4 parts is summed, and then parameter updates are performed (in practice, it is still the average loss of one batch * learning rate).
After completing the computation for one batch, model parameter synchronization is required.
Model Parallelism (MP)#
For mp_size = 4, batch = 64, dp_size = 1
It is equivalent to requiring each part to compute the complete batch of data, with each part having only a quarter of the model parameters, updating the weights of each part separately, and communication is also needed during forward and backward computations.
In MP, there is no requirement for global parameter consistency; the parameters of each rank are inherently different, so synchronization updates are not needed.
MP & DP#
In practical applications, MP and DP can be used together, but the above rules must still be followed.
GPT-written
Summary of Training Parallelism: DP / MP / DP+MP#
1. Why is training parallelism needed?#
The memory and computational power of a single GPU are limited, making it difficult to support large deep learning models.
By using multiple GPUs, one can:
-
Share the computational load
-
Expand the scale of trainable models
-
Increase training speed
Parallelism in deep learning training is generally divided into two categories:
-
Data Parallelism (DP)
-
Model Parallelism (MP)
They can also be combined into hybrid parallelism (DP + MP).
2. Data Parallelism (DP)#
Assuming:
-
dp_size = 4
-
mp_size = 1
-
batch = 64
Data parallelism divides one batch into multiple sub-batches:
-
GPU0: 16 samples
-
GPU1: 16 samples
-
GPU2: 16 samples
-
GPU3: 16 samples
Each GPU holds the complete model parameters.
Each GPU independently computes the gradients for its local samples:
g0 = ∇θ L0
g1 = ∇θ L1
g2 = ∇θ L2
g3 = ∇θ L3
Then, the gradients are synchronized through Allreduce(sum):
g_global = g0 + g1 + g2 + g3
Finally, it is divided by dp_size to obtain the average gradient:
g_avg = g_global / 4
Each GPU uses the same gradient to update parameters:
θ ← θ - lr * g_avg
Thus, the essence of DP is:
-
Data splitting
-
Gradient synchronization
-
Parameter consistency
3. Model Parallelism (MP)#
Assuming:
-
mp_size = 4
-
dp_size = 1
-
batch = 64
Model parallelism splits model parameters across different GPUs along a certain dimension.
For example, the weights of a fully connected layer:
W = [ W0 | W1 | W2 | W3 ]
Each GPU sees the complete batch of input X (64 × d_in).
Each GPU only performs its own matrix multiplication:
GPU0: Y0 = X @ W0
GPU1: Y1 = X @ W1
GPU2: Y2 = X @ W2
GPU3: Y3 = X @ W3
The complete output is concatenated:
Y = [Y0 | Y1 | Y2 | Y3]
Therefore, the forward pass must concatenate outputs through Allgather.
Backpropagation#
Each GPU computes the gradients it is responsible for:
dW_i = X^T @ dY_i
But the input gradient dX must be summed from contributions of all GPUs:
dX = dX0 + dX1 + dX2 + dX3
Thus, Allreduce(sum) is needed.
Core Idea of Model Parallelism#
MP does not independently train model chunks; instead:
Multiple GPUs collaboratively complete one forward and backward pass, requiring synchronization of activations and gradients.
Must use:
-
Allgather (concatenate activations)
-
Allreduce (sum input gradients)
-
ReduceScatter (optimization for Megatron-LM)
Characteristics of MP#
-
Parameter splitting
-
Complete data (each GPU sees the entire batch)
-
Each GPU only updates its own parameter shard
-
Parameters are inherently different, no need for consistency
-
But activations/gradients must be communicated and synchronized
4. Hybrid Parallelism (DP + MP)#
In real large model training, DP and MP are usually used simultaneously.
The GPU cluster structure can be viewed as a 2D grid:
-
Rows: DP groups (processing different data)
-
Columns: MP groups (model splitting)
Communication patterns:
-
DP dimension: gradient synchronization (Allreduce)
-
MP dimension: collaborative computation (Allgather / Allreduce / ReduceScatter)
DP+MP is the core of training large models like GPT-3, PaLM, Megatron-LM, etc.
5. Summary#
Data Parallelism (DP)#
-
Data splitting
-
Model replication
-
Gradient synchronization
-
Parameter consistency
Model Parallelism (MP)#
-
Model splitting
-
Complete data
-
Activation/gradient communication
-
Independent updates of parameter shards
DP + MP#
-
Two-dimensional parallelism
-
Foundation for large model training