Tensor Parallelism vs Pipeline Parallelism on 4x H100 GPUs for Llama 3 405B Inference

Summary

Configured tensor parallelism on 4x H100 NVLink for Llama 3 405B inference, comparing TP=2 vs TP=4 in terms of throughput, latency, and NVLink bandwidth utilization. Additionally, tested pipeline parallelism PP=2 combined with TP=2.

What I Did

Deployed Llama 3 405B across 4x H100 80GB SXM5 GPUs using tensor parallelism degree 4. Compared performance metrics for TP=4 and TP=2 with PP=2 configurations. Measured all-reduce communication overhead as model size and sequence length changed. Explored pipeline parallelism PP=2 alongside TP=2 as an alternative sharding strategy.

Key Technical Findings

Deployed Llama 3 405B across 4x H100 80GB SXM5 GPUs using tensor parallelism degree 4.
Compared performance metrics for TP=4 and TP=2 with PP=2 configurations.
Measured all-reduce communication overhead as model size and sequence length changed.
Explored pipeline parallelism PP=2 alongside TP=2 as an alternative sharding strategy.

Commands Used

trtllm-build --checkpoint_dir ./llama3-405b --tp_size 4 --pp_size 1 --gemm_plugin fp8 --output_dir ./engine-tp4
mpirun -n 4 python trtllm-serve.py --engine_dir ./engine-tp4 --tp_size 4
trtllm-build --checkpoint_dir ./llama3-405b --tp_size 2 --pp_size 2 --gemm_plugin fp8 --output_dir ./engine-tp2pp2
nvidia-smi nvlink --status -i 0

Lessons Learned

Tensor parallelism benefits from NVLink; PCIe bandwidth is a bottleneck.
TP=4 outperforms TP=2+PP=2 for interactive serving, while PP is advantageous for batch/offline jobs.
All-reduce overhead is a fixed cost per layer, increasing with more layers and communication.
Verify NVLink topology using nvidia-smi topo --matrix before designing sharding strategies.

Next Steps

Test expert parallelism on MoE models like Mixtral. Benchmark TP=8 on 8xH100 GPUs to find NVLink saturation point.