Summary
Configured tensor parallelism on 4x H100 NVLink for Llama 3 405B inference, comparing TP=2 vs TP=4 in terms of throughput, latency, and NVLink bandwidth utilization. Additionally, tested pipeline parallelism PP=2 combined with TP=2.
What I Did
Deployed Llama 3 405B across 4x H100 80GB SXM5 GPUs using tensor parallelism degree 4. Compared performance metrics for TP=4 and TP=2 with PP=2 configurations. Measured all-reduce communication overhead as model size and sequence length changed. Explored pipeline parallelism PP=2 alongside TP=2 as an alternative sharding strategy.
Key Technical Findings
- Deployed Llama 3 405B across 4x H100 80GB SXM5 GPUs using tensor parallelism degree 4.
- Compared performance metrics for TP=4 and TP=2 with PP=2 configurations.
- Measured all-reduce communication overhead as model size and sequence length changed.
- Explored pipeline parallelism PP=2 alongside TP=2 as an alternative sharding strategy.
Commands Used
trtllm-build --checkpoint_dir ./llama3-405b --tp_size 4 --pp_size 1 --gemm_plugin fp8 --output_dir ./engine-tp4
mpirun -n 4 python trtllm-serve.py --engine_dir ./engine-tp4 --tp_size 4
trtllm-build --checkpoint_dir ./llama3-405b --tp_size 2 --pp_size 2 --gemm_plugin fp8 --output_dir ./engine-tp2pp2
nvidia-smi nvlink --status -i 0
Lessons Learned
- Tensor parallelism benefits from NVLink; PCIe bandwidth is a bottleneck.
- TP=4 outperforms TP=2+PP=2 for interactive serving, while PP is advantageous for batch/offline jobs.
- All-reduce overhead is a fixed cost per layer, increasing with more layers and communication.
- Verify NVLink topology using nvidia-smi topo --matrix before designing sharding strategies.
Next Steps
Test expert parallelism on MoE models like Mixtral. Benchmark TP=8 on 8xH100 GPUs to find NVLink saturation point.