Power Efficiency Tuning of NVIDIA GB10 for Sustained Inference

Summary

Measuring and optimizing tokens per watt on DGX Spark GB10 under sustained inference load. The study involved tuning TDP power limit and GPU clocks to find optimal efficiency operating points, comparing batch sizes, and using nvidia-smi dmon for monitoring.

Key Technical Findings

Default TDP (450W) yields 2080 tok/s with an efficiency of 4.6 tok/s/W.
Power cap settings between 350W to 400W showed the highest efficiency, with a peak at 380-400W and concurrency 96-112 achieving an efficiency of up to 4.97 tok/s/W.
FP8 inference is significantly more power efficient than FP16, providing a 64% improvement in efficiency.

Commands

nvidia-smi -i 0 -pl 400
nvidia-smi dmon -s pucvmet -d 1 -o DT > power_log.csv
python benchmark.py --concurrency 96 --duration 120 --output_len 256
python plot_efficiency.py --log power_log.csv --throughput bench_results.json
nvidia-smi -i 0 -pm 1

Lessons Learned

There is an efficiency sweet spot between 350-400W TDP for GB10 LLM inference.
FP8 improves both throughput and efficiency, being strictly better than FP16 on Blackwell.
High concurrency enhances tok/s/W as the GPU remains saturated.
Logging power and throughput concurrently is crucial for identifying peak efficiency.