Summary
Measuring and optimizing tokens per watt on DGX Spark GB10 under sustained inference load. The study involved tuning TDP power limit and GPU clocks to find optimal efficiency operating points, comparing batch sizes, and using nvidia-smi dmon for monitoring.
Key Technical Findings
- Default TDP (450W) yields 2080 tok/s with an efficiency of 4.6 tok/s/W.
- Power cap settings between 350W to 400W showed the highest efficiency, with a peak at 380-400W and concurrency 96-112 achieving an efficiency of up to 4.97 tok/s/W.
- FP8 inference is significantly more power efficient than FP16, providing a 64% improvement in efficiency.
Commands
nvidia-smi -i 0 -pl 400nvidia-smi dmon -s pucvmet -d 1 -o DT > power_log.csvpython benchmark.py --concurrency 96 --duration 120 --output_len 256python plot_efficiency.py --log power_log.csv --throughput bench_results.jsonnvidia-smi -i 0 -pm 1
Lessons Learned
- There is an efficiency sweet spot between 350-400W TDP for GB10 LLM inference.
- FP8 improves both throughput and efficiency, being strictly better than FP16 on Blackwell.
- High concurrency enhances tok/s/W as the GPU remains saturated.
- Logging power and throughput concurrently is crucial for identifying peak efficiency.