Summary
Most transformer inference workloads are memory-bandwidth-bound, not compute-bound — even on H100. But the reported 3.35 TB/s HBM3 bandwidth on H100 SXM5 is a ceiling you rarely hit. This post explains why, what the realistic ceiling is, and how to profile accurately using Nsight Compute.
What I Did
I instrumented a production LLM inference workload (70B model, BF16, TP=8) with Nsight Compute and analyzed the memory hierarchy at the kernel level. The goal was to understand why measured memory bandwidth was ~60% of theoretical peak.
Key Technical Findings
H100 SXM5 memory hierarchy
| Level | Bandwidth | Latency | Capacity |
|---|---|---|---|
| HBM3 | 3.35 TB/s | ~400 ns | 80 GB |
| L2 | ~12 TB/s | ~50 ns | 50 MB |
| L1/SMEM | ~33 TB/s | ~25 ns | 256 KB/SM |
| Registers | — | ~1 ns | 256 KB/SM |
Why 3.35 TB/s is not achievable
-
DRAM row buffer efficiency — HBM DRAM rows are 2 KB. Random access patterns cause frequent row buffer misses, reducing effective bandwidth to 40–70% of peak.
-
ECC overhead — With ECC enabled (default on H100 SXM), you lose ~6% bandwidth. Can be disabled for some workloads.
-
Bank conflicts — In BF16 attention, the access pattern to K/V tensors often creates L2 bank conflicts, measured at 15–25% penalty in our kernels.
-
Prefetch pipeline depth — CUDA async memory prefetch with a depth <4 leaves the memory pipeline underutilized during compute phases.
Realistic bandwidth ceilings observed
| Operation | Measured BW | % of Peak |
|---|---|---|
| Matrix load (coalesced) | 2.8 TB/s | 84% |
| Attention K/V access | 1.9 TB/s | 57% |
| MLP weight load | 2.6 TB/s | 78% |
| KV cache read | 1.4 TB/s | 42% |
KV cache read is the key bottleneck — the access pattern is highly irregular (batch x head x seq, non-sequential due to paged cache), causing poor HBM row buffer utilization.
Commands Used
Collect kernel-level Nsight Compute profile
ncu \
--target-processes all \
--replay-mode kernel \
--metrics \
l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum,\
lts__t_bytes.sum,\
dram__bytes_read.sum,\
dram__bytes_write.sum,\
sm__throughput.avg.pct_of_peak_sustained_elapsed \
--kernel-name-base demangled \
--filter-func flash_fwd_kernel \
python run_inference.py \
--model llama3-70b \
--batch_size 32 \
2>&1 | tee ncu_profile.txt
Parse bandwidth from ncu output
import subprocess
import re
def parse_ncu_bandwidth(profile_path: str) -> dict[str, float]:
with open(profile_path) as f:
content = f.read()
# DRAM read bandwidth per kernel
pattern = r"dram__bytes_read\.sum\s+(\d+\.?\d*)\s+(\w+)"
matches = re.findall(pattern, content)
results = {}
for val, unit in matches:
multiplier = {"Gbyte": 1e9, "Mbyte": 1e6, "Kbyte": 1e3}.get(unit, 1)
results[val] = float(val) * multiplier
return results
Roofline analysis with PyNVML
import pynvml
import time
def measure_sustained_bandwidth(func, *args, warmup=5, iters=20):
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
# Warmup
for _ in range(warmup):
func(*args)
torch.cuda.synchronize()
samples = []
for _ in range(iters):
pynvml.nvmlDeviceResetMemoryBandwidthUtilization(handle)
t0 = time.perf_counter()
func(*args)
torch.cuda.synchronize()
t1 = time.perf_counter()
stats = pynvml.nvmlDeviceGetMemoryBandwidthUtilization(handle)
samples.append({
"elapsed_ms": (t1 - t0) * 1000,
"bw_tbps": stats.read / 1e12,
})
return samples
Check L2 hit rate (critical for KV cache)
ncu \
--metrics lts__t_sector_hit_rate.pct,\
lts__t_requests_srcunit_tex.sum \
--kernel-name-base demangled \
--filter-func fmha_v2_flash_attention \
python run_inference.py
Lessons Learned
-
L2 hit rate is the most important metric for attention kernels — we found 22% L2 hit rate on KV access vs 71% for weight loads. Improving KV layout to match access patterns (transposing K heads) boosted L2 hit rate to 48% and improved attention throughput by 31%.
-
Roofline model must use measured, not theoretical BW — if you use 3.35 TB/s as your memory-bandwidth ceiling in the roofline, all your kernels will look compute-bound. Use your actual sustained bandwidth (~2.6 TB/s for coalesced access) as the ceiling.
-
Paged KV cache has a bandwidth cost — page table lookups add irregular access patterns. Profile your paged vs. contiguous KV cache access bandwidth separately. Expect 15–25% lower effective bandwidth from paging.
-
ECC costs are real on SXM — disabling ECC (requires GPU reboot) gave us a 5.8% bandwidth increase, measurable in production. Worth evaluating for dedicated inference hardware where data integrity is handled at application layer.
# Disable ECC (requires sudo, reboots GPU context)
sudo nvidia-smi --ecc-config=0
sudo nvidia-smi --gpu-reset
- Profiling overhead changes kernel behavior — Nsight Compute in "replay" mode re-runs kernels. KV cache state differs between replays. Use
--replay-mode=applicationfor accurate memory access patterns but longer collection time.
Next Steps
- Profile flash attention v3 vs v2 memory access patterns on H100
- Investigate SW-prefetch strategies for paged KV cache blocks
- Implement custom CUDA memcpy kernel with software pipelining to get >90% HBM efficiency for sequential access patterns