Continues from Part 1 — In Part 1, we confirmed the GPU fires, ran a matrix multiply on the NVIDIA GB10, and attached Nsight Systems from the host. Almost every profiler section said
SKIPPED: does not contain CUDA data. In Part 2, we fix that — and read the name of the exact CUDA kernel running on our silicon.
Hardware: NVIDIA GB10 (DGX Spark)
Tools: nsys inside container · CUTLASS · cuBLAS
Part: 2 of Series
Where We Left Off
In Part 1 we ran a 4096×4096 matrix multiply, timed it at ~189ms, and attached Nsight Systems to produce a .nsys-rep profile. Then nsys stats showed almost every section as SKIPPED: does not contain CUDA data.
The reason: nsys was running on the host, profiling a Docker container as a black box. OS-level events were visible. But CUDA kernel execution, memory transfers, and the GPU timeline were inside the container's namespace — invisible to a profiler sitting outside it.
Part 2 fixes that. We move nsys inside the container, get real CUDA data, and then run progressively more complex workloads — watching the GPU's kernel vocabulary grow from one kernel name to eight.
Discovery 1 — The Warmup Effect: 189ms Was a Lie
The first new script runs the same 4096×4096 matmul with 5 warmup iterations, then 10 measured runs:
Warmup: 5 runs (not timed)
Run 1: 0.003774 seconds
Run 2: 0.004333 seconds
Average: 0.003692 seconds
Best: 0.003237 seconds
| Measurement | Time | What it includes |
|---|---|---|
| Part 1 (no warmup) | 189ms | CUDA init + JIT compilation + compute |
| Part 2 (after warmup) | 3.7ms | Pure compute only |
That is a 51× difference on the same hardware, same matrix size, same operation.
Why the first run is always slow: CUDA lazy initialization
CUDA defers expensive setup until the first GPU operation is requested. When your code first hits torch.matmul(), four things happen that never happen again:
-
CUDA context creation — The driver creates a private namespace on the GPU for your process (allocations, compiled kernels, execution streams). This takes 50–150ms on first call.
-
cuBLAS library initialization — cuBLAS loads its internal heuristics tables, algorithm selection databases, and workspace buffers. On Blackwell this includes loading the CUTLASS Tensor Core GEMM library.
-
JIT kernel compilation — CUDA ships kernels in PTX (pseudo-assembly). The actual machine code for your specific GPU is compiled JIT at first use. cuBLAS also runs autotuning to select the best kernel variant for your matrix shape.
-
GPU memory warming — First accesses to a large memory region incur TLB misses and page-table walks. After warmup, the access pattern is cached and subsequent runs see near-peak memory bandwidth.
The benchmarking rule: Every GPU benchmark you read assumes warmup has happened. Always run at least 3–5 warmup iterations before recording numbers, and report the mean of stable runs.
We scaled to 8192×8192 to confirm steady-state timing scales with compute:
8192×8192 Average: 0.031139 seconds (~31ms)
Scale vs 4096²: ~8.4× (expected ~8×) ✓
Effective throughput: ~1.1 TFLOPS
Discovery 2 — The Version Wall: nsys Can't Read Its Own Report
After getting a profile inside the container, we tried to analyze it from the host:
Exportation error: Report was created in Nsight Systems version
(2025.5.1.121), newer than your current version (2025.3.2.474).
Please update your Nsight Systems to the latest version.
ERROR: Database file gpu_matmul_8192_trace.sqlite does not exist.
ERROR: Database file gpu_matmul_8192_trace.sqlite does not exist.
The host has nsys 2025.3.2. The NGC container ships nsys 2025.5.1. The .nsys-rep binary format changed between versions — the older tool cannot export the newer format. The SQLite file never gets created, so every downstream report script fails.
The fix: Profile and analyze from inside the same container. The report file lives on the host via bind mount, but all nsys operations go through the container's binary.
# Enter container interactively
docker run --rm -it \
--gpus all --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v /home/$USER/projects:/workspace \
-w /workspace/ai-zero-to-gpu-lab \
nvcr.io/nvidia/pytorch:25.09-py3
# Inside container — same nsys version for both commands
nsys profile -t cuda,nvtx,osrt \
-o profiles/gpu_matmul_8192_trace \
--force-overwrite=true \
python scripts/gpu_matmul_warmup.py
nsys stats profiles/gpu_matmul_8192_trace.nsys-rep # works ✓
Step 3 — Inside the Container: Real CUDA API Data
With nsys running inside the container, the CUDA sections no longer say SKIPPED. Here is the CUDA API summary for the QKV projection workload:
** CUDA API Summary (cuda_api_sum): **
Time (%) Total Time (ns) Num Calls Avg (ns) Name
-------- --------------- --------- ---------- ----
74.8 343,323,216 5 68,664,643 cuLibraryLoadData
13.8 63,303,744 2 31,651,872 cudaDeviceSynchronize
10.2 46,805,552 12 3,900,463 cudaMalloc
1.0 4,479,776 12 373,315 cuKernelGetFunction
0.2 882,304 4 220,576 cudaLaunchKernel
Reading each row:
-
cuLibraryLoadData (74.8%, 5 calls, 68ms avg) — JIT compilation and kernel module loading. One-time cost. This is the warmup effect, visible at the CUDA API level. Five separate kernel modules loaded.
-
cudaDeviceSynchronize (13.8%, 2 calls, 31ms avg) — Our
torch.cuda.synchronize()calls. 31ms = actual GPU execution time for the matrix multiply. The number we benchmarked. -
cudaMalloc (10.2%, 12 calls, 3.9ms avg) — GPU memory allocation. 12 allocations for 3 matrices means cuBLAS also allocates internal workspace buffers for the GEMM algorithm.
-
cudaLaunchKernel (0.2%, 4 calls, 220μs avg) — Kernel submission. Only 220 microseconds — confirming the CPU just submits the job, does not do the work.
The implication: 74% of CUDA API time is one-time initialization. The actual compute submission (
cudaLaunchKernel) is 0.2%. This is exactly why warmup skews cold measurements so dramatically.
Step 4 — The Kernel Has a Name
The GPU Kernel Summary shows what actually ran on the Streaming Multiprocessors:
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum): **
Time(%) Total(ns) Inst Avg(ns) Name
------- ---------- ---- -------- ----
97.3 72,620,672 20 3,631,033 void cutlass::Kernel2<cutlass_80_tensorop_s1688gemm_128x128_32x3_nn_align4>(T1::Params)
2.7 2,013,184 6 335,530 void at::native::distribution_elementwise_grid_stride_kernel<...>
There it is. The kernel has a name. Let's decode every token:
Kernel Name Anatomy: cutlass_80_tensorop_s1688gemm_128x128_32x3_nn_align4
| Token | Meaning |
|---|---|
cutlass |
From NVIDIA's CUTLASS library — CUDA Templates for Linear Algebra. cuBLAS uses CUTLASS internally for Tensor Core GEMMs. |
80 |
Compute capability 8.0 (Ampere). The GB10 is SM 10.x (Blackwell) — this is a compatibility kernel. We investigate this in Part 3. |
tensorop |
Tensor Cores active. Non-Tensor Core kernels say simt here. This confirms hardware-accelerated matrix arithmetic. |
s1688 |
s = float32 (single precision). 16×8×8 = Tensor Core tile size. The hardware computes a 16×8×8 matmul per Tensor Core operation per warp. |
gemm_128x128 |
Thread block tile: each CUDA thread block processes a 128×128 output tile. |
32x3 |
K-tile depth = 32, pipeline stages = 3. Triple-buffering in shared memory for maximum overlap between loads and compute. |
nn |
Both matrices non-transposed (row-major). When K is transposed (QKᵀ), this becomes tn. |
align4 |
4-element (16-byte) memory alignment — enables vectorized 128-bit loads. PyTorch's allocator guarantees this. |
97.3% of GPU time is this one kernel. The remaining 2.7% is torch.randn() initializing our matrices.
Step 5 — Growing the Workload: Watching New Kernels Appear
We ran a series of progressively more complex scripts, adding one operation type at a time:
| Workload | Total Time | New Kernels |
|---|---|---|
| matmul only (5 layers) | 18.5ms | cutlass_128x128_32x3_nn, distribution_elementwise |
| matmul + ReLU | 20.8ms | vectorized_elementwise_kernel (clamp_scalar) |
| linear + residual add | 43.8ms | vectorized_elementwise_kernel (CUDAFunctor_add) |
| QKV projections | 20.9ms | same GEMM, larger shapes |
| QKV + attention scores (QKᵀ) | 24.0ms | cutlass_256x64_16x4_tn (transposed!) |
| + softmax | 22.8ms | softmax_warp_forward, vectorized_elementwise (BUnaryFunctor) |
| Full transformer block | 107.7ms | cutlass_128x256_16x3_nn (FFN expand) |
The ReLU kernel
vectorized_elementwise_kernel<launch_clamp_scalar> — ReLU is clamp(x, 0, ∞). Vectorized = 128-bit loads (4 floats per thread per transaction). Runs in ~580μs — small relative to GEMM but visible.
The transposed GEMM for Q×Kᵀ
When computing attention scores, K must be transposed. A completely different CUTLASS kernel appears:
cutlass_80_tensorop_s1688gemm_256x64_16x4_tn_align4
256x64 instead of 128x128 (wider, shallower tile), tn instead of nn. cuBLAS automatically selects a different tile shape when the memory access pattern changes due to transposition. The optimal tile geometry depends on the matrix layout.
Softmax: warp shuffles
softmax_warp_forward uses warp shuffle instructions — hardware that lets 32 threads in the same warp directly exchange register values without shared memory. For 512 attention scores per row, each warp does a tree reduction in 9 shuffle steps (log₂512 = 9 — the (int)9 in the kernel template). Runs in ~105μs. Negligible at seq_len=512.
Step 6 — The Full Transformer Block
Full profile: batch=16, seq=512, hidden=4096, FFN=16384.
** CUDA GPU Kernel Summary: Full Transformer Block **
Time(%) Total(ns) Inst Avg(ns) Name
------- ---------- ---- ---------- ----
58.2 242,540,960 8 30,317,620 cutlass_80_tensorop_s1688gemm_128x256_16x3_nn ← FFN expand/contract
30.3 126,089,888 20 6,304,494 cutlass_80_tensorop_s1688gemm_128x128_32x3_nn ← QKV + attn output
4.8 20,137,248 4 5,034,312 vectorized_elementwise_kernel (clamp → ReLU)
3.6 14,844,320 8 1,855,540 vectorized_elementwise_kernel (CUDAFunctor_add → residual)
1.7 6,991,552 4 1,747,888 cutlass_80_tensorop_s1688gemm_256x64_16x4_tn ← attention scores
1.2 4,910,656 7 701,522 distribution_elementwise (randn)
0.1 473,248 4 118,312 vectorized_elementwise (BUnaryFunctor → scale)
0.1 419,296 4 104,824 softmax_warp_forward
Reading this as a transformer
58.2% — FFN expansion (128x256 GEMM)
The FFN expands from hidden=4096 to FFN=16384, then contracts back. This 4× expansion makes it the single most expensive operation. In real transformer model profiles, FFN typically accounts for 55–65% of compute time. Here it's 58.2%. Exactly as expected.
30.3% — QKV projections + attention output (128x128 GEMM)
20 instances of the square GEMM: 3 QKV matrices × batch groups + attention output projections. Uses the square 128×128 tile (optimized for square matrices) rather than the tall 128×256 tile.
4.8% — ReLU after FFN expand
5ms for 4 invocations at FFN scale (4096→16384 matrices = 67M floats per pass). Memory-bandwidth bound. Scales linearly with parameter count.
3.6% — Residual additions
a[i] + b[i] for every element. Completely memory-bandwidth bound — reads two tensors, writes one, one addition per element. No way to make this faster without changing the operation.
1.7% — Attention scores (Q×Kᵀ)
The transposed GEMM is relatively cheap at seq_len=512. At seq_len=4096, the attention score matrix grows 64× and this term would dominate. Sequence length is why attention is quadratic.
0.1% — Softmax
~105μs. Essentially free at 512 tokens. Not free at 32K tokens.
The key insight
88.5% of this transformer block is three CUTLASS GEMM kernel types. If you want to make a transformer faster, you are optimizing GEMMs. This is why FlashAttention, quantization (smaller GEMMs), and speculative decoding (different GEMM shapes) are the dominant optimization strategies in production AI inference. The profiler shows exactly why.
Sidebar: Why does (ai-zero-to-gpu-lab) appear in the prompt?
This prefix is a uv virtual environment indicator — the host shell has a Python virtual environment named ai-zero-to-gpu-lab activated. When you enter the Docker container, the container uses its own Python (the NGC image's system Python), independent of the host's virtual environment. The container doesn't inherit it. Normal behavior, no action needed.
What Part 2 Established
| Finding | Detail |
|---|---|
| 189ms was initialization, not compute | Steady-state 4096² GEMM = ~3.7ms. The 51× difference is CUDA context creation + cuBLAS init + JIT compilation. |
| nsys versions must match | Profiles from nsys 2025.5.1 (container) can't be analyzed by nsys 2025.3.2 (host). Profile and analyze inside the same container. |
| The kernel name encodes everything | cutlass_80_tensorop_s1688gemm_128x128_32x3_nn_align4 = CUTLASS + Tensor Cores + SM 8.0 target + float32 + 16×8×8 tile + 128×128 block + 3-stage pipeline + non-transposed + 16-byte aligned. |
| A transformer block is 88.5% GEMM | Three CUTLASS GEMM variants. Softmax = 0.1%. ReLU = 4.8%. Performance is determined by GEMM efficiency. |
| cuBLAS selects kernels by shape | Square matrices → 128×128 tile. Tall matrices (FFN) → 128×256 tile. Transposed input → tn variant. Automatic, shape-dependent, heuristic-driven. |
Coming in Part 3: Inside the Kernel
Part 2 gave us kernel names and execution times. Part 3 goes inside the kernel using Nsight Compute (ncu) — the kernel-level profiler with hardware counter access.
We'll answer the SM 8.0 question: the GB10 is SM 10.x (Blackwell), but our kernel targets SM 8.0 (Ampere). Is this a compatibility fallback? Is there a native Blackwell-optimized path? How much performance are we leaving on the table?
Then we get into silicon metrics: warp occupancy, arithmetic intensity, L2 cache hit rate, Tensor Core utilization, and the roofline model — the framework that tells you whether a kernel is compute-bound or memory-bandwidth-bound, and exactly how far it is from peak hardware efficiency.
Topics: nsight compute (ncu) · SM 8.0 vs SM 10.x · warp occupancy · arithmetic intensity · tensor core utilization · L2 cache efficiency · roofline model
GPU Internals Series — Part 2 | Hardware: NVIDIA GB10 (DGX Spark) | Tools: nsys · CUTLASS · cuBLAS