What Actually Happens When Your Python Calls the GPU?

A hands-on journey from a single line of PyTorch code to the silicon on an NVIDIA GB10 Spark — and the profiler that watches it all

Hardware — NVIDIA GB10 (DGX Spark) Stack — PyTorch 2.9 · Docker · nsys Audience — Software / Systems Engineers Part — 1 of Series

The Why

What this series is really about

When you call a language model — or run any AI workload — something deeply mechanical is happening. Millions of numbers are loaded from memory, multiplied together in specific patterns, and sent back. Over and over. Billions of times per second. All of it happening inside a chip you've never directly seen execute.

If you have a background in software or cloud infrastructure, you already understand layers of abstraction. You know what a CPU thread is. You know what memory allocation means. You know what a network socket does at the OS level. But when it comes to GPU compute, most engineers stop at: "the model runs on the GPU, it's fast, done."

This series exists to break that abstraction open. Not by reading white papers — but by writing code, running it on real hardware, and watching what happens at every layer between your Python source file and the physical transistors.

// The core question we are answering

When you write torch.matmul(A, B) in Python — what is the complete sequence of events from that line to electrons moving through silicon? How do you observe it? How do you measure it? And what does that tell you about how AI models actually work?

Why not just run a model?

A model like Llama or GPT has millions of operations, dynamic shapes, attention patterns, KV caches, and sampling logic. If you're new to GPU compute internals, starting there is like learning how a car engine works by buying a Formula 1 car. You'll be drowned in complexity before you see anything useful.

Instead, we're going to use a single matrix multiplication — a 4096×4096 float32 GEMM — as our microscope slide. This one operation is, architecturally, the same thing a transformer layer does hundreds of times per forward pass. Every attention projection, every feed-forward expansion — it's all matrix multiplication. So by understanding what happens when this one operation fires, you understand the fundamental unit of all AI compute.

This is Part 1. We establish that the GPU is real, that our code runs on it, that we can time it correctly, and that we can attach a profiler to watch it. Parts that follow will go deeper — inside the CUDA kernels, inside the Tensor Cores, inside the memory hierarchy.

Hardware

The machine: NVIDIA GB10 Spark (DGX Spark)

The hardware we're running on is an NVIDIA DGX Spark , which contains a GB10 GPU — built on NVIDIA's Blackwell architecture. This is the same generation of silicon that powers the massive GB200 NVLink racks in hyperscale data centers, shrunk into a compact desktop supercomputer form factor.

Two things about this machine are architecturally unusual and worth understanding before you look at any profiler output:

1. Unified memory via NVLink-C2C

On a traditional server, the CPU and GPU are separate chips connected by a PCIe bus. Moving data between them — say, loading model weights from RAM into GPU memory — costs real time and is often a performance bottleneck. You'll see tutorials obsess over "minimize CPU-GPU transfers" for exactly this reason.

The DGX Spark is different. The CPU (Grace) and the GPU (GB10) are connected by NVLink-C2C — a high-bandwidth, low-latency die-to-die interconnect. They share the same physical memory pool. There is no PCIe bus. This changes the performance profile significantly: memory transfers that would be bottlenecks on a traditional GPU setup are nearly free here. You'll see this reflected in profiler data later in the series.

2. Tensor Cores

Blackwell GPUs contain specialized matrix-multiply accelerator circuits called Tensor Cores. These are not general ALUs. They are dedicated hardware that can compute a 4×4 matrix multiply in a single clock cycle — something that would take 64 multiply-accumulate operations on a regular CUDA core. When PyTorch calls torch.matmul(), it ultimately routes through cuBLAS which selects a kernel that exploits Tensor Cores. We'll see exactly which kernel gets chosen when we go deeper.

Architecture

The full execution stack — all the way down

Before we write a single command, we need a map. Here is the complete chain of what happens when you run torch.matmul(A, B) on a GPU. Each box is a real software or hardware layer. Each arrow is a function call or hardware interface crossing a boundary.

Python

torch.matmul(A, B)
you write this

PyTorch C++

ATen operator
dispatch system

cuBLAS

NVIDIA's GEMM
library, selects kernel

CUDA Kernel

Compiled GPU code
running on SMs

Tensor Cores

4×4 matmul / cycle
inside the silicon

In this series, we will trace this entire path. In Part 1, we verify that execution reaches the GPU at all and attach our first profiler. In Part 2, we'll see the CUDA API calls. In Part 3, we'll see individual kernel execution on the Streaming Multiprocessors (SMs). In Part 4, we'll reach the Tensor Core metrics — warp occupancy, arithmetic intensity, memory throughput.

// one critical thing to understand now

Every layer in this stack adds latency — but not computation. Python calls C++. C++ calls CUDA runtime. CUDA runtime queues a kernel. The kernel executes on thousands of parallel cores. Understanding where time is actually spent (and why) is the entire point of profiling. The answer is almost always surprising the first time you see it.

Step 1 — Environment

Running inside an NGC container — and why this matters

The first thing we do is not install anything on the host system. Instead, we pull NVIDIA's official PyTorch container from the NGC (NVIDIA GPU Cloud) registry and run our code inside it.

This isn't just convenience. NVIDIA builds and validates the entire software stack — CUDA toolkit, cuDNN, cuBLAS, PyTorch, and all their interdependencies — for specific hardware and driver versions. Running inside the NGC container means:

You are running the exact software stack NVIDIA optimized for the GB10. Not a generic pip-installed PyTorch that might have mismatched CUDA versions. Not a version of cuBLAS compiled against an older architecture. The real thing, tuned for this chip.

We pull nvcr.io/nvidia/pytorch:25.09-py3 — NVIDIA Release 25.09, shipping PyTorch 2.9.0 on Python 3.12.

terminal — run the container

docker run --rm -it \
  --gpus all \
  -v /home/$USER/projects:/workspace \
  -w /workspace/ai-zero-to-gpu-lab \
  nvcr.io/nvidia/pytorch:25.09-py3 \
  python scripts/gpu_test.py

Every flag — what it does and why you can't skip it

Flag	Explanation
--rm	Automatically removes the container filesystem when the process exits. Containers are ephemeral by design — the running container creates a thin writable layer on top of the image. Without `--rm`, that layer persists as a stopped container. Run this command ten times and you have ten dead containers consuming disk space. This flag keeps things clean.
--gpus all	This is the critical one. By default, Docker containers have zero access to the host's GPU hardware. The NVIDIA Container Toolkit (installed on the DGX) intercepts this flag and mounts the GPU device files (`/dev/nvidia0`, `/dev/nvidiactl`, etc.) and the NVIDIA driver libraries into the container's namespace. Without it, `torch.cuda.is_available()` returns False — your code runs on CPU and you have no idea why it feels slow.
-v host:container	Bind-mounts a directory from the host filesystem into the container. The container's own filesystem is completely destroyed when the container exits — any files you create inside it are gone. By mounting `/home/$USER/projects` into `/workspace`, your scripts and (critically) any profiler output files you write survive after the container exits. This is how we keep our `.nsys-rep` profiles.
-w /workspace/...	Sets the working directory inside the container. Any relative paths in your script (`profiles/gpu_trace.nsys-rep`, `scripts/gpu_test.py`) are resolved from here. Without this, the working directory defaults to `/` and relative paths will fail or write to wrong locations.

Step 2 — The Workload

The test script: a deliberate, minimal GPU workload

The script we're running is intentionally the smallest possible real GPU computation. Not a toy, but also not complex. Here it is in full — and then we will pick apart every line:

scripts/gpu_test.py

import torch, time

# Step 1: Check if a CUDA GPU is actually available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

# Step 2: Allocate two large matrices DIRECTLY on the GPU
A = torch.randn(4096, 4096, device=device)
B = torch.randn(4096, 4096, device=device)

# Step 3: Synchronize before timing — CRITICAL (explained below)
torch.cuda.synchronize()
t0 = time.time()

C = torch.matmul(A, B)

torch.cuda.synchronize()  # wait for kernel to fully finish
t1 = time.time()

print(f"Time: {t1 - t0}")
print(f"Shape: {C.shape}")

Why 4096×4096?

A matrix of this size contains 4096² = ~16.7 million float32 numbers per matrix. Two matrices = ~33 million numbers = ~134MB of GPU memory before the result. More importantly, the multiply itself requires 2 × 4096³ ≈ 137 billion floating-point operations. That is enough computation to genuinely stress the GPU and produce meaningful timing numbers — not sub-millisecond noise. It also closely resembles the shape of weight matrices in mid-size transformer models.

The most important detail: why synchronize() before timing?

This is where most engineers get their first GPU benchmark completely wrong. CUDA executes asynchronously. When you call torch.matmul(A, B), Python does not wait for the GPU to finish. It submits a job to the GPU's command queue and returns immediately. The GPU then executes the kernel on its own timeline, in parallel with whatever the CPU does next.

This means: if you put time.time() immediately after torch.matmul() with no synchronization, you are measuring the time to submit the job — which is microseconds. The actual computation takes hundreds of milliseconds but the CPU never waited for it.

// common benchmarking mistake

Without torch.cuda.synchronize(), timing a GPU operation in Python will tell you the kernel submission latency, not the kernel execution time. You'll measure 0.002ms and think your GPU is impossibly fast. It's not — you just measured the wrong thing. Always bracket your timing with synchronize calls.

torch.cuda.synchronize() is a CPU-side barrier: it blocks the Python thread until all pending GPU work on that device is complete. This gives you the true wall-clock time of the operation.

// actual output from the GB10

Device:cuda

CUDA is available. If this said "cpu", --gpus all was missing.

GPU:NVIDIA GB10

Confirms we are on Blackwell silicon, not a VM or CPU fallback.

Time:0.1885852813720703 seconds

~189ms for 137 billion FLOPs. Real, synchronized, accurate.

Shape:torch.Size([4096, 4096])

Result matrix C is 4096×4096. Computation completed correctly.

~189ms

Synchronized wall time

137B

FLOPs computed

~727 GFLOPS

Effective float32 throughput

Step 3 — Profiling

Nsight Systems: the tool that watches the GPU execute

Nsight Systems (nsys) is NVIDIA's system-wide profiler. To understand what it does, first understand the problem it solves.

You know your code ran in 189ms. But you have no idea what happened inside those 189ms. How long did memory allocation take? When exactly did the CUDA kernel start? How many threads were active? Was the GPU waiting on memory? Was the CPU waiting on the GPU? Was there kernel launch overhead eating into your computation time?

nsys answers all of these questions by instrumenting the NVIDIA driver and the OS scheduler. It sits below your Python code and intercepts every CUDA API call, records every kernel launch, timestamps every memory transfer, and saves it all to a structured report file. Think of it as perf stat + strace, but GPU-aware, and with nanosecond-precision timing.

First attempt — and what we got wrong

terminal — first nsys attempt

nsys profile \
  -t cuda,nvtx,osrt \
  --capture-range=nvtx \
  --capture-range-end=stop \
  -o profiles/gpu_trace \
  --force-overwrite=true \
  docker run ... python scripts/gpu_test.py

Breaking down the flags

Flag	Explanation
-t cuda,nvtx,osrt	Trace sources — which categories of events to capture. `cuda` — CUDA API calls and kernel launches (the main event). `nvtx` — NVIDIA Tools Extension markers. These are annotations you manually add to your code to label regions. Think of them as "start recording this named section" / "stop recording". `osrt` — OS Runtime: thread creates, memory maps, signal handlers — the OS-level plumbing beneath CUDA.
--capture-range=nvtx	This tells nsys: don't record everything — only record while an NVTX range is active. This is useful for focusing the profile on one specific region of a long-running program. But it has a hard dependency: your code must contain NVTX range markers. If it doesn't, nsys will watch the entire process run and capture nothing.
--capture-range-end=stop	Stop profiling when an NVTX range named "stop" is encountered. Again — requires the code to emit NVTX signals.
-o profiles/gpu_trace	Output file name (no extension). nsys writes a binary `.nsys-rep` report file.
--force-overwrite=true	Overwrite any existing file with the same name. Without this, nsys refuses to overwrite and your command fails silently. Essential during iterative development.

What are NVTX markers — concretely?

NVTX (NVIDIA Tools Extension) is a C/Python API that lets you emit named timing events from inside your code. You call torch.cuda.nvtx.range_push("my_operation") at the start of a region and torch.cuda.nvtx.range_pop() at the end. The CUDA driver records the exact nanosecond timestamps of those calls and associates them with a label you chose.

python — what NVTX instrumentation looks like

import torch

torch.cuda.nvtx.range_push("matmul_4096")   # START of named region
C = torch.matmul(A, B)
torch.cuda.nvtx.range_pop()                  # END of named region

# Now nsys can show: "matmul_4096 took 189ms"
# instead of just an anonymous kernel execution

When --capture-range=nvtx is set and there are no markers in the code, nsys waits for a signal that never comes. The workload runs and completes normally, but nsys captures zero data. That's exactly what happened:

// result of first nsys run

Generated:No reports were generated

The GPU ran fine. nsys saw the process but captured nothing — it was waiting for NVTX markers that never appeared.

// the lesson here

When profiling a new codebase, always run nsys without --capture-range first to confirm it can see CUDA activity at all. Once you've verified baseline data capture, then add NVTX markers and narrow the capture window. We'll do exactly this in Part 2.

Step 4 — Profiling (Take 2)

Dropping the capture range — and getting real data

The second run removes --capture-range entirely and adds three Docker flags that NVIDIA recommends for all PyTorch workloads:

terminal — second nsys run (gpu_trace2)

nsys profile \
  -t cuda,nvtx,osrt \
  -o profiles/gpu_trace2 \
  --force-overwrite=true \
  docker run --rm --gpus all \
    --ipc=host \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -v /home/$USER/projects:/workspace \
    -w /workspace/ai-zero-to-gpu-lab \
    nvcr.io/nvidia/pytorch:25.09-py3 \
    python scripts/gpu_test.py

Three new Docker flags — deeply explained

These three flags are often copy-pasted without understanding. They each address a specific hardware-level constraint.

Flag	What it solves and why it exists
--ipc=host	IPC = Inter-Process Communication. Linux provides a shared memory facility (`shm`) that lets processes exchange data through a region of RAM both processes can address directly — no copying, no sockets. PyTorch's DataLoader workers use this heavily to pass batches between processes. CUDA also uses IPC internally for certain multi-GPU operations and for communicating between the CUDA runtime and the GPU driver. By default, Docker containers get their own isolated IPC namespace with a very small shared memory limit (64MB). PyTorch's default shared memory allocation often exceeds this, causing cryptic errors or silent fallback to slower copy-based transfers. `--ipc=host` gives the container access to the host's full IPC namespace, removing this constraint. It's in the NVIDIA official run instructions for exactly this reason.
--ulimit memlock=-1	Pinned (locked) memory is the key to fast GPU transfers. Normally, the OS can swap any page of RAM to disk at any time. The GPU's DMA (Direct Memory Access) engine cannot work with swappable memory — it needs to know the physical address of a buffer and trust it won't move. "Pinning" a memory page tells the OS: never swap this, its physical address is fixed. CUDA uses pinned memory extensively for fast host-to-GPU and GPU-to-host transfers. Linux enforces a limit on how much memory a process can pin (the `memlock` limit). Docker containers inherit a restrictive default. Setting `--ulimit memlock=-1` removes this limit entirely. Without it, CUDA silently falls back to pageable transfers — which go through an intermediate bounce buffer — and your memory bandwidth can drop by 2-10x.
--ulimit stack=67108864	64MB of stack space per thread. Linux's default per-thread stack is 8MB. CUDA kernels and their host-side launchers can have deeply recursive call patterns and large local variable arrays, especially in complex cuBLAS routines. If the stack overflows, you get a segfault — often with no useful error message, just a crash. 67108864 bytes = exactly 64MB (64 × 1024 × 1024). This is a standard NVIDIA recommendation for containers running PyTorch. You may never hit the limit on simple workloads, but on complex models with many layers and large batch sizes, this headroom matters.

// result of second nsys run — success

Status:Collecting data...

GPU run:Device: cuda | GPU: NVIDIA GB10 | Time: 0.18935s

nsys:Generating '/tmp/nsys-report-d768.qdstrm'

Written:profiles/gpu_trace2.nsys-rep ✓

Step 4b — The Report File

What is a .nsys-rep file, exactly?

When nsys finishes running, it produces a file called gpu_trace2.nsys-rep. Let's be precise about what this file is, because it's going to be the center of our work for several parts of this series.

A .nsys-rep file is a compressed binary database of time-stamped events. Think of it like a structured log file, except instead of lines of text, each record has a nanosecond-precision timestamp, a thread ID, an event type (CUDA API call, kernel launch, OS function), and associated metadata.

The file captures events at multiple levels simultaneously:

OS Runtime layer (osrt)

System calls and OS-level function invocations: thread creation (pthread_create), memory mapping (mmap), signal handlers (sigaction). This is the foundation — what the OS does to bootstrap the process before any GPU work happens.

CUDA API layer (cuda)

Every call into the CUDA runtime library: cudaMalloc (allocate GPU memory), cudaMemcpy (transfer data), cudaLaunchKernel (submit work to the GPU), cudaEventRecord (place timing markers). Each of these has a precise start and end time on the CPU side.

GPU kernel execution layer

The actual execution of compiled CUDA code on the GPU's Streaming Multiprocessors (SMs). Each kernel record shows: which kernel ran, when it started on the GPU timeline, how long it ran, and which GPU it ran on. This is the deepest layer nsys captures.

Memory transfer layer

Host-to-device (H2D) and device-to-host (D2H) memory copies, with sizes and durations. On traditional PCIe GPUs this is often where you find the real bottleneck. On the GB10 with NVLink-C2C, you'll see this layer nearly disappear.

When you run nsys stats gpu_trace2.nsys-rep, the tool converts the binary report to a SQLite database and then runs a set of Python report scripts against it, producing human-readable summary tables. The output you see in the terminal is those tables, one per event category.

Step 5 — Reading the Output

Parsing the nsys stats output — line by line

Running nsys stats profiles/gpu_trace2.nsys-rep produces a long output. Most lines say SKIPPED. One table has real data. Let's understand both.

The SKIPPED lines — and why they're not a failure

nsys stats output — skipped sections

SKIPPED: profiles/gpu_trace2.sqlite does not contain CUDA trace data.
SKIPPED: profiles/gpu_trace2.sqlite does not contain CUDA kernel data.
SKIPPED: profiles/gpu_trace2.sqlite does not contain GPU memory data.
SKIPPED: profiles/gpu_trace2.sqlite does not contain NV Tools Extension (NVTX) data.
SKIPPED: profiles/gpu_trace2.sqlite does not contain CUDA Unified Memory CPU page faults data.

This is the most interesting finding in Part 1 — and it's not a bug. Here's why these are skipped:

We ran nsys on the host and wrapped a Docker container as the process being profiled. nsys on the host can see the container's process at the OS level — it can intercept OS system calls and see thread activity. But to capture CUDA kernel data and GPU timeline events, nsys needs to be running inside the same context as the CUDA driver. The CUDA driver is inside the container's namespace, and nsys on the host cannot reach inside it via ptrace (the Linux mechanism profilers use to inspect a process's internals).

This means the profiler saw the container process as a black box. It could see OS-level calls — thread creation, memory mapping — but everything inside the container involving CUDA was invisible to it. The GPU ran fine. The profiler just couldn't see inside the fence.

// what we need to do in part 2

Run nsys inside the container, profiling the Python process directly. When nsys and the CUDA driver are in the same process namespace, it has full access to CUDA API calls, kernel timelines, and memory transfer events. That's where the real profiler data appears — and where this series gets genuinely exciting.

What the OS Runtime Summary actually shows us

nsys stats — OS Runtime Summary (osrt_sum)

 ** OS Runtime Summary (osrt_sum): **

 Time (%)  Total Time (ns)  Num Calls  Avg (ns)  Name
 --------  ---------------  ---------  --------  ---------
    78.7          467,696          9  51,966.2  pthread_create
    20.9          124,352         28   4,441.1  mmap
     0.4            2,576          2   1,288.0  sigaction

This is the OS-level bootstrap of the PyTorch runtime — what happens at the OS level in the moments before any matrix is multiplied. Each row tells a story:

78%

pthread_create — 9 threads created, ~52μs each

PyTorch doesn't run on a single thread. During initialization it spins up a pool of OS threads: CUDA worker threads that own the GPU command queue, an event-pool thread for managing CUDA events, and internal dispatcher threads for async operations. The 9 creates represent this worker pool coming to life. The ~52μs per thread is normal Linux thread creation overhead — kernel stack allocation, TLS setup, scheduler registration. These threads will persist for the duration of the process and handle all GPU submissions.

21%

mmap — 28 calls, ~4.4μs each

mmap is how Linux loads libraries and creates large memory regions without copying. PyTorch uses it to load compiled CUDA modules — the .cubin files containing optimized GPU binary code for your specific architecture (GB10/Blackwell in this case) — from disk into the process's address space. It's also used for setting up shared memory regions for IPC. These 28 calls are the runtime loading its toolbox.

0.4%

sigaction — 2 calls, ~1.3μs each

PyTorch installs custom OS signal handlers during initialization. Specifically, it catches SIGSEGV (segmentation fault) and SIGBUS (bus error) — the signals the OS sends when a process accesses invalid memory. The custom handlers let PyTorch print a useful Python stack trace and GPU memory state instead of just dumping core. Without these, a GPU memory fault would produce a cryptic crash with no actionable information.

Total OS-layer overhead: ~595 microseconds. That's the cost of bootstrapping a CUDA PyTorch environment from scratch. Everything after this — memory allocation, kernel launch, GPU execution — is invisible to this profile because nsys was outside the container.

What Part 1 Established

Where we are and what we now know

Part 1 was about foundations. We did not produce a flashy flame graph or a kernel timeline. What we did do was more important: we verified every layer of the stack from Python code to physical GPU execution, and we found the exact boundary where our profiler can't yet see.

✓

The NGC container gives us the correct, optimized software stack

PyTorch 2.9 + cuBLAS + cuDNN built for GB10/Blackwell. Not a generic installation — the real thing.

✓

CUDA is live on the GB10 and the GPU executed our workload

Device confirmed as NVIDIA GB10, CUDA available, 4096² matmul completed in ~189ms = ~727 GFLOPS effective throughput.

✓

GPU timing requires synchronization — we know why

CUDA is asynchronous. Without torch.cuda.synchronize(), all timing measurements are wrong. Always bracket GPU benchmarks.

✓

nsys is installed and we produced a .nsys-rep report file

We understand what the binary file contains, why most sections were SKIPPED (host-vs-container profiling boundary), and what the OS Runtime data tells us about PyTorch's bootstrap sequence.

→

The gap we found tells us exactly what to do next

Running nsys inside the container is the next step — that's where the CUDA kernel timeline, memory transfer data, and GPU execution data become visible. That's Part 2.

▶▶ Coming Next — Part 2

Inside the Container: The CUDA Timeline, Kernel Selection & Real GPU Metrics

Part 1 confirmed the GPU is real and the code runs. Part 2 moves nsys inside the Docker container — where it can reach the CUDA driver directly and capture the full GPU execution timeline.

We'll add NVTX range markers to the Python script so the profiler can label exactly when our matmul starts and ends. We'll then read the CUDA API summary to see every cudaMalloc, cudaMemcpy, and cudaLaunchKernel call with precise timings. We'll look at the GPU kernel summary and see the exact cuBLAS kernel that PyTorch selected for the GB10's Tensor Cores — its name, its duration, how many times it was called.

Then we bring in Nsight Compute (ncu) — the kernel-level deep profiler — to look inside that cuBLAS kernel. Warp occupancy. Memory bandwidth utilization. Arithmetic intensity. L2 cache hit rates. This is where the silicon physics becomes visible — and where you'll start to understand why model performance on a given GPU has the characteristics it does.

nsys inside container NVTX annotations CUDA API timeline cuBLAS kernel selection GPU memory transfers nsight compute (ncu) warp occupancy arithmetic intensity

GPU Internals Series — Part 1 Hardware: NVIDIA GB10 (DGX Spark) · Stack: PyTorch 2.9 · Docker · Nsight Systems

Part 2: CUDA Timeline →