Latest Thinking
All postsWhat Actually Happens When Your Python Calls the GPU? — Part 2 of a Series
What Actually Happens When Your Python Calls the GPU?
A hands-on journey from a single line of PyTorch code to the silicon on an NVIDIA GB10 Spark — tracing a matrix multiply through the full execution stack and profiling it with Nsight Systems.
Evaluating Inflight Batching and Scheduler Policies in TensorRT-LLM Triton Backend
Configured inflight batching in TensorRT-LLM Triton backend for Llama 3 8B model, tuning parameters to optimize GPU utilization and balance latency and throughput under bursty loads.
AWQ vs. FP8 Quantization: Balancing Accuracy and Throughput for Llama 3 70B
Comparing AWQ INT4 and FP8 quantization methods on the Llama 3 70B model across accuracy benchmarks (MMLU, HumanEval, GSM8K, MT-Bench) and inference throughput measurements on an H100 GPU.
Running Qwen 2.5 Coder Locally with OpenCode: A Private AI Coding Assistant
A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode, creating a private, offline-capable AI coding assistant in your terminal.
Speculative Decoding Latency Optimization on TensorRT-LLM with Llama 3 Models
Tuning speculative decoding using Llama 3 8B as the draft model and Llama 3 70B as the target to reduce decode latency.
CUDA Graph Capture for Low-Latency LLM Decode on RTX 4090
Using CUDA graph capture in TensorRT-LLM to eliminate kernel launch overhead during the decode phase of Llama 3 8B on RTX 4090 — benchmarks, trade-offs, and lessons learned.
End-to-End TensorRT-LLM FP8 Inference Optimization on NVIDIA GB10 Blackwell
End-to-end TensorRT-LLM FP8 inference optimization on NVIDIA GB10 Blackwell from baseline to 3K+ tokens/sec was achieved through a series of phases.