All Posts
20 postsOptimizing KV Cache Management in TensorRT-LLM for H100 GPUs
Tuned KV cache allocation and paging strategy in TensorRT-LLM to maximize concurrent sequences on H100 GPUs for Llama 3 70B at long context, measuring memory pressure, calculating theoretical limits, tuning memory fraction settings, comparing fixed vs paged strategies, and implementing prefix reuse.
Running Qwen 2.5 Coder Locally with OpenCode: A Private Offline AI Coding Assistant
A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode (an open-source AI coding CLI) to create a private, offline-capable AI pair programmer in your terminal.
Running Qwen 2.5 Coder Locally with OpenCode: A Private AI Coding Assistant
A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode, creating a private, offline-capable AI coding assistant in your terminal.
Multi-Model Routing with Triton Inference Server for Efficient LLM Serving
Deployed Triton Inference Server with two models (Llama 3 8B and Llama 3 70B) to serve simple and complex queries respectively.
TensorRT-LLM Engine Build Flags Deep Dive for Llama 3 70B on H100 SXM5
The document provides a detailed analysis of the `trtllm-build` flags to optimize the inference engine configuration for Llama 3 70B on H100 SXM5.
Tensor Parallelism vs Pipeline Parallelism on 4x H100 GPUs for Llama 3 405B Inference
Configured tensor parallelism on 4x H100 NVLink for Llama 3 405B inference, comparing TP=2 vs TP=4 in terms of throughput, latency, and NVLink bandwidth utilization.
Speculative Decoding Latency Optimization on TensorRT-LLM with Llama 3 Models
Tuning speculative decoding using Llama 3 8B as the draft model and Llama 3 70B as the target to reduce decode latency.
Power Efficiency Tuning of NVIDIA GB10 for Sustained Inference
Measuring and optimizing tokens per watt on DGX Spark GB10 under sustained inference load.
Optimizing KV Cache Management in TensorRT-LLM for H100 GPUs
Tuned KV cache allocation and paging strategy in TensorRT-LLM to maximize concurrent sequences on H100 GPUs for Llama 3 70B at long context, measuring memory pressure, calculating theoretical limits, tuning memory fraction settings, comparing fixed vs paged strategies, and implementing prefix reuse.
CUDA Graph Capture for Low-Latency LLM Decode on RTX 4090
Using CUDA graph capture in TensorRT-LLM to eliminate kernel launch overhead during the decode phase of Llama 3 8B on RTX 4090.
Evaluating Inflight Batching and Scheduler Policies in TensorRT-LLM Triton Backend
Configured inflight batching in TensorRT-LLM Triton backend for Llama 3 8B model, tuning parameters to optimize GPU utilization and balance latency and throughput under bursty loads.
AWQ vs. FP8 Quantization: Balancing Accuracy and Throughput for Llama 3 70B
Comparing AWQ INT4 and FP8 quantization methods on the Llama 3 70B model across accuracy benchmarks (MMLU, HumanEval, GSM8K, MT-Bench) and inference throughput measurements on an H100 GPU.
End-to-End TensorRT-LLM FP8 Inference Optimization on NVIDIA GB10 Blackwell
End-to-end TensorRT-LLM FP8 inference optimization on NVIDIA GB10 Blackwell from baseline to 3K+ tokens/sec was achieved through a series of phases.
Automated Content Publishing Pipeline with Local LLMs
Built an automated content publishing pipeline using local LLMs that processes raw session notes into structured markdown and publishes them to a blog GitHub repository if they meet publishability criteria.
Head of AI Products - Multi-Agent Systems Architect
The job posting requires a strong background in AI agent architectures, orchestration, model selection, and deployment.
TensorRT Profiling Report
TensorRT profiling has been completed.
FP8 Quantization on H100: A Practical Guide
End-to-end walkthrough of enabling FP8 precision in TensorRT-LLM for Llama-3 70B — including calibration, accuracy validation, and production results.
TensorRT-LLM vs vLLM: Production Comparison at Scale
A rigorous head-to-head benchmark of TRT-LLM and vLLM for production LLM serving — covering throughput, latency percentiles, memory efficiency, and operational complexity.
GPU Memory Bandwidth: What Your Profiler Isn't Telling You
A deep dive into H100 memory subsystem characteristics — HBM3 bandwidth ceilings, L2 cache behavior, and how to use Nsight Compute to find actual bottlenecks in transformer inference kernels.