Tuned KV cache allocation and paging strategy in TensorRT-LLM to maximize concurrent sequences on H100 GPUs for Llama 3 70B at long context, measuring memory pressure, calculating theoretical limits, tuning memory fraction settings, comparing fixed vs paged strategies, and implementing prefix reuse.

Optimizing KV Cache Management in TensorRT-LLM for H100 GPUs

A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode (an open-source AI coding CLI) to create a private, offline-capable AI pair programmer in your terminal.

Running Qwen 2.5 Coder Locally with OpenCode: A Private Offline AI Coding Assistant

A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode, creating a private, offline-capable AI coding assistant in your terminal.

Running Qwen 2.5 Coder Locally with OpenCode: A Private AI Coding Assistant

Deployed Triton Inference Server with two models (Llama 3 8B and Llama 3 70B) to serve simple and complex queries respectively.

Multi-Model Routing with Triton Inference Server for Efficient LLM Serving

The document provides a detailed analysis of the `trtllm-build` flags to optimize the inference engine configuration for Llama 3 70B on H100 SXM5.

TensorRT-LLM Engine Build Flags Deep Dive for Llama 3 70B on H100 SXM5

Configured tensor parallelism on 4x H100 NVLink for Llama 3 405B inference, comparing TP=2 vs TP=4 in terms of throughput, latency, and NVLink bandwidth utilization.

Tensor Parallelism vs Pipeline Parallelism on 4x H100 GPUs for Llama 3 405B Inference

Tuning speculative decoding using Llama 3 8B as the draft model and Llama 3 70B as the target to reduce decode latency.

Speculative Decoding Latency Optimization on TensorRT-LLM with Llama 3 Models

Measuring and optimizing tokens per watt on DGX Spark GB10 under sustained inference load.

Power Efficiency Tuning of NVIDIA GB10 for Sustained Inference

Using CUDA graph capture in TensorRT-LLM to eliminate kernel launch overhead during the decode phase of Llama 3 8B on RTX 4090.

CUDA Graph Capture for Low-Latency LLM Decode on RTX 4090

Configured inflight batching in TensorRT-LLM Triton backend for Llama 3 8B model, tuning parameters to optimize GPU utilization and balance latency and throughput under bursty loads.

Evaluating Inflight Batching and Scheduler Policies in TensorRT-LLM Triton Backend

Comparing AWQ INT4 and FP8 quantization methods on the Llama 3 70B model across accuracy benchmarks (MMLU, HumanEval, GSM8K, MT-Bench) and inference throughput measurements on an H100 GPU.

AWQ vs. FP8 Quantization: Balancing Accuracy and Throughput for Llama 3 70B

End-to-end TensorRT-LLM FP8 inference optimization on NVIDIA GB10 Blackwell from baseline to 3K+ tokens/sec was achieved through a series of phases.

End-to-End TensorRT-LLM FP8 Inference Optimization on NVIDIA GB10 Blackwell

Built an automated content publishing pipeline using local LLMs that processes raw session notes into structured markdown and publishes them to a blog GitHub repository if they meet publishability criteria.

Automated Content Publishing Pipeline with Local LLMs

The job posting requires a strong background in AI agent architectures, orchestration, model selection, and deployment.

Head of AI Products - Multi-Agent Systems Architect

TensorRT Profiling Report

End-to-end walkthrough of enabling FP8 precision in TensorRT-LLM for Llama-3 70B — including calibration, accuracy validation, and production results.

FP8 Quantization on H100: A Practical Guide

A rigorous head-to-head benchmark of TRT-LLM and vLLM for production LLM serving — covering throughput, latency percentiles, memory efficiency, and operational complexity.

TensorRT-LLM vs vLLM: Production Comparison at Scale

A deep dive into H100 memory subsystem characteristics — HBM3 bandwidth ceilings, L2 cache behavior, and how to use Nsight Compute to find actual bottlenecks in transformer inference kernels.

GPU Memory Bandwidth: What Your Profiler Isn't Telling You

DGX-GB10---Blackwell-fp8-optimization-journey

<p>Here is the converted markdown:</p>
<h1>LLM Inference Optimization: From 300 to 3,310 Tokens/sec with FP8</h1>
<h2>Summary</h2>
<p>The author achieved an 11x improvement in LLM inference throughput from 300 to 3,310 tokens/sec using the DeepSeek-Coder-33B model on Blackwell-class GPUs without changing hardware. The key optimizations included switching to FP8 precision and adjusting batch size configuration.</p>
<h2>Key Technical Findings</h2>
<ul>
<li>Decode dominates production LLM inference with long outputs.</li>
<li>FP8 quantization significantly increased throughput and reduced latency by shifting compute to FP8 tensor cores.</li>
<li>Proper batching configuration is crucial for scaling under load; increasing <code>max_batch_size</code> resolved a bottleneck in request handling.</li>
</ul>
<h2>Commands Used</h2>
<pre><code class="language-bash">Switched model precision from FP16 to FP8 using TensorRT-LLM.
Increased `max_batch_size` from 32 to 64.
</code></pre>
<h2>Lessons Learned</h2>
<ul>
<li>Profiling is essential for identifying bottlenecks and optimizing LLM inference.</li>
<li>Prefill optimization has minimal impact on decode-heavy workloads.</li>
<li>The "knee" in the concurrency throughput curve indicates optimal efficiency.</li>
</ul>
<h3>Publishability</h3>
<p>yes</p>
<h3>Privacy Risk</h3>
<p>low</p>
<h3>Tags</h3>
<ul>
<li>inference</li>
<li>gpu-optimization</li>
<li>fp8</li>
<li>tensorrt-llm</li>
<li>blackwell</li>
<li>deepseek</li>
<li>benchmarking</li>
</ul>
<h3>Action Items</h3>
<ul>
<li>Implement FP8 precision for LLM models on compatible hardware.</li>
<li>Review and adjust batching configurations in production systems.</li>
</ul>