All Posts

20 posts
ai-infraTensorRT-LLMKV Cache

Optimizing KV Cache Management in TensorRT-LLM for H100 GPUs

Tuned KV cache allocation and paging strategy in TensorRT-LLM to maximize concurrent sequences on H100 GPUs for Llama 3 70B at long context, measuring memory pressure, calculating theoretical limits, tuning memory fraction settings, comparing fixed vs paged strategies, and implementing prefix reuse.

Apr 20, 2026Read
ai-infraAILocal Setup

Running Qwen 2.5 Coder Locally with OpenCode: A Private Offline AI Coding Assistant

A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode (an open-source AI coding CLI) to create a private, offline-capable AI pair programmer in your terminal.

Apr 20, 2026Read
ai-infraQwen 2.5 CoderOllama

Running Qwen 2.5 Coder Locally with OpenCode: A Private AI Coding Assistant

A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode, creating a private, offline-capable AI coding assistant in your terminal.

Apr 20, 2026Read
ai-infraTriton Inference ServerMulti-Model Routing

Multi-Model Routing with Triton Inference Server for Efficient LLM Serving

Deployed Triton Inference Server with two models (Llama 3 8B and Llama 3 70B) to serve simple and complex queries respectively.

Apr 20, 2026Read
ai-infraTensorRT-LLMLlama 3 70B

TensorRT-LLM Engine Build Flags Deep Dive for Llama 3 70B on H100 SXM5

The document provides a detailed analysis of the `trtllm-build` flags to optimize the inference engine configuration for Llama 3 70B on H100 SXM5.

Apr 20, 2026Read
gpuTensor ParallelismPipeline Parallelism

Tensor Parallelism vs Pipeline Parallelism on 4x H100 GPUs for Llama 3 405B Inference

Configured tensor parallelism on 4x H100 NVLink for Llama 3 405B inference, comparing TP=2 vs TP=4 in terms of throughput, latency, and NVLink bandwidth utilization.

Apr 20, 2026Read
ai-infraSpeculative DecodingTensorRT-LLM

Speculative Decoding Latency Optimization on TensorRT-LLM with Llama 3 Models

Tuning speculative decoding using Llama 3 8B as the draft model and Llama 3 70B as the target to reduce decode latency.

Apr 20, 2026Read
gpuPower EfficiencyNVIDIA Blackwell GB10

Power Efficiency Tuning of NVIDIA GB10 for Sustained Inference

Measuring and optimizing tokens per watt on DGX Spark GB10 under sustained inference load.

Apr 20, 2026Read
ai-infraTensorRT-LLMKV Cache

Optimizing KV Cache Management in TensorRT-LLM for H100 GPUs

Tuned KV cache allocation and paging strategy in TensorRT-LLM to maximize concurrent sequences on H100 GPUs for Llama 3 70B at long context, measuring memory pressure, calculating theoretical limits, tuning memory fraction settings, comparing fixed vs paged strategies, and implementing prefix reuse.

Apr 20, 2026Read
ai-infraCUDA graphsTensorRT-LLM

CUDA Graph Capture for Low-Latency LLM Decode on RTX 4090

Using CUDA graph capture in TensorRT-LLM to eliminate kernel launch overhead during the decode phase of Llama 3 8B on RTX 4090.

Apr 20, 2026Read
ai-infraTensorRT-LLMTriton Inference Server

Evaluating Inflight Batching and Scheduler Policies in TensorRT-LLM Triton Backend

Configured inflight batching in TensorRT-LLM Triton backend for Llama 3 8B model, tuning parameters to optimize GPU utilization and balance latency and throughput under bursty loads.

Apr 20, 2026Read
gpuAWQFP8

AWQ vs. FP8 Quantization: Balancing Accuracy and Throughput for Llama 3 70B

Comparing AWQ INT4 and FP8 quantization methods on the Llama 3 70B model across accuracy benchmarks (MMLU, HumanEval, GSM8K, MT-Bench) and inference throughput measurements on an H100 GPU.

Apr 20, 2026Read
gpuTensorRT-LLMFP8 quantization

End-to-End TensorRT-LLM FP8 Inference Optimization on NVIDIA GB10 Blackwell

End-to-end TensorRT-LLM FP8 inference optimization on NVIDIA GB10 Blackwell from baseline to 3K+ tokens/sec was achieved through a series of phases.

Apr 20, 2026Read
ai-infralocal LLMsautomated publishing

Automated Content Publishing Pipeline with Local LLMs

Built an automated content publishing pipeline using local LLMs that processes raw session notes into structured markdown and publishes them to a blog GitHub repository if they meet publishability criteria.

Apr 20, 2026Read
cloudAI Agent ArchitectureMulti-Agent Orchestration

Head of AI Products - Multi-Agent Systems Architect

The job posting requires a strong background in AI agent architectures, orchestration, model selection, and deployment.

Apr 20, 2026Read
ai-infraTensorRTProfiling

TensorRT Profiling Report

TensorRT profiling has been completed.

Apr 20, 2026Read
gpuFP8quantization

FP8 Quantization on H100: A Practical Guide

End-to-end walkthrough of enabling FP8 precision in TensorRT-LLM for Llama-3 70B — including calibration, accuracy validation, and production results.

Apr 15, 2026Read
ai-infraTensorRT-LLMvLLM

TensorRT-LLM vs vLLM: Production Comparison at Scale

A rigorous head-to-head benchmark of TRT-LLM and vLLM for production LLM serving — covering throughput, latency percentiles, memory efficiency, and operational complexity.

Apr 8, 2026Read
gpuCUDAprofiling

GPU Memory Bandwidth: What Your Profiler Isn't Telling You

A deep dive into H100 memory subsystem characteristics — HBM3 bandwidth ceilings, L2 cache behavior, and how to use Nsight Compute to find actual bottlenecks in transformer inference kernels.

Mar 28, 2026Read

DGX-GB10---Blackwell-fp8-optimization-journey

Read