Latest Thinking

All posts
ai-infraTensorRT-LLMKV Cache

Optimizing KV Cache Management in TensorRT-LLM for H100 GPUs

Tuned KV cache allocation and paging strategy in TensorRT-LLM to maximize concurrent sequences on H100 GPUs for Llama 3 70B at long context, measuring memory pressure, calculating theoretical limits, tuning memory fraction settings, comparing fixed vs paged strategies, and implementing prefix reuse.

Apr 20, 2026Read
ai-infraAILocal Setup

Running Qwen 2.5 Coder Locally with OpenCode: A Private Offline AI Coding Assistant

A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode (an open-source AI coding CLI) to create a private, offline-capable AI pair programmer in your terminal.

Apr 20, 2026Read
ai-infraQwen 2.5 CoderOllama

Running Qwen 2.5 Coder Locally with OpenCode: A Private AI Coding Assistant

A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode, creating a private, offline-capable AI coding assistant in your terminal.

Apr 20, 2026Read
ai-infraTriton Inference ServerMulti-Model Routing

Multi-Model Routing with Triton Inference Server for Efficient LLM Serving

Deployed Triton Inference Server with two models (Llama 3 8B and Llama 3 70B) to serve simple and complex queries respectively.

Apr 20, 2026Read
ai-infraTensorRT-LLMLlama 3 70B

TensorRT-LLM Engine Build Flags Deep Dive for Llama 3 70B on H100 SXM5

The document provides a detailed analysis of the `trtllm-build` flags to optimize the inference engine configuration for Llama 3 70B on H100 SXM5.

Apr 20, 2026Read
gpuTensor ParallelismPipeline Parallelism

Tensor Parallelism vs Pipeline Parallelism on 4x H100 GPUs for Llama 3 405B Inference

Configured tensor parallelism on 4x H100 NVLink for Llama 3 405B inference, comparing TP=2 vs TP=4 in terms of throughput, latency, and NVLink bandwidth utilization.

Apr 20, 2026Read
ai-infraSpeculative DecodingTensorRT-LLM

Speculative Decoding Latency Optimization on TensorRT-LLM with Llama 3 Models

Tuning speculative decoding using Llama 3 8B as the draft model and Llama 3 70B as the target to reduce decode latency.

Apr 20, 2026Read
gpuPower EfficiencyNVIDIA Blackwell GB10

Power Efficiency Tuning of NVIDIA GB10 for Sustained Inference

Measuring and optimizing tokens per watt on DGX Spark GB10 under sustained inference load.

Apr 20, 2026Read

Browse Topics