Latest Thinking
All postsOptimizing KV Cache Management in TensorRT-LLM for H100 GPUs
Tuned KV cache allocation and paging strategy in TensorRT-LLM to maximize concurrent sequences on H100 GPUs for Llama 3 70B at long context, measuring memory pressure, calculating theoretical limits, tuning memory fraction settings, comparing fixed vs paged strategies, and implementing prefix reuse.
Running Qwen 2.5 Coder Locally with OpenCode: A Private Offline AI Coding Assistant
A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode (an open-source AI coding CLI) to create a private, offline-capable AI pair programmer in your terminal.
Running Qwen 2.5 Coder Locally with OpenCode: A Private AI Coding Assistant
A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode, creating a private, offline-capable AI coding assistant in your terminal.
Multi-Model Routing with Triton Inference Server for Efficient LLM Serving
Deployed Triton Inference Server with two models (Llama 3 8B and Llama 3 70B) to serve simple and complex queries respectively.
TensorRT-LLM Engine Build Flags Deep Dive for Llama 3 70B on H100 SXM5
The document provides a detailed analysis of the `trtllm-build` flags to optimize the inference engine configuration for Llama 3 70B on H100 SXM5.
Tensor Parallelism vs Pipeline Parallelism on 4x H100 GPUs for Llama 3 405B Inference
Configured tensor parallelism on 4x H100 NVLink for Llama 3 405B inference, comparing TP=2 vs TP=4 in terms of throughput, latency, and NVLink bandwidth utilization.
Speculative Decoding Latency Optimization on TensorRT-LLM with Llama 3 Models
Tuning speculative decoding using Llama 3 8B as the draft model and Llama 3 70B as the target to reduce decode latency.
Power Efficiency Tuning of NVIDIA GB10 for Sustained Inference
Measuring and optimizing tokens per watt on DGX Spark GB10 under sustained inference load.