Latest Thinking
All postsBuilding ML Infrastructure on AWS with Kiro: Part 2 — Multi-Account Landing Zone
Building ML Infrastructure on AWS with Kiro: Part 1 — Network Foundation
Running Qwen 2.5 Coder Locally with OpenCode: A Private Offline AI Coding Assistant
A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode (an open-source AI coding CLI) to create a private, offline-capable AI pair programmer in your terminal.
Running Qwen 2.5 Coder Locally with OpenCode: A Private AI Coding Assistant
A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode, creating a private, offline-capable AI coding assistant in your terminal.
Speculative Decoding Latency Optimization on TensorRT-LLM with Llama 3 Models
Tuning speculative decoding using Llama 3 8B as the draft model and Llama 3 70B as the target to reduce decode latency.
CUDA Graph Capture for Low-Latency LLM Decode on RTX 4090
Using CUDA graph capture in TensorRT-LLM to eliminate kernel launch overhead during the decode phase of Llama 3 8B on RTX 4090 — benchmarks, trade-offs, and lessons learned.
End-to-End TensorRT-LLM FP8 Inference Optimization on NVIDIA GB10 Blackwell
End-to-end TensorRT-LLM FP8 inference optimization on NVIDIA GB10 Blackwell from baseline to 3K+ tokens/sec was achieved through a series of phases.
Automated Content Publishing Pipeline with Local LLMs
Built an automated content publishing pipeline using local LLMs that processes raw session notes into structured markdown and publishes them to a blog GitHub repository if they meet publishability criteria.