Latest Thinking

All posts
cloud

Building ML Infrastructure on AWS with Kiro: Part 2 — Multi-Account Landing Zone

Apr 28, 2026Read
cloud

Building ML Infrastructure on AWS with Kiro: Part 1 — Network Foundation

Apr 28, 2026Read
ai-infraAILocal Setup

Running Qwen 2.5 Coder Locally with OpenCode: A Private Offline AI Coding Assistant

A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode (an open-source AI coding CLI) to create a private, offline-capable AI pair programmer in your terminal.

Read
ai-infraQwen 2.5 CoderOllama

Running Qwen 2.5 Coder Locally with OpenCode: A Private AI Coding Assistant

A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode, creating a private, offline-capable AI coding assistant in your terminal.

Read
ai-infraSpeculative DecodingTensorRT-LLM

Speculative Decoding Latency Optimization on TensorRT-LLM with Llama 3 Models

Tuning speculative decoding using Llama 3 8B as the draft model and Llama 3 70B as the target to reduce decode latency.

Read
gpuCUDATensorRT-LLM

CUDA Graph Capture for Low-Latency LLM Decode on RTX 4090

Using CUDA graph capture in TensorRT-LLM to eliminate kernel launch overhead during the decode phase of Llama 3 8B on RTX 4090 — benchmarks, trade-offs, and lessons learned.

Read
gpuTensorRT-LLMFP8 quantization

End-to-End TensorRT-LLM FP8 Inference Optimization on NVIDIA GB10 Blackwell

End-to-end TensorRT-LLM FP8 inference optimization on NVIDIA GB10 Blackwell from baseline to 3K+ tokens/sec was achieved through a series of phases.

Read
ai-infralocal LLMsautomated publishing

Automated Content Publishing Pipeline with Local LLMs

Built an automated content publishing pipeline using local LLMs that processes raw session notes into structured markdown and publishes them to a blog GitHub repository if they meet publishability criteria.

Read

Browse Topics