All Posts

8 posts
ai-infraAILocal Setup

Running Qwen 2.5 Coder Locally with OpenCode: A Private Offline AI Coding Assistant

A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode (an open-source AI coding CLI) to create a private, offline-capable AI pair programmer in your terminal.

Read
ai-infraQwen 2.5 CoderOllama

Running Qwen 2.5 Coder Locally with OpenCode: A Private AI Coding Assistant

A complete setup guide for running Qwen 2.5 Coder locally via Ollama and connecting it to OpenCode, creating a private, offline-capable AI coding assistant in your terminal.

Read
ai-infraSpeculative DecodingTensorRT-LLM

Speculative Decoding Latency Optimization on TensorRT-LLM with Llama 3 Models

Tuning speculative decoding using Llama 3 8B as the draft model and Llama 3 70B as the target to reduce decode latency.

Read
gpuCUDATensorRT-LLM

CUDA Graph Capture for Low-Latency LLM Decode on RTX 4090

Using CUDA graph capture in TensorRT-LLM to eliminate kernel launch overhead during the decode phase of Llama 3 8B on RTX 4090 — benchmarks, trade-offs, and lessons learned.

Read
gpuTensorRT-LLMFP8 quantization

End-to-End TensorRT-LLM FP8 Inference Optimization on NVIDIA GB10 Blackwell

End-to-end TensorRT-LLM FP8 inference optimization on NVIDIA GB10 Blackwell from baseline to 3K+ tokens/sec was achieved through a series of phases.

Read
ml-infraAWSTerraform

Building Production ML Infrastructure on AWS with Kiro: Part 2 — Multi-Account Landing Zone

Migrating from a single-account setup to a proper multi-account AWS landing zone with Organizations, Control Tower, Tailscale hybrid networking, centralized security, and cost guardrails — all built spec-first with Kiro.

Read
ml-infraAWSTerraform

Building Production ML Infrastructure on AWS with Kiro: Part 1 — Network Foundation

How we used Kiro's spec-driven workflow to go from a rough idea to fully deployed AWS network infrastructure across two environments — VPCs, NAT, CI/CD, and audit logging — in a single session.

Read
ai-infralocal LLMsautomated publishing

Automated Content Publishing Pipeline with Local LLMs

Built an automated content publishing pipeline using local LLMs that processes raw session notes into structured markdown and publishes them to a blog GitHub repository if they meet publishability criteria.

Read