Building ML Infrastructure on AWS with Kiro: Part 1 — Network Foundation

Part 1 of a series on building production-grade infrastructure for deploying custom Random Forest and LSTM models on AWS.

The Mission

At Mica Mirai, we're building infrastructure to serve custom machine learning models — a Random Forest for structured predictions and an LSTM for time-series forecasting. Before any model touches a GPU, you need a network that's secure, observable, and ready for what comes next.

This post covers how we used Kiro, an AI-powered IDE, to go from a rough idea to fully deployed AWS network infrastructure across two environments in a single session. No console clicking. No copy-pasting Terraform snippets from blog posts. Just a conversation with an AI that writes production-grade IaC.

What's coming in Part 2: We'll migrate to a multi-account AWS landing zone with Organizations, Control Tower, SCPs, Tailscale hybrid networking to on-prem AI resources, and the full Well-Architected Framework foundation a startup needs before scaling to GPU compute.

Why Start with the Network?

It's tempting to jump straight to Kubernetes and GPUs. But every production ML system sits on top of networking decisions that are painful to change later — CIDR ranges, NAT egress strategy, subnet topology, audit logging. Get these wrong and you're re-architecting under pressure when the models are already in production.

Our requirements:

Two isolated environments — dev for iteration, prod for serving. No shared resources between them.
EKS-ready subnets — tagged for Kubernetes load balancer discovery from day one.
Cost-conscious dev — a NAT Instance instead of managed NAT Gateways saves ~$64/month in a non-production environment.
Security baseline from the start — private subnets with no IGW route, encrypted state and audit buckets, no static IAM credentials anywhere.
Full CI/CD — plan on PRs, auto-apply dev on merge, manual approval gate for prod.

The Kiro Workflow

Spec-Driven Development

Kiro's approach is spec-driven. Instead of jumping into code, we started with a structured conversation that produced three documents:

Requirements — 27 formal requirements covering everything from S3 state bucket encryption to CloudTrail multi-region logging. Each requirement has acceptance criteria written in a SHALL/SHALL NOT format that leaves no room for ambiguity.
Design — A technical design document with architecture diagrams, CIDR allocation tables, component inventories per environment, and key design decisions (why NAT Instance in dev, why per-AZ NAT Gateways in prod, why partial backend configuration).
Tasks — A 20-task implementation plan with sub-tasks, each referencing specific requirements for traceability. Tasks are ordered by dependency: scaffold first, then bootstrap, then network resources, then CI/CD.

This isn't just documentation for documentation's sake. The spec became the contract between us and Kiro. When we said "implement task 9," Kiro knew exactly what EIP resources to create, which naming convention to follow, and which requirements to satisfy.

Three Phases of Infrastructure

The design breaks the work into three phases:

Phase 0 — Bootstrap (manual, once)
  └─ S3 state bucket + DynamoDB lock table + IAM OIDC provider/role

Phase 1 — Network (automated via CI/CD)
  └─ VPC, subnets, IGW, NAT, route tables, flow logs, CloudTrail

Phase 2 — CI/CD Pipeline
  └─ GitHub Actions: plan on PR, apply on merge, approval gate for prod

Phase 0 has to run manually because the remote state backend doesn't exist yet — it's the chicken-and-egg of Terraform. After that, everything goes through the pipeline.

What We Built

Network Topology

dev environment — optimized for cost:

VPC: 10.1.0.0/16 (us-east-1)
├── Public Subnet A  (10.1.0.0/24, us-east-1a)  ← NAT Instance + EIP
├── Public Subnet B  (10.1.1.0/24, us-east-1b)
├── Private Subnet A (10.1.10.0/24, us-east-1a) ← future EKS nodes
├── Private Subnet B (10.1.11.0/24, us-east-1b) ← future EKS nodes
├── 1 shared private route table → NAT Instance ENI
└── Internet Gateway

prod environment — optimized for availability:

VPC: 10.2.0.0/16 (us-east-1)
├── Public Subnet A  (10.2.0.0/24, us-east-1a)  ← NAT Gateway A + EIP
├── Public Subnet B  (10.2.1.0/24, us-east-1b)  ← NAT Gateway B + EIP
├── Private Subnet A (10.2.10.0/24, us-east-1a) ← future EKS nodes
├── Private Subnet B (10.2.11.0/24, us-east-1b) ← future EKS nodes
├── 2 per-AZ private route tables → respective NAT Gateways
└── Internet Gateway

The private subnets are where our EKS node groups will live in Part 2. They already carry the kubernetes.io/role/internal-elb tag so EKS will discover them automatically — no Terraform changes needed when we add the cluster.

Resource Inventory

Resource	dev	prod
VPC	1	1
Subnets	4	4
Internet Gateway	1	1
Elastic IPs	1	2
NAT Instance (EC2)	1	—
NAT Gateways	—	2
Route Tables	2	3
Route Table Associations	4	4
VPC Flow Log	1	1
CloudWatch Log Group	1	1
IAM Role (Flow Logs)	1	1
CloudTrail Trail	1	1
CloudTrail S3 Bucket	1	1

Naming Convention

Every resource follows MM-{env}-{region}-{az}-{resource}-{purpose}:

MM-dev-use1-vpc-core
MM-dev-use1-use1a-subnet-public
MM-prod-use1-use1b-nat-core
MM-dev-use1-rt-private

This makes resources instantly identifiable in the AWS console without checking tags.

Tagging

Twelve mandatory tags on every resource, enforced through a single local.common_tags block:

tags = merge(local.common_tags, {
  Name = "${local.name_prefix}-vpc-core"
})

Adding a new mandatory tag means editing one file. Every resource picks it up on the next apply.

The CI/CD Pipeline

The GitHub Actions workflow handles the full lifecycle:

PR opened/updated
  → terraform init → fmt check → validate → plan
  → post plan as PR comment
  → environment cross-contamination guard (checks Environment tags match target)

Merge to main
  → apply dev (automatic)
  → idempotency check (re-plan must show 0 changes)
  → apply prod (requires manual approval via GitHub Environment)
  → idempotency check

Authentication uses OIDC — GitHub Actions assumes an IAM role via federated identity. No AWS access keys stored anywhere.

The cross-contamination guard is a jq filter that scans the plan JSON for any resource with an Environment tag that doesn't match the target environment. If dev resources somehow appear in a prod plan, the pipeline stops.

Lessons from the Build

1. The Amazon Linux NAT AMI is Gone

The classic amzn-ami-vpc-nat-* AMIs that everyone references in NAT Instance tutorials have been fully deprecated. Our first CI/CD run failed because the AMI data source returned zero results.

The fix: use Amazon Linux 2023 with a user data script that enables IP forwarding and configures iptables masquerade. It's a few lines of bash, and it works identically to the old NAT AMI.

2. IAM Permissions Need Iteration

The Terraform AWS provider makes API calls you don't expect. Creating an S3 bucket requires not just s3:CreateBucket but also s3:GetBucketTagging, s3:GetBucketCORS, s3:GetBucketObjectLockConfiguration, and a dozen other Get* calls that the provider uses to read back the resource state.

We started with a scoped IAM policy and had to expand it twice before the apply succeeded. The lesson: either use a broad s3:Get* wildcard or be prepared to iterate on permissions as Terraform discovers what it needs.

3. OIDC Trust Policy Scope Matters

Our initial trust policy restricted role assumption to refs/heads/main only. This is correct for apply jobs, but it means the plan job can't run on PR branches. We broadened the condition to repo:org/repo:* to allow PR workflows to authenticate.

In a production setup, you'd want separate roles — a read-only role for plan (broad trust) and a write role for apply (main-only trust). For this project, a single role with broad trust was the pragmatic choice.

4. Spec-Driven Development Pays Off for IaC

The spec workflow forced us to think through naming conventions, CIDR allocation, and environment isolation before writing any Terraform. When implementation started, there were no "what should I name this?" or "which CIDR should I use?" decisions left — they were all in the design doc.

This matters more for infrastructure than application code. A renamed variable is a refactor. A changed CIDR block is a destroy-and-recreate.

Repository Structure

.
├── .github/workflows/terraform.yaml    # CI/CD pipeline
├── .kiro/specs/mm-aws-network-infra/   # Spec documents
│   ├── requirements.md
│   ├── design.md
│   └── tasks.md
├── docs/post-apply-checklist.md        # Verification runbook
├── specs/
│   ├── phase0-state.yaml               # Bootstrap intent
│   └── phase1-network.yaml             # Network intent
└── terraform/
    ├── bootstrap/                      # Phase 0 (local state)
    │   ├── main.tf
    │   ├── variables.tf
    │   ├── outputs.tf
    │   └── providers.tf
    └── network/                        # Phase 1 (S3 remote state)
        ├── backend.tf
        ├── locals.tf                   # common_tags + naming helpers
        ├── vpc.tf
        ├── subnets.tf
        ├── igw.tf
        ├── eip.tf
        ├── nat_gateway.tf              # prod only
        ├── nat_instance.tf             # dev only
        ├── route_tables.tf
        ├── flow_logs.tf
        ├── cloudtrail.tf
        └── envs/
            ├── dev.tfvars
            └── prod.tfvars

What's Next: Part 2 — Multi-Account Landing Zone

The single-account setup got us running fast, but it won't scale. In Part 2, we're migrating to a proper multi-account architecture:

AWS Organizations and Control Tower — three accounts (Management, Dev, Prod) with centralized governance
Service Control Policies — region restrictions, GPU instance size caps, encryption enforcement, no static IAM credentials
Tailscale hybrid networking — connecting on-prem NVIDIA GPU servers to AWS private subnets without traditional VPN hardware
Centralized security — organization-wide CloudTrail, GuardDuty, Security Hub, and AWS Config
Cost controls — billing alerts, budgets, anomaly detection, and right-sizing defaults for a startup budget
Cross-region DR readiness — backup replication to us-west-2, DR VPC pre-provisioned for prod
VPC private endpoints — keeping AWS API traffic off the public internet
AI model security — CMK-encrypted model artifact buckets, scoped IAM for GPU nodes, inference endpoint guardrails

The network foundation from Part 1 carries forward — same CIDRs, same naming conventions, same tagging. We're just putting proper account boundaries and governance around it.

Try It Yourself

The entire infrastructure was built in a single Kiro session — from rough idea to deployed resources in both environments. The spec-driven workflow meant we never had to backtrack on a design decision, and the task list gave us a clear path from empty repo to running infrastructure.

If you're building ML infrastructure and want to skip the weeks of Terraform trial-and-error, give Kiro a try. It won't replace your judgment on architecture decisions, but it will write the Terraform so you can focus on the decisions that matter.

This is Part 1 of the Mica Mirai ML Infrastructure series. Part 2 covers the multi-account landing zone, Well-Architected Framework adoption, and hybrid networking with on-prem AI resources.