Building ML Infrastructure on AWS with Kiro: Part 1 — Network Foundation
Part 1 of a series on building production-grade infrastructure for deploying custom Random Forest and LSTM models on AWS.
The Mission
At MikaMirAI, we're building infrastructure to serve custom machine learning models — a Random Forest for structured predictions and an LSTM for time-series forecasting. Before any model touches a GPU, you need a network that's secure, observable, and ready for what comes next.
This post covers how we used Kiro, an AI-powered IDE, to go from a rough idea to fully deployed AWS network infrastructure across two environments in a single session. No console clicking. No copy-pasting Terraform snippets from blog posts. Just a conversation with an AI that writes production-grade IaC.
What's coming in Part 2: We'll migrate to a multi-account AWS landing zone with Organizations, Control Tower, SCPs, Tailscale hybrid networking to on-prem AI resources, and the full Well-Architected Framework foundation a startup needs before scaling to GPU compute.
Why Start with the Network?
It's tempting to jump straight to Kubernetes and GPUs. But every production ML system sits on top of networking decisions that are painful to change later — CIDR ranges, NAT egress strategy, subnet topology, audit logging. Get these wrong and you're re-architecting under pressure when the models are already in production.
Our requirements:
- Two isolated environments —
devfor iteration,prodfor serving. No shared resources between them. - EKS-ready subnets — tagged for Kubernetes load balancer discovery from day one.
- Cost-conscious dev — a NAT Instance instead of managed NAT Gateways saves ~$64/month in a non-production environment.
- Security baseline from the start — private subnets with no IGW route, encrypted state and audit buckets, no static IAM credentials anywhere.
- Full CI/CD — plan on PRs, auto-apply dev on merge, manual approval gate for prod.
The Kiro Workflow
Spec-Driven Development
Kiro's approach is spec-driven. Instead of jumping into code, we started with a structured conversation that produced three documents:
-
Requirements — 27 formal requirements covering everything from S3 state bucket encryption to CloudTrail multi-region logging. Each requirement has acceptance criteria written in a SHALL/SHALL NOT format that leaves no room for ambiguity.
-
Design — A technical design document with architecture diagrams, CIDR allocation tables, component inventories per environment, and key design decisions (why NAT Instance in dev, why per-AZ NAT Gateways in prod, why partial backend configuration).
-
Tasks — A 20-task implementation plan with sub-tasks, each referencing specific requirements for traceability. Tasks are ordered by dependency: scaffold first, then bootstrap, then network resources, then CI/CD.
This isn't just documentation for documentation's sake. The spec became the contract between us and Kiro. When we said "implement task 9," Kiro knew exactly what EIP resources to create, which naming convention to follow, and which requirements to satisfy.
Three Phases of Infrastructure
The design breaks the work into three phases:
Phase 0 — Bootstrap (manual, once)
└─ S3 state bucket + DynamoDB lock table + IAM OIDC provider/role
Phase 1 — Network (automated via CI/CD)
└─ VPC, subnets, IGW, NAT, route tables, flow logs, CloudTrail
Phase 2 — CI/CD Pipeline
└─ GitHub Actions: plan on PR, apply on merge, approval gate for prod
Phase 0 has to run manually because the remote state backend doesn't exist yet — it's the chicken-and-egg of Terraform. After that, everything goes through the pipeline.
What We Built
Overall Architecture
graph TB
subgraph AWS["AWS Account — us-east-1"]
subgraph Bootstrap["Phase 0 — Bootstrap (local state)"]
S3State["S3: mm-terraform-state<br/>Versioned · SSE-S3 · Public blocked"]
DDB["DynamoDB: MM-terraform-lock<br/>PAY_PER_REQUEST · SSE"]
OIDC["IAM OIDC Provider<br/>token.actions.githubusercontent.com"]
GHARole["IAM Role: MM-github-actions-role<br/>Scoped Terraform permissions"]
end
subgraph DevVPC["Dev VPC — 10.1.0.0/16"]
subgraph DevAZa["us-east-1a"]
DevPubA["Public Subnet<br/>10.1.0.0/24"]
DevPrivA["Private Subnet<br/>10.1.10.0/24"]
NATInst["NAT Instance<br/>t3.micro + EIP<br/>source_dest_check=false"]
end
subgraph DevAZb["us-east-1b"]
DevPubB["Public Subnet<br/>10.1.1.0/24"]
DevPrivB["Private Subnet<br/>10.1.11.0/24"]
end
DevIGW["Internet Gateway"]
DevPrivRT["Shared Private RT<br/>0.0.0.0/0 → NAT Instance ENI"]
DevFlowLog["VPC Flow Logs → CloudWatch<br/>14-day retention"]
end
subgraph ProdVPC["Prod VPC — 10.2.0.0/16"]
subgraph ProdAZa["us-east-1a"]
ProdPubA["Public Subnet<br/>10.2.0.0/24"]
ProdPrivA["Private Subnet<br/>10.2.10.0/24"]
NATGWA["NAT Gateway A<br/>+ EIP"]
end
subgraph ProdAZb["us-east-1b"]
ProdPubB["Public Subnet<br/>10.2.1.0/24"]
ProdPrivB["Private Subnet<br/>10.2.11.0/24"]
NATGWB["NAT Gateway B<br/>+ EIP"]
end
ProdIGW["Internet Gateway"]
ProdPrivRTA["Private RT (AZ-a)<br/>0.0.0.0/0 → NAT-GW-A"]
ProdPrivRTB["Private RT (AZ-b)<br/>0.0.0.0/0 → NAT-GW-B"]
ProdFlowLog["VPC Flow Logs → CloudWatch<br/>90-day retention"]
end
end
subgraph CICD["GitHub Actions CI/CD"]
PR["PR: plan + comment"]
ApplyDev["Merge: apply dev (auto)"]
ApplyProd["Merge: apply prod (approval)"]
end
CICD -->|OIDC| GHARole
GHARole -->|assume role| S3State
GHARole -->|lock| DDB
NATInst --> DevPrivRT
DevPrivRT --> DevPrivA
DevPrivRT --> DevPrivB
NATGWA --> ProdPrivRTA
ProdPrivRTA --> ProdPrivA
NATGWB --> ProdPrivRTB
ProdPrivRTB --> ProdPrivB
Network Topology
dev environment — optimized for cost:
VPC: 10.1.0.0/16 (us-east-1)
├── Public Subnet A (10.1.0.0/24, us-east-1a) ← NAT Instance + EIP
├── Public Subnet B (10.1.1.0/24, us-east-1b)
├── Private Subnet A (10.1.10.0/24, us-east-1a) ← future EKS nodes
├── Private Subnet B (10.1.11.0/24, us-east-1b) ← future EKS nodes
├── 1 shared private route table → NAT Instance ENI
└── Internet Gateway
graph LR
Internet((Internet))
subgraph DevVPC["Dev VPC 10.1.0.0/16"]
IGW[Internet Gateway]
subgraph PubA["Public 10.1.0.0/24 — us-east-1a"]
NAT["NAT Instance<br/>t3.micro<br/>EIP: allocated<br/>source_dest_check: false"]
end
subgraph PubB["Public 10.1.1.0/24 — us-east-1b"]
PubBEmpty["(no NAT resource)"]
end
subgraph PrivA["Private 10.1.10.0/24 — us-east-1a"]
PrivAWorkload["Future EKS Nodes<br/>k8s.io/role/internal-elb: 1"]
end
subgraph PrivB["Private 10.1.11.0/24 — us-east-1b"]
PrivBWorkload["Future EKS Nodes<br/>k8s.io/role/internal-elb: 1"]
end
end
Internet <-->|public traffic| IGW
IGW -->|0.0.0.0/0| PubA
IGW -->|0.0.0.0/0| PubB
PrivAWorkload -->|0.0.0.0/0 via ENI| NAT
PrivBWorkload -->|0.0.0.0/0 via ENI| NAT
NAT -->|outbound| IGW
FlowLogs["CloudWatch Logs<br/>MM-dev-use1-logs-flow<br/>14-day retention"]
CT["CloudTrail<br/>MM-dev-use1-cloudtrail-core<br/>→ S3 (90-day lifecycle)"]
DevVPC -.->|ALL traffic metadata| FlowLogs
DevVPC -.->|API audit| CT
prod environment — optimized for availability:
VPC: 10.2.0.0/16 (us-east-1)
├── Public Subnet A (10.2.0.0/24, us-east-1a) ← NAT Gateway A + EIP
├── Public Subnet B (10.2.1.0/24, us-east-1b) ← NAT Gateway B + EIP
├── Private Subnet A (10.2.10.0/24, us-east-1a) ← future EKS nodes
├── Private Subnet B (10.2.11.0/24, us-east-1b) ← future EKS nodes
├── 2 per-AZ private route tables → respective NAT Gateways
└── Internet Gateway
graph LR
Internet((Internet))
subgraph ProdVPC["Prod VPC 10.2.0.0/16"]
IGW[Internet Gateway]
subgraph PubA["Public 10.2.0.0/24 — us-east-1a"]
NATGWA["NAT Gateway A<br/>+ EIP"]
end
subgraph PubB["Public 10.2.1.0/24 — us-east-1b"]
NATGWB["NAT Gateway B<br/>+ EIP"]
end
subgraph PrivA["Private 10.2.10.0/24 — us-east-1a"]
PrivAWorkload["Future EKS Nodes<br/>k8s.io/role/internal-elb: 1"]
end
subgraph PrivB["Private 10.2.11.0/24 — us-east-1b"]
PrivBWorkload["Future EKS Nodes<br/>k8s.io/role/internal-elb: 1"]
end
end
Internet <-->|public traffic| IGW
IGW -->|0.0.0.0/0| PubA
IGW -->|0.0.0.0/0| PubB
PrivAWorkload -->|0.0.0.0/0| NATGWA
PrivBWorkload -->|0.0.0.0/0| NATGWB
NATGWA -->|outbound| IGW
NATGWB -->|outbound| IGW
FlowLogs["CloudWatch Logs<br/>MM-prod-use1-logs-flow<br/>90-day retention"]
CT["CloudTrail<br/>MM-prod-use1-cloudtrail-core<br/>→ S3 (365-day lifecycle)"]
ProdVPC -.->|ALL traffic metadata| FlowLogs
ProdVPC -.->|API audit| CT
The private subnets are where our EKS node groups will live in Part 2. They already carry the kubernetes.io/role/internal-elb tag so EKS will discover them automatically — no Terraform changes needed when we add the cluster.
CIDR Allocation
graph TB
subgraph CIDRs["IP Address Allocation"]
subgraph Dev["Dev — 10.1.0.0/16"]
D1["10.1.0.0/24 — Public A (us-east-1a)"]
D2["10.1.1.0/24 — Public B (us-east-1b)"]
D3["10.1.10.0/24 — Private A (us-east-1a)"]
D4["10.1.11.0/24 — Private B (us-east-1b)"]
D5["10.1.20.0/24+ — Reserved (future)"]
end
subgraph Prod["Prod — 10.2.0.0/16"]
P1["10.2.0.0/24 — Public A (us-east-1a)"]
P2["10.2.1.0/24 — Public B (us-east-1b)"]
P3["10.2.10.0/24 — Private A (us-east-1a)"]
P4["10.2.11.0/24 — Private B (us-east-1b)"]
P5["10.2.20.0/24+ — Reserved (future)"]
end
end
style Dev fill:#e6ffe6,stroke:#009900
style Prod fill:#fff0e6,stroke:#cc6600
Resource Inventory
| Resource | dev | prod |
|---|---|---|
| VPC | 1 | 1 |
| Subnets | 4 | 4 |
| Internet Gateway | 1 | 1 |
| Elastic IPs | 1 | 2 |
| NAT Instance (EC2) | 1 | — |
| NAT Gateways | — | 2 |
| Route Tables | 2 | 3 |
| Route Table Associations | 4 | 4 |
| VPC Flow Log | 1 | 1 |
| CloudWatch Log Group | 1 | 1 |
| IAM Role (Flow Logs) | 1 | 1 |
| CloudTrail Trail | 1 | 1 |
| CloudTrail S3 Bucket | 1 | 1 |
Naming Convention
Every resource follows MM-{env}-{region}-{az}-{resource}-{purpose}:
MM-dev-use1-vpc-core
MM-dev-use1-use1a-subnet-public
MM-prod-use1-use1b-nat-core
MM-dev-use1-rt-private
This makes resources instantly identifiable in the AWS console without checking tags.
Tagging
Twelve mandatory tags on every resource, enforced through a single local.common_tags block:
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-vpc-core"
})
Adding a new mandatory tag means editing one file. Every resource picks it up on the next apply.
The CI/CD Pipeline
The GitHub Actions workflow handles the full lifecycle:
flowchart LR
A[Developer] -->|push branch| B[Pull Request]
B -->|trigger| C{Plan Job}
C -->|terraform init| D[fmt check]
D --> E[validate]
E --> F[plan dev.tfvars]
F --> G[env tag guard]
G --> H[Post plan as PR comment]
B -->|merge to main| I{Apply Dev}
I -->|OIDC auth| J[terraform apply dev.tfvars]
J --> K[Idempotency check]
K -->|success| L{Apply Prod}
L -->|manual approval| M[terraform apply prod.tfvars]
M --> N[Idempotency check]
style C fill:#f9f,stroke:#333
style I fill:#9f9,stroke:#333
style L fill:#ff9,stroke:#333
Authentication uses OIDC — GitHub Actions assumes an IAM role via federated identity. No AWS access keys stored anywhere.
The cross-contamination guard is a jq filter that scans the plan JSON for any resource with an Environment tag that doesn't match the target environment. If dev resources somehow appear in a prod plan, the pipeline stops.
Lessons from the Build
1. The Amazon Linux NAT AMI is Gone
The classic amzn-ami-vpc-nat-* AMIs that everyone references in NAT Instance tutorials have been fully deprecated. Our first CI/CD run failed because the AMI data source returned zero results.
The fix: use Amazon Linux 2023 with a user data script that enables IP forwarding and configures iptables masquerade. It's a few lines of bash, and it works identically to the old NAT AMI.
2. IAM Permissions Need Iteration
The Terraform AWS provider makes API calls you don't expect. Creating an S3 bucket requires not just s3:CreateBucket but also s3:GetBucketTagging, s3:GetBucketCORS, s3:GetBucketObjectLockConfiguration, and a dozen other Get* calls that the provider uses to read back the resource state.
We started with a scoped IAM policy and had to expand it twice before the apply succeeded. The lesson: either use a broad s3:Get* wildcard or be prepared to iterate on permissions as Terraform discovers what it needs.
3. OIDC Trust Policy Scope Matters
Our initial trust policy restricted role assumption to refs/heads/main only. This is correct for apply jobs, but it means the plan job can't run on PR branches. We broadened the condition to repo:org/repo:* to allow PR workflows to authenticate.
In a production setup, you'd want separate roles — a read-only role for plan (broad trust) and a write role for apply (main-only trust). For this project, a single role with broad trust was the pragmatic choice.
4. Spec-Driven Development Pays Off for IaC
The spec workflow forced us to think through naming conventions, CIDR allocation, and environment isolation before writing any Terraform. When implementation started, there were no "what should I name this?" or "which CIDR should I use?" decisions left — they were all in the design doc.
This matters more for infrastructure than application code. A renamed variable is a refactor. A changed CIDR block is a destroy-and-recreate.
Repository Structure
.
├── .github/workflows/terraform.yaml # CI/CD pipeline
├── .kiro/specs/mm-aws-network-infra/ # Spec documents
│ ├── requirements.md
│ ├── design.md
│ └── tasks.md
├── docs/post-apply-checklist.md # Verification runbook
├── specs/
│ ├── phase0-state.yaml # Bootstrap intent
│ └── phase1-network.yaml # Network intent
└── terraform/
├── bootstrap/ # Phase 0 (local state)
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ └── providers.tf
└── network/ # Phase 1 (S3 remote state)
├── backend.tf
├── locals.tf # common_tags + naming helpers
├── vpc.tf
├── subnets.tf
├── igw.tf
├── eip.tf
├── nat_gateway.tf # prod only
├── nat_instance.tf # dev only
├── route_tables.tf
├── flow_logs.tf
├── cloudtrail.tf
└── envs/
├── dev.tfvars
└── prod.tfvars
What's Next: Part 2 — Multi-Account Landing Zone
The single-account setup got us running fast, but it won't scale. In Part 2, we're migrating to a proper multi-account architecture:
- AWS Organizations and Control Tower — three accounts (Management, Dev, Prod) with centralized governance
- Service Control Policies — region restrictions, GPU instance size caps, encryption enforcement, no static IAM credentials
- Tailscale hybrid networking — connecting on-prem NVIDIA GPU servers to AWS private subnets without traditional VPN hardware
- Centralized security — organization-wide CloudTrail, GuardDuty, Security Hub, and AWS Config
- Cost controls — billing alerts, budgets, anomaly detection, and right-sizing defaults for a startup budget
- Cross-region DR readiness — backup replication to us-west-2, DR VPC pre-provisioned for prod
The network foundation from Part 1 carries forward — same CIDRs, same naming conventions, same tagging. We're just putting proper account boundaries and governance around it.
Try It Yourself
The entire infrastructure was built in a single Kiro session — from rough idea to deployed resources in both environments. The spec-driven workflow meant we never had to backtrack on a design decision, and the task list gave us a clear path from empty repo to running infrastructure.
If you're building ML infrastructure and want to skip the weeks of Terraform trial-and-error, give Kiro a try. It won't replace your judgment on architecture decisions, but it will write the Terraform so you can focus on the decisions that matter.
This is Part 1 of the MikaMirAI ML Infrastructure series. Part 2 covers the multi-account landing zone, Well-Architected Framework adoption, and hybrid networking with on-prem AI resources.