Back to Blog
ml-infraAWSTerraformKiroML InfrastructureIaCGitHub ActionsNetworking

Building Production ML Infrastructure on AWS with Kiro: Part 1 — Network Foundation

How we used Kiro's spec-driven workflow to go from a rough idea to fully deployed AWS network infrastructure across two environments — VPCs, NAT, CI/CD, and audit logging — in a single session.

·10 min read

Building ML Infrastructure on AWS with Kiro: Part 1 — Network Foundation

Part 1 of a series on building production-grade infrastructure for deploying custom Random Forest and LSTM models on AWS.


The Mission

At MikaMirAI, we're building infrastructure to serve custom machine learning models — a Random Forest for structured predictions and an LSTM for time-series forecasting. Before any model touches a GPU, you need a network that's secure, observable, and ready for what comes next.

This post covers how we used Kiro, an AI-powered IDE, to go from a rough idea to fully deployed AWS network infrastructure across two environments in a single session. No console clicking. No copy-pasting Terraform snippets from blog posts. Just a conversation with an AI that writes production-grade IaC.

What's coming in Part 2: We'll migrate to a multi-account AWS landing zone with Organizations, Control Tower, SCPs, Tailscale hybrid networking to on-prem AI resources, and the full Well-Architected Framework foundation a startup needs before scaling to GPU compute.


Why Start with the Network?

It's tempting to jump straight to Kubernetes and GPUs. But every production ML system sits on top of networking decisions that are painful to change later — CIDR ranges, NAT egress strategy, subnet topology, audit logging. Get these wrong and you're re-architecting under pressure when the models are already in production.

Our requirements:

  • Two isolated environmentsdev for iteration, prod for serving. No shared resources between them.
  • EKS-ready subnets — tagged for Kubernetes load balancer discovery from day one.
  • Cost-conscious dev — a NAT Instance instead of managed NAT Gateways saves ~$64/month in a non-production environment.
  • Security baseline from the start — private subnets with no IGW route, encrypted state and audit buckets, no static IAM credentials anywhere.
  • Full CI/CD — plan on PRs, auto-apply dev on merge, manual approval gate for prod.

The Kiro Workflow

Spec-Driven Development

Kiro's approach is spec-driven. Instead of jumping into code, we started with a structured conversation that produced three documents:

  1. Requirements — 27 formal requirements covering everything from S3 state bucket encryption to CloudTrail multi-region logging. Each requirement has acceptance criteria written in a SHALL/SHALL NOT format that leaves no room for ambiguity.

  2. Design — A technical design document with architecture diagrams, CIDR allocation tables, component inventories per environment, and key design decisions (why NAT Instance in dev, why per-AZ NAT Gateways in prod, why partial backend configuration).

  3. Tasks — A 20-task implementation plan with sub-tasks, each referencing specific requirements for traceability. Tasks are ordered by dependency: scaffold first, then bootstrap, then network resources, then CI/CD.

This isn't just documentation for documentation's sake. The spec became the contract between us and Kiro. When we said "implement task 9," Kiro knew exactly what EIP resources to create, which naming convention to follow, and which requirements to satisfy.

Three Phases of Infrastructure

The design breaks the work into three phases:

Phase 0 — Bootstrap (manual, once)
  └─ S3 state bucket + DynamoDB lock table + IAM OIDC provider/role

Phase 1 — Network (automated via CI/CD)
  └─ VPC, subnets, IGW, NAT, route tables, flow logs, CloudTrail

Phase 2 — CI/CD Pipeline
  └─ GitHub Actions: plan on PR, apply on merge, approval gate for prod

Phase 0 has to run manually because the remote state backend doesn't exist yet — it's the chicken-and-egg of Terraform. After that, everything goes through the pipeline.


What We Built

Overall Architecture

graph TB
    subgraph AWS["AWS Account — us-east-1"]
        subgraph Bootstrap["Phase 0 — Bootstrap (local state)"]
            S3State["S3: mm-terraform-state<br/>Versioned · SSE-S3 · Public blocked"]
            DDB["DynamoDB: MM-terraform-lock<br/>PAY_PER_REQUEST · SSE"]
            OIDC["IAM OIDC Provider<br/>token.actions.githubusercontent.com"]
            GHARole["IAM Role: MM-github-actions-role<br/>Scoped Terraform permissions"]
        end

        subgraph DevVPC["Dev VPC — 10.1.0.0/16"]
            subgraph DevAZa["us-east-1a"]
                DevPubA["Public Subnet<br/>10.1.0.0/24"]
                DevPrivA["Private Subnet<br/>10.1.10.0/24"]
                NATInst["NAT Instance<br/>t3.micro + EIP<br/>source_dest_check=false"]
            end
            subgraph DevAZb["us-east-1b"]
                DevPubB["Public Subnet<br/>10.1.1.0/24"]
                DevPrivB["Private Subnet<br/>10.1.11.0/24"]
            end
            DevIGW["Internet Gateway"]
            DevPrivRT["Shared Private RT<br/>0.0.0.0/0 → NAT Instance ENI"]
            DevFlowLog["VPC Flow Logs → CloudWatch<br/>14-day retention"]
        end

        subgraph ProdVPC["Prod VPC — 10.2.0.0/16"]
            subgraph ProdAZa["us-east-1a"]
                ProdPubA["Public Subnet<br/>10.2.0.0/24"]
                ProdPrivA["Private Subnet<br/>10.2.10.0/24"]
                NATGWA["NAT Gateway A<br/>+ EIP"]
            end
            subgraph ProdAZb["us-east-1b"]
                ProdPubB["Public Subnet<br/>10.2.1.0/24"]
                ProdPrivB["Private Subnet<br/>10.2.11.0/24"]
                NATGWB["NAT Gateway B<br/>+ EIP"]
            end
            ProdIGW["Internet Gateway"]
            ProdPrivRTA["Private RT (AZ-a)<br/>0.0.0.0/0 → NAT-GW-A"]
            ProdPrivRTB["Private RT (AZ-b)<br/>0.0.0.0/0 → NAT-GW-B"]
            ProdFlowLog["VPC Flow Logs → CloudWatch<br/>90-day retention"]
        end
    end

    subgraph CICD["GitHub Actions CI/CD"]
        PR["PR: plan + comment"]
        ApplyDev["Merge: apply dev (auto)"]
        ApplyProd["Merge: apply prod (approval)"]
    end

    CICD -->|OIDC| GHARole
    GHARole -->|assume role| S3State
    GHARole -->|lock| DDB

    NATInst --> DevPrivRT
    DevPrivRT --> DevPrivA
    DevPrivRT --> DevPrivB

    NATGWA --> ProdPrivRTA
    ProdPrivRTA --> ProdPrivA
    NATGWB --> ProdPrivRTB
    ProdPrivRTB --> ProdPrivB

Network Topology

dev environment — optimized for cost:

VPC: 10.1.0.0/16 (us-east-1)
├── Public Subnet A  (10.1.0.0/24, us-east-1a)  ← NAT Instance + EIP
├── Public Subnet B  (10.1.1.0/24, us-east-1b)
├── Private Subnet A (10.1.10.0/24, us-east-1a) ← future EKS nodes
├── Private Subnet B (10.1.11.0/24, us-east-1b) ← future EKS nodes
├── 1 shared private route table → NAT Instance ENI
└── Internet Gateway
graph LR
    Internet((Internet))

    subgraph DevVPC["Dev VPC 10.1.0.0/16"]
        IGW[Internet Gateway]

        subgraph PubA["Public 10.1.0.0/24 — us-east-1a"]
            NAT["NAT Instance<br/>t3.micro<br/>EIP: allocated<br/>source_dest_check: false"]
        end

        subgraph PubB["Public 10.1.1.0/24 — us-east-1b"]
            PubBEmpty["(no NAT resource)"]
        end

        subgraph PrivA["Private 10.1.10.0/24 — us-east-1a"]
            PrivAWorkload["Future EKS Nodes<br/>k8s.io/role/internal-elb: 1"]
        end

        subgraph PrivB["Private 10.1.11.0/24 — us-east-1b"]
            PrivBWorkload["Future EKS Nodes<br/>k8s.io/role/internal-elb: 1"]
        end
    end

    Internet <-->|public traffic| IGW
    IGW -->|0.0.0.0/0| PubA
    IGW -->|0.0.0.0/0| PubB
    PrivAWorkload -->|0.0.0.0/0 via ENI| NAT
    PrivBWorkload -->|0.0.0.0/0 via ENI| NAT
    NAT -->|outbound| IGW

    FlowLogs["CloudWatch Logs<br/>MM-dev-use1-logs-flow<br/>14-day retention"]
    CT["CloudTrail<br/>MM-dev-use1-cloudtrail-core<br/>→ S3 (90-day lifecycle)"]

    DevVPC -.->|ALL traffic metadata| FlowLogs
    DevVPC -.->|API audit| CT

prod environment — optimized for availability:

VPC: 10.2.0.0/16 (us-east-1)
├── Public Subnet A  (10.2.0.0/24, us-east-1a)  ← NAT Gateway A + EIP
├── Public Subnet B  (10.2.1.0/24, us-east-1b)  ← NAT Gateway B + EIP
├── Private Subnet A (10.2.10.0/24, us-east-1a) ← future EKS nodes
├── Private Subnet B (10.2.11.0/24, us-east-1b) ← future EKS nodes
├── 2 per-AZ private route tables → respective NAT Gateways
└── Internet Gateway
graph LR
    Internet((Internet))

    subgraph ProdVPC["Prod VPC 10.2.0.0/16"]
        IGW[Internet Gateway]

        subgraph PubA["Public 10.2.0.0/24 — us-east-1a"]
            NATGWA["NAT Gateway A<br/>+ EIP"]
        end

        subgraph PubB["Public 10.2.1.0/24 — us-east-1b"]
            NATGWB["NAT Gateway B<br/>+ EIP"]
        end

        subgraph PrivA["Private 10.2.10.0/24 — us-east-1a"]
            PrivAWorkload["Future EKS Nodes<br/>k8s.io/role/internal-elb: 1"]
        end

        subgraph PrivB["Private 10.2.11.0/24 — us-east-1b"]
            PrivBWorkload["Future EKS Nodes<br/>k8s.io/role/internal-elb: 1"]
        end
    end

    Internet <-->|public traffic| IGW
    IGW -->|0.0.0.0/0| PubA
    IGW -->|0.0.0.0/0| PubB
    PrivAWorkload -->|0.0.0.0/0| NATGWA
    PrivBWorkload -->|0.0.0.0/0| NATGWB
    NATGWA -->|outbound| IGW
    NATGWB -->|outbound| IGW

    FlowLogs["CloudWatch Logs<br/>MM-prod-use1-logs-flow<br/>90-day retention"]
    CT["CloudTrail<br/>MM-prod-use1-cloudtrail-core<br/>→ S3 (365-day lifecycle)"]

    ProdVPC -.->|ALL traffic metadata| FlowLogs
    ProdVPC -.->|API audit| CT

The private subnets are where our EKS node groups will live in Part 2. They already carry the kubernetes.io/role/internal-elb tag so EKS will discover them automatically — no Terraform changes needed when we add the cluster.

CIDR Allocation

graph TB
    subgraph CIDRs["IP Address Allocation"]
        subgraph Dev["Dev — 10.1.0.0/16"]
            D1["10.1.0.0/24 — Public A (us-east-1a)"]
            D2["10.1.1.0/24 — Public B (us-east-1b)"]
            D3["10.1.10.0/24 — Private A (us-east-1a)"]
            D4["10.1.11.0/24 — Private B (us-east-1b)"]
            D5["10.1.20.0/24+ — Reserved (future)"]
        end

        subgraph Prod["Prod — 10.2.0.0/16"]
            P1["10.2.0.0/24 — Public A (us-east-1a)"]
            P2["10.2.1.0/24 — Public B (us-east-1b)"]
            P3["10.2.10.0/24 — Private A (us-east-1a)"]
            P4["10.2.11.0/24 — Private B (us-east-1b)"]
            P5["10.2.20.0/24+ — Reserved (future)"]
        end
    end

    style Dev fill:#e6ffe6,stroke:#009900
    style Prod fill:#fff0e6,stroke:#cc6600

Resource Inventory

Resource dev prod
VPC 1 1
Subnets 4 4
Internet Gateway 1 1
Elastic IPs 1 2
NAT Instance (EC2) 1
NAT Gateways 2
Route Tables 2 3
Route Table Associations 4 4
VPC Flow Log 1 1
CloudWatch Log Group 1 1
IAM Role (Flow Logs) 1 1
CloudTrail Trail 1 1
CloudTrail S3 Bucket 1 1

Naming Convention

Every resource follows MM-{env}-{region}-{az}-{resource}-{purpose}:

MM-dev-use1-vpc-core
MM-dev-use1-use1a-subnet-public
MM-prod-use1-use1b-nat-core
MM-dev-use1-rt-private

This makes resources instantly identifiable in the AWS console without checking tags.

Tagging

Twelve mandatory tags on every resource, enforced through a single local.common_tags block:

tags = merge(local.common_tags, {
  Name = "${local.name_prefix}-vpc-core"
})

Adding a new mandatory tag means editing one file. Every resource picks it up on the next apply.


The CI/CD Pipeline

The GitHub Actions workflow handles the full lifecycle:

flowchart LR
    A[Developer] -->|push branch| B[Pull Request]
    B -->|trigger| C{Plan Job}
    C -->|terraform init| D[fmt check]
    D --> E[validate]
    E --> F[plan dev.tfvars]
    F --> G[env tag guard]
    G --> H[Post plan as PR comment]

    B -->|merge to main| I{Apply Dev}
    I -->|OIDC auth| J[terraform apply dev.tfvars]
    J --> K[Idempotency check]
    K -->|success| L{Apply Prod}
    L -->|manual approval| M[terraform apply prod.tfvars]
    M --> N[Idempotency check]

    style C fill:#f9f,stroke:#333
    style I fill:#9f9,stroke:#333
    style L fill:#ff9,stroke:#333

Authentication uses OIDC — GitHub Actions assumes an IAM role via federated identity. No AWS access keys stored anywhere.

The cross-contamination guard is a jq filter that scans the plan JSON for any resource with an Environment tag that doesn't match the target environment. If dev resources somehow appear in a prod plan, the pipeline stops.


Lessons from the Build

1. The Amazon Linux NAT AMI is Gone

The classic amzn-ami-vpc-nat-* AMIs that everyone references in NAT Instance tutorials have been fully deprecated. Our first CI/CD run failed because the AMI data source returned zero results.

The fix: use Amazon Linux 2023 with a user data script that enables IP forwarding and configures iptables masquerade. It's a few lines of bash, and it works identically to the old NAT AMI.

2. IAM Permissions Need Iteration

The Terraform AWS provider makes API calls you don't expect. Creating an S3 bucket requires not just s3:CreateBucket but also s3:GetBucketTagging, s3:GetBucketCORS, s3:GetBucketObjectLockConfiguration, and a dozen other Get* calls that the provider uses to read back the resource state.

We started with a scoped IAM policy and had to expand it twice before the apply succeeded. The lesson: either use a broad s3:Get* wildcard or be prepared to iterate on permissions as Terraform discovers what it needs.

3. OIDC Trust Policy Scope Matters

Our initial trust policy restricted role assumption to refs/heads/main only. This is correct for apply jobs, but it means the plan job can't run on PR branches. We broadened the condition to repo:org/repo:* to allow PR workflows to authenticate.

In a production setup, you'd want separate roles — a read-only role for plan (broad trust) and a write role for apply (main-only trust). For this project, a single role with broad trust was the pragmatic choice.

4. Spec-Driven Development Pays Off for IaC

The spec workflow forced us to think through naming conventions, CIDR allocation, and environment isolation before writing any Terraform. When implementation started, there were no "what should I name this?" or "which CIDR should I use?" decisions left — they were all in the design doc.

This matters more for infrastructure than application code. A renamed variable is a refactor. A changed CIDR block is a destroy-and-recreate.


Repository Structure

.
├── .github/workflows/terraform.yaml    # CI/CD pipeline
├── .kiro/specs/mm-aws-network-infra/   # Spec documents
│   ├── requirements.md
│   ├── design.md
│   └── tasks.md
├── docs/post-apply-checklist.md        # Verification runbook
├── specs/
│   ├── phase0-state.yaml               # Bootstrap intent
│   └── phase1-network.yaml             # Network intent
└── terraform/
    ├── bootstrap/                      # Phase 0 (local state)
    │   ├── main.tf
    │   ├── variables.tf
    │   ├── outputs.tf
    │   └── providers.tf
    └── network/                        # Phase 1 (S3 remote state)
        ├── backend.tf
        ├── locals.tf                   # common_tags + naming helpers
        ├── vpc.tf
        ├── subnets.tf
        ├── igw.tf
        ├── eip.tf
        ├── nat_gateway.tf              # prod only
        ├── nat_instance.tf             # dev only
        ├── route_tables.tf
        ├── flow_logs.tf
        ├── cloudtrail.tf
        └── envs/
            ├── dev.tfvars
            └── prod.tfvars

What's Next: Part 2 — Multi-Account Landing Zone

The single-account setup got us running fast, but it won't scale. In Part 2, we're migrating to a proper multi-account architecture:

  • AWS Organizations and Control Tower — three accounts (Management, Dev, Prod) with centralized governance
  • Service Control Policies — region restrictions, GPU instance size caps, encryption enforcement, no static IAM credentials
  • Tailscale hybrid networking — connecting on-prem NVIDIA GPU servers to AWS private subnets without traditional VPN hardware
  • Centralized security — organization-wide CloudTrail, GuardDuty, Security Hub, and AWS Config
  • Cost controls — billing alerts, budgets, anomaly detection, and right-sizing defaults for a startup budget
  • Cross-region DR readiness — backup replication to us-west-2, DR VPC pre-provisioned for prod

The network foundation from Part 1 carries forward — same CIDRs, same naming conventions, same tagging. We're just putting proper account boundaries and governance around it.


Try It Yourself

The entire infrastructure was built in a single Kiro session — from rough idea to deployed resources in both environments. The spec-driven workflow meant we never had to backtrack on a design decision, and the task list gave us a clear path from empty repo to running infrastructure.

If you're building ML infrastructure and want to skip the weeks of Terraform trial-and-error, give Kiro a try. It won't replace your judgment on architecture decisions, but it will write the Terraform so you can focus on the decisions that matter.


This is Part 1 of the MikaMirAI ML Infrastructure series. Part 2 covers the multi-account landing zone, Well-Architected Framework adoption, and hybrid networking with on-prem AI resources.