Building ML Infrastructure on AWS with Kiro: Part 2 — Multi-Account Landing Zone

Part 2 of a series on building production-grade infrastructure for deploying custom Random Forest and LSTM models on AWS.

Where We Left Off

In Part 1, we built the network foundation — VPCs, subnets, NAT egress, route tables, flow logs, CloudTrail, and a CI/CD pipeline — all in a single AWS account using Kiro's spec-driven workflow. Two environments deployed, both passing idempotency checks, in one session.

That got us running. But a single AWS account with two VPCs is a prototype topology, not a production one. When you're building infrastructure to serve ML models — especially with on-prem GPU resources in the mix — you need proper account boundaries, centralized governance, and cost controls that catch problems before they drain your runway.

Part 2 is about growing up without over-engineering. We're a small startup. Every dollar matters. But so does not getting breached, not losing data, and not waking up to a $10,000 bill because someone launched a p3.16xlarge in the wrong account.

Motivation

Why Multi-Account Now?

The single-account approach from Part 1 has real limitations:

Blast radius — a misconfigured IAM policy in dev can affect prod resources in the same account
Billing visibility — you can't see dev vs prod costs without meticulous tagging discipline
Compliance — auditors want to see environment isolation at the account level, not just the VPC level
Credential scope — a compromised dev credential has access to prod infrastructure

AWS Organizations with Control Tower gives us account-level isolation with centralized governance. Three accounts (Management, Dev, Prod) is the minimum viable multi-account setup for a startup.

Why Tailscale?

We have on-prem NVIDIA GPU servers for model training. Traditional AWS VPN (Site-to-Site VPN or Direct Connect) costs $36-73/month minimum and requires static public IPs or dedicated hardware. Tailscale gives us WireGuard-based mesh networking for $0 (free tier covers small teams) with no VPN appliances to manage. A t3.micro subnet router in each VPC advertises routes to the Tailscale network, and our on-prem GPUs can reach AWS private subnets as if they were on the same network.

Why Well-Architected Now?

It's tempting to defer governance until you're bigger. But the cost of retrofitting security, reliability, and cost controls into an existing architecture is 10x the cost of building them in from the start. The Well-Architected Framework gives us a checklist, not a burden — and most of the controls (SCPs, GuardDuty, Config rules, billing alerts) are either free or pennies/month at startup scale.

Desired Outcome

By the end of Part 2, we want:

Three isolated AWS accounts — Management (governance only), Dev (development workloads), Prod (production workloads) — with Control Tower guardrails enforced
Centralized security — one CloudTrail trail for all accounts, GuardDuty threat detection everywhere, Security Hub aggregating findings in one dashboard
Hybrid connectivity — on-prem AI resources reaching AWS private subnets via Tailscale, no VPN hardware, no public endpoints
Cost guardrails — billing alerts at $100/$200/$500 per account, anomaly detection, GPU instance size caps via SCP
DR readiness — cross-region backup replication for prod, a pre-provisioned DR VPC in us-west-2, documented failover runbook
Private AWS API access — VPC endpoints for S3, ECR, STS, CloudWatch, Secrets Manager, and EKS so traffic never leaves the AWS backbone
AI model security — encrypted model artifact storage, scoped IAM for GPU nodes, inference endpoint security groups
Single sign-on — IAM Identity Center with role-based access (platform engineers, developers, auditors) across all accounts
Everything as code — Terraform modules for each concern, multi-account CI/CD pipeline, consistent naming and tagging from Part 1

What We're Building

Account Structure

graph TB
    subgraph Org["AWS Organization"]
        MgmtAcct["Management Account<br/>━━━━━━━━━━━━━━━━━━<br/>• AWS Organizations<br/>• Control Tower<br/>• IAM Identity Center (SSO)<br/>• Consolidated Billing<br/>• Org CloudTrail<br/>• Config Aggregator<br/>• GuardDuty Administrator<br/>• Security Hub<br/>• Cost Anomaly Detection<br/>• AWS Budgets<br/>━━━━━━━━━━━━━━━━━━<br/>No workloads"]

        subgraph WorkloadsOU["OU: Workloads"]
            DevAcct["Dev Account<br/>━━━━━━━━━━━━━━━━━━<br/>• VPC 10.1.0.0/16 (us-east-1)<br/>• NAT Instance (t3.micro)<br/>• Tailscale Subnet Router<br/>• VPC Endpoints<br/>• AWS Backup (7d/30d)<br/>• GuardDuty member<br/>• Model Artifact S3 (CMK)<br/>• Budget: $100/mo"]

            ProdAcct["Prod Account<br/>━━━━━━━━━━━━━━━━━━<br/>• VPC 10.2.0.0/16 (us-east-1)<br/>• VPC 10.3.0.0/16 (us-west-2 DR)<br/>• 2x NAT Gateway<br/>• Tailscale Subnet Router<br/>• VPC Endpoints<br/>• AWS Backup (30d/90d + DR)<br/>• GuardDuty member<br/>• Model Artifact S3 (CMK)<br/>• Budget: $500/mo"]
        end

        subgraph SecurityOU["OU: Security (reserved)"]
            Future["Future Security Account"]
        end
    end

    subgraph SCPs["Service Control Policies"]
        SCP1["Deny regions outside us-east-1, us-west-2"]
        SCP2["Deny CloudTrail deletion"]
        SCP3["Deny GuardDuty disable"]
        SCP4["Deny IAM user creation"]
        SCP5["Deny unencrypted S3"]
        SCP6["Deny GPU larger than p3.2xlarge / g5.xlarge"]
        SCP7["Deny state bucket modification"]
    end

    SCPs -->|attached to| WorkloadsOU

    style MgmtAcct fill:#1a2a4a,stroke:#4a7abf
    style DevAcct fill:#1a3a1a,stroke:#4abf4a
    style ProdAcct fill:#3a2a1a,stroke:#bf7a4a
    style Future fill:#2a2a2a,stroke:#666,stroke-dasharray: 5 5

The Management Account runs no workloads — only Organizations, Control Tower, IAM Identity Center, consolidated billing, and centralized security services. This is a hard boundary enforced by SCP.

Service Control Policies

SCPs are the guardrails that prevent expensive mistakes:

SCP	What it prevents
Region restriction	Using any region except us-east-1 and us-west-2
CloudTrail protection	Deleting or modifying audit trails in member accounts
GuardDuty protection	Disabling threat detection in member accounts
No static credentials	Creating IAM users with passwords or access keys
Encryption enforcement	Creating S3 buckets without server-side encryption
GPU instance cap	Launching instances larger than p3.2xlarge or g5.xlarge
State protection	Modifying Terraform state resources outside CI/CD

The GPU instance cap is startup-specific — it prevents a p3.16xlarge ($24.48/hr) from running when a p3.2xlarge ($3.06/hr) is what we actually need.

Centralized Security

graph TB
    subgraph DevAcct["Dev Account"]
        DevGD["GuardDuty<br/>(member)"]
        DevConfig["AWS Config<br/>(recording)"]
        DevCT["CloudTrail<br/>(org trail member)"]
    end

    subgraph ProdAcct["Prod Account"]
        ProdGD["GuardDuty<br/>(member)"]
        ProdConfig["AWS Config<br/>(recording)"]
        ProdCT["CloudTrail<br/>(org trail member)"]
    end

    subgraph MgmtAcct["Management Account"]
        OrgTrail["Organization CloudTrail<br/>→ S3 (SSE, public blocked)<br/>Log validation enabled<br/>Multi-region"]
        ConfigAgg["Config Aggregator<br/>Rules: encryption, SSH,<br/>flow logs, IAM hygiene"]
        GDAdmin["GuardDuty Administrator<br/>S3 + EKS protection<br/>us-east-1 + us-west-2"]
        SecHub["Security Hub<br/>AWS Foundational Security<br/>Best Practices"]
        SNS["SNS: Platform Alerts<br/>→ Email + integrations"]
    end

    DevGD -->|findings| GDAdmin
    ProdGD -->|findings| GDAdmin
    DevConfig -->|compliance| ConfigAgg
    ProdConfig -->|compliance| ConfigAgg
    DevCT -->|logs| OrgTrail
    ProdCT -->|logs| OrgTrail

    GDAdmin -->|HIGH/CRITICAL| SecHub
    ConfigAgg -->|non-compliant| SecHub
    GDAdmin -->|HIGH/CRITICAL| SNS
    SecHub -->|CRITICAL| SNS
    ConfigAgg -->|non-compliant| SNS

    style MgmtAcct fill:#1a2a4a,stroke:#4a7abf
    style DevAcct fill:#1a3a1a,stroke:#4abf4a
    style ProdAcct fill:#3a2a1a,stroke:#bf7a4a

Per-account CloudTrail trails from Part 1 get disabled — the organization trail captures everything in one place, cutting duplicate storage costs.

Tailscale Hybrid Networking

graph LR
    subgraph OnPrem["On-Prem AI Lab"]
        GPU1["GPU Server 1<br/>NVIDIA A100<br/>Tailscale client"]
        GPU2["GPU Server 2<br/>NVIDIA A100<br/>Tailscale client"]
    end

    subgraph TailscaleCP["Tailscale Coordination"]
        Coord["Tailscale Control Plane<br/>ACLs enforce:<br/>• Private subnets only<br/>• Ports 443, 50051 only"]
    end

    subgraph DevVPC["Dev VPC 10.1.0.0/16"]
        DevTSR["Tailscale Subnet Router<br/>t3.micro · Private Subnet<br/>ASG min=1 max=1<br/>Auth: Secrets Manager<br/>Advertises: 10.1.0.0/16"]
        DevNAT["NAT Instance<br/>(outbound for TS registration)"]
        DevWorkload["Future: EKS Pods<br/>Model Inference<br/>:443 :50051"]
    end

    subgraph ProdVPC["Prod VPC 10.2.0.0/16"]
        ProdTSR["Tailscale Subnet Router<br/>t3.micro · Private Subnet<br/>ASG min=1 max=1<br/>Auth: Secrets Manager<br/>Advertises: 10.2.0.0/16"]
        ProdNATGW["NAT Gateway<br/>(outbound for TS registration)"]
        ProdWorkload["Future: EKS Pods<br/>Model Inference<br/>:443 :50051"]
    end

    GPU1 <-->|WireGuard UDP 41641| Coord
    GPU2 <-->|WireGuard UDP 41641| Coord
    DevTSR <-->|WireGuard UDP 41641| Coord
    ProdTSR <-->|WireGuard UDP 41641| Coord

    GPU1 -.->|10.1.x.x:443| DevWorkload
    GPU1 -.->|10.2.x.x:50051| ProdWorkload
    GPU2 -.->|10.1.x.x:50051| DevWorkload

    DevTSR -->|outbound registration| DevNAT
    ProdTSR -->|outbound registration| ProdNATGW

    style OnPrem fill:#3a1a1a,stroke:#bf4a4a
    style TailscaleCP fill:#1a1a3a,stroke:#4a4abf

Each subnet router advertises its VPC CIDR to the Tailscale network. On-prem GPUs see AWS private IPs as directly routable. Auth keys live in Secrets Manager. If a router dies, the ASG replaces it automatically.

Tailscale ACLs restrict on-prem access to private subnet CIDRs on ports 443 (HTTPS) and 50051 (gRPC) — just enough for model inference, nothing more.

VPC Private Endpoints

Endpoint Type	Service	Why
Gateway (free)	S3	Model artifact downloads, state bucket access
Gateway (free)	DynamoDB	State locking
Interface	ECR (api + dkr)	Container image pulls without NAT
Interface	STS	IAM role assumption for pods
Interface	CloudWatch Logs	Log delivery from private subnets
Interface	Secrets Manager	Tailscale auth keys, app secrets
Interface	EKS, EKS-Auth	Future private EKS API server access

Gateway endpoints are free. Interface endpoints cost ~$7.20/month each, so we're selective — only services that see high traffic or handle sensitive data.

Cost Controls

graph TB
    subgraph MgmtAcct["Management Account"]
        Budgets["AWS Budgets"]
        Anomaly["Cost Anomaly Detection"]
        SNS["SNS: Cost Alerts"]

        BudgetDev["Dev Budget: $100/mo<br/>Alerts: 50% · 80% · 100%<br/>Forecast: 90%"]
        BudgetMgmt["Mgmt Budget: $200/mo<br/>Alerts: 50% · 80% · 100%<br/>Forecast: 90%"]
        BudgetProd["Prod Budget: $500/mo<br/>Alerts: 50% · 80% · 100%<br/>Forecast: 90%"]

        AnomalyRule["Anomaly Monitor<br/>By service + by account<br/>Threshold: $10 above baseline"]
    end

    subgraph SCPs["SCP Cost Guards"]
        GPUCap["GPU Instance Cap<br/>Max: p3.2xlarge / g5.xlarge"]
        RegionLock["Region Lock<br/>Only: us-east-1, us-west-2"]
    end

    subgraph DevAcct["Dev Account — Cost Optimizations"]
        NATInst["NAT Instance t3.micro<br/>vs NAT Gateway ($32/mo savings)"]
        ShortLogs["Short log retention<br/>Flow: 14d · CloudTrail: 90d"]
        GWEndpoints["Gateway Endpoints (free)<br/>S3 · DynamoDB"]
        ScheduledStop["Tailscale Router<br/>Scheduled stop 8PM-8AM"]
        PayPerReq["DynamoDB PAY_PER_REQUEST"]
    end

    Budgets --> BudgetDev
    Budgets --> BudgetMgmt
    Budgets --> BudgetProd
    Anomaly --> AnomalyRule

    BudgetDev -->|threshold breach| SNS
    BudgetMgmt -->|threshold breach| SNS
    BudgetProd -->|threshold breach| SNS
    AnomalyRule -->|anomaly detected| SNS

    SCPs -->|enforced on| DevAcct

    style MgmtAcct fill:#1a2a4a,stroke:#4a7abf
    style DevAcct fill:#1a3a1a,stroke:#4abf4a
    style SCPs fill:#3a1a1a,stroke:#bf4a4a

Backup and DR

graph LR
    subgraph DevAcct["Dev Account — us-east-1"]
        DevVault["Backup Vault<br/>AWS-managed KMS"]
        DevDaily["Daily Backup<br/>Retain: 7 days"]
        DevWeekly["Weekly Backup<br/>Retain: 30 days"]
        DevResources["Tagged Resources<br/>BackupPolicy: daily"]
    end

    subgraph ProdAcct["Prod Account"]
        subgraph ProdEast["us-east-1 (Primary)"]
            ProdVault["Backup Vault<br/>AWS-managed KMS"]
            ProdDaily["Daily Backup<br/>Retain: 30 days"]
            ProdWeekly["Weekly Backup<br/>Retain: 90 days"]
            ProdResources["Tagged Resources<br/>BackupPolicy: daily"]
            ProdVPC["Prod VPC<br/>10.2.0.0/16"]
        end

        subgraph ProdWest["us-west-2 (DR)"]
            DRVault["DR Backup Vault<br/>Cross-region copy"]
            DRVPC["DR VPC<br/>10.3.0.0/16<br/>Same subnet topology"]
            DRRunbook["DR Runbook<br/>Failover · DNS · Restore"]
        end
    end

    DevResources -->|backup| DevDaily
    DevResources -->|backup| DevWeekly
    DevDaily --> DevVault
    DevWeekly --> DevVault

    ProdResources -->|backup| ProdDaily
    ProdResources -->|backup| ProdWeekly
    ProdDaily --> ProdVault
    ProdWeekly --> ProdVault

    ProdVault -->|cross-region replication| DRVault
    ProdVPC -.->|failover ready| DRVPC

    style DevAcct fill:#1a3a1a,stroke:#4abf4a
    style ProdEast fill:#3a2a1a,stroke:#bf7a4a
    style ProdWest fill:#3a2a1a,stroke:#bf7a4a,stroke-dasharray: 5 5

Backup selection is tag-based — tag a resource with BackupPolicy = "daily" and it's automatically included. No manual snapshot management.

IAM Identity Center

Group	Permission Set	Accounts
PlatformEngineers	AdministratorAccess	All three
Developers	DeveloperAccess (no IAM/Orgs/billing)	Dev only
Auditors	ReadOnlyAccess	All three

No static IAM credentials anywhere. Everyone authenticates through the SSO portal. Sessions expire after 8 hours.

Full Architecture

graph TB
    subgraph OnPrem["On-Prem AI Lab"]
        GPU["NVIDIA GPU Servers<br/>Model Training"]
    end

    subgraph GitHub["GitHub"]
        GHA["GitHub Actions<br/>OIDC Authentication"]
    end

    subgraph MgmtAcct["Management Account"]
        OrgTrail["Org CloudTrail → S3"]
        ConfigAgg["Config Aggregator"]
        GDAdmin["GuardDuty Admin"]
        SecHub["Security Hub"]
        SSO["IAM Identity Center<br/>PlatformEngineers | Developers | Auditors"]
        Budgets["AWS Budgets<br/>$100 dev | $200 mgmt | $500 prod"]
        Anomaly["Cost Anomaly Detection"]
        SNSMgmt["SNS: Platform Alerts"]
        StateBucket["S3: Terraform State<br/>Cross-region replication → us-west-2"]
        LockTable["DynamoDB: State Lock"]
    end

    subgraph DevAcct["Dev Account — us-east-1"]
        subgraph DevVPC["VPC 10.1.0.0/16"]
            DevPriv["Private Subnets<br/>10.1.10.0/24 · 10.1.11.0/24"]
            DevNAT["NAT Instance t3.micro + EIP"]
            DevTS["Tailscale Subnet Router<br/>Advertises 10.1.0.0/16"]
            DevVPCE["VPC Endpoints<br/>S3 · DynamoDB · ECR · STS · Logs"]
        end
        DevModels["S3: Model Artifacts (CMK)"]
    end

    subgraph ProdAcct["Prod Account — us-east-1"]
        subgraph ProdVPC["VPC 10.2.0.0/16"]
            ProdPriv["Private Subnets<br/>10.2.10.0/24 · 10.2.11.0/24"]
            ProdNAT["2x NAT Gateway + 2 EIPs"]
            ProdTS["Tailscale Subnet Router<br/>Advertises 10.2.0.0/16"]
            ProdVPCE["VPC Endpoints<br/>S3 · DynamoDB · ECR · STS · Logs"]
        end
        subgraph DRVPC["DR VPC 10.3.0.0/16 — us-west-2"]
            DRPriv["Private Subnets (pre-provisioned)"]
        end
        ProdModels["S3: Model Artifacts (CMK)"]
    end

    GPU <-->|Tailscale WireGuard| DevTS
    GPU <-->|Tailscale WireGuard| ProdTS

    GHA -->|OIDC| MgmtAcct
    GHA -->|OIDC| DevAcct
    GHA -->|OIDC| ProdAcct

    DevAcct -.->|findings| GDAdmin
    ProdAcct -.->|findings| GDAdmin
    DevAcct -.->|config| ConfigAgg
    ProdAcct -.->|config| ConfigAgg
    GDAdmin -.->|HIGH/CRITICAL| SNSMgmt
    Budgets -.->|threshold alerts| SNSMgmt
    Anomaly -.->|anomaly alerts| SNSMgmt

    style MgmtAcct fill:#1a2a4a,stroke:#4a7abf
    style DevAcct fill:#1a3a1a,stroke:#4abf4a
    style ProdAcct fill:#3a2a1a,stroke:#bf7a4a
    style OnPrem fill:#3a1a1a,stroke:#bf4a4a
    style DRVPC fill:#3a2a1a,stroke:#bf7a4a,stroke-dasharray: 5 5

Benefits

For the Engineering Team

One portal, all accounts — SSO eliminates credential juggling and reduces the risk of leaked access keys
Guardrails, not gates — SCPs prevent dangerous actions without requiring approval workflows that slow down development
On-prem GPU access — Tailscale connects the AI lab to AWS without VPN hardware or networking expertise

For the Business

Cost visibility — per-account budgets and anomaly detection catch spending problems in hours, not at month-end
Audit readiness — centralized CloudTrail, Config, and Security Hub provide the compliance evidence investors and customers ask for
DR capability — cross-region backups and a pre-provisioned DR VPC mean production can survive a regional outage

For the ML Platform

Model security — encrypted artifact storage with scoped IAM means models are protected from unauthorized access and exfiltration
Private connectivity — VPC endpoints and Tailscale keep model traffic off the public internet
GPU cost control — SCP-enforced instance caps prevent accidental $24/hr GPU bills

Migration from Part 1

The existing single-account infrastructure doesn't get thrown away. The migration strategy:

Create the Organization and member accounts via Control Tower
Use Terraform state move operations to transfer VPC resources to the new accounts
Preserve existing CIDRs (10.1.0.0/16 → Dev Account, 10.2.0.0/16 → Prod Account)
Replace per-account CloudTrail trails with the organization trail
Update the CI/CD pipeline for multi-account OIDC authentication
Zero downtime throughout — no resources destroyed and recreated

Terraform Module Structure

terraform/
├── organization/          # AWS Organizations, Control Tower, SCPs
├── account-baseline/      # Per-account baseline (Config, GuardDuty enrollment)
├── security-services/     # Security Hub, org CloudTrail, Config aggregator
├── networking/            # VPC endpoints, Tailscale subnet routers
├── backup/                # AWS Backup plans, vaults, cross-region replication
├── cost-management/       # Budgets, anomaly detection, Cost Explorer
└── identity/              # IAM Identity Center, permission sets, groups

Each module has its own state file keyed by {module}/{account}/{env}/terraform.tfstate. A failure in one module doesn't affect others.

Multi-Account CI/CD Pipeline

flowchart TB
    Dev[Developer] -->|push branch| PR[Pull Request]
    PR -->|trigger| Plan

    subgraph Plan["Plan Job (PR)"]
        P1[checkout] --> P2[setup terraform]
        P2 --> P3[OIDC auth → each account]
        P3 --> P4[validate + fmt all modules]
        P4 --> P5[plan all affected modules]
        P5 --> P6[post plans as PR comment]
    end

    PR -->|merge to main| Apply

    subgraph Apply["Apply Jobs (sequential)"]
        A1["Apply: Management Account<br/>organization · security-services<br/>cost-management · identity"]
        A1 -->|success| A2["Apply: Dev Account<br/>account-baseline · networking · backup"]
        A2 -->|success| A3{"Apply: Prod Account<br/>Manual Approval Required"}
        A3 -->|approved| A4["Apply: Prod Account<br/>account-baseline · networking · backup"]
    end

    subgraph Auth["OIDC Authentication"]
        R1["IAM Role: Management"]
        R2["IAM Role: Dev"]
        R3["IAM Role: Prod"]
    end

    A1 -.->|assume| R1
    A2 -.->|assume| R2
    A4 -.->|assume| R3

    style A3 fill:#5a4a00,stroke:#ff9
    style Plan fill:#1a1a2a,stroke:#666
    style Apply fill:#1a2a1a,stroke:#666

What's Next: Part 3 — EKS with NVIDIA GPU Nodes

With the multi-account foundation in place, Part 3 will deploy the compute layer:

EKS clusters in Dev and Prod accounts on the private subnets from Part 1
NVIDIA GPU node groups (p3.2xlarge / g5.xlarge) for Random Forest and LSTM model inference
GPU device plugin and NVIDIA container runtime
Private API server endpoint via the EKS VPC endpoints provisioned in Part 2
Model serving pipeline — pull artifacts from the encrypted S3 bucket, serve via gRPC on port 50051

The subnets are tagged. The endpoints are in place. The security groups are defined. The model artifact buckets are encrypted. Part 3 is where the models start running.

This is Part 2 of the MikaMirAI ML Infrastructure series. Part 1 covered the network foundation. Part 3 will cover EKS deployment with NVIDIA GPU support for serving Random Forest and LSTM models.