Back to Blog
technical-referenceawslanding-zoneorganizationscontrol-towersso

Building ML Infrastructure on AWS with Kiro: Part 2 — Multi-Account Landing Zone

Deploying the MM AWS multi-account landing zone: Control Tower setup, OU hierarchy, SCP guardrails, and account factory automation.

October 18, 2025·8 min read

Building ML Infrastructure on AWS with Kiro: Part 2 — Multi-Account Landing Zone

Part 2 of a series on building production-grade infrastructure for deploying custom Random Forest and LSTM models on AWS.


Where We Left Off

In Part 1, we built the network foundation — VPCs, subnets, NAT egress, route tables, flow logs, CloudTrail, and a CI/CD pipeline — all in a single AWS account using Kiro's spec-driven workflow. Two environments deployed, both passing idempotency checks, in one session.

That got us running. But a single AWS account with two VPCs is a prototype topology, not a production one. When you're building infrastructure to serve ML models — especially with on-prem GPU resources in the mix — you need proper account boundaries, centralized governance, and cost controls that catch problems before they drain your runway.

Part 2 is about growing up without over-engineering. We're a small startup. Every dollar matters. But so does not getting breached, not losing data, and not waking up to a $10,000 bill because someone launched a p3.16xlarge in the wrong account.


Motivation

Why Multi-Account Now?

The single-account approach from Part 1 has real limitations:

  • Blast radius — a misconfigured IAM policy in dev can affect prod resources in the same account
  • Billing visibility — you can't see dev vs prod costs without meticulous tagging discipline
  • Compliance — auditors want to see environment isolation at the account level, not just the VPC level
  • Credential scope — a compromised dev credential has access to prod infrastructure

AWS Organizations with Control Tower gives us account-level isolation with centralized governance. Three accounts (Management, Dev, Prod) is the minimum viable multi-account setup for a startup.

Why Tailscale?

We have on-prem NVIDIA GPU servers for model training. Traditional AWS VPN (Site-to-Site VPN or Direct Connect) costs $36-73/month minimum and requires static public IPs or dedicated hardware. Tailscale gives us WireGuard-based mesh networking for $0 (free tier covers small teams) with no VPN appliances to manage. A t3.micro subnet router in each VPC advertises routes to the Tailscale network, and our on-prem GPUs can reach AWS private subnets as if they were on the same network.

Why Well-Architected Now?

It's tempting to defer governance until you're bigger. But the cost of retrofitting security, reliability, and cost controls into an existing architecture is 10x the cost of building them in from the start. The Well-Architected Framework gives us a checklist, not a burden — and most of the controls (SCPs, GuardDuty, Config rules, billing alerts) are either free or pennies/month at startup scale.


Desired Outcome

By the end of Part 2, we want:

  1. Three isolated AWS accounts — Management (governance only), Dev (development workloads), Prod (production workloads) — with Control Tower guardrails enforced
  2. Centralized security — one CloudTrail trail for all accounts, GuardDuty threat detection everywhere, Security Hub aggregating findings in one dashboard
  3. Hybrid connectivity — on-prem AI resources reaching AWS private subnets via Tailscale, no VPN hardware, no public endpoints
  4. Cost guardrails — billing alerts at $100/$200/$500 per account, anomaly detection, GPU instance size caps via SCP
  5. DR readiness — cross-region backup replication for prod, a pre-provisioned DR VPC in us-west-2, documented failover runbook
  6. Private AWS API access — VPC endpoints for S3, ECR, STS, CloudWatch, Secrets Manager, and EKS so traffic never leaves the AWS backbone
  7. AI model security — encrypted model artifact storage, scoped IAM for GPU nodes, inference endpoint security groups
  8. Single sign-on — IAM Identity Center with role-based access (platform engineers, developers, auditors) across all accounts
  9. Everything as code — Terraform modules for each concern, multi-account CI/CD pipeline, consistent naming and tagging from Part 1

What We're Building

Account Structure

AWS Organization (Management Account)
├── OU: Workloads
│   ├── Dev Account    (10.1.0.0/16, us-east-1)
│   └── Prod Account   (10.2.0.0/16, us-east-1)
│                       (10.3.0.0/16, us-west-2 — DR)
└── OU: Security (reserved for future)

The Management Account runs no workloads — only Organizations, Control Tower, IAM Identity Center, consolidated billing, and centralized security services. This is a hard boundary enforced by SCP.

Service Control Policies

SCPs are the guardrails that prevent expensive mistakes:

SCP What it prevents
Region restriction Using any region except us-east-1 and us-west-2
CloudTrail protection Deleting or modifying audit trails in member accounts
GuardDuty protection Disabling threat detection in member accounts
No static credentials Creating IAM users with passwords or access keys
Encryption enforcement Creating S3 buckets without server-side encryption
GPU instance cap Launching instances larger than p3.2xlarge or g5.xlarge
State protection Modifying Terraform state resources outside CI/CD

The GPU instance cap is startup-specific — it prevents a p3.16xlarge ($24.48/hr) from running when a p3.2xlarge ($3.06/hr) is what we actually need.

Centralized Logging and Security

Management Account
├── Organization CloudTrail → S3 bucket (replaces per-account trails)
├── AWS Config Aggregator   → compliance data from all accounts
├── GuardDuty Administrator → findings from all accounts
├── Security Hub            → aggregated security posture
└── SNS Topic               → alerts for HIGH/CRITICAL findings

Per-account CloudTrail trails from Part 1 get disabled — the organization trail captures everything in one place, cutting duplicate storage costs.

Tailscale Hybrid Networking

On-Prem AI Lab                          AWS
┌─────────────────┐                     ┌──────────────────────────┐
│ NVIDIA GPU       │                     │ Dev VPC (10.1.0.0/16)    │
│ Servers          │◄── Tailscale ──────►│ ┌──────────────────────┐ │
│ (training)       │    mesh network     │ │ Tailscale Subnet     │ │
│                  │                     │ │ Router (t3.micro)    │ │
│                  │                     │ │ ASG min=1 max=1      │ │
└─────────────────┘                     │ └──────────────────────┘ │
                                        └──────────────────────────┘
                                        ┌──────────────────────────┐
                                        │ Prod VPC (10.2.0.0/16)   │
                                   ────►│ ┌──────────────────────┐ │
                                        │ │ Tailscale Subnet     │ │
                                        │ │ Router (t3.micro)    │ │
                                        │ │ ASG min=1 max=1      │ │
                                        │ └──────────────────────┘ │
                                        └──────────────────────────┘

Each subnet router advertises its VPC CIDR to the Tailscale network. On-prem GPUs see AWS private IPs as directly routable. Auth keys live in Secrets Manager. If a router dies, the ASG replaces it automatically.

Tailscale ACLs restrict on-prem access to private subnet CIDRs on ports 443 (HTTPS) and 50051 (gRPC) — just enough for model inference, nothing more.

VPC Private Endpoints

Endpoint Type Service Why
Gateway (free) S3 Model artifact downloads, state bucket access
Gateway (free) DynamoDB State locking
Interface ECR (api + dkr) Container image pulls without NAT
Interface STS IAM role assumption for pods
Interface CloudWatch Logs Log delivery from private subnets
Interface Secrets Manager Tailscale auth keys, app secrets
Interface EKS, EKS-Auth Future private EKS API server access

Gateway endpoints are free. Interface endpoints cost ~$7.20/month each, so we're selective — only services that see high traffic or handle sensitive data.

Cost Controls

Control Detail
Monthly budgets $100 (dev), $200 (mgmt), $500 (prod)
Alert thresholds 50%, 80%, 100% of budget
Forecast alerts Trigger at 90% of projected spend
Anomaly detection Alert on $10+ above baseline
GPU SCP Block instances larger than p3.2xlarge / g5.xlarge
Scheduled stops Dev Tailscale routers off outside business hours
NAT Instance t3.micro in dev instead of managed NAT Gateway
Free-tier KMS AWS-managed keys everywhere except model artifacts

Backup and DR

Dev Prod
Daily backups 7-day retention 30-day retention
Weekly backups 30-day retention 90-day retention
Cross-region replication No (cost) Yes → us-west-2
DR VPC No Pre-provisioned (10.3.0.0/16)
DR runbook No Documented

Backup selection is tag-based — tag a resource with BackupPolicy = "daily" and it's automatically included. No manual snapshot management.

IAM Identity Center

Group Permission Set Accounts
PlatformEngineers AdministratorAccess All three
Developers DeveloperAccess (no IAM/Orgs/billing) Dev only
Auditors ReadOnlyAccess All three

No static IAM credentials anywhere. Everyone authenticates through the SSO portal. Sessions expire after 8 hours.


Benefits

For the Engineering Team

  • One portal, all accounts — SSO eliminates credential juggling and reduces the risk of leaked access keys
  • Guardrails, not gates — SCPs prevent dangerous actions without requiring approval workflows that slow down development
  • On-prem GPU access — Tailscale connects the AI lab to AWS without VPN hardware or networking expertise

For the Business

  • Cost visibility — per-account budgets and anomaly detection catch spending problems in hours, not at month-end
  • Audit readiness — centralized CloudTrail, Config, and Security Hub provide the compliance evidence investors and customers ask for
  • DR capability — cross-region backups and a pre-provisioned DR VPC mean production can survive a regional outage

For the ML Platform

  • Model security — encrypted artifact storage with scoped IAM means models are protected from unauthorized access and exfiltration
  • Private connectivity — VPC endpoints and Tailscale keep model traffic off the public internet
  • GPU cost control — SCP-enforced instance caps prevent accidental $24/hr GPU bills

Migration from Part 1

The existing single-account infrastructure doesn't get thrown away. The migration strategy:

  1. Create the Organization and member accounts via Control Tower
  2. Use Terraform state move operations to transfer VPC resources to the new accounts
  3. Preserve existing CIDRs (10.1.0.0/16 → Dev Account, 10.2.0.0/16 → Prod Account)
  4. Replace per-account CloudTrail trails with the organization trail
  5. Update the CI/CD pipeline for multi-account OIDC authentication
  6. Zero downtime throughout — no resources destroyed and recreated

Terraform Module Structure

terraform/
├── organization/          # AWS Organizations, Control Tower, SCPs
├── account-baseline/      # Per-account baseline (Config, GuardDuty enrollment)
├── security-services/     # Security Hub, org CloudTrail, Config aggregator
├── networking/            # VPC endpoints, Tailscale subnet routers
├── backup/                # AWS Backup plans, vaults, cross-region replication
├── cost-management/       # Budgets, anomaly detection, Cost Explorer
└── identity/              # IAM Identity Center, permission sets, groups

Each module has its own state file keyed by {module}/{account}/{env}/terraform.tfstate. A failure in one module doesn't affect others.


What's Next: Part 3 — EKS with NVIDIA GPU Nodes

With the multi-account foundation in place, Part 3 will deploy the compute layer:

  • EKS clusters in Dev and Prod accounts on the private subnets from Part 1
  • NVIDIA GPU node groups (p3.2xlarge / g5.xlarge) for Random Forest and LSTM model inference
  • GPU device plugin and NVIDIA container runtime
  • Private API server endpoint via the EKS VPC endpoints provisioned in Part 2
  • Model serving pipeline — pull artifacts from the encrypted S3 bucket, serve via gRPC on port 50051

The subnets are tagged. The endpoints are in place. The security groups are defined. The model artifact buckets are encrypted. Part 3 is where the models start running.


This is Part 2 of the Mica Mirai ML Infrastructure series. Part 1 covered the network foundation. Part 3 will cover EKS deployment with NVIDIA GPU support for serving Random Forest and LSTM models.