Building ML Infrastructure on AWS with Kiro: Part 2 — Multi-Account Landing Zone

Part 2 of a series on building production-grade infrastructure for deploying custom Random Forest and LSTM models on AWS.

Where We Left Off

In Part 1, we built the network foundation — VPCs, subnets, NAT egress, route tables, flow logs, CloudTrail, and a CI/CD pipeline — all in a single AWS account using Kiro's spec-driven workflow. Two environments deployed, both passing idempotency checks, in one session.

That got us running. But a single AWS account with two VPCs is a prototype topology, not a production one. When you're building infrastructure to serve ML models — especially with on-prem GPU resources in the mix — you need proper account boundaries, centralized governance, and cost controls that catch problems before they drain your runway.

Part 2 is about growing up without over-engineering. We're a small startup. Every dollar matters. But so does not getting breached, not losing data, and not waking up to a $10,000 bill because someone launched a p3.16xlarge in the wrong account.

Motivation

Why Multi-Account Now?

The single-account approach from Part 1 has real limitations:

Blast radius — a misconfigured IAM policy in dev can affect prod resources in the same account
Billing visibility — you can't see dev vs prod costs without meticulous tagging discipline
Compliance — auditors want to see environment isolation at the account level, not just the VPC level
Credential scope — a compromised dev credential has access to prod infrastructure

AWS Organizations with Control Tower gives us account-level isolation with centralized governance. Three accounts (Management, Dev, Prod) is the minimum viable multi-account setup for a startup.

Why Tailscale?

We have on-prem NVIDIA GPU servers for model training. Traditional AWS VPN (Site-to-Site VPN or Direct Connect) costs $36-73/month minimum and requires static public IPs or dedicated hardware. Tailscale gives us WireGuard-based mesh networking for $0 (free tier covers small teams) with no VPN appliances to manage. A t3.micro subnet router in each VPC advertises routes to the Tailscale network, and our on-prem GPUs can reach AWS private subnets as if they were on the same network.

Why Well-Architected Now?

It's tempting to defer governance until you're bigger. But the cost of retrofitting security, reliability, and cost controls into an existing architecture is 10x the cost of building them in from the start. The Well-Architected Framework gives us a checklist, not a burden — and most of the controls (SCPs, GuardDuty, Config rules, billing alerts) are either free or pennies/month at startup scale.

Desired Outcome

By the end of Part 2, we want:

Three isolated AWS accounts — Management (governance only), Dev (development workloads), Prod (production workloads) — with Control Tower guardrails enforced
Centralized security — one CloudTrail trail for all accounts, GuardDuty threat detection everywhere, Security Hub aggregating findings in one dashboard
Hybrid connectivity — on-prem AI resources reaching AWS private subnets via Tailscale, no VPN hardware, no public endpoints
Cost guardrails — billing alerts at $100/$200/$500 per account, anomaly detection, GPU instance size caps via SCP
DR readiness — cross-region backup replication for prod, a pre-provisioned DR VPC in us-west-2, documented failover runbook
Private AWS API access — VPC endpoints for S3, ECR, STS, CloudWatch, Secrets Manager, and EKS so traffic never leaves the AWS backbone
AI model security — encrypted model artifact storage, scoped IAM for GPU nodes, inference endpoint security groups
Single sign-on — IAM Identity Center with role-based access (platform engineers, developers, auditors) across all accounts
Everything as code — Terraform modules for each concern, multi-account CI/CD pipeline, consistent naming and tagging from Part 1

What We're Building

Account Structure

AWS Organization (Management Account)
├── OU: Workloads
│   ├── Dev Account    (10.1.0.0/16, us-east-1)
│   └── Prod Account   (10.2.0.0/16, us-east-1)
│                       (10.3.0.0/16, us-west-2 — DR)
└── OU: Security (reserved for future)

The Management Account runs no workloads — only Organizations, Control Tower, IAM Identity Center, consolidated billing, and centralized security services. This is a hard boundary enforced by SCP.

Service Control Policies

SCPs are the guardrails that prevent expensive mistakes:

SCP	What it prevents
Region restriction	Using any region except us-east-1 and us-west-2
CloudTrail protection	Deleting or modifying audit trails in member accounts
GuardDuty protection	Disabling threat detection in member accounts
No static credentials	Creating IAM users with passwords or access keys
Encryption enforcement	Creating S3 buckets without server-side encryption
GPU instance cap	Launching instances larger than p3.2xlarge or g5.xlarge
State protection	Modifying Terraform state resources outside CI/CD

The GPU instance cap is startup-specific — it prevents a p3.16xlarge ($24.48/hr) from running when a p3.2xlarge ($3.06/hr) is what we actually need.

Centralized Logging and Security

Management Account
├── Organization CloudTrail → S3 bucket (replaces per-account trails)
├── AWS Config Aggregator   → compliance data from all accounts
├── GuardDuty Administrator → findings from all accounts
├── Security Hub            → aggregated security posture
└── SNS Topic               → alerts for HIGH/CRITICAL findings

Per-account CloudTrail trails from Part 1 get disabled — the organization trail captures everything in one place, cutting duplicate storage costs.

Tailscale Hybrid Networking

On-Prem AI Lab                          AWS
┌─────────────────┐                     ┌──────────────────────────┐
│ NVIDIA GPU       │                     │ Dev VPC (10.1.0.0/16)    │
│ Servers          │◄── Tailscale ──────►│ ┌──────────────────────┐ │
│ (training)       │    mesh network     │ │ Tailscale Subnet     │ │
│                  │                     │ │ Router (t3.micro)    │ │
│                  │                     │ │ ASG min=1 max=1      │ │
└─────────────────┘                     │ └──────────────────────┘ │
                                        └──────────────────────────┘
                                        ┌──────────────────────────┐
                                        │ Prod VPC (10.2.0.0/16)   │
                                   ────►│ ┌──────────────────────┐ │
                                        │ │ Tailscale Subnet     │ │
                                        │ │ Router (t3.micro)    │ │
                                        │ │ ASG min=1 max=1      │ │
                                        │ └──────────────────────┘ │
                                        └──────────────────────────┘

Each subnet router advertises its VPC CIDR to the Tailscale network. On-prem GPUs see AWS private IPs as directly routable. Auth keys live in Secrets Manager. If a router dies, the ASG replaces it automatically.

Tailscale ACLs restrict on-prem access to private subnet CIDRs on ports 443 (HTTPS) and 50051 (gRPC) — just enough for model inference, nothing more.

VPC Private Endpoints

Endpoint Type	Service	Why
Gateway (free)	S3	Model artifact downloads, state bucket access
Gateway (free)	DynamoDB	State locking
Interface	ECR (api + dkr)	Container image pulls without NAT
Interface	STS	IAM role assumption for pods
Interface	CloudWatch Logs	Log delivery from private subnets
Interface	Secrets Manager	Tailscale auth keys, app secrets
Interface	EKS, EKS-Auth	Future private EKS API server access

Gateway endpoints are free. Interface endpoints cost ~$7.20/month each, so we're selective — only services that see high traffic or handle sensitive data.

Cost Controls

Control	Detail
Monthly budgets	$100 (dev), $200 (mgmt), $500 (prod)
Alert thresholds	50%, 80%, 100% of budget
Forecast alerts	Trigger at 90% of projected spend
Anomaly detection	Alert on $10+ above baseline
GPU SCP	Block instances larger than p3.2xlarge / g5.xlarge
Scheduled stops	Dev Tailscale routers off outside business hours
NAT Instance	t3.micro in dev instead of managed NAT Gateway
Free-tier KMS	AWS-managed keys everywhere except model artifacts

Backup and DR

	Dev	Prod
Daily backups	7-day retention	30-day retention
Weekly backups	30-day retention	90-day retention
Cross-region replication	No (cost)	Yes → us-west-2
DR VPC	No	Pre-provisioned (10.3.0.0/16)
DR runbook	No	Documented

Backup selection is tag-based — tag a resource with BackupPolicy = "daily" and it's automatically included. No manual snapshot management.

IAM Identity Center

Group	Permission Set	Accounts
PlatformEngineers	AdministratorAccess	All three
Developers	DeveloperAccess (no IAM/Orgs/billing)	Dev only
Auditors	ReadOnlyAccess	All three

No static IAM credentials anywhere. Everyone authenticates through the SSO portal. Sessions expire after 8 hours.

Benefits

For the Engineering Team

One portal, all accounts — SSO eliminates credential juggling and reduces the risk of leaked access keys
Guardrails, not gates — SCPs prevent dangerous actions without requiring approval workflows that slow down development
On-prem GPU access — Tailscale connects the AI lab to AWS without VPN hardware or networking expertise

For the Business

Cost visibility — per-account budgets and anomaly detection catch spending problems in hours, not at month-end
Audit readiness — centralized CloudTrail, Config, and Security Hub provide the compliance evidence investors and customers ask for
DR capability — cross-region backups and a pre-provisioned DR VPC mean production can survive a regional outage

For the ML Platform

Model security — encrypted artifact storage with scoped IAM means models are protected from unauthorized access and exfiltration
Private connectivity — VPC endpoints and Tailscale keep model traffic off the public internet
GPU cost control — SCP-enforced instance caps prevent accidental $24/hr GPU bills

Migration from Part 1

The existing single-account infrastructure doesn't get thrown away. The migration strategy:

Create the Organization and member accounts via Control Tower
Use Terraform state move operations to transfer VPC resources to the new accounts
Preserve existing CIDRs (10.1.0.0/16 → Dev Account, 10.2.0.0/16 → Prod Account)
Replace per-account CloudTrail trails with the organization trail
Update the CI/CD pipeline for multi-account OIDC authentication
Zero downtime throughout — no resources destroyed and recreated

Terraform Module Structure

terraform/
├── organization/          # AWS Organizations, Control Tower, SCPs
├── account-baseline/      # Per-account baseline (Config, GuardDuty enrollment)
├── security-services/     # Security Hub, org CloudTrail, Config aggregator
├── networking/            # VPC endpoints, Tailscale subnet routers
├── backup/                # AWS Backup plans, vaults, cross-region replication
├── cost-management/       # Budgets, anomaly detection, Cost Explorer
└── identity/              # IAM Identity Center, permission sets, groups

Each module has its own state file keyed by {module}/{account}/{env}/terraform.tfstate. A failure in one module doesn't affect others.

What's Next: Part 3 — EKS with NVIDIA GPU Nodes

With the multi-account foundation in place, Part 3 will deploy the compute layer:

EKS clusters in Dev and Prod accounts on the private subnets from Part 1
NVIDIA GPU node groups (p3.2xlarge / g5.xlarge) for Random Forest and LSTM model inference
GPU device plugin and NVIDIA container runtime
Private API server endpoint via the EKS VPC endpoints provisioned in Part 2
Model serving pipeline — pull artifacts from the encrypted S3 bucket, serve via gRPC on port 50051

The subnets are tagged. The endpoints are in place. The security groups are defined. The model artifact buckets are encrypted. Part 3 is where the models start running.

This is Part 2 of the Mica Mirai ML Infrastructure series. Part 1 covered the network foundation. Part 3 will cover EKS deployment with NVIDIA GPU support for serving Random Forest and LSTM models.