Building ML Infrastructure on AWS with Kiro: Part 2 — Multi-Account Landing Zone
Part 2 of a series on building production-grade infrastructure for deploying custom Random Forest and LSTM models on AWS.
Where We Left Off
In Part 1, we built the network foundation — VPCs, subnets, NAT egress, route tables, flow logs, CloudTrail, and a CI/CD pipeline — all in a single AWS account using Kiro's spec-driven workflow. Two environments deployed, both passing idempotency checks, in one session.
That got us running. But a single AWS account with two VPCs is a prototype topology, not a production one. When you're building infrastructure to serve ML models — especially with on-prem GPU resources in the mix — you need proper account boundaries, centralized governance, and cost controls that catch problems before they drain your runway.
Part 2 is about growing up without over-engineering. We're a small startup. Every dollar matters. But so does not getting breached, not losing data, and not waking up to a $10,000 bill because someone launched a p3.16xlarge in the wrong account.
Motivation
Why Multi-Account Now?
The single-account approach from Part 1 has real limitations:
- Blast radius — a misconfigured IAM policy in dev can affect prod resources in the same account
- Billing visibility — you can't see dev vs prod costs without meticulous tagging discipline
- Compliance — auditors want to see environment isolation at the account level, not just the VPC level
- Credential scope — a compromised dev credential has access to prod infrastructure
AWS Organizations with Control Tower gives us account-level isolation with centralized governance. Three accounts (Management, Dev, Prod) is the minimum viable multi-account setup for a startup.
Why Tailscale?
We have on-prem NVIDIA GPU servers for model training. Traditional AWS VPN (Site-to-Site VPN or Direct Connect) costs $36-73/month minimum and requires static public IPs or dedicated hardware. Tailscale gives us WireGuard-based mesh networking for $0 (free tier covers small teams) with no VPN appliances to manage. A t3.micro subnet router in each VPC advertises routes to the Tailscale network, and our on-prem GPUs can reach AWS private subnets as if they were on the same network.
Why Well-Architected Now?
It's tempting to defer governance until you're bigger. But the cost of retrofitting security, reliability, and cost controls into an existing architecture is 10x the cost of building them in from the start. The Well-Architected Framework gives us a checklist, not a burden — and most of the controls (SCPs, GuardDuty, Config rules, billing alerts) are either free or pennies/month at startup scale.
Desired Outcome
By the end of Part 2, we want:
- Three isolated AWS accounts — Management (governance only), Dev (development workloads), Prod (production workloads) — with Control Tower guardrails enforced
- Centralized security — one CloudTrail trail for all accounts, GuardDuty threat detection everywhere, Security Hub aggregating findings in one dashboard
- Hybrid connectivity — on-prem AI resources reaching AWS private subnets via Tailscale, no VPN hardware, no public endpoints
- Cost guardrails — billing alerts at $100/$200/$500 per account, anomaly detection, GPU instance size caps via SCP
- DR readiness — cross-region backup replication for prod, a pre-provisioned DR VPC in us-west-2, documented failover runbook
- Private AWS API access — VPC endpoints for S3, ECR, STS, CloudWatch, Secrets Manager, and EKS so traffic never leaves the AWS backbone
- AI model security — encrypted model artifact storage, scoped IAM for GPU nodes, inference endpoint security groups
- Single sign-on — IAM Identity Center with role-based access (platform engineers, developers, auditors) across all accounts
- Everything as code — Terraform modules for each concern, multi-account CI/CD pipeline, consistent naming and tagging from Part 1
What We're Building
Account Structure
AWS Organization (Management Account)
├── OU: Workloads
│ ├── Dev Account (10.1.0.0/16, us-east-1)
│ └── Prod Account (10.2.0.0/16, us-east-1)
│ (10.3.0.0/16, us-west-2 — DR)
└── OU: Security (reserved for future)
The Management Account runs no workloads — only Organizations, Control Tower, IAM Identity Center, consolidated billing, and centralized security services. This is a hard boundary enforced by SCP.
Service Control Policies
SCPs are the guardrails that prevent expensive mistakes:
| SCP | What it prevents |
|---|---|
| Region restriction | Using any region except us-east-1 and us-west-2 |
| CloudTrail protection | Deleting or modifying audit trails in member accounts |
| GuardDuty protection | Disabling threat detection in member accounts |
| No static credentials | Creating IAM users with passwords or access keys |
| Encryption enforcement | Creating S3 buckets without server-side encryption |
| GPU instance cap | Launching instances larger than p3.2xlarge or g5.xlarge |
| State protection | Modifying Terraform state resources outside CI/CD |
The GPU instance cap is startup-specific — it prevents a p3.16xlarge ($24.48/hr) from running when a p3.2xlarge ($3.06/hr) is what we actually need.
Centralized Logging and Security
Management Account
├── Organization CloudTrail → S3 bucket (replaces per-account trails)
├── AWS Config Aggregator → compliance data from all accounts
├── GuardDuty Administrator → findings from all accounts
├── Security Hub → aggregated security posture
└── SNS Topic → alerts for HIGH/CRITICAL findings
Per-account CloudTrail trails from Part 1 get disabled — the organization trail captures everything in one place, cutting duplicate storage costs.
Tailscale Hybrid Networking
On-Prem AI Lab AWS
┌─────────────────┐ ┌──────────────────────────┐
│ NVIDIA GPU │ │ Dev VPC (10.1.0.0/16) │
│ Servers │◄── Tailscale ──────►│ ┌──────────────────────┐ │
│ (training) │ mesh network │ │ Tailscale Subnet │ │
│ │ │ │ Router (t3.micro) │ │
│ │ │ │ ASG min=1 max=1 │ │
└─────────────────┘ │ └──────────────────────┘ │
└──────────────────────────┘
┌──────────────────────────┐
│ Prod VPC (10.2.0.0/16) │
────►│ ┌──────────────────────┐ │
│ │ Tailscale Subnet │ │
│ │ Router (t3.micro) │ │
│ │ ASG min=1 max=1 │ │
│ └──────────────────────┘ │
└──────────────────────────┘
Each subnet router advertises its VPC CIDR to the Tailscale network. On-prem GPUs see AWS private IPs as directly routable. Auth keys live in Secrets Manager. If a router dies, the ASG replaces it automatically.
Tailscale ACLs restrict on-prem access to private subnet CIDRs on ports 443 (HTTPS) and 50051 (gRPC) — just enough for model inference, nothing more.
VPC Private Endpoints
| Endpoint Type | Service | Why |
|---|---|---|
| Gateway (free) | S3 | Model artifact downloads, state bucket access |
| Gateway (free) | DynamoDB | State locking |
| Interface | ECR (api + dkr) | Container image pulls without NAT |
| Interface | STS | IAM role assumption for pods |
| Interface | CloudWatch Logs | Log delivery from private subnets |
| Interface | Secrets Manager | Tailscale auth keys, app secrets |
| Interface | EKS, EKS-Auth | Future private EKS API server access |
Gateway endpoints are free. Interface endpoints cost ~$7.20/month each, so we're selective — only services that see high traffic or handle sensitive data.
Cost Controls
| Control | Detail |
|---|---|
| Monthly budgets | $100 (dev), $200 (mgmt), $500 (prod) |
| Alert thresholds | 50%, 80%, 100% of budget |
| Forecast alerts | Trigger at 90% of projected spend |
| Anomaly detection | Alert on $10+ above baseline |
| GPU SCP | Block instances larger than p3.2xlarge / g5.xlarge |
| Scheduled stops | Dev Tailscale routers off outside business hours |
| NAT Instance | t3.micro in dev instead of managed NAT Gateway |
| Free-tier KMS | AWS-managed keys everywhere except model artifacts |
Backup and DR
| Dev | Prod | |
|---|---|---|
| Daily backups | 7-day retention | 30-day retention |
| Weekly backups | 30-day retention | 90-day retention |
| Cross-region replication | No (cost) | Yes → us-west-2 |
| DR VPC | No | Pre-provisioned (10.3.0.0/16) |
| DR runbook | No | Documented |
Backup selection is tag-based — tag a resource with BackupPolicy = "daily" and it's automatically included. No manual snapshot management.
IAM Identity Center
| Group | Permission Set | Accounts |
|---|---|---|
| PlatformEngineers | AdministratorAccess | All three |
| Developers | DeveloperAccess (no IAM/Orgs/billing) | Dev only |
| Auditors | ReadOnlyAccess | All three |
No static IAM credentials anywhere. Everyone authenticates through the SSO portal. Sessions expire after 8 hours.
Benefits
For the Engineering Team
- One portal, all accounts — SSO eliminates credential juggling and reduces the risk of leaked access keys
- Guardrails, not gates — SCPs prevent dangerous actions without requiring approval workflows that slow down development
- On-prem GPU access — Tailscale connects the AI lab to AWS without VPN hardware or networking expertise
For the Business
- Cost visibility — per-account budgets and anomaly detection catch spending problems in hours, not at month-end
- Audit readiness — centralized CloudTrail, Config, and Security Hub provide the compliance evidence investors and customers ask for
- DR capability — cross-region backups and a pre-provisioned DR VPC mean production can survive a regional outage
For the ML Platform
- Model security — encrypted artifact storage with scoped IAM means models are protected from unauthorized access and exfiltration
- Private connectivity — VPC endpoints and Tailscale keep model traffic off the public internet
- GPU cost control — SCP-enforced instance caps prevent accidental $24/hr GPU bills
Migration from Part 1
The existing single-account infrastructure doesn't get thrown away. The migration strategy:
- Create the Organization and member accounts via Control Tower
- Use Terraform state move operations to transfer VPC resources to the new accounts
- Preserve existing CIDRs (10.1.0.0/16 → Dev Account, 10.2.0.0/16 → Prod Account)
- Replace per-account CloudTrail trails with the organization trail
- Update the CI/CD pipeline for multi-account OIDC authentication
- Zero downtime throughout — no resources destroyed and recreated
Terraform Module Structure
terraform/
├── organization/ # AWS Organizations, Control Tower, SCPs
├── account-baseline/ # Per-account baseline (Config, GuardDuty enrollment)
├── security-services/ # Security Hub, org CloudTrail, Config aggregator
├── networking/ # VPC endpoints, Tailscale subnet routers
├── backup/ # AWS Backup plans, vaults, cross-region replication
├── cost-management/ # Budgets, anomaly detection, Cost Explorer
└── identity/ # IAM Identity Center, permission sets, groups
Each module has its own state file keyed by {module}/{account}/{env}/terraform.tfstate. A failure in one module doesn't affect others.
What's Next: Part 3 — EKS with NVIDIA GPU Nodes
With the multi-account foundation in place, Part 3 will deploy the compute layer:
- EKS clusters in Dev and Prod accounts on the private subnets from Part 1
- NVIDIA GPU node groups (p3.2xlarge / g5.xlarge) for Random Forest and LSTM model inference
- GPU device plugin and NVIDIA container runtime
- Private API server endpoint via the EKS VPC endpoints provisioned in Part 2
- Model serving pipeline — pull artifacts from the encrypted S3 bucket, serve via gRPC on port 50051
The subnets are tagged. The endpoints are in place. The security groups are defined. The model artifact buckets are encrypted. Part 3 is where the models start running.
This is Part 2 of the Mica Mirai ML Infrastructure series. Part 1 covered the network foundation. Part 3 will cover EKS deployment with NVIDIA GPU support for serving Random Forest and LSTM models.