Building ML Infrastructure on AWS with Kiro: Part 2 — Multi-Account Landing Zone
Part 2 of a series on building production-grade infrastructure for deploying custom Random Forest and LSTM models on AWS.
Where We Left Off
In Part 1, we built the network foundation — VPCs, subnets, NAT egress, route tables, flow logs, CloudTrail, and a CI/CD pipeline — all in a single AWS account using Kiro's spec-driven workflow. Two environments deployed, both passing idempotency checks, in one session.
That got us running. But a single AWS account with two VPCs is a prototype topology, not a production one. When you're building infrastructure to serve ML models — especially with on-prem GPU resources in the mix — you need proper account boundaries, centralized governance, and cost controls that catch problems before they drain your runway.
Part 2 is about growing up without over-engineering. We're a small startup. Every dollar matters. But so does not getting breached, not losing data, and not waking up to a $10,000 bill because someone launched a p3.16xlarge in the wrong account.
Motivation
Why Multi-Account Now?
The single-account approach from Part 1 has real limitations:
- Blast radius — a misconfigured IAM policy in dev can affect prod resources in the same account
- Billing visibility — you can't see dev vs prod costs without meticulous tagging discipline
- Compliance — auditors want to see environment isolation at the account level, not just the VPC level
- Credential scope — a compromised dev credential has access to prod infrastructure
AWS Organizations with Control Tower gives us account-level isolation with centralized governance. Three accounts (Management, Dev, Prod) is the minimum viable multi-account setup for a startup.
Why Tailscale?
We have on-prem NVIDIA GPU servers for model training. Traditional AWS VPN (Site-to-Site VPN or Direct Connect) costs $36-73/month minimum and requires static public IPs or dedicated hardware. Tailscale gives us WireGuard-based mesh networking for $0 (free tier covers small teams) with no VPN appliances to manage. A t3.micro subnet router in each VPC advertises routes to the Tailscale network, and our on-prem GPUs can reach AWS private subnets as if they were on the same network.
Why Well-Architected Now?
It's tempting to defer governance until you're bigger. But the cost of retrofitting security, reliability, and cost controls into an existing architecture is 10x the cost of building them in from the start. The Well-Architected Framework gives us a checklist, not a burden — and most of the controls (SCPs, GuardDuty, Config rules, billing alerts) are either free or pennies/month at startup scale.
Desired Outcome
By the end of Part 2, we want:
- Three isolated AWS accounts — Management (governance only), Dev (development workloads), Prod (production workloads) — with Control Tower guardrails enforced
- Centralized security — one CloudTrail trail for all accounts, GuardDuty threat detection everywhere, Security Hub aggregating findings in one dashboard
- Hybrid connectivity — on-prem AI resources reaching AWS private subnets via Tailscale, no VPN hardware, no public endpoints
- Cost guardrails — billing alerts at $100/$200/$500 per account, anomaly detection, GPU instance size caps via SCP
- DR readiness — cross-region backup replication for prod, a pre-provisioned DR VPC in us-west-2, documented failover runbook
- Private AWS API access — VPC endpoints for S3, ECR, STS, CloudWatch, Secrets Manager, and EKS so traffic never leaves the AWS backbone
- AI model security — encrypted model artifact storage, scoped IAM for GPU nodes, inference endpoint security groups
- Single sign-on — IAM Identity Center with role-based access (platform engineers, developers, auditors) across all accounts
- Everything as code — Terraform modules for each concern, multi-account CI/CD pipeline, consistent naming and tagging from Part 1
What We're Building
Account Structure
graph TB
subgraph Org["AWS Organization"]
MgmtAcct["Management Account<br/>━━━━━━━━━━━━━━━━━━<br/>• AWS Organizations<br/>• Control Tower<br/>• IAM Identity Center (SSO)<br/>• Consolidated Billing<br/>• Org CloudTrail<br/>• Config Aggregator<br/>• GuardDuty Administrator<br/>• Security Hub<br/>• Cost Anomaly Detection<br/>• AWS Budgets<br/>━━━━━━━━━━━━━━━━━━<br/>No workloads"]
subgraph WorkloadsOU["OU: Workloads"]
DevAcct["Dev Account<br/>━━━━━━━━━━━━━━━━━━<br/>• VPC 10.1.0.0/16 (us-east-1)<br/>• NAT Instance (t3.micro)<br/>• Tailscale Subnet Router<br/>• VPC Endpoints<br/>• AWS Backup (7d/30d)<br/>• GuardDuty member<br/>• Model Artifact S3 (CMK)<br/>• Budget: $100/mo"]
ProdAcct["Prod Account<br/>━━━━━━━━━━━━━━━━━━<br/>• VPC 10.2.0.0/16 (us-east-1)<br/>• VPC 10.3.0.0/16 (us-west-2 DR)<br/>• 2x NAT Gateway<br/>• Tailscale Subnet Router<br/>• VPC Endpoints<br/>• AWS Backup (30d/90d + DR)<br/>• GuardDuty member<br/>• Model Artifact S3 (CMK)<br/>• Budget: $500/mo"]
end
subgraph SecurityOU["OU: Security (reserved)"]
Future["Future Security Account"]
end
end
subgraph SCPs["Service Control Policies"]
SCP1["Deny regions outside us-east-1, us-west-2"]
SCP2["Deny CloudTrail deletion"]
SCP3["Deny GuardDuty disable"]
SCP4["Deny IAM user creation"]
SCP5["Deny unencrypted S3"]
SCP6["Deny GPU larger than p3.2xlarge / g5.xlarge"]
SCP7["Deny state bucket modification"]
end
SCPs -->|attached to| WorkloadsOU
style MgmtAcct fill:#1a2a4a,stroke:#4a7abf
style DevAcct fill:#1a3a1a,stroke:#4abf4a
style ProdAcct fill:#3a2a1a,stroke:#bf7a4a
style Future fill:#2a2a2a,stroke:#666,stroke-dasharray: 5 5
The Management Account runs no workloads — only Organizations, Control Tower, IAM Identity Center, consolidated billing, and centralized security services. This is a hard boundary enforced by SCP.
Service Control Policies
SCPs are the guardrails that prevent expensive mistakes:
| SCP | What it prevents |
|---|---|
| Region restriction | Using any region except us-east-1 and us-west-2 |
| CloudTrail protection | Deleting or modifying audit trails in member accounts |
| GuardDuty protection | Disabling threat detection in member accounts |
| No static credentials | Creating IAM users with passwords or access keys |
| Encryption enforcement | Creating S3 buckets without server-side encryption |
| GPU instance cap | Launching instances larger than p3.2xlarge or g5.xlarge |
| State protection | Modifying Terraform state resources outside CI/CD |
The GPU instance cap is startup-specific — it prevents a p3.16xlarge ($24.48/hr) from running when a p3.2xlarge ($3.06/hr) is what we actually need.
Centralized Security
graph TB
subgraph DevAcct["Dev Account"]
DevGD["GuardDuty<br/>(member)"]
DevConfig["AWS Config<br/>(recording)"]
DevCT["CloudTrail<br/>(org trail member)"]
end
subgraph ProdAcct["Prod Account"]
ProdGD["GuardDuty<br/>(member)"]
ProdConfig["AWS Config<br/>(recording)"]
ProdCT["CloudTrail<br/>(org trail member)"]
end
subgraph MgmtAcct["Management Account"]
OrgTrail["Organization CloudTrail<br/>→ S3 (SSE, public blocked)<br/>Log validation enabled<br/>Multi-region"]
ConfigAgg["Config Aggregator<br/>Rules: encryption, SSH,<br/>flow logs, IAM hygiene"]
GDAdmin["GuardDuty Administrator<br/>S3 + EKS protection<br/>us-east-1 + us-west-2"]
SecHub["Security Hub<br/>AWS Foundational Security<br/>Best Practices"]
SNS["SNS: Platform Alerts<br/>→ Email + integrations"]
end
DevGD -->|findings| GDAdmin
ProdGD -->|findings| GDAdmin
DevConfig -->|compliance| ConfigAgg
ProdConfig -->|compliance| ConfigAgg
DevCT -->|logs| OrgTrail
ProdCT -->|logs| OrgTrail
GDAdmin -->|HIGH/CRITICAL| SecHub
ConfigAgg -->|non-compliant| SecHub
GDAdmin -->|HIGH/CRITICAL| SNS
SecHub -->|CRITICAL| SNS
ConfigAgg -->|non-compliant| SNS
style MgmtAcct fill:#1a2a4a,stroke:#4a7abf
style DevAcct fill:#1a3a1a,stroke:#4abf4a
style ProdAcct fill:#3a2a1a,stroke:#bf7a4a
Per-account CloudTrail trails from Part 1 get disabled — the organization trail captures everything in one place, cutting duplicate storage costs.
Tailscale Hybrid Networking
graph LR
subgraph OnPrem["On-Prem AI Lab"]
GPU1["GPU Server 1<br/>NVIDIA A100<br/>Tailscale client"]
GPU2["GPU Server 2<br/>NVIDIA A100<br/>Tailscale client"]
end
subgraph TailscaleCP["Tailscale Coordination"]
Coord["Tailscale Control Plane<br/>ACLs enforce:<br/>• Private subnets only<br/>• Ports 443, 50051 only"]
end
subgraph DevVPC["Dev VPC 10.1.0.0/16"]
DevTSR["Tailscale Subnet Router<br/>t3.micro · Private Subnet<br/>ASG min=1 max=1<br/>Auth: Secrets Manager<br/>Advertises: 10.1.0.0/16"]
DevNAT["NAT Instance<br/>(outbound for TS registration)"]
DevWorkload["Future: EKS Pods<br/>Model Inference<br/>:443 :50051"]
end
subgraph ProdVPC["Prod VPC 10.2.0.0/16"]
ProdTSR["Tailscale Subnet Router<br/>t3.micro · Private Subnet<br/>ASG min=1 max=1<br/>Auth: Secrets Manager<br/>Advertises: 10.2.0.0/16"]
ProdNATGW["NAT Gateway<br/>(outbound for TS registration)"]
ProdWorkload["Future: EKS Pods<br/>Model Inference<br/>:443 :50051"]
end
GPU1 <-->|WireGuard UDP 41641| Coord
GPU2 <-->|WireGuard UDP 41641| Coord
DevTSR <-->|WireGuard UDP 41641| Coord
ProdTSR <-->|WireGuard UDP 41641| Coord
GPU1 -.->|10.1.x.x:443| DevWorkload
GPU1 -.->|10.2.x.x:50051| ProdWorkload
GPU2 -.->|10.1.x.x:50051| DevWorkload
DevTSR -->|outbound registration| DevNAT
ProdTSR -->|outbound registration| ProdNATGW
style OnPrem fill:#3a1a1a,stroke:#bf4a4a
style TailscaleCP fill:#1a1a3a,stroke:#4a4abf
Each subnet router advertises its VPC CIDR to the Tailscale network. On-prem GPUs see AWS private IPs as directly routable. Auth keys live in Secrets Manager. If a router dies, the ASG replaces it automatically.
Tailscale ACLs restrict on-prem access to private subnet CIDRs on ports 443 (HTTPS) and 50051 (gRPC) — just enough for model inference, nothing more.
VPC Private Endpoints
| Endpoint Type | Service | Why |
|---|---|---|
| Gateway (free) | S3 | Model artifact downloads, state bucket access |
| Gateway (free) | DynamoDB | State locking |
| Interface | ECR (api + dkr) | Container image pulls without NAT |
| Interface | STS | IAM role assumption for pods |
| Interface | CloudWatch Logs | Log delivery from private subnets |
| Interface | Secrets Manager | Tailscale auth keys, app secrets |
| Interface | EKS, EKS-Auth | Future private EKS API server access |
Gateway endpoints are free. Interface endpoints cost ~$7.20/month each, so we're selective — only services that see high traffic or handle sensitive data.
Cost Controls
graph TB
subgraph MgmtAcct["Management Account"]
Budgets["AWS Budgets"]
Anomaly["Cost Anomaly Detection"]
SNS["SNS: Cost Alerts"]
BudgetDev["Dev Budget: $100/mo<br/>Alerts: 50% · 80% · 100%<br/>Forecast: 90%"]
BudgetMgmt["Mgmt Budget: $200/mo<br/>Alerts: 50% · 80% · 100%<br/>Forecast: 90%"]
BudgetProd["Prod Budget: $500/mo<br/>Alerts: 50% · 80% · 100%<br/>Forecast: 90%"]
AnomalyRule["Anomaly Monitor<br/>By service + by account<br/>Threshold: $10 above baseline"]
end
subgraph SCPs["SCP Cost Guards"]
GPUCap["GPU Instance Cap<br/>Max: p3.2xlarge / g5.xlarge"]
RegionLock["Region Lock<br/>Only: us-east-1, us-west-2"]
end
subgraph DevAcct["Dev Account — Cost Optimizations"]
NATInst["NAT Instance t3.micro<br/>vs NAT Gateway ($32/mo savings)"]
ShortLogs["Short log retention<br/>Flow: 14d · CloudTrail: 90d"]
GWEndpoints["Gateway Endpoints (free)<br/>S3 · DynamoDB"]
ScheduledStop["Tailscale Router<br/>Scheduled stop 8PM-8AM"]
PayPerReq["DynamoDB PAY_PER_REQUEST"]
end
Budgets --> BudgetDev
Budgets --> BudgetMgmt
Budgets --> BudgetProd
Anomaly --> AnomalyRule
BudgetDev -->|threshold breach| SNS
BudgetMgmt -->|threshold breach| SNS
BudgetProd -->|threshold breach| SNS
AnomalyRule -->|anomaly detected| SNS
SCPs -->|enforced on| DevAcct
style MgmtAcct fill:#1a2a4a,stroke:#4a7abf
style DevAcct fill:#1a3a1a,stroke:#4abf4a
style SCPs fill:#3a1a1a,stroke:#bf4a4a
Backup and DR
graph LR
subgraph DevAcct["Dev Account — us-east-1"]
DevVault["Backup Vault<br/>AWS-managed KMS"]
DevDaily["Daily Backup<br/>Retain: 7 days"]
DevWeekly["Weekly Backup<br/>Retain: 30 days"]
DevResources["Tagged Resources<br/>BackupPolicy: daily"]
end
subgraph ProdAcct["Prod Account"]
subgraph ProdEast["us-east-1 (Primary)"]
ProdVault["Backup Vault<br/>AWS-managed KMS"]
ProdDaily["Daily Backup<br/>Retain: 30 days"]
ProdWeekly["Weekly Backup<br/>Retain: 90 days"]
ProdResources["Tagged Resources<br/>BackupPolicy: daily"]
ProdVPC["Prod VPC<br/>10.2.0.0/16"]
end
subgraph ProdWest["us-west-2 (DR)"]
DRVault["DR Backup Vault<br/>Cross-region copy"]
DRVPC["DR VPC<br/>10.3.0.0/16<br/>Same subnet topology"]
DRRunbook["DR Runbook<br/>Failover · DNS · Restore"]
end
end
DevResources -->|backup| DevDaily
DevResources -->|backup| DevWeekly
DevDaily --> DevVault
DevWeekly --> DevVault
ProdResources -->|backup| ProdDaily
ProdResources -->|backup| ProdWeekly
ProdDaily --> ProdVault
ProdWeekly --> ProdVault
ProdVault -->|cross-region replication| DRVault
ProdVPC -.->|failover ready| DRVPC
style DevAcct fill:#1a3a1a,stroke:#4abf4a
style ProdEast fill:#3a2a1a,stroke:#bf7a4a
style ProdWest fill:#3a2a1a,stroke:#bf7a4a,stroke-dasharray: 5 5
Backup selection is tag-based — tag a resource with BackupPolicy = "daily" and it's automatically included. No manual snapshot management.
IAM Identity Center
| Group | Permission Set | Accounts |
|---|---|---|
| PlatformEngineers | AdministratorAccess | All three |
| Developers | DeveloperAccess (no IAM/Orgs/billing) | Dev only |
| Auditors | ReadOnlyAccess | All three |
No static IAM credentials anywhere. Everyone authenticates through the SSO portal. Sessions expire after 8 hours.
Full Architecture
graph TB
subgraph OnPrem["On-Prem AI Lab"]
GPU["NVIDIA GPU Servers<br/>Model Training"]
end
subgraph GitHub["GitHub"]
GHA["GitHub Actions<br/>OIDC Authentication"]
end
subgraph MgmtAcct["Management Account"]
OrgTrail["Org CloudTrail → S3"]
ConfigAgg["Config Aggregator"]
GDAdmin["GuardDuty Admin"]
SecHub["Security Hub"]
SSO["IAM Identity Center<br/>PlatformEngineers | Developers | Auditors"]
Budgets["AWS Budgets<br/>$100 dev | $200 mgmt | $500 prod"]
Anomaly["Cost Anomaly Detection"]
SNSMgmt["SNS: Platform Alerts"]
StateBucket["S3: Terraform State<br/>Cross-region replication → us-west-2"]
LockTable["DynamoDB: State Lock"]
end
subgraph DevAcct["Dev Account — us-east-1"]
subgraph DevVPC["VPC 10.1.0.0/16"]
DevPriv["Private Subnets<br/>10.1.10.0/24 · 10.1.11.0/24"]
DevNAT["NAT Instance t3.micro + EIP"]
DevTS["Tailscale Subnet Router<br/>Advertises 10.1.0.0/16"]
DevVPCE["VPC Endpoints<br/>S3 · DynamoDB · ECR · STS · Logs"]
end
DevModels["S3: Model Artifacts (CMK)"]
end
subgraph ProdAcct["Prod Account — us-east-1"]
subgraph ProdVPC["VPC 10.2.0.0/16"]
ProdPriv["Private Subnets<br/>10.2.10.0/24 · 10.2.11.0/24"]
ProdNAT["2x NAT Gateway + 2 EIPs"]
ProdTS["Tailscale Subnet Router<br/>Advertises 10.2.0.0/16"]
ProdVPCE["VPC Endpoints<br/>S3 · DynamoDB · ECR · STS · Logs"]
end
subgraph DRVPC["DR VPC 10.3.0.0/16 — us-west-2"]
DRPriv["Private Subnets (pre-provisioned)"]
end
ProdModels["S3: Model Artifacts (CMK)"]
end
GPU <-->|Tailscale WireGuard| DevTS
GPU <-->|Tailscale WireGuard| ProdTS
GHA -->|OIDC| MgmtAcct
GHA -->|OIDC| DevAcct
GHA -->|OIDC| ProdAcct
DevAcct -.->|findings| GDAdmin
ProdAcct -.->|findings| GDAdmin
DevAcct -.->|config| ConfigAgg
ProdAcct -.->|config| ConfigAgg
GDAdmin -.->|HIGH/CRITICAL| SNSMgmt
Budgets -.->|threshold alerts| SNSMgmt
Anomaly -.->|anomaly alerts| SNSMgmt
style MgmtAcct fill:#1a2a4a,stroke:#4a7abf
style DevAcct fill:#1a3a1a,stroke:#4abf4a
style ProdAcct fill:#3a2a1a,stroke:#bf7a4a
style OnPrem fill:#3a1a1a,stroke:#bf4a4a
style DRVPC fill:#3a2a1a,stroke:#bf7a4a,stroke-dasharray: 5 5
Benefits
For the Engineering Team
- One portal, all accounts — SSO eliminates credential juggling and reduces the risk of leaked access keys
- Guardrails, not gates — SCPs prevent dangerous actions without requiring approval workflows that slow down development
- On-prem GPU access — Tailscale connects the AI lab to AWS without VPN hardware or networking expertise
For the Business
- Cost visibility — per-account budgets and anomaly detection catch spending problems in hours, not at month-end
- Audit readiness — centralized CloudTrail, Config, and Security Hub provide the compliance evidence investors and customers ask for
- DR capability — cross-region backups and a pre-provisioned DR VPC mean production can survive a regional outage
For the ML Platform
- Model security — encrypted artifact storage with scoped IAM means models are protected from unauthorized access and exfiltration
- Private connectivity — VPC endpoints and Tailscale keep model traffic off the public internet
- GPU cost control — SCP-enforced instance caps prevent accidental $24/hr GPU bills
Migration from Part 1
The existing single-account infrastructure doesn't get thrown away. The migration strategy:
- Create the Organization and member accounts via Control Tower
- Use Terraform state move operations to transfer VPC resources to the new accounts
- Preserve existing CIDRs (10.1.0.0/16 → Dev Account, 10.2.0.0/16 → Prod Account)
- Replace per-account CloudTrail trails with the organization trail
- Update the CI/CD pipeline for multi-account OIDC authentication
- Zero downtime throughout — no resources destroyed and recreated
Terraform Module Structure
terraform/
├── organization/ # AWS Organizations, Control Tower, SCPs
├── account-baseline/ # Per-account baseline (Config, GuardDuty enrollment)
├── security-services/ # Security Hub, org CloudTrail, Config aggregator
├── networking/ # VPC endpoints, Tailscale subnet routers
├── backup/ # AWS Backup plans, vaults, cross-region replication
├── cost-management/ # Budgets, anomaly detection, Cost Explorer
└── identity/ # IAM Identity Center, permission sets, groups
Each module has its own state file keyed by {module}/{account}/{env}/terraform.tfstate. A failure in one module doesn't affect others.
Multi-Account CI/CD Pipeline
flowchart TB
Dev[Developer] -->|push branch| PR[Pull Request]
PR -->|trigger| Plan
subgraph Plan["Plan Job (PR)"]
P1[checkout] --> P2[setup terraform]
P2 --> P3[OIDC auth → each account]
P3 --> P4[validate + fmt all modules]
P4 --> P5[plan all affected modules]
P5 --> P6[post plans as PR comment]
end
PR -->|merge to main| Apply
subgraph Apply["Apply Jobs (sequential)"]
A1["Apply: Management Account<br/>organization · security-services<br/>cost-management · identity"]
A1 -->|success| A2["Apply: Dev Account<br/>account-baseline · networking · backup"]
A2 -->|success| A3{"Apply: Prod Account<br/>Manual Approval Required"}
A3 -->|approved| A4["Apply: Prod Account<br/>account-baseline · networking · backup"]
end
subgraph Auth["OIDC Authentication"]
R1["IAM Role: Management"]
R2["IAM Role: Dev"]
R3["IAM Role: Prod"]
end
A1 -.->|assume| R1
A2 -.->|assume| R2
A4 -.->|assume| R3
style A3 fill:#5a4a00,stroke:#ff9
style Plan fill:#1a1a2a,stroke:#666
style Apply fill:#1a2a1a,stroke:#666
What's Next: Part 3 — EKS with NVIDIA GPU Nodes
With the multi-account foundation in place, Part 3 will deploy the compute layer:
- EKS clusters in Dev and Prod accounts on the private subnets from Part 1
- NVIDIA GPU node groups (p3.2xlarge / g5.xlarge) for Random Forest and LSTM model inference
- GPU device plugin and NVIDIA container runtime
- Private API server endpoint via the EKS VPC endpoints provisioned in Part 2
- Model serving pipeline — pull artifacts from the encrypted S3 bucket, serve via gRPC on port 50051
The subnets are tagged. The endpoints are in place. The security groups are defined. The model artifact buckets are encrypted. Part 3 is where the models start running.
This is Part 2 of the MikaMirAI ML Infrastructure series. Part 1 covered the network foundation. Part 3 will cover EKS deployment with NVIDIA GPU support for serving Random Forest and LSTM models.