Back to Blog
technical-referenceawsvpcnetworkingterraformtailscaleinfra

Design Document: MM AWS Network Infrastructure

AWS network infrastructure design: dual-VPC layout, subnet allocation, NAT strategy, Tailscale mesh routing, and peering topology.

November 1, 2025·19 min read

Design Document: MM AWS Network Infrastructure

Overview

This document describes the technical design for the Mica Mirai (MM) AWS Network Infrastructure project. The scope covers three phases delivered as Terraform code, version-controlled in GitHub, and deployed via GitHub Actions CI/CD:

  • Phase 0 — Bootstrap: One-time manual provisioning of the Terraform remote state backend (S3 bucket + DynamoDB lock table) and the IAM OIDC provider + role for GitHub Actions.
  • Phase 1 — Network: VPC, subnets, Internet Gateway, NAT egress resources (NAT Instance in dev, NAT Gateways in prod), route tables, VPC Flow Logs, and CloudTrail, deployed per environment (dev and prod).
  • Phase 2 — CI/CD: GitHub Actions workflow that automates terraform plan on pull requests and terraform apply on merges to main, with a mandatory approval gate before prod.

The dev environment uses a single NAT Instance instead of managed NAT Gateways to reduce cost. The prod environment uses two NAT Gateways (one per AZ) for high availability. The NAT Instance is the only EC2 compute resource provisioned — it exists solely for NAT egress and is not a workload instance. The architecture is EKS-ready: subnets carry the Kubernetes load-balancer discovery tags so a future EKS cluster can identify them automatically.

Goals

  • All infrastructure defined as code; zero manual console changes after bootstrap.
  • Two fully isolated environments (dev, prod) sharing no AWS resources.
  • Security baseline enforced from day one: private subnets have no IGW route, all state and audit buckets are encrypted and public-access-blocked, no static IAM credentials.
  • Consistent naming and tagging across every resource via a single local.common_tags locals block.
  • Cost-optimised dev environment using a NAT Instance instead of managed NAT Gateways.
  • EKS-ready subnet topology for future node groups and load balancers.

Non-Goals

  • Compute workload resources (EC2 application servers, EKS, Lambda, RDS). Note: a NAT Instance EC2 is provisioned in dev for NAT egress only.
  • Application-level security groups (beyond the VPC default).
  • DNS zones or Route 53 configuration.
  • Cost optimization (Reserved Instances, Savings Plans).

Architecture

High-Level Topology

dev environment — single NAT Instance for cost saving:

┌─────────────────────────────────────────────────────────────────────┐
│  AWS Account  (us-east-1)   dev VPC: 10.1.0.0/16                   │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  ┌─────────────────────┐  ┌─────────────────────────────┐   │   │
│  │  │  AZ: us-east-1a     │  │  AZ: us-east-1b             │   │   │
│  │  │                     │  │                             │   │   │
│  │  │  Public Subnet /24  │  │  Public Subnet /24          │   │   │
│  │  │  ┌───────────────┐  │  │  (no NAT resource)          │   │   │
│  │  │  │ NAT Instance  │  │  │                             │   │   │
│  │  │  │ (EC2 + EIP)   │  │  │                             │   │   │
│  │  │  └───────────────┘  │  │                             │   │   │
│  │  │                     │  │                             │   │   │
│  │  │  Private Subnet /24 │  │  Private Subnet /24         │   │   │
│  │  │  (shared RT → NAT)  │  │  (shared RT → NAT)          │   │   │
│  │  └─────────────────────┘  └─────────────────────────────┘   │   │
│  │                                                              │   │
│  │  Internet Gateway (1)                                        │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

prod environment — two NAT Gateways for high availability:

┌─────────────────────────────────────────────────────────────────────┐
│  AWS Account  (us-east-1)   prod VPC: 10.2.0.0/16                  │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  ┌─────────────────────┐  ┌─────────────────────────────┐   │   │
│  │  │  AZ: us-east-1a     │  │  AZ: us-east-1b             │   │   │
│  │  │                     │  │                             │   │   │
│  │  │  Public Subnet /24  │  │  Public Subnet /24          │   │   │
│  │  │  ┌───────────────┐  │  │  ┌───────────────────────┐  │   │   │
│  │  │  │  NAT-GW (AZ-a)│  │  │  │  NAT-GW (AZ-b)        │  │   │   │
│  │  │  │  EIP          │  │  │  │  EIP                  │  │   │   │
│  │  │  └───────────────┘  │  │  └───────────────────────┘  │   │   │
│  │  │                     │  │                             │   │   │
│  │  │  Private Subnet /24 │  │  Private Subnet /24         │   │   │
│  │  │  (RT-a → NAT-GW-a)  │  │  (RT-b → NAT-GW-b)         │   │   │
│  │  └─────────────────────┘  └─────────────────────────────┘   │   │
│  │                                                              │   │
│  │  Internet Gateway (1)                                        │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Deployment Phases

Phase 0 (manual, once)
  └─ terraform/bootstrap/
       ├─ S3 state bucket
       ├─ DynamoDB lock table
       ├─ IAM OIDC provider
       └─ IAM role MM-github-actions-role

Phase 1 (automated via CI/CD, per env)
  └─ terraform/network/
       ├─ VPC
       ├─ Public subnets (x2)
       ├─ Private subnets (x2)
       ├─ Internet Gateway
       ├─ [dev]  1 EIP + 1 NAT Instance (EC2) + 1 shared private RT
       ├─ [prod] 2 EIPs + 2 NAT Gateways + 2 per-AZ private RTs
       ├─ 1 public route table + 4 route table associations
       ├─ VPC Flow Logs + CloudWatch log group
       ├─ CloudTrail trail
       └─ CloudTrail S3 bucket

Phase 2 (GitHub Actions workflow)
  └─ .github/workflows/terraform.yaml
       ├─ PR: plan + comment
       ├─ merge to main: apply dev (auto)
       └─ prod gate: manual approval → apply prod

CIDR Allocation

Environment VPC CIDR Public Subnet AZ-a Public Subnet AZ-b Private Subnet AZ-a Private Subnet AZ-b
dev 10.1.0.0/16 10.1.0.0/24 10.1.1.0/24 10.1.10.0/24 10.1.11.0/24
prod 10.2.0.0/16 10.2.0.0/24 10.2.1.0/24 10.2.10.0/24 10.2.11.0/24

Public subnets occupy the .0.x and .1.x blocks; private subnets occupy .10.x and .11.x, leaving ample room for future expansion (e.g., intra-cluster subnets at .20.x).


Components and Interfaces

Phase 0: Bootstrap Module (terraform/bootstrap/)

Provisioned once by a platform engineer running terraform apply locally before the CI/CD pipeline exists.

Resource Terraform Resource Type Purpose
S3 State Bucket aws_s3_bucket + sub-resources Stores all Terraform remote state files
DynamoDB Lock Table aws_dynamodb_table Prevents concurrent Terraform runs
IAM OIDC Provider aws_iam_openid_connect_provider Federates GitHub Actions to AWS
IAM Role (GHA) aws_iam_role Assumed by GitHub Actions via OIDC
IAM Role Policy aws_iam_role_policy Grants Terraform permissions to the GHA role

S3 State Bucket sub-resources:

  • aws_s3_bucket_versioning — enabled
  • aws_s3_bucket_server_side_encryption_configuration — SSE-S3 (AES256)
  • aws_s3_bucket_public_access_block — all four flags true
  • aws_s3_bucket_logging — access logs delivered to a logs/ prefix in the same bucket (or a dedicated access-log bucket)

DynamoDB Lock Table:

  • Partition key: LockID (String)
  • Billing mode: PAY_PER_REQUEST
  • SSE: enabled (AWS-managed key)

IAM OIDC Provider:

  • URL: https://token.actions.githubusercontent.com
  • Audience: sts.amazonaws.com
  • Thumbprint list: current GitHub Actions OIDC thumbprint

IAM Role MM-github-actions-role:

  • Trust policy condition: token.actions.githubusercontent.com:sub must match repo:<org>/<repo>:ref:refs/heads/main
  • Permissions: scoped to Terraform operations (S3 state read/write, DynamoDB lock, EC2/VPC/IAM/CloudTrail/CloudWatch describe and create/delete within the account)

Phase 1: Network Module (terraform/network/)

Deployed per environment via CI/CD. All resources are parameterised through variables and a shared local.common_tags block. NAT egress resources differ by environment.

Resource Terraform Resource Type dev prod
VPC aws_vpc 1 1
Public Subnet aws_subnet 2 2
Private Subnet aws_subnet 2 2
Internet Gateway aws_internet_gateway 1 1
Elastic IP aws_eip 1 2
NAT Instance aws_instance 1
NAT Instance Security Group aws_security_group 1
NAT Gateway aws_nat_gateway 2
Public Route Table aws_route_table 1 1
Private Route Table aws_route_table 1 (shared) 2 (per-AZ)
Route Table Association aws_route_table_association 4 4
VPC Flow Log aws_flow_log 1 1
CloudWatch Log Group aws_cloudwatch_log_group 1 1
IAM Role (Flow Logs) aws_iam_role 1 1
CloudTrail Trail aws_cloudtrail 1 1
CloudTrail S3 Bucket aws_s3_bucket + sub-resources 1 1

NAT Instance details (dev only):

  • AMI: latest Amazon Linux 2 NAT AMI (amzn-ami-vpc-nat-*) — looked up via data "aws_ami" data source
  • Instance type: t3.micro (sufficient for dev traffic volumes)
  • Placed in public_subnet[0] (us-east-1a)
  • source_dest_check = false — required for NAT forwarding
  • Associated with one EIP
  • Security group allows: inbound from VPC CIDR on all ports; outbound to 0.0.0.0/0
  • Private route table: single shared RT with 0.0.0.0/0 → network_interface_id of the NAT Instance's primary ENI; both private subnets associate with this RT

NAT Gateway details (prod only):

  • One per AZ, placed in the public subnet of that AZ
  • Each associated with a dedicated EIP
  • Each private subnet has its own route table pointing to the NAT Gateway in the same AZ

Phase 2: CI/CD Workflow (.github/workflows/terraform.yaml)

The workflow has two jobs:

  1. plan — triggered on pull requests targeting main for changes under terraform/** or specs/**:

    • Checks out code
    • Configures AWS credentials via OIDC (aws-actions/configure-aws-credentials@v4)
    • Runs terraform init, terraform validate, terraform fmt -check
    • Runs terraform plan -var-file=envs/dev.tfvars -out=tfplan
    • Posts plan output as a PR comment
  2. apply — triggered on push to main:

    • dev job: runs terraform apply with dev.tfvars automatically
    • prod job: depends on dev job success + manual approval via GitHub Environment prod; runs terraform apply with prod.tfvars

Data Models

Terraform Variable Schema

Bootstrap Variables (terraform/bootstrap/variables.tf)

variable "aws_region" {
  type        = string
  default     = "us-east-1"
  description = "AWS region for all bootstrap resources"
}

variable "state_bucket_name" {
  type        = string
  description = "Globally unique name for the Terraform state S3 bucket"
}

variable "lock_table_name" {
  type        = string
  default     = "MM-terraform-lock"
  description = "Name of the DynamoDB state lock table"
}

variable "github_org" {
  type        = string
  description = "GitHub organisation name (used in OIDC trust policy)"
}

variable "github_repo" {
  type        = string
  description = "GitHub repository name (used in OIDC trust policy)"
}

# Mandatory tagging variables
variable "environment"       { type = string }
variable "owner"             { type = string }
variable "cost_center"       { type = string }
variable "created_by"        { type = string }
variable "creation_date"     { type = string }

Network Variables (terraform/network/variables.tf)

variable "aws_region" {
  type    = string
  default = "us-east-1"
}

variable "environment" {
  type        = string
  description = "Deployment environment: dev or prod"
  validation {
    condition     = contains(["dev", "prod"], var.environment)
    error_message = "environment must be 'dev' or 'prod'."
  }
}

variable "vpc_cidr" {
  type        = string
  description = "CIDR block for the VPC (10.1.0.0/16 for dev, 10.2.0.0/16 for prod)"
}

variable "availability_zones" {
  type        = list(string)
  default     = ["us-east-1a", "us-east-1b"]
  description = "List of AZs to deploy into (exactly 2 required)"
}

variable "public_subnet_cidrs" {
  type        = list(string)
  description = "List of /24 CIDR blocks for public subnets (one per AZ)"
}

variable "private_subnet_cidrs" {
  type        = list(string)
  description = "List of /24 CIDR blocks for private subnets (one per AZ)"
}

variable "flow_log_retention_days" {
  type        = number
  description = "CloudWatch log retention in days (14 for dev, 90 for prod)"
}

variable "cloudtrail_log_retention_days" {
  type        = number
  description = "S3 lifecycle expiry for CloudTrail logs (90 for dev, 365 for prod)"
}

variable "nat_instance_type" {
  type        = string
  default     = "t3.micro"
  description = "EC2 instance type for the NAT Instance (dev only)"
}

variable "state_bucket_name" {
  type        = string
  description = "Name of the S3 bucket used for Terraform remote state (from Phase 0)"
}

variable "lock_table_name" {
  type        = string
  description = "Name of the DynamoDB lock table (from Phase 0)"
}

# Mandatory tagging variables
variable "owner"             { type = string }
variable "cost_center"       { type = string }
variable "created_by"        { type = string }
variable "creation_date"     { type = string }
variable "data_classification" { type = string; default = "internal" }
variable "criticality"       { type = string }
variable "backup_policy"     { type = string; default = "none" }
variable "patch_group"       { type = string; default = "none" }

Environment tfvars Files

terraform/network/envs/dev.tfvars

environment                   = "dev"
vpc_cidr                      = "10.1.0.0/16"
availability_zones            = ["us-east-1a", "us-east-1b"]
public_subnet_cidrs           = ["10.1.0.0/24", "10.1.1.0/24"]
private_subnet_cidrs          = ["10.1.10.0/24", "10.1.11.0/24"]
flow_log_retention_days       = 14
cloudtrail_log_retention_days = 90
nat_instance_type             = "t3.micro"
owner                         = "platform-team"
cost_center                   = "mm-platform"
created_by                    = "terraform"
creation_date                 = "2025-01-01"
criticality                   = "low"

terraform/network/envs/prod.tfvars

environment                   = "prod"
vpc_cidr                      = "10.2.0.0/16"
availability_zones            = ["us-east-1a", "us-east-1b"]
public_subnet_cidrs           = ["10.2.0.0/24", "10.2.1.0/24"]
private_subnet_cidrs          = ["10.2.10.0/24", "10.2.11.0/24"]
flow_log_retention_days       = 90
cloudtrail_log_retention_days = 365
owner                         = "platform-team"
cost_center                   = "mm-platform"
created_by                    = "terraform"
creation_date                 = "2025-01-01"
criticality                   = "high"

Naming Convention Implementation

The naming convention MM-{env}-{region-short}-{az-short}-{resource}-{purpose} is implemented as a Terraform local:

locals {
  region_short = "use1"

  # AZ short tokens indexed by AZ name
  az_short = {
    "us-east-1a" = "use1a"
    "us-east-1b" = "use1b"
  }

  # Helper: name for a non-AZ-specific resource
  # Usage: local.name("vpc", "core")
  # Returns: "MM-dev-use1-vpc-core"
  name_prefix = "MM-${var.environment}-${local.region_short}"

  # Helper function pattern (implemented inline per resource):
  # AZ-specific:     "MM-${var.environment}-${local.region_short}-${local.az_short[az]}-${resource}-${purpose}"
  # Non-AZ-specific: "MM-${var.environment}-${local.region_short}-${resource}-${purpose}"
}

Mandatory Tags Implementation

locals {
  common_tags = {
    Environment      = var.environment
    Owner            = var.owner
    CostCenter       = var.cost_center
    Project          = "mm-aws-infra"
    ManagedBy        = "Terraform"
    CreatedBy        = var.created_by
    CreationDate     = var.creation_date
    DataClassification = var.data_classification
    Criticality      = var.criticality
    BackupPolicy     = var.backup_policy
    PatchGroup       = var.patch_group
  }
}

Every resource block merges this map:

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = merge(local.common_tags, {
    Name = "MM-${var.environment}-${local.region_short}-vpc-core"
  })
}

Backend Configuration

terraform/network/backend.tf

terraform {
  backend "s3" {
    bucket         = "<state_bucket_name>"   # supplied via -backend-config or partial config
    key            = "network/${var.environment}/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "<lock_table_name>"
  }
}

Because Terraform does not allow variable interpolation in backend blocks, the key is parameterised using a partial backend configuration passed at terraform init time:

terraform init \
  -backend-config="key=network/dev/terraform.tfstate" \
  -backend-config="bucket=${STATE_BUCKET}" \
  -backend-config="dynamodb_table=${LOCK_TABLE}"

The CI/CD workflow sets these values from GitHub Actions environment variables or secrets.


Repository Structure

.
├── .github/
│   └── workflows/
│       └── terraform.yaml          # CI/CD pipeline (Phase 2)
├── .kiro/
│   └── specs/
│       └── mm-aws-network-infra/
│           ├── requirements.md
│           ├── design.md
│           └── tasks.md
├── docs/
│   └── post-apply-checklist.md     # Requirement 27.3 verification checklist
├── modules/                        # Reserved for future shared Terraform modules
├── specs/
│   ├── phase0-state.yaml           # Kiro spec: bootstrap (Req 19.1)
│   └── phase1-network.yaml         # Kiro spec: network (Req 19.2)
├── terraform/
│   ├── bootstrap/                  # Phase 0 — run once manually
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── providers.tf
│   │   └── terraform.tfvars        # NOT committed; supplied by engineer
│   └── network/                    # Phase 1 — deployed by CI/CD
│       ├── backend.tf
│       ├── main.tf
│       ├── variables.tf
│       ├── outputs.tf
│       ├── providers.tf
│       ├── locals.tf               # common_tags + naming helpers
│       ├── vpc.tf
│       ├── subnets.tf
│       ├── igw.tf
│       ├── eip.tf
│       ├── nat_gateway.tf
│       ├── nat_instance.tf         # NAT Instance + SG (dev only, count = env == "dev" ? 1 : 0)
│       ├── route_tables.tf
│       ├── flow_logs.tf
│       ├── cloudtrail.tf
│       └── envs/
│           ├── dev.tfvars
│           └── prod.tfvars

File Responsibilities

File Contents
bootstrap/main.tf S3 bucket, DynamoDB table, IAM OIDC provider, IAM role
network/locals.tf common_tags, region_short, az_short map, name-building expressions
network/vpc.tf aws_vpc resource
network/subnets.tf aws_subnet resources (public x2, private x2) with EKS tags
network/igw.tf aws_internet_gateway resource
network/eip.tf aws_eip resources — 1 in dev (for NAT Instance), 2 in prod (for NAT Gateways)
network/nat_gateway.tf aws_nat_gateway resources (x2, prod only) with depends_on IGW
network/nat_instance.tf aws_instance NAT Instance + aws_security_group (dev only); count conditional on var.environment == "dev"
network/route_tables.tf Public RT; 1 shared private RT (dev) or 2 per-AZ private RTs (prod); 4 associations; routes
network/flow_logs.tf aws_flow_log, aws_cloudwatch_log_group, IAM role for flow logs
network/cloudtrail.tf aws_cloudtrail, aws_s3_bucket + sub-resources for CloudTrail
network/backend.tf Partial S3 backend configuration
network/providers.tf AWS provider pinned version
network/outputs.tf VPC ID, subnet IDs, NAT GW IDs or NAT Instance ID, etc. for downstream modules

Key Design Decisions

1. NAT Instance in dev, NAT Gateways in prod

dev uses a single EC2 NAT Instance (t3.micro, Amazon Linux 2 NAT AMI) placed in public_subnet[0] (us-east-1a). Both private subnets share one route table pointing to the NAT Instance's primary ENI. This eliminates the ~$32/month per-NAT-Gateway cost in the non-production environment where high availability is not required.

prod uses two managed NAT Gateways — one per AZ — each with a dedicated EIP. Each private subnet has its own route table pointing to the NAT Gateway in the same AZ. This eliminates cross-AZ traffic charges and removes the single point of failure that a shared NAT Gateway would introduce.

The environment-specific resources are controlled via count = var.environment == "dev" ? 1 : 0 (NAT Instance) and count = var.environment == "prod" ? 2 : 0 (NAT Gateways), keeping a single shared Terraform configuration for both environments.

2. One NAT Gateway per AZ in prod (not shared)

See Decision 1 above for the full rationale. The per-AZ NAT Gateway topology in prod ensures that a NAT Gateway failure in one AZ does not affect egress in the other AZ, and avoids cross-AZ data transfer charges.

3. Partial Backend Configuration

Terraform does not support variable interpolation in backend {} blocks. Rather than hardcoding the bucket name and state key, the design uses a partial backend configuration: backend.tf contains only the region and encrypt flag; the bucket name, key, and DynamoDB table are passed via -backend-config flags at terraform init time. The CI/CD workflow injects these from GitHub Actions secrets/environment variables, keeping them out of source control.

4. map_public_ip_on_launch = false on Public Subnets

No compute workloads are placed in public subnets in this phase. Setting map_public_ip_on_launch = false prevents accidental public IP assignment if a resource is mistakenly launched there. NAT Gateways and the NAT Instance receive their public IPs via explicitly allocated EIPs, not this flag.

5. Separate CloudTrail Bucket per Environment

Each environment gets its own CloudTrail S3 bucket rather than a shared bucket. This keeps environment blast radius contained: a misconfigured bucket policy in dev cannot affect prod audit logs. The lifecycle retention difference (90 days dev vs 365 days prod) also makes per-environment buckets simpler to manage.

6. local.common_tags as the Single Source of Truth for Tags

All mandatory tags are defined once in locals.tf. Every resource block uses tags = merge(local.common_tags, { Name = "..." }). This means adding a new mandatory tag requires editing exactly one file. A terraform-compliance or checkov pre-apply lint rule can enforce that no resource block omits local.common_tags.

7. EKS Subnet Tags Applied at Creation

The kubernetes.io/role/elb and kubernetes.io/role/internal-elb tags are applied at subnet creation time as part of the tags merge. This avoids a future state drift when an EKS cluster is added — the subnets are already correctly tagged and EKS will discover them without any Terraform changes.

8. Bootstrap is Not Managed by Remote State

The bootstrap module (terraform/bootstrap/) uses local state (or a manually managed state file) because the remote state backend does not yet exist when bootstrap runs. After bootstrap completes, all subsequent modules use the S3 backend. The bootstrap state file should be stored securely by the platform engineer (e.g., in a secure local directory or imported into the S3 bucket manually after creation).


Error Handling

Terraform Init Failures

If the S3 state bucket is unreachable during terraform init, Terraform exits with a non-zero code and a descriptive error. The CI/CD workflow treats any non-zero exit as a failure and stops the pipeline. No plan or apply step runs.

State Lock Conflicts

If a concurrent Terraform run attempts to acquire the DynamoDB lock while one is held, Terraform returns a lock conflict error with the lock holder's ID and timestamp. The operator must either wait for the lock to be released or run terraform force-unlock <lock-id> after confirming the holding process is dead.

NAT Egress Provisioning Dependencies

prod NAT Gateways: depend on both the EIP allocation and the Internet Gateway being attached. Terraform's dependency graph handles this automatically via depends_on = [aws_internet_gateway.main] on the NAT Gateway resources. If the IGW attachment fails, the NAT Gateway creation is not attempted.

dev NAT Instance: depends on the public subnet and the EIP. The route table entry uses network_interface_id of the NAT Instance's primary ENI, so the route table association depends on the instance being in a running state. Terraform handles this via implicit resource references in the route resource.

Plan Deviation Gate (Requirement 27)

The CI/CD workflow captures the terraform plan exit code and resource change summary. Expected resource counts differ by environment:

  • dev: 1 VPC, 4 subnets, 1 IGW, 1 EIP, 1 NAT Instance, 1 NAT Instance SG, 2 RTs, 4 RT associations, 1 flow log, 1 CloudTrail trail
  • prod: 1 VPC, 4 subnets, 1 IGW, 2 EIPs, 2 NAT GWs, 3 RTs, 4 RT associations, 1 flow log, 1 CloudTrail trail

If the plan shows a resource count outside the expected range for the target environment, the workflow posts a warning comment on the PR and requires a human reviewer to explicitly approve before apply proceeds.

Missing Required Directories (Requirement 18)

A pre-apply shell step in the CI/CD workflow checks for the existence of all required top-level directories (specs/, terraform/bootstrap/, terraform/network/, modules/, .github/workflows/, docs/). If any are absent, the step exits non-zero with a descriptive message identifying the missing directory, and the pipeline fails before terraform init is called.

Environment Cross-Contamination Guard (Requirement 23.5)

Before terraform apply, the workflow verifies that the plan does not include changes to resources tagged with a different Environment value than the target. This is implemented as a terraform show -json tfplan | jq check that scans planned resource changes for Environment tag mismatches.


Testing Strategy

This feature is Infrastructure as Code (Terraform + GitHub Actions). Property-based testing is not applicable because:

  • The code is declarative configuration, not a function with inputs and outputs.
  • Correctness is verified by Terraform's own plan/apply cycle and AWS API responses.
  • Running 100 iterations of terraform apply would be prohibitively expensive and slow.

The testing strategy uses the following complementary approaches:

1. Static Analysis (pre-apply, every PR)

Tool What it checks
terraform validate HCL syntax and provider schema correctness
terraform fmt -check Consistent formatting
checkov or tfsec Security misconfigurations (public S3, unencrypted resources, missing tags)
terraform-compliance Policy-as-code: every resource has local.common_tags, no IGW route on private RTs

2. Plan Verification (pre-apply, every PR and merge)

  • terraform plan is run for dev on every PR.
  • The plan output is parsed to confirm the expected resource count (Requirement 27).
  • Any deviation triggers a required human review before apply.

3. Post-Apply Smoke Tests (after each terraform apply)

A shell script or AWS CLI check verifies:

  • VPC exists with the correct CIDR block.
  • All four subnets are in the correct AZs with correct CIDRs.
  • dev: NAT Instance is in running state, source/destination check is disabled, EIP is associated.
  • prod: Both NAT Gateways are in available state, each with a dedicated EIP.
  • Public route table has a 0.0.0.0/0 → IGW route.
  • dev: One shared private route table has a 0.0.0.0/0 → NAT Instance ENI route; no IGW route.
  • prod: Each private route table has a 0.0.0.0/0 → NAT-GW route (no IGW route).
  • VPC Flow Logs are delivering records to the CloudWatch log group.
  • CloudTrail trail is active and logging.
  • All S3 buckets have public access blocked and SSE enabled.
  • DynamoDB lock table exists with LockID partition key.

These checks are defined in docs/post-apply-checklist.md and run as a post-deploy step in the CI/CD workflow.

4. Idempotency Test

After a successful terraform apply, the workflow runs terraform plan again and asserts that the plan shows zero changes. A non-zero change count after a clean apply indicates a non-idempotent resource configuration and fails the pipeline.

5. Integration Test: State Locking

A manual test (documented in docs/post-apply-checklist.md) verifies that running two concurrent terraform plan operations against the same state key results in one succeeding and one receiving a lock conflict error.

6. Security Baseline Verification

checkov is run in the CI/CD pipeline with a policy file that enforces:

  • No S3 bucket has public access enabled.
  • All S3 buckets have SSE configured.
  • All DynamoDB tables have SSE enabled.
  • No IAM user with static access keys is created.
  • No route table associated with a private subnet has a route to an IGW.
  • The NAT Instance (dev) has source_dest_check = false.