Design Document: NVIDIA AWS GPU Certification Study System

Overview

This system provides a structured 6-week study plan and resource platform for the NVIDIA Certified Professional AI Infrastructure exam, focused on AWS GPU deployment. It combines curriculum management, hands-on labs, progress tracking, blog publishing, and certification blueprint cross-checking into a cohesive learning platform.

The system is designed as a static-site-based application (Next.js) with local-first data storage, enabling candidates to work through the curriculum offline while maintaining the ability to publish and share content. Content is organized around the NVIDIA certification blueprint objectives, with each component (labs, progress, blog) tied back to specific exam objectives.

Key Design Decisions

Next.js Static Site with MDX — Curriculum content is authored in MDX for rich interactivity while maintaining version control friendliness. This allows code blocks, diagrams, and interactive elements within study materials.
Local-first with JSON/Markdown storage — Progress data and blog drafts are stored locally in structured JSON and Markdown files, avoiding external database dependencies during study. This keeps the system portable and simple.
File-based Lab System — Labs are self-contained directories with instructions, Terraform configs, validation scripts, and expected outputs. This makes labs reproducible and version-controllable.
Blueprint-driven Architecture — The NVIDIA certification blueprint is the source of truth. All content (weeks, labs, exercises) maps back to specific blueprint objectives, enabling gap analysis and coverage tracking.
Mermaid Diagrams — Architecture and flow diagrams use Mermaid for inline rendering without external tooling.

Architecture

High-Level System Architecture

graph TB
    subgraph "Host Machine (Docker Only)"
        DC[Docker Compose]
        
        subgraph "Container: app"
            subgraph "Content Layer"
                CUR[Curriculum MDX Files]
                LABS[Lab Definitions]
                RES[Resource References]
                BP[Blueprint Objectives]
            end

            subgraph "Application Layer"
                APP[Next.js Application]
                PROG[Progress Tracker]
                BLOG[Blog System]
                BPC[Blueprint Checker]
                GAP[Gap Analyzer]
                COST[Cost Calculator]
            end
        end

        subgraph "Container: tools"
            TF[Terraform]
            KUBECTL[kubectl]
            AWSCLI[AWS CLI]
            NVIDIASMI[nvidia-smi simulator]
        end

        subgraph "Container: monitoring"
            PROM[Prometheus]
            GRAF[Grafana]
        end

        subgraph "Docker Volumes"
            PDATA[Progress Data - JSON]
            BDATA[Blog Posts - MDX]
            ASSETS[Uploaded Assets]
        end

        subgraph "Output Layer"
            SITE[Static Site]
            EXPORT[Blog Export]
            REPORT[Progress Reports]
        end
    end

    DC --> app
    DC --> tools
    DC --> monitoring

    CUR --> APP
    LABS --> APP
    RES --> APP
    BP --> BPC
    BP --> GAP

    APP --> PROG
    APP --> BLOG
    APP --> BPC
    APP --> GAP
    APP --> COST

    PROG --> PDATA
    BLOG --> BDATA
    BLOG --> ASSETS

    APP --> SITE
    BLOG --> EXPORT
    PROG --> REPORT

Directory Structure

nvidia-gpu-cert-study/
├── docker-compose.yml
├── Dockerfile                    # Main app container
├── Dockerfile.tools              # Lab tools container (Terraform, kubectl, AWS CLI)
├── Dockerfile.monitoring         # Monitoring stack container (Prometheus, Grafana)
├── .env.local                    # Local environment variables
├── .env.cloud.example            # Cloud environment variable template
├── content/
│   ├── curriculum/
│   │   ├── week-1-gpu-fundamentals/
│   │   ├── week-2-kubernetes-eks/
│   │   ├── week-3-slurm-hpc/
│   │   ├── week-4-monitoring-dcgm/
│   │   ├── week-5-cost-optimization/
│   │   └── week-6-troubleshooting/
│   ├── labs/
│   │   ├── lab-01-nvidia-smi-basics/
│   │   ├── lab-02-mig-configuration/
│   │   ├── lab-03-eks-gpu-cluster/
│   │   └── ...
│   ├── blueprint/
│   │   ├── objectives.json
│   │   └── mapping.json
│   └── resources/
│       └── references.json
├── src/
│   ├── components/
│   │   ├── LabRunner/
│   │   ├── ProgressTracker/
│   │   ├── BlogEditor/
│   │   ├── BlueprintChecker/
│   │   ├── GapAnalysis/
│   │   └── CostCalculator/
│   ├── lib/
│   │   ├── progress.ts
│   │   ├── blueprint.ts
│   │   ├── labs.ts
│   │   ├── blog.ts
│   │   ├── cost.ts
│   │   └── container.ts
│   └── pages/
│       ├── index.tsx
│       ├── week/[weekId].tsx
│       ├── labs/[labId].tsx
│       ├── progress.tsx
│       ├── blog/
│       ├── blueprint.tsx
│       └── gap-analysis.tsx
├── data/                         # Docker volume mount point
│   ├── progress.json
│   └── blog-posts/
├── public/
│   └── assets/                   # Docker volume mount point
├── terraform/
│   ├── modules/
│   │   ├── eks-gpu-cluster/
│   │   ├── p5-instance/
│   │   └── monitoring-stack/
│   └── labs/
├── monitoring/
│   ├── prometheus.yml
│   └── grafana/
│       └── dashboards/
└── scripts/
    ├── validate-lab.sh
    └── export-blog.ts

Technology Stack

Component	Technology	Rationale
Framework	Next.js 14 (App Router)	Static generation, MDX support, file-based routing
Content	MDX	Rich content with embedded components and code
Styling	Tailwind CSS	Rapid UI development, responsive design
Data Storage	Local JSON + Markdown files	No database dependency, portable, version-controllable
Diagrams	Mermaid	Inline diagrams without external tools
IaC	Terraform	Industry standard for AWS provisioning
Lab Validation	Shell scripts + Node.js	Cross-platform validation of lab outcomes
Blog Export	Markdown + HTML	Compatible with common publishing platforms
Testing	Vitest + fast-check	Unit testing with property-based testing support
Containerization	Docker + Docker Compose	All services run in containers, no host dependencies
Container Registry	Docker Hub / Amazon ECR	Cloud-compatible image distribution
Monitoring	Prometheus + Grafana (containerized)	GPU metrics visualization in isolated containers

Components and Interfaces

1. Curriculum Engine

Responsible for loading, organizing, and presenting the 6-week study plan content.

interface WeekModule {
  id: string; // "week-1", "week-2", etc.
  title: string;
  objectives: LearningObjective[];
  topics: Topic[];
  labs: string[]; // Lab IDs
  dependencies: string[]; // IDs of prerequisite weeks
  blueprintObjectives: string[]; // Blueprint objective IDs covered
}

interface LearningObjective {
  id: string;
  description: string;
  blueprintRef: string[]; // Maps to blueprint objectives
  assessmentCriteria: string;
}

interface Topic {
  id: string;
  title: string;
  content: string; // MDX file path
  difficulty: "foundational" | "intermediate" | "advanced";
  estimatedMinutes: number;
  practiceExercises: Exercise[];
}

interface Exercise {
  id: string;
  title: string;
  description: string;
  type: "command" | "configuration" | "deployment" | "analysis";
  instructions: string;
  expectedOutcome: string;
  hints: string[];
  validationScript?: string;
}

2. Lab System

Self-contained lab exercises tied to certification objectives with validation. Labs are organized into a tiered architecture that optimizes for cost discipline and repetition frequency.

type LabTier = "foundation" | "advanced" | "elite";

interface Lab {
  id: string;
  title: string;
  weekId: string;
  objectives: string[]; // Learning objective IDs
  blueprintRefs: string[]; // Blueprint objective IDs
  prerequisites: string[]; // Lab IDs that must be completed first
  difficulty: "beginner" | "intermediate" | "advanced";
  estimatedMinutes: number;
  tier: LabTier;
  recommendedInstance: string; // Cheapest instance that works (e.g., "g5.xlarge")
  estimatedHourlyCost: number; // Hourly cost in USD
  minimumGpuRequirement: number; // 1, 4, or 8
  multiGpuRequired: boolean;
  nvlinkRequired: boolean;
  instanceJustification: string; // Why this instance was chosen
  estimatedCost: CostEstimate;
  environment: LabEnvironment;
  steps: LabStep[];
  validationCheckpoints: ValidationCheckpoint[];
}

interface LabEnvironment {
  awsServices: string[];
  instanceTypes: string[];
  terraformModule?: string; // Path to Terraform config
  setupInstructions: string;
  teardownInstructions: string;
}

interface LabStep {
  order: number;
  title: string;
  instructions: string; // MDX content
  commands?: string[];
  expectedOutput?: string;
  checkpoint?: string; // Validation checkpoint ID
}

interface ValidationCheckpoint {
  id: string;
  description: string;
  validationType: "command_output" | "file_exists" | "api_response" | "manual";
  validationScript?: string;
  expectedResult: string;
}

interface CostEstimate {
  hourlyRate: number;
  estimatedDuration: number; // minutes
  totalEstimate: number;
  instanceType: string;
  notes: string;
}

Lab Tier Architecture

The lab system uses a three-tier architecture to enforce cost discipline while ensuring learning objectives are met:

Tier	Target Instances	Hourly Cost	Lab Distribution	Use Cases
Foundation	g5.xlarge, g5.2xlarge	~$1-2/hr	70-80% of labs	nvidia-smi, CUDA basics, Docker GPU runtime, EKS GPU plugin, K8s scheduling, taints/tolerations, DCGM basics, Prometheus/Grafana, MIG concepts (theory), Terraform, GPU pod deployment
Advanced	g5.12xlarge (4x A10G), p4d.24xlarge	~$5-33/hr	10-20% of labs	Multiple GPU scheduling, inference scaling, profiling, concurrency testing, MIG practicals (A100 required)
Elite	p4d.24xlarge, p5.48xlarge	~$33-98/hr	≤10% of labs	NVLink, NCCL, distributed training, topology awareness, GPUDirect, EFA, multi-node collectives

Lab-to-Tier Assignments

Lab	Title	Tier	Instance	Hourly Cost	Justification
Lab 01	nvidia-smi basics	Foundation	g5.xlarge	$1/hr	Single-GPU nvidia-smi queries work on any GPU
Lab 02	MIG configuration	Advanced	p4d.24xlarge	$33/hr	MIG requires A100 GPU (not available on G5)
Lab 03	EKS GPU cluster	Foundation	g5.xlarge (workers)	$1/hr	Device plugin, scheduling, taints work on single GPU
Lab 04	GPU pod scheduling	Foundation	g5.xlarge	$1/hr	Resource requests/limits work on single GPU
Lab 05	MIG Kubernetes	Advanced	p4d.24xlarge	$33/hr	MIG-in-K8s requires A100 for real MIG partitions
Lab 06	Slurm GPU cluster	Foundation	g5.xlarge	$1/hr	Slurm GPU scheduling works with single GPU
Lab 07	DCGM setup	Foundation	g5.xlarge	$1/hr	DCGM installation and basic metrics on any GPU
Lab 08	Prometheus/Grafana	Foundation	g5.xlarge	$1/hr	Monitoring stack setup is GPU-count agnostic
Lab 09	Cost analysis	Foundation	local/none	$0/hr	Cost analysis is a calculation exercise, no GPU needed
Lab 10	P5 deployment	Elite	p5.48xlarge	$98/hr	H100-specific features, NVLink topology
Lab 11	EFA configuration	Elite	p5.48xlarge	$98/hr	EFA/multi-node requires p5 with EFA interfaces
Lab 12	Troubleshooting	Foundation	g5.xlarge	$1/hr	Troubleshooting methodology works on any GPU

Tier-Based Workflow Recommendations

graph TD
    subgraph "Daily Practice (Foundation Tier)"
        F1[nvidia-smi drills]
        F2[K8s scheduling exercises]
        F3[DCGM/monitoring setup]
        F4[Terraform deployments]
        F5[Troubleshooting scenarios]
    end

    subgraph "Weekly Sessions (Advanced Tier)"
        A1[MIG configuration]
        A2[Multi-GPU profiling]
    end

    subgraph "Focused Sessions (Elite Tier, 2-4hr max)"
        E1[NVLink/NCCL testing]
        E2[EFA multi-node]
    end

    F1 --> A1
    F2 --> A1
    F3 --> E1
    A1 --> E1
    A2 --> E2

Recommended workflow:

Daily (1-2 hours): Run foundation-tier labs on g5.xlarge. Build muscle memory through repetition. Cost: ~$1-2/session.
Weekly (1 session): Run advanced-tier labs when foundation concepts are solid. Cost: ~$5-33/session.
Bi-weekly (2-4 hour focused session): Run elite-tier labs only when specifically preparing for NVLink/NCCL/EFA topics. Terminate immediately after. Cost: ~$66-392/session.

3. Progress Tracking System

Tracks completion status, stores evidence, and provides summary views.

interface ProgressStore {
  userId: string;
  startDate: string;
  weeks: WeekProgress[];
  labs: LabProgress[];
  overallCompletion: number; // 0-100
  blueprintCoverage: BlueprintCoverage;
}

interface WeekProgress {
  weekId: string;
  status: "not_started" | "in_progress" | "completed";
  startedAt?: string;
  completedAt?: string;
  objectives: ObjectiveProgress[];
}

interface ObjectiveProgress {
  objectiveId: string;
  status: "not_started" | "in_progress" | "completed";
  completedAt?: string;
  evidence: ProgressEvidence[];
  notes: string;
}

interface LabProgress {
  labId: string;
  status: "not_started" | "in_progress" | "completed" | "failed";
  startedAt?: string;
  completedAt?: string;
  checkpointResults: CheckpointResult[];
  evidence: ProgressEvidence[];
  timeSpent: number; // minutes
}

interface ProgressEvidence {
  id: string;
  type: "screenshot" | "code" | "command_output" | "config_file" | "note";
  title: string;
  content: string; // For code/commands: inline content. For images: file path
  filePath?: string; // Path to uploaded asset
  createdAt: string;
  objectiveId?: string;
  labId?: string;
}

interface CheckpointResult {
  checkpointId: string;
  passed: boolean;
  output?: string;
  timestamp: string;
}

interface BlueprintCoverage {
  totalObjectives: number;
  covered: number;
  partial: number;
  uncovered: number;
  byCategory: CategoryCoverage[];
}

interface CategoryCoverage {
  category: string;
  objectives: { id: string; status: "covered" | "partial" | "uncovered" }[];
}

4. Blog System

Intuitive content creation for milestone documentation and knowledge sharing.

interface BlogPost {
  id: string;
  title: string;
  slug: string;
  milestoneId: string; // Week or lab ID this documents
  status: "draft" | "published";
  createdAt: string;
  updatedAt: string;
  publishedAt?: string;
  content: string; // MDX content
  assets: BlogAsset[];
  tags: string[];
  objectivesCovered: string[];
  template: string; // Template used for creation
}

interface BlogAsset {
  id: string;
  type: "image" | "diagram" | "code" | "terminal_output";
  fileName: string;
  filePath: string;
  caption?: string;
  uploadedAt: string;
}

interface BlogTemplate {
  id: string;
  milestoneType: "week_completion" | "lab_completion" | "certification_prep";
  title: string;
  sections: TemplateSection[];
}

interface TemplateSection {
  heading: string;
  placeholder: string;
  required: boolean;
  prefillFrom?: string; // Data source for pre-population
}

5. Blueprint Checker

Maps certification objectives to study plan content and identifies gaps.

interface BlueprintObjective {
  id: string;
  category: "deployment_validation" | "software_installation" | "performance_testing" | "troubleshooting";
  title: string;
  description: string;
  subObjectives?: string[];
}

interface ObjectiveMapping {
  objectiveId: string;
  weekModules: string[];
  labs: string[];
  exercises: string[];
  coverageLevel: "full" | "partial" | "none";
  notes?: string;
  alternativeResources?: string[];
}

interface BlueprintReport {
  totalObjectives: number;
  fullyCovered: number;
  partiallyCovered: number;
  notCovered: number;
  mappings: ObjectiveMapping[];
  gaps: GapItem[];
}

6. Gap Analysis System

Classifies objectives by AWS achievability and provides alternative paths.

interface GapAnalysisEntry {
  objectiveId: string;
  objectiveTitle: string;
  awsClassification: "achievable" | "partially_achievable" | "not_achievable";
  awsLimitation?: string;
  awsCapabilities?: string; // What CAN be done on AWS
  alternatives: AlternativePath[];
  recommendedPath: string;
}

interface AlternativePath {
  type: "nvidia_launchpad" | "local_hardware" | "simulation" | "virtual_lab" | "partner_lab";
  description: string;
  accessInstructions: string;
  estimatedCost: string;
  availability: string;
  url?: string;
}

interface GapAnalysisReport {
  summary: {
    achievableOnAws: number;
    partiallyAchievable: number;
    notAchievable: number;
    total: number;
  };
  entries: GapAnalysisEntry[];
  recommendedPath: RecommendedPath;
}

interface RecommendedPath {
  description: string;
  phases: PathPhase[];
}

interface PathPhase {
  order: number;
  title: string;
  platform: string;
  objectives: string[];
  estimatedDuration: string;
  estimatedCost: string;
}

7. Cost Calculator

Estimates AWS costs for lab exercises and deployment scenarios.

interface CostCalculation {
  instanceType: string;
  region: string;
  hoursPerDay: number;
  daysPerWeek: number;
  weeks: number;
  pricingModel: "on_demand" | "spot" | "reserved_1yr" | "reserved_3yr";
  additionalServices: ServiceCost[];
  totalEstimate: number;
  breakdown: CostBreakdown;
}

interface ServiceCost {
  service: string;
  monthlyEstimate: number;
  notes: string;
}

interface CostBreakdown {
  compute: number;
  storage: number;
  networking: number;
  monitoring: number;
  total: number;
  savingsVsOnDemand?: number;
}

interface InstanceComparison {
  instances: InstanceCostProfile[];
  recommendation: string;
  rationale: string;
}

interface InstanceCostProfile {
  instanceType: string;
  gpuModel: string;
  gpuCount: number;
  onDemandHourly: number;
  spotHourly: number;
  reservedHourly: number;
  performanceScore: number; // Relative score
  costEfficiency: number; // Performance per dollar
}

8. Container/Deployment Layer

Manages Docker container configuration, service orchestration, and cloud migration settings.

interface DockerService {
  name: string; // "app", "tools", "monitoring"
  dockerfile: string; // Path to Dockerfile
  ports: PortMapping[];
  volumes: VolumeMapping[];
  environment: EnvironmentVariable[];
  networks: string[];
  healthcheck?: HealthCheck;
  dependsOn?: string[]; // Other service names
}

interface PortMapping {
  host: number;
  container: number;
  protocol: "tcp" | "udp";
}

interface VolumeMapping {
  hostPath: string; // Local path or named volume
  containerPath: string;
  readOnly: boolean;
  type: "bind" | "volume" | "tmpfs";
  purpose: string; // Description of what this volume stores
}

interface ContainerConfig {
  serviceName: string;
  baseImage: string;
  buildStages: BuildStage[];
  exposedPorts: number[];
  workdir: string;
  user?: string;
  entrypoint: string[];
  cmd: string[];
  labels: Record<string, string>;
}

interface BuildStage {
  name: string;
  from: string;
  commands: string[];
  copyFrom?: string; // Multi-stage build source
}

interface EnvironmentVariable {
  name: string;
  localDefault: string;
  cloudValue?: string; // Value or source in cloud deployment
  description: string;
  required: boolean;
}

interface HealthCheck {
  test: string[];
  interval: string;
  timeout: string;
  retries: number;
  startPeriod?: string;
}

interface CloudMigrationConfig {
  platform: "ecs" | "eks" | "fargate";
  environmentOverrides: EnvironmentVariable[];
  volumeReplacements: CloudVolumeMapping[];
  networkConfig: CloudNetworkConfig;
  serviceDiscovery: ServiceDiscoveryConfig;
}

interface CloudVolumeMapping {
  localVolume: string; // Name from docker-compose
  cloudStorage: "efs" | "s3" | "ebs";
  cloudPath: string;
  accessMode: "ReadWriteOnce" | "ReadWriteMany" | "ReadOnlyMany";
}

interface CloudNetworkConfig {
  vpcId?: string;
  subnetIds?: string[];
  securityGroupIds?: string[];
  serviceConnectEnabled: boolean;
}

interface ServiceDiscoveryConfig {
  namespace: string;
  services: { name: string; port: number; protocol: string }[];
}

9. Terminal Simulator

Browser-based terminal simulator providing zero-cost GPU command practice using mock outputs. This is the zero-cost practice layer referenced by foundation-tier labs for command familiarization before spinning up real instances.

Architecture

The simulator uses a client-only architecture with no backend server:

xterm.js canvas — Renders the terminal interface in the browser
DOM overlay picker — Parameter picker (Ctrl+Space) rendered as a sibling DOM element positioned over the terminal
localStorage session store — Persists all session history, completion state, and replay data in the browser

File Structure

simulator/
├── data/
│   └── commands.json          # Single source of truth: command definitions, parameter registries, mock outputs, error variants, lab structure
├── src/
│   ├── paramPicker.js         # Ctrl+Space overlay — DOM sibling of xterm, longest-match parsing for context-aware suggestions
│   ├── mockRunner.js          # Command execution — longest-match command lookup, error mode support, output rendering
│   ├── sessionStore.js        # localStorage persistence — entry schema with outlineId/tokens/errorType fields
│   ├── navRenderer.js         # Left nav panel — lab outline with completion tracking, session history entries, click-to-replay
│   └── main.js                # Wire all modules — xterm.js terminal initialization, command execution flow, event routing
├── public/
│   └── index.html             # Two-column layout (nav + terminal), dark theme, xterm.js loaded from CDN
└── Dockerfile                 # Static file server container, port 3000, no backend dependencies

Integration Points

Foundation-tier labs reference the simulator for command familiarization before real instance spin-up
Cost calculator shows "$0/hr — simulator" as a practice option
commands.json defines 5 domains: GPU observability, fabric/topology, MIG/partitioning, workload/containers, and failure diagnosis
Each command entry includes clean output and error variants (driver mismatches, ECC errors, FM not running, etc.)

interface CommandEntry {
  command: string;           // e.g., "nvidia-smi"
  domain: "gpu-observability" | "fabric-topology" | "mig-partitioning" | "workload-containers" | "failure-diagnosis";
  subcommands: SubcommandEntry[];
  parameters: ParameterDef[];
  mockOutputs: MockOutput[];
  errorVariants: ErrorVariant[];
  labId: string;            // Which lab this command belongs to
  exerciseIds: string[];    // Which exercises use this command
}

interface SubcommandEntry {
  name: string;
  description: string;
  parameters: ParameterDef[];
  examples: string[];
}

interface ParameterDef {
  flag: string;             // e.g., "--query-gpu"
  description: string;
  values?: string[];        // Allowed values if enumerable
  examples: string[];
}

interface MockOutput {
  input: string;            // Full command string that triggers this output
  output: string;           // Terminal output to render
  description: string;
}

interface ErrorVariant {
  input: string;
  output: string;
  errorType: string;        // e.g., "driver_mismatch", "ecc_error", "fm_not_running"
  description: string;
}

interface SessionEntry {
  id: string;
  timestamp: string;
  outlineId: string;        // Lab/exercise this belongs to
  tokens: string[];         // Parsed command tokens
  errorType?: string;       // If error mode was active
  output: string;           // What was rendered
}

Data Models

Progress Data (progress.json)

{
  "userId": "candidate-001",
  "startDate": "2024-01-15",
  "currentWeek": 3,
  "overallCompletion": 42,
  "weeks": [
    {
      "weekId": "week-1",
      "status": "completed",
      "startedAt": "2024-01-15T08:00:00Z",
      "completedAt": "2024-01-21T18:00:00Z",
      "objectives": [
        {
          "objectiveId": "w1-obj-1",
          "status": "completed",
          "completedAt": "2024-01-16T14:00:00Z",
          "evidence": [
            {
              "id": "ev-001",
              "type": "screenshot",
              "title": "nvidia-smi output showing GPU details",
              "content": "",
              "filePath": "assets/progress/week-1/nvidia-smi-output.png",
              "createdAt": "2024-01-16T14:00:00Z"
            }
          ],
          "notes": "Completed GPU architecture review"
        }
      ]
    }
  ],
  "labs": [
    {
      "labId": "lab-01",
      "status": "completed",
      "startedAt": "2024-01-17T09:00:00Z",
      "completedAt": "2024-01-17T11:30:00Z",
      "checkpointResults": [
        {
          "checkpointId": "cp-01",
          "passed": true,
          "output": "GPU 0: NVIDIA H100 80GB HBM3",
          "timestamp": "2024-01-17T09:30:00Z"
        }
      ],
      "evidence": [],
      "timeSpent": 150
    }
  ],
  "blueprintCoverage": {
    "totalObjectives": 45,
    "covered": 12,
    "partial": 5,
    "uncovered": 28,
    "byCategory": []
  }
}

Blueprint Objectives (objectives.json)

{
  "categories": [
    {
      "id": "deployment_validation",
      "title": "Deployment and Validation",
      "objectives": [
        {
          "id": "dv-01",
          "title": "Deployment event sequences",
          "description": "Understand and execute proper deployment event sequences for GPU infrastructure"
        },
        {
          "id": "dv-02",
          "title": "Network topologies for AI factories",
          "description": "Design and validate network topologies for AI factory deployments"
        },
        {
          "id": "dv-03",
          "title": "BMC/OOB/TPM configuration",
          "description": "Configure Baseboard Management Controller, Out-of-Band management, and TPM"
        }
      ]
    },
    {
      "id": "software_installation",
      "title": "Software Installation and Configuration",
      "objectives": [
        {
          "id": "si-01",
          "title": "BCM installation with HA configuration",
          "description": "Install Base Command Manager with high availability configuration"
        }
      ]
    },
    {
      "id": "performance_testing",
      "title": "Performance Testing and Validation",
      "objectives": [
        {
          "id": "pt-01",
          "title": "Single-node stress tests",
          "description": "Execute and interpret single-node GPU stress tests"
        }
      ]
    },
    {
      "id": "troubleshooting",
      "title": "Troubleshooting and Maintenance",
      "objectives": [
        {
          "id": "tm-01",
          "title": "Hardware fault identification",
          "description": "Identify hardware faults for GPUs, fans, and network cards"
        }
      ]
    }
  ]
}

Objective Mapping (mapping.json)

{
  "mappings": [
    {
      "objectiveId": "dv-01",
      "weekModules": ["week-1", "week-2"],
      "labs": ["lab-03", "lab-05"],
      "exercises": ["ex-w1-03", "ex-w2-01"],
      "coverageLevel": "full",
      "notes": "Covered through EKS deployment labs"
    },
    {
      "objectiveId": "dv-03",
      "weekModules": [],
      "labs": [],
      "exercises": [],
      "coverageLevel": "none",
      "notes": "BMC/OOB/TPM requires physical hardware access",
      "alternativeResources": [
        "NVIDIA LaunchPad BMC Lab",
        "Partner hardware lab access"
      ]
    }
  ]
}

6-Week Curriculum Structure

gantt
    title 6-Week Study Plan
    dateFormat  YYYY-MM-DD
    section Week 1
    GPU Fundamentals & Architecture    :w1, 2024-01-15, 7d
    section Week 2
    Kubernetes & EKS Deployment        :w2, after w1, 7d
    section Week 3
    Slurm HPC Orchestration            :w3, after w2, 7d
    section Week 4
    DCGM Monitoring & Observability    :w4, after w3, 7d
    section Week 5
    Cost Optimization & AWS Scenarios  :w5, after w4, 7d
    section Week 6
    Troubleshooting & Exam Prep        :w6, after w5, 7d

Week	Focus Area	Key Topics	Labs	Blueprint Categories
1	GPU Fundamentals	CUDA cores, tensor cores, memory hierarchy, NVLink, MIG, nvidia-smi, A100/H100 architecture	nvidia-smi basics, MIG configuration, GPU diagnostics	Deployment & Validation
2	Kubernetes & EKS	Device plugin, GPU scheduling, resource limits, EKS GPU nodes, Terraform EKS, MIG in K8s	EKS GPU cluster, GPU pod scheduling, MIG K8s integration	Software Installation
3	Slurm HPC	GPU resource management, job scripts, DCGM integration, MIG in Slurm, Enroot/Pyxis	Slurm GPU cluster, job submission, container workloads	Software Installation
4	Monitoring & DCGM	DCGM installation, Prometheus/Grafana, GPU metrics, alerting, dashboards	DCGM setup, Prometheus integration, Grafana dashboards	Performance Testing
5	Cost & AWS	Instance comparison, spot/reserved, EFA networking, IAM/VPC, Terraform deployments	Cost analysis, P5 deployment, EFA configuration	Deployment & Validation
6	Troubleshooting	Systematic methodology, error codes, K8s GPU debugging, DCGM diagnostics, real-world scenarios	Troubleshooting scenarios, fault injection, exam simulation	Troubleshooting

Integration Points

graph LR
    subgraph "Content → Progress"
        A[Lab Completion] -->|Updates| B[Progress Store]
        C[Exercise Completion] -->|Updates| B
    end

    subgraph "Progress → Blueprint"
        B -->|Feeds| D[Blueprint Coverage]
        D -->|Identifies| E[Gap Analysis]
    end

    subgraph "Progress → Blog"
        B -->|Triggers| F[Milestone Notification]
        F -->|Pre-populates| G[Blog Template]
    end

    subgraph "Blueprint → Content"
        E -->|Recommends| H[Additional Resources]
        D -->|Validates| I[Curriculum Coverage]
    end

Key Integration Flows:

Lab → Progress → Blueprint: Completing a lab updates progress, which recalculates blueprint coverage
Progress → Blog: Milestone completion triggers blog template pre-population with achieved objectives
Blueprint → Gap Analysis: Uncovered objectives feed into gap analysis with AWS classification
Cost Calculator → Labs: Each lab references cost estimates for required AWS resources
Curriculum → Labs: Weekly modules reference specific labs for hands-on practice

Container Architecture

The entire system runs inside Docker containers with no host dependencies beyond Docker and Docker Compose. This enables single-command local setup and seamless cloud migration.

Docker Compose Service Definitions

graph TB
    subgraph "docker-compose.yml"
        subgraph "app service (port 3000)"
            NEXT[Next.js 14 App]
            MDX[MDX Content Engine]
        end

        subgraph "tools service"
            TF[Terraform 1.6+]
            KUB[kubectl]
            AWS[AWS CLI v2]
            HELM[Helm 3]
        end

        subgraph "monitoring service (ports 9090, 3001)"
            PROM[Prometheus]
            GRAF[Grafana]
        end
    end

    subgraph "Volumes"
        V1[app-data: ./data]
        V2[app-assets: ./public/assets]
        V3[monitoring-data: ./monitoring/data]
    end

    subgraph "Network"
        NET[cert-study-network - bridge]
    end

    app --> V1
    app --> V2
    monitoring --> V3
    app --> NET
    tools --> NET
    monitoring --> NET

Docker Compose Configuration

version: "3.9"

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "3000:3000"
    volumes:
      - ./data:/app/data
      - ./public/assets:/app/public/assets
      - ./content:/app/content:ro
    environment:
      - NODE_ENV=${NODE_ENV:-development}
      - DATA_DIR=/app/data
      - ASSETS_DIR=/app/public/assets
      - CONTENT_DIR=/app/content
      - MONITORING_URL=http://monitoring:9090
    networks:
      - cert-study-network
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  tools:
    build:
      context: .
      dockerfile: Dockerfile.tools
    volumes:
      - ./terraform:/workspace/terraform
      - ./data:/workspace/data
      - ~/.aws:/root/.aws:ro
    environment:
      - AWS_REGION=${AWS_REGION:-us-east-1}
      - AWS_PROFILE=${AWS_PROFILE:-default}
      - KUBECONFIG=/workspace/.kube/config
    networks:
      - cert-study-network
    stdin_open: true
    tty: true

  monitoring:
    build:
      context: .
      dockerfile: Dockerfile.monitoring
    ports:
      - "9090:9090"
      - "3001:3000"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards:ro
      - monitoring-data:/var/lib/prometheus
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
      - GF_SERVER_ROOT_URL=${GRAFANA_ROOT_URL:-http://localhost:3001}
    networks:
      - cert-study-network

volumes:
  monitoring-data:

networks:
  cert-study-network:
    driver: bridge

Dockerfile Specifications

Dockerfile (Main App)

FROM node:20-alpine AS base
WORKDIR /app

FROM base AS deps
COPY package.json package-lock.json ./
RUN npm ci --only=production

FROM base AS builder
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM base AS runner
ENV NODE_ENV=production
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs
COPY --from=builder /app/public ./public
COPY --from=builder /app/.next/standalone ./
COPY --from=builder /app/.next/static ./.next/static
USER nextjs
EXPOSE 3000
ENV PORT=3000
CMD ["node", "server.js"]

Dockerfile.tools (Lab Tools)

FROM ubuntu:22.04

RUN apt-get update && apt-get install -y \
    curl unzip git jq python3 python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Terraform
ARG TERRAFORM_VERSION=1.6.6
RUN curl -fsSL https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terraform_${TERRAFORM_VERSION}_linux_amd64.zip -o terraform.zip \
    && unzip terraform.zip -d /usr/local/bin/ && rm terraform.zip

# kubectl
RUN curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" \
    && install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl && rm kubectl

# AWS CLI
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
    && unzip awscliv2.zip && ./aws/install && rm -rf aws awscliv2.zip

# Helm
RUN curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

WORKDIR /workspace
CMD ["/bin/bash"]

Dockerfile.monitoring (Monitoring Stack)

FROM prom/prometheus:latest AS prometheus
FROM grafana/grafana:latest AS grafana

FROM ubuntu:22.04
RUN apt-get update && apt-get install -y supervisor curl && rm -rf /var/lib/apt/lists/*

COPY --from=prometheus /bin/prometheus /usr/local/bin/prometheus
COPY --from=grafana /usr/share/grafana /usr/share/grafana
COPY --from=grafana /run.sh /run-grafana.sh

COPY monitoring/supervisord.conf /etc/supervisor/conf.d/supervisord.conf
EXPOSE 9090 3000
CMD ["/usr/bin/supervisord"]

Volume Mount Strategy

Volume	Local Path	Container Path	Purpose
app-data	`./data`	`/app/data`	Progress JSON, blog post drafts
app-assets	`./public/assets`	`/app/public/assets`	Uploaded screenshots, diagrams
content (bind)	`./content`	`/app/content`	Curriculum MDX, lab definitions (read-only)
terraform (bind)	`./terraform`	`/workspace/terraform`	IaC configs for lab exercises
monitoring-data	Docker named volume	`/var/lib/prometheus`	Prometheus time-series data

Network Configuration

All containers communicate over a single Docker bridge network (cert-study-network):

app → monitoring: Fetches GPU metrics for dashboard display
app → tools: Triggers lab validation scripts
tools → external: AWS API calls for lab provisioning (requires AWS credentials)

No container exposes ports to the host except:

app: port 3000 (web UI)
monitoring: ports 9090 (Prometheus) and 3001 (Grafana)

Environment Variable Configuration (Local vs Cloud)

Variable	Local Default	Cloud (ECS/EKS)	Purpose
`NODE_ENV`	`development`	`production`	App mode
`DATA_DIR`	`/app/data`	`/mnt/efs/data`	Persistent data path
`ASSETS_DIR`	`/app/public/assets`	`s3://bucket/assets`	Asset storage
`CONTENT_DIR`	`/app/content`	`/app/content`	Curriculum content
`MONITORING_URL`	`http://monitoring:9090`	`http://prometheus.internal:9090`	Metrics endpoint
`AWS_REGION`	`us-east-1`	(from task role)	AWS region
`GRAFANA_ROOT_URL`	`http://localhost:3001`	`https://grafana.example.com`	Grafana base URL

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system—essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: Week Module Structural Completeness

For any WeekModule in the curriculum, it SHALL contain non-empty learning objectives, at least one topic, at least one hands-on exercise with specific tasks and expected outcomes, and at least one AWS-specific deployment scenario reference.

Validates: Requirements 1.2, 9.1

Property 2: Progressive Complexity Ordering

For any pair of consecutive weeks (week N and week N+1) in the study plan, the maximum difficulty level of topics in week N+1 SHALL be greater than or equal to the maximum difficulty level in week N, ensuring foundational concepts precede advanced topics.

Validates: Requirements 1.3

Property 3: Dependency Graph Validity

For any dependency relationship where week B depends on week A, the index of week A SHALL be strictly less than the index of week B (no cycles, valid topological ordering).

Validates: Requirements 1.4

Property 4: Resource Entry Completeness

For any resource entry in the resource guide, it SHALL have a non-empty URL string and a non-empty description string.

Validates: Requirements 10.5

Property 5: Learning Goal to Lab Coverage

For any learning objective defined in the curriculum, there SHALL exist at least one Lab whose objectives array contains that learning objective's ID.

Validates: Requirements 11.1

Property 6: Lab Structural Completeness

For any Lab object, it SHALL have: a non-empty objectives array referencing valid learning objectives, a non-empty blueprintRefs array referencing valid blueprint objective IDs, a non-empty steps array where each step has instructions, and a non-empty validationCheckpoints array where each checkpoint has a validationType and expectedResult.

Validates: Requirements 11.2, 11.3, 11.4

Property 7: AWS Lab Cost and Provisioning

For any Lab where environment.awsServices is non-empty, the environment.setupInstructions SHALL be non-empty, environment.teardownInstructions SHALL be non-empty, and estimatedCost SHALL have a positive totalEstimate value.

Validates: Requirements 11.5

Property 7a: Lab Tier Classification Validity

For any Lab in the system, its tier field SHALL be one of "foundation", "advanced", or "elite", AND the recommendedInstance, estimatedHourlyCost, minimumGpuRequirement, multiGpuRequired, nvlinkRequired, and instanceJustification fields SHALL all be populated with valid values.

Validates: Requirements 11.6, 11.11

Property 7b: Lab Tier Distribution

For the complete set of Labs in the system, the proportion of foundation-tier labs SHALL be between 70% and 80% inclusive, the proportion of advanced-tier labs SHALL be between 10% and 20% inclusive, and the proportion of elite-tier labs SHALL be at most 10%.

Validates: Requirements 11.7

Property 7c: Lab Tier Instance Consistency

For any Lab with tier "foundation", the recommendedInstance SHALL be "g5.xlarge", "g5.2xlarge", or "local" AND estimatedHourlyCost SHALL be at most $2/hr. For any Lab with tier "advanced", the recommendedInstance SHALL be "g5.12xlarge" or "p4d.24xlarge". For any Lab with tier "elite", the recommendedInstance SHALL be "p4d.24xlarge" or "p5.48xlarge" AND nvlinkRequired or multiGpuRequired SHALL be true.

Validates: Requirements 11.8, 11.9, 11.10

Property 7d: Lab NVLink Requirement Implies Elite Tier

For any Lab where nvlinkRequired is true, the tier SHALL be "elite" and the recommendedInstance SHALL be "p4d.24xlarge" or "p5.48xlarge".

Validates: Requirements 11.10, 11.12

Property 8: Progress Evidence Storage Round-Trip

For any valid ProgressEvidence with type in ["screenshot", "code", "command_output", "config_file", "note"], storing it in the progress system and then retrieving it by its ID SHALL return an equivalent evidence object with all fields preserved.

Validates: Requirements 12.1, 12.2, 12.5

Property 9: Overall Completion Calculation Consistency

For any ProgressStore state, the overallCompletion percentage SHALL equal (number of completed objectives + completed labs) divided by (total objectives + total labs) multiplied by 100, and the summary counts (completed + in_progress + not_started) SHALL equal the total number of milestones.

Validates: Requirements 12.3, 12.6

Property 10: Evidence Organization by Milestone

For any ProgressEvidence stored with a non-null objectiveId or labId, querying evidence by that objectiveId or labId SHALL return a collection containing that evidence item.

Validates: Requirements 12.4

Property 11: Blog Template Pre-Population

For any completed milestone (week or lab), generating a blog template SHALL produce a BlogPost where the content contains the milestone title, the objectivesCovered array is non-empty and references valid objectives from that milestone, and all required TemplateSection headings are present.

Validates: Requirements 13.3

Property 12: Blog Export Validity

For any BlogPost with non-empty content and at least one asset, the export function SHALL produce valid Markdown where all image references point to valid asset paths and all code blocks are properly fenced.

Validates: Requirements 13.5

Property 13: Blueprint Objective Mapping Completeness

For any BlueprintObjective across all categories (deployment_validation, software_installation, performance_testing, troubleshooting), there SHALL exist an ObjectiveMapping entry with a valid coverageLevel classification.

Validates: Requirements 14.1, 14.2, 14.3, 14.4

Property 14: Covered Objectives Reference Content

For any ObjectiveMapping with coverageLevel "full" or "partial", the weekModules array or labs array SHALL be non-empty, indicating which study content addresses the objective.

Validates: Requirements 14.5

Property 15: Uncovered Objectives Have Alternatives

For any ObjectiveMapping with coverageLevel "none", the alternativeResources array SHALL be non-empty, providing recommendations for addressing the gap.

Validates: Requirements 14.6

Property 16: Gap Analysis Classification Completeness

For any BlueprintObjective in the system, there SHALL exist a GapAnalysisEntry with awsClassification in ["achievable", "partially_achievable", "not_achievable"].

Validates: Requirements 15.1

Property 17: Not-Achievable Entry Completeness

For any GapAnalysisEntry with awsClassification "not_achievable", the awsLimitation field SHALL be non-empty AND the alternatives array SHALL contain at least one AlternativePath.

Validates: Requirements 15.2, 15.3

Property 18: Partially-Achievable Entry Completeness

For any GapAnalysisEntry with awsClassification "partially_achievable", the awsCapabilities field SHALL be non-empty (describing what CAN be done on AWS) AND the alternatives array SHALL be non-empty (providing paths for what cannot).

Validates: Requirements 15.4

Property 19: Alternative Path Field Completeness

For any AlternativePath in any GapAnalysisEntry's alternatives array, the accessInstructions, estimatedCost, and availability fields SHALL all be non-empty strings.

Validates: Requirements 15.6

Property 20: Blog Asset Type Support

For any BlogAsset with type in ["image", "diagram", "code", "terminal_output"], the blog system SHALL accept and store it with a valid filePath and preserve the type classification on retrieval.

Validates: Requirements 13.2

Property 21: Container Isolation

For any service component defined in the system, it SHALL have a corresponding Dockerfile entry and be listed in docker-compose.yml, with no runtime dependencies (Node.js, Python, Terraform, kubectl, AWS CLI) required to be installed on the host machine beyond Docker and Docker Compose.

Validates: Requirements 16.1, 16.3, 16.8

Property 22: Cloud Portability

For any environment-specific configuration value (storage paths, service URLs, credentials, feature flags) used by the application, it SHALL be sourced from environment variables or external configuration files, such that changing only environment variables and volume mounts enables cloud deployment without code modifications.

Validates: Requirements 16.5, 16.6

Property 23: Volume Persistence

For any data written to a Docker volume mount path (progress data, blog posts, uploaded assets), that data SHALL be retrievable after a container restart, verifying that all persistent state is stored in volume-mounted directories and not in ephemeral container filesystem layers.

Validates: Requirements 16.4

Error Handling

Content Loading Errors

Error Scenario	Handling Strategy
Missing MDX file for a week/topic	Display error message with file path, allow navigation to other content
Invalid JSON in progress.json	Attempt recovery from backup, prompt user to reset if unrecoverable
Missing lab Terraform module	Show warning, allow lab instructions to be read without provisioning
Corrupt asset file	Display placeholder with error message, log for user attention

Progress System Errors

Error Scenario	Handling Strategy
File write failure (disk full)	Queue writes, notify user, retry on space availability
Invalid evidence type	Reject with clear error message listing supported types
Duplicate evidence ID	Generate new unique ID, log warning
Progress calculation overflow	Cap at 100%, log inconsistency for review

Blog System Errors

Error Scenario	Handling Strategy
Asset upload exceeds size limit	Reject with size limit message, suggest compression
Template generation fails	Fall back to minimal template with just title and date
Export format error	Provide raw markdown as fallback, log formatting issue
Invalid milestone reference	Create blog without pre-population, notify user

Blueprint/Gap Analysis Errors

Error Scenario	Handling Strategy
Objective ID not found in mapping	Flag as "unmapped" in report, add to gap list
Circular dependency in objectives	Detect and break cycle, log warning
Missing alternative resources	Display "alternatives pending" status, flag for content update

Container/Deployment Errors

Error Scenario	Handling Strategy
Docker daemon not running	Display clear error message with instructions to start Docker
Port conflict (3000, 9090, 3001 already in use)	Log conflicting port, suggest alternative port mapping via env var
Volume mount permission denied	Check directory permissions, create directories if missing, log instructions
Container build failure	Display build stage that failed, suggest `docker compose build --no-cache`
Container health check failure	Retry with backoff, log container logs, suggest `docker compose logs <service>`
Cloud migration config mismatch	Validate all required env vars are set before startup, fail fast with missing var names
Named volume data corruption	Provide volume backup/restore instructions, allow fresh volume creation

General Error Principles

Graceful degradation — System remains usable even when individual components fail
Data preservation — Never lose user progress data; prefer read-only mode over data loss
Clear messaging — All errors provide actionable information to the user
Recovery paths — Each error state has a documented recovery procedure
Logging — All errors are logged with context for debugging

Testing Strategy

Property-Based Testing

This system is well-suited for property-based testing because it has clear data structures with invariants that must hold across all valid instances. The curriculum, lab, progress, and blueprint systems all have universal properties about structural completeness and data consistency.

Library: fast-check (TypeScript property-based testing library) Configuration: Minimum 100 iterations per property test Tag format: Feature: nvidia-aws-gpu-certification, Property {number}: {property_text}

Property tests will cover:

Structural completeness of curriculum data (Properties 1-3)
Resource and lab data integrity (Properties 4-7)
Progress system round-trips and calculations (Properties 8-10)
Blog system template generation and export (Properties 11-12, 20)
Blueprint mapping completeness (Properties 13-15)
Gap analysis classification and completeness (Properties 16-19)
Container isolation, cloud portability, and volume persistence (Properties 21-23)

Unit Testing

Unit tests complement property tests by covering specific examples and edge cases:

Cost Calculator: Specific instance type calculations with known expected values
Progress percentage: Edge cases (0%, 100%, single item)
Blog template: Specific milestone types produce expected template structures
Gap Analysis: Known objectives with known AWS limitations
Dependency resolution: Specific curriculum orderings

Integration Testing

Content loading pipeline: MDX files parse correctly and render expected components
Progress persistence: Write → read cycle across application restarts
Blog export pipeline: End-to-end from template to exported markdown
Terraform validation: Lab Terraform configs pass terraform validate
Blueprint cross-reference: Full mapping produces expected coverage report
Container orchestration: All services start via docker compose up and communicate correctly
Volume persistence: Data written in containers survives docker compose down && docker compose up

Smoke Testing

Given the large number of content-completeness requirements (Requirements 2-8, 10), smoke tests verify:

All 6 week modules load without errors
All referenced labs exist and have valid structure
All external resource URLs are formatted correctly
Terraform modules pass syntax validation
Blueprint objectives JSON is valid and complete

Container Testing

Container-level tests verify Requirement 16 compliance:

Dockerfile validity: All Dockerfiles build successfully with docker build
Compose validation: docker compose config passes without errors
Service health: All services pass their health checks after startup
No host dependencies: Application functions correctly with only Docker installed (no Node.js, Python, etc. on host)
Volume mounts: Data directories are correctly mounted and writable
Network connectivity: Services can communicate over the internal bridge network
Cloud config compatibility: Container images run with cloud environment variable overrides
Tools container: Lab tools (Terraform, kubectl, AWS CLI) are available and functional inside the tools container

Test Organization

tests/
├── properties/
│   ├── curriculum.property.test.ts
│   ├── labs.property.test.ts
│   ├── progress.property.test.ts
│   ├── blog.property.test.ts
│   ├── blueprint.property.test.ts
│   ├── gap-analysis.property.test.ts
│   └── container.property.test.ts
├── unit/
│   ├── cost-calculator.test.ts
│   ├── progress-calculation.test.ts
│   ├── blog-template.test.ts
│   └── dependency-resolver.test.ts
├── integration/
│   ├── content-loading.test.ts
│   ├── progress-persistence.test.ts
│   ├── blog-export.test.ts
│   └── container-orchestration.test.ts
├── smoke/
│   ├── content-completeness.test.ts
│   ├── terraform-validation.test.ts
│   └── docker-build.test.ts
└── container/
    ├── dockerfile-build.test.ts
    ├── compose-validation.test.ts
    ├── service-health.test.ts
    ├── volume-persistence.test.ts
    └── cloud-config.test.ts