Design Document: NVIDIA AWS GPU Certification Study System
Overview
This system provides a structured 6-week study plan and resource platform for the NVIDIA Certified Professional AI Infrastructure exam, focused on AWS GPU deployment. It combines curriculum management, hands-on labs, progress tracking, blog publishing, and certification blueprint cross-checking into a cohesive learning platform.
The system is designed as a static-site-based application (Next.js) with local-first data storage, enabling candidates to work through the curriculum offline while maintaining the ability to publish and share content. Content is organized around the NVIDIA certification blueprint objectives, with each component (labs, progress, blog) tied back to specific exam objectives.
Key Design Decisions
-
Next.js Static Site with MDX — Curriculum content is authored in MDX for rich interactivity while maintaining version control friendliness. This allows code blocks, diagrams, and interactive elements within study materials.
-
Local-first with JSON/Markdown storage — Progress data and blog drafts are stored locally in structured JSON and Markdown files, avoiding external database dependencies during study. This keeps the system portable and simple.
-
File-based Lab System — Labs are self-contained directories with instructions, Terraform configs, validation scripts, and expected outputs. This makes labs reproducible and version-controllable.
-
Blueprint-driven Architecture — The NVIDIA certification blueprint is the source of truth. All content (weeks, labs, exercises) maps back to specific blueprint objectives, enabling gap analysis and coverage tracking.
-
Mermaid Diagrams — Architecture and flow diagrams use Mermaid for inline rendering without external tooling.
Architecture
High-Level System Architecture
graph TB
subgraph "Host Machine (Docker Only)"
DC[Docker Compose]
subgraph "Container: app"
subgraph "Content Layer"
CUR[Curriculum MDX Files]
LABS[Lab Definitions]
RES[Resource References]
BP[Blueprint Objectives]
end
subgraph "Application Layer"
APP[Next.js Application]
PROG[Progress Tracker]
BLOG[Blog System]
BPC[Blueprint Checker]
GAP[Gap Analyzer]
COST[Cost Calculator]
end
end
subgraph "Container: tools"
TF[Terraform]
KUBECTL[kubectl]
AWSCLI[AWS CLI]
NVIDIASMI[nvidia-smi simulator]
end
subgraph "Container: monitoring"
PROM[Prometheus]
GRAF[Grafana]
end
subgraph "Docker Volumes"
PDATA[Progress Data - JSON]
BDATA[Blog Posts - MDX]
ASSETS[Uploaded Assets]
end
subgraph "Output Layer"
SITE[Static Site]
EXPORT[Blog Export]
REPORT[Progress Reports]
end
end
DC --> app
DC --> tools
DC --> monitoring
CUR --> APP
LABS --> APP
RES --> APP
BP --> BPC
BP --> GAP
APP --> PROG
APP --> BLOG
APP --> BPC
APP --> GAP
APP --> COST
PROG --> PDATA
BLOG --> BDATA
BLOG --> ASSETS
APP --> SITE
BLOG --> EXPORT
PROG --> REPORT
Directory Structure
nvidia-gpu-cert-study/
├── docker-compose.yml
├── Dockerfile # Main app container
├── Dockerfile.tools # Lab tools container (Terraform, kubectl, AWS CLI)
├── Dockerfile.monitoring # Monitoring stack container (Prometheus, Grafana)
├── .env.local # Local environment variables
├── .env.cloud.example # Cloud environment variable template
├── content/
│ ├── curriculum/
│ │ ├── week-1-gpu-fundamentals/
│ │ ├── week-2-kubernetes-eks/
│ │ ├── week-3-slurm-hpc/
│ │ ├── week-4-monitoring-dcgm/
│ │ ├── week-5-cost-optimization/
│ │ └── week-6-troubleshooting/
│ ├── labs/
│ │ ├── lab-01-nvidia-smi-basics/
│ │ ├── lab-02-mig-configuration/
│ │ ├── lab-03-eks-gpu-cluster/
│ │ └── ...
│ ├── blueprint/
│ │ ├── objectives.json
│ │ └── mapping.json
│ └── resources/
│ └── references.json
├── src/
│ ├── components/
│ │ ├── LabRunner/
│ │ ├── ProgressTracker/
│ │ ├── BlogEditor/
│ │ ├── BlueprintChecker/
│ │ ├── GapAnalysis/
│ │ └── CostCalculator/
│ ├── lib/
│ │ ├── progress.ts
│ │ ├── blueprint.ts
│ │ ├── labs.ts
│ │ ├── blog.ts
│ │ ├── cost.ts
│ │ └── container.ts
│ └── pages/
│ ├── index.tsx
│ ├── week/[weekId].tsx
│ ├── labs/[labId].tsx
│ ├── progress.tsx
│ ├── blog/
│ ├── blueprint.tsx
│ └── gap-analysis.tsx
├── data/ # Docker volume mount point
│ ├── progress.json
│ └── blog-posts/
├── public/
│ └── assets/ # Docker volume mount point
├── terraform/
│ ├── modules/
│ │ ├── eks-gpu-cluster/
│ │ ├── p5-instance/
│ │ └── monitoring-stack/
│ └── labs/
├── monitoring/
│ ├── prometheus.yml
│ └── grafana/
│ └── dashboards/
└── scripts/
├── validate-lab.sh
└── export-blog.ts
Technology Stack
| Component | Technology | Rationale |
|---|---|---|
| Framework | Next.js 14 (App Router) | Static generation, MDX support, file-based routing |
| Content | MDX | Rich content with embedded components and code |
| Styling | Tailwind CSS | Rapid UI development, responsive design |
| Data Storage | Local JSON + Markdown files | No database dependency, portable, version-controllable |
| Diagrams | Mermaid | Inline diagrams without external tools |
| IaC | Terraform | Industry standard for AWS provisioning |
| Lab Validation | Shell scripts + Node.js | Cross-platform validation of lab outcomes |
| Blog Export | Markdown + HTML | Compatible with common publishing platforms |
| Testing | Vitest + fast-check | Unit testing with property-based testing support |
| Containerization | Docker + Docker Compose | All services run in containers, no host dependencies |
| Container Registry | Docker Hub / Amazon ECR | Cloud-compatible image distribution |
| Monitoring | Prometheus + Grafana (containerized) | GPU metrics visualization in isolated containers |
Components and Interfaces
1. Curriculum Engine
Responsible for loading, organizing, and presenting the 6-week study plan content.
interface WeekModule {
id: string; // "week-1", "week-2", etc.
title: string;
objectives: LearningObjective[];
topics: Topic[];
labs: string[]; // Lab IDs
dependencies: string[]; // IDs of prerequisite weeks
blueprintObjectives: string[]; // Blueprint objective IDs covered
}
interface LearningObjective {
id: string;
description: string;
blueprintRef: string[]; // Maps to blueprint objectives
assessmentCriteria: string;
}
interface Topic {
id: string;
title: string;
content: string; // MDX file path
difficulty: "foundational" | "intermediate" | "advanced";
estimatedMinutes: number;
practiceExercises: Exercise[];
}
interface Exercise {
id: string;
title: string;
description: string;
type: "command" | "configuration" | "deployment" | "analysis";
instructions: string;
expectedOutcome: string;
hints: string[];
validationScript?: string;
}
2. Lab System
Self-contained lab exercises tied to certification objectives with validation. Labs are organized into a tiered architecture that optimizes for cost discipline and repetition frequency.
type LabTier = "foundation" | "advanced" | "elite";
interface Lab {
id: string;
title: string;
weekId: string;
objectives: string[]; // Learning objective IDs
blueprintRefs: string[]; // Blueprint objective IDs
prerequisites: string[]; // Lab IDs that must be completed first
difficulty: "beginner" | "intermediate" | "advanced";
estimatedMinutes: number;
tier: LabTier;
recommendedInstance: string; // Cheapest instance that works (e.g., "g5.xlarge")
estimatedHourlyCost: number; // Hourly cost in USD
minimumGpuRequirement: number; // 1, 4, or 8
multiGpuRequired: boolean;
nvlinkRequired: boolean;
instanceJustification: string; // Why this instance was chosen
estimatedCost: CostEstimate;
environment: LabEnvironment;
steps: LabStep[];
validationCheckpoints: ValidationCheckpoint[];
}
interface LabEnvironment {
awsServices: string[];
instanceTypes: string[];
terraformModule?: string; // Path to Terraform config
setupInstructions: string;
teardownInstructions: string;
}
interface LabStep {
order: number;
title: string;
instructions: string; // MDX content
commands?: string[];
expectedOutput?: string;
checkpoint?: string; // Validation checkpoint ID
}
interface ValidationCheckpoint {
id: string;
description: string;
validationType: "command_output" | "file_exists" | "api_response" | "manual";
validationScript?: string;
expectedResult: string;
}
interface CostEstimate {
hourlyRate: number;
estimatedDuration: number; // minutes
totalEstimate: number;
instanceType: string;
notes: string;
}
Lab Tier Architecture
The lab system uses a three-tier architecture to enforce cost discipline while ensuring learning objectives are met:
| Tier | Target Instances | Hourly Cost | Lab Distribution | Use Cases |
|---|---|---|---|---|
| Foundation | g5.xlarge, g5.2xlarge | ~$1-2/hr | 70-80% of labs | nvidia-smi, CUDA basics, Docker GPU runtime, EKS GPU plugin, K8s scheduling, taints/tolerations, DCGM basics, Prometheus/Grafana, MIG concepts (theory), Terraform, GPU pod deployment |
| Advanced | g5.12xlarge (4x A10G), p4d.24xlarge | ~$5-33/hr | 10-20% of labs | Multiple GPU scheduling, inference scaling, profiling, concurrency testing, MIG practicals (A100 required) |
| Elite | p4d.24xlarge, p5.48xlarge | ~$33-98/hr | ≤10% of labs | NVLink, NCCL, distributed training, topology awareness, GPUDirect, EFA, multi-node collectives |
Lab-to-Tier Assignments
| Lab | Title | Tier | Instance | Hourly Cost | Justification |
|---|---|---|---|---|---|
| Lab 01 | nvidia-smi basics | Foundation | g5.xlarge | $1/hr | Single-GPU nvidia-smi queries work on any GPU |
| Lab 02 | MIG configuration | Advanced | p4d.24xlarge | $33/hr | MIG requires A100 GPU (not available on G5) |
| Lab 03 | EKS GPU cluster | Foundation | g5.xlarge (workers) | $1/hr | Device plugin, scheduling, taints work on single GPU |
| Lab 04 | GPU pod scheduling | Foundation | g5.xlarge | $1/hr | Resource requests/limits work on single GPU |
| Lab 05 | MIG Kubernetes | Advanced | p4d.24xlarge | $33/hr | MIG-in-K8s requires A100 for real MIG partitions |
| Lab 06 | Slurm GPU cluster | Foundation | g5.xlarge | $1/hr | Slurm GPU scheduling works with single GPU |
| Lab 07 | DCGM setup | Foundation | g5.xlarge | $1/hr | DCGM installation and basic metrics on any GPU |
| Lab 08 | Prometheus/Grafana | Foundation | g5.xlarge | $1/hr | Monitoring stack setup is GPU-count agnostic |
| Lab 09 | Cost analysis | Foundation | local/none | $0/hr | Cost analysis is a calculation exercise, no GPU needed |
| Lab 10 | P5 deployment | Elite | p5.48xlarge | $98/hr | H100-specific features, NVLink topology |
| Lab 11 | EFA configuration | Elite | p5.48xlarge | $98/hr | EFA/multi-node requires p5 with EFA interfaces |
| Lab 12 | Troubleshooting | Foundation | g5.xlarge | $1/hr | Troubleshooting methodology works on any GPU |
Tier-Based Workflow Recommendations
graph TD
subgraph "Daily Practice (Foundation Tier)"
F1[nvidia-smi drills]
F2[K8s scheduling exercises]
F3[DCGM/monitoring setup]
F4[Terraform deployments]
F5[Troubleshooting scenarios]
end
subgraph "Weekly Sessions (Advanced Tier)"
A1[MIG configuration]
A2[Multi-GPU profiling]
end
subgraph "Focused Sessions (Elite Tier, 2-4hr max)"
E1[NVLink/NCCL testing]
E2[EFA multi-node]
end
F1 --> A1
F2 --> A1
F3 --> E1
A1 --> E1
A2 --> E2
Recommended workflow:
- Daily (1-2 hours): Run foundation-tier labs on g5.xlarge. Build muscle memory through repetition. Cost: ~$1-2/session.
- Weekly (1 session): Run advanced-tier labs when foundation concepts are solid. Cost: ~$5-33/session.
- Bi-weekly (2-4 hour focused session): Run elite-tier labs only when specifically preparing for NVLink/NCCL/EFA topics. Terminate immediately after. Cost: ~$66-392/session.
3. Progress Tracking System
Tracks completion status, stores evidence, and provides summary views.
interface ProgressStore {
userId: string;
startDate: string;
weeks: WeekProgress[];
labs: LabProgress[];
overallCompletion: number; // 0-100
blueprintCoverage: BlueprintCoverage;
}
interface WeekProgress {
weekId: string;
status: "not_started" | "in_progress" | "completed";
startedAt?: string;
completedAt?: string;
objectives: ObjectiveProgress[];
}
interface ObjectiveProgress {
objectiveId: string;
status: "not_started" | "in_progress" | "completed";
completedAt?: string;
evidence: ProgressEvidence[];
notes: string;
}
interface LabProgress {
labId: string;
status: "not_started" | "in_progress" | "completed" | "failed";
startedAt?: string;
completedAt?: string;
checkpointResults: CheckpointResult[];
evidence: ProgressEvidence[];
timeSpent: number; // minutes
}
interface ProgressEvidence {
id: string;
type: "screenshot" | "code" | "command_output" | "config_file" | "note";
title: string;
content: string; // For code/commands: inline content. For images: file path
filePath?: string; // Path to uploaded asset
createdAt: string;
objectiveId?: string;
labId?: string;
}
interface CheckpointResult {
checkpointId: string;
passed: boolean;
output?: string;
timestamp: string;
}
interface BlueprintCoverage {
totalObjectives: number;
covered: number;
partial: number;
uncovered: number;
byCategory: CategoryCoverage[];
}
interface CategoryCoverage {
category: string;
objectives: { id: string; status: "covered" | "partial" | "uncovered" }[];
}
4. Blog System
Intuitive content creation for milestone documentation and knowledge sharing.
interface BlogPost {
id: string;
title: string;
slug: string;
milestoneId: string; // Week or lab ID this documents
status: "draft" | "published";
createdAt: string;
updatedAt: string;
publishedAt?: string;
content: string; // MDX content
assets: BlogAsset[];
tags: string[];
objectivesCovered: string[];
template: string; // Template used for creation
}
interface BlogAsset {
id: string;
type: "image" | "diagram" | "code" | "terminal_output";
fileName: string;
filePath: string;
caption?: string;
uploadedAt: string;
}
interface BlogTemplate {
id: string;
milestoneType: "week_completion" | "lab_completion" | "certification_prep";
title: string;
sections: TemplateSection[];
}
interface TemplateSection {
heading: string;
placeholder: string;
required: boolean;
prefillFrom?: string; // Data source for pre-population
}
5. Blueprint Checker
Maps certification objectives to study plan content and identifies gaps.
interface BlueprintObjective {
id: string;
category: "deployment_validation" | "software_installation" | "performance_testing" | "troubleshooting";
title: string;
description: string;
subObjectives?: string[];
}
interface ObjectiveMapping {
objectiveId: string;
weekModules: string[];
labs: string[];
exercises: string[];
coverageLevel: "full" | "partial" | "none";
notes?: string;
alternativeResources?: string[];
}
interface BlueprintReport {
totalObjectives: number;
fullyCovered: number;
partiallyCovered: number;
notCovered: number;
mappings: ObjectiveMapping[];
gaps: GapItem[];
}
6. Gap Analysis System
Classifies objectives by AWS achievability and provides alternative paths.
interface GapAnalysisEntry {
objectiveId: string;
objectiveTitle: string;
awsClassification: "achievable" | "partially_achievable" | "not_achievable";
awsLimitation?: string;
awsCapabilities?: string; // What CAN be done on AWS
alternatives: AlternativePath[];
recommendedPath: string;
}
interface AlternativePath {
type: "nvidia_launchpad" | "local_hardware" | "simulation" | "virtual_lab" | "partner_lab";
description: string;
accessInstructions: string;
estimatedCost: string;
availability: string;
url?: string;
}
interface GapAnalysisReport {
summary: {
achievableOnAws: number;
partiallyAchievable: number;
notAchievable: number;
total: number;
};
entries: GapAnalysisEntry[];
recommendedPath: RecommendedPath;
}
interface RecommendedPath {
description: string;
phases: PathPhase[];
}
interface PathPhase {
order: number;
title: string;
platform: string;
objectives: string[];
estimatedDuration: string;
estimatedCost: string;
}
7. Cost Calculator
Estimates AWS costs for lab exercises and deployment scenarios.
interface CostCalculation {
instanceType: string;
region: string;
hoursPerDay: number;
daysPerWeek: number;
weeks: number;
pricingModel: "on_demand" | "spot" | "reserved_1yr" | "reserved_3yr";
additionalServices: ServiceCost[];
totalEstimate: number;
breakdown: CostBreakdown;
}
interface ServiceCost {
service: string;
monthlyEstimate: number;
notes: string;
}
interface CostBreakdown {
compute: number;
storage: number;
networking: number;
monitoring: number;
total: number;
savingsVsOnDemand?: number;
}
interface InstanceComparison {
instances: InstanceCostProfile[];
recommendation: string;
rationale: string;
}
interface InstanceCostProfile {
instanceType: string;
gpuModel: string;
gpuCount: number;
onDemandHourly: number;
spotHourly: number;
reservedHourly: number;
performanceScore: number; // Relative score
costEfficiency: number; // Performance per dollar
}
8. Container/Deployment Layer
Manages Docker container configuration, service orchestration, and cloud migration settings.
interface DockerService {
name: string; // "app", "tools", "monitoring"
dockerfile: string; // Path to Dockerfile
ports: PortMapping[];
volumes: VolumeMapping[];
environment: EnvironmentVariable[];
networks: string[];
healthcheck?: HealthCheck;
dependsOn?: string[]; // Other service names
}
interface PortMapping {
host: number;
container: number;
protocol: "tcp" | "udp";
}
interface VolumeMapping {
hostPath: string; // Local path or named volume
containerPath: string;
readOnly: boolean;
type: "bind" | "volume" | "tmpfs";
purpose: string; // Description of what this volume stores
}
interface ContainerConfig {
serviceName: string;
baseImage: string;
buildStages: BuildStage[];
exposedPorts: number[];
workdir: string;
user?: string;
entrypoint: string[];
cmd: string[];
labels: Record<string, string>;
}
interface BuildStage {
name: string;
from: string;
commands: string[];
copyFrom?: string; // Multi-stage build source
}
interface EnvironmentVariable {
name: string;
localDefault: string;
cloudValue?: string; // Value or source in cloud deployment
description: string;
required: boolean;
}
interface HealthCheck {
test: string[];
interval: string;
timeout: string;
retries: number;
startPeriod?: string;
}
interface CloudMigrationConfig {
platform: "ecs" | "eks" | "fargate";
environmentOverrides: EnvironmentVariable[];
volumeReplacements: CloudVolumeMapping[];
networkConfig: CloudNetworkConfig;
serviceDiscovery: ServiceDiscoveryConfig;
}
interface CloudVolumeMapping {
localVolume: string; // Name from docker-compose
cloudStorage: "efs" | "s3" | "ebs";
cloudPath: string;
accessMode: "ReadWriteOnce" | "ReadWriteMany" | "ReadOnlyMany";
}
interface CloudNetworkConfig {
vpcId?: string;
subnetIds?: string[];
securityGroupIds?: string[];
serviceConnectEnabled: boolean;
}
interface ServiceDiscoveryConfig {
namespace: string;
services: { name: string; port: number; protocol: string }[];
}
9. Terminal Simulator
Browser-based terminal simulator providing zero-cost GPU command practice using mock outputs. This is the zero-cost practice layer referenced by foundation-tier labs for command familiarization before spinning up real instances.
Architecture
The simulator uses a client-only architecture with no backend server:
- xterm.js canvas — Renders the terminal interface in the browser
- DOM overlay picker — Parameter picker (Ctrl+Space) rendered as a sibling DOM element positioned over the terminal
- localStorage session store — Persists all session history, completion state, and replay data in the browser
File Structure
simulator/
├── data/
│ └── commands.json # Single source of truth: command definitions, parameter registries, mock outputs, error variants, lab structure
├── src/
│ ├── paramPicker.js # Ctrl+Space overlay — DOM sibling of xterm, longest-match parsing for context-aware suggestions
│ ├── mockRunner.js # Command execution — longest-match command lookup, error mode support, output rendering
│ ├── sessionStore.js # localStorage persistence — entry schema with outlineId/tokens/errorType fields
│ ├── navRenderer.js # Left nav panel — lab outline with completion tracking, session history entries, click-to-replay
│ └── main.js # Wire all modules — xterm.js terminal initialization, command execution flow, event routing
├── public/
│ └── index.html # Two-column layout (nav + terminal), dark theme, xterm.js loaded from CDN
└── Dockerfile # Static file server container, port 3000, no backend dependencies
Integration Points
- Foundation-tier labs reference the simulator for command familiarization before real instance spin-up
- Cost calculator shows "$0/hr — simulator" as a practice option
- commands.json defines 5 domains: GPU observability, fabric/topology, MIG/partitioning, workload/containers, and failure diagnosis
- Each command entry includes clean output and error variants (driver mismatches, ECC errors, FM not running, etc.)
interface CommandEntry {
command: string; // e.g., "nvidia-smi"
domain: "gpu-observability" | "fabric-topology" | "mig-partitioning" | "workload-containers" | "failure-diagnosis";
subcommands: SubcommandEntry[];
parameters: ParameterDef[];
mockOutputs: MockOutput[];
errorVariants: ErrorVariant[];
labId: string; // Which lab this command belongs to
exerciseIds: string[]; // Which exercises use this command
}
interface SubcommandEntry {
name: string;
description: string;
parameters: ParameterDef[];
examples: string[];
}
interface ParameterDef {
flag: string; // e.g., "--query-gpu"
description: string;
values?: string[]; // Allowed values if enumerable
examples: string[];
}
interface MockOutput {
input: string; // Full command string that triggers this output
output: string; // Terminal output to render
description: string;
}
interface ErrorVariant {
input: string;
output: string;
errorType: string; // e.g., "driver_mismatch", "ecc_error", "fm_not_running"
description: string;
}
interface SessionEntry {
id: string;
timestamp: string;
outlineId: string; // Lab/exercise this belongs to
tokens: string[]; // Parsed command tokens
errorType?: string; // If error mode was active
output: string; // What was rendered
}
Data Models
Progress Data (progress.json)
{
"userId": "candidate-001",
"startDate": "2024-01-15",
"currentWeek": 3,
"overallCompletion": 42,
"weeks": [
{
"weekId": "week-1",
"status": "completed",
"startedAt": "2024-01-15T08:00:00Z",
"completedAt": "2024-01-21T18:00:00Z",
"objectives": [
{
"objectiveId": "w1-obj-1",
"status": "completed",
"completedAt": "2024-01-16T14:00:00Z",
"evidence": [
{
"id": "ev-001",
"type": "screenshot",
"title": "nvidia-smi output showing GPU details",
"content": "",
"filePath": "assets/progress/week-1/nvidia-smi-output.png",
"createdAt": "2024-01-16T14:00:00Z"
}
],
"notes": "Completed GPU architecture review"
}
]
}
],
"labs": [
{
"labId": "lab-01",
"status": "completed",
"startedAt": "2024-01-17T09:00:00Z",
"completedAt": "2024-01-17T11:30:00Z",
"checkpointResults": [
{
"checkpointId": "cp-01",
"passed": true,
"output": "GPU 0: NVIDIA H100 80GB HBM3",
"timestamp": "2024-01-17T09:30:00Z"
}
],
"evidence": [],
"timeSpent": 150
}
],
"blueprintCoverage": {
"totalObjectives": 45,
"covered": 12,
"partial": 5,
"uncovered": 28,
"byCategory": []
}
}
Blueprint Objectives (objectives.json)
{
"categories": [
{
"id": "deployment_validation",
"title": "Deployment and Validation",
"objectives": [
{
"id": "dv-01",
"title": "Deployment event sequences",
"description": "Understand and execute proper deployment event sequences for GPU infrastructure"
},
{
"id": "dv-02",
"title": "Network topologies for AI factories",
"description": "Design and validate network topologies for AI factory deployments"
},
{
"id": "dv-03",
"title": "BMC/OOB/TPM configuration",
"description": "Configure Baseboard Management Controller, Out-of-Band management, and TPM"
}
]
},
{
"id": "software_installation",
"title": "Software Installation and Configuration",
"objectives": [
{
"id": "si-01",
"title": "BCM installation with HA configuration",
"description": "Install Base Command Manager with high availability configuration"
}
]
},
{
"id": "performance_testing",
"title": "Performance Testing and Validation",
"objectives": [
{
"id": "pt-01",
"title": "Single-node stress tests",
"description": "Execute and interpret single-node GPU stress tests"
}
]
},
{
"id": "troubleshooting",
"title": "Troubleshooting and Maintenance",
"objectives": [
{
"id": "tm-01",
"title": "Hardware fault identification",
"description": "Identify hardware faults for GPUs, fans, and network cards"
}
]
}
]
}
Objective Mapping (mapping.json)
{
"mappings": [
{
"objectiveId": "dv-01",
"weekModules": ["week-1", "week-2"],
"labs": ["lab-03", "lab-05"],
"exercises": ["ex-w1-03", "ex-w2-01"],
"coverageLevel": "full",
"notes": "Covered through EKS deployment labs"
},
{
"objectiveId": "dv-03",
"weekModules": [],
"labs": [],
"exercises": [],
"coverageLevel": "none",
"notes": "BMC/OOB/TPM requires physical hardware access",
"alternativeResources": [
"NVIDIA LaunchPad BMC Lab",
"Partner hardware lab access"
]
}
]
}
6-Week Curriculum Structure
gantt
title 6-Week Study Plan
dateFormat YYYY-MM-DD
section Week 1
GPU Fundamentals & Architecture :w1, 2024-01-15, 7d
section Week 2
Kubernetes & EKS Deployment :w2, after w1, 7d
section Week 3
Slurm HPC Orchestration :w3, after w2, 7d
section Week 4
DCGM Monitoring & Observability :w4, after w3, 7d
section Week 5
Cost Optimization & AWS Scenarios :w5, after w4, 7d
section Week 6
Troubleshooting & Exam Prep :w6, after w5, 7d
| Week | Focus Area | Key Topics | Labs | Blueprint Categories |
|---|---|---|---|---|
| 1 | GPU Fundamentals | CUDA cores, tensor cores, memory hierarchy, NVLink, MIG, nvidia-smi, A100/H100 architecture | nvidia-smi basics, MIG configuration, GPU diagnostics | Deployment & Validation |
| 2 | Kubernetes & EKS | Device plugin, GPU scheduling, resource limits, EKS GPU nodes, Terraform EKS, MIG in K8s | EKS GPU cluster, GPU pod scheduling, MIG K8s integration | Software Installation |
| 3 | Slurm HPC | GPU resource management, job scripts, DCGM integration, MIG in Slurm, Enroot/Pyxis | Slurm GPU cluster, job submission, container workloads | Software Installation |
| 4 | Monitoring & DCGM | DCGM installation, Prometheus/Grafana, GPU metrics, alerting, dashboards | DCGM setup, Prometheus integration, Grafana dashboards | Performance Testing |
| 5 | Cost & AWS | Instance comparison, spot/reserved, EFA networking, IAM/VPC, Terraform deployments | Cost analysis, P5 deployment, EFA configuration | Deployment & Validation |
| 6 | Troubleshooting | Systematic methodology, error codes, K8s GPU debugging, DCGM diagnostics, real-world scenarios | Troubleshooting scenarios, fault injection, exam simulation | Troubleshooting |
Integration Points
graph LR
subgraph "Content → Progress"
A[Lab Completion] -->|Updates| B[Progress Store]
C[Exercise Completion] -->|Updates| B
end
subgraph "Progress → Blueprint"
B -->|Feeds| D[Blueprint Coverage]
D -->|Identifies| E[Gap Analysis]
end
subgraph "Progress → Blog"
B -->|Triggers| F[Milestone Notification]
F -->|Pre-populates| G[Blog Template]
end
subgraph "Blueprint → Content"
E -->|Recommends| H[Additional Resources]
D -->|Validates| I[Curriculum Coverage]
end
Key Integration Flows:
- Lab → Progress → Blueprint: Completing a lab updates progress, which recalculates blueprint coverage
- Progress → Blog: Milestone completion triggers blog template pre-population with achieved objectives
- Blueprint → Gap Analysis: Uncovered objectives feed into gap analysis with AWS classification
- Cost Calculator → Labs: Each lab references cost estimates for required AWS resources
- Curriculum → Labs: Weekly modules reference specific labs for hands-on practice
Container Architecture
The entire system runs inside Docker containers with no host dependencies beyond Docker and Docker Compose. This enables single-command local setup and seamless cloud migration.
Docker Compose Service Definitions
graph TB
subgraph "docker-compose.yml"
subgraph "app service (port 3000)"
NEXT[Next.js 14 App]
MDX[MDX Content Engine]
end
subgraph "tools service"
TF[Terraform 1.6+]
KUB[kubectl]
AWS[AWS CLI v2]
HELM[Helm 3]
end
subgraph "monitoring service (ports 9090, 3001)"
PROM[Prometheus]
GRAF[Grafana]
end
end
subgraph "Volumes"
V1[app-data: ./data]
V2[app-assets: ./public/assets]
V3[monitoring-data: ./monitoring/data]
end
subgraph "Network"
NET[cert-study-network - bridge]
end
app --> V1
app --> V2
monitoring --> V3
app --> NET
tools --> NET
monitoring --> NET
Docker Compose Configuration
version: "3.9"
services:
app:
build:
context: .
dockerfile: Dockerfile
ports:
- "3000:3000"
volumes:
- ./data:/app/data
- ./public/assets:/app/public/assets
- ./content:/app/content:ro
environment:
- NODE_ENV=${NODE_ENV:-development}
- DATA_DIR=/app/data
- ASSETS_DIR=/app/public/assets
- CONTENT_DIR=/app/content
- MONITORING_URL=http://monitoring:9090
networks:
- cert-study-network
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
interval: 30s
timeout: 10s
retries: 3
tools:
build:
context: .
dockerfile: Dockerfile.tools
volumes:
- ./terraform:/workspace/terraform
- ./data:/workspace/data
- ~/.aws:/root/.aws:ro
environment:
- AWS_REGION=${AWS_REGION:-us-east-1}
- AWS_PROFILE=${AWS_PROFILE:-default}
- KUBECONFIG=/workspace/.kube/config
networks:
- cert-study-network
stdin_open: true
tty: true
monitoring:
build:
context: .
dockerfile: Dockerfile.monitoring
ports:
- "9090:9090"
- "3001:3000"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards:ro
- monitoring-data:/var/lib/prometheus
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
- GF_SERVER_ROOT_URL=${GRAFANA_ROOT_URL:-http://localhost:3001}
networks:
- cert-study-network
volumes:
monitoring-data:
networks:
cert-study-network:
driver: bridge
Dockerfile Specifications
Dockerfile (Main App)
FROM node:20-alpine AS base
WORKDIR /app
FROM base AS deps
COPY package.json package-lock.json ./
RUN npm ci --only=production
FROM base AS builder
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM base AS runner
ENV NODE_ENV=production
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs
COPY --from=builder /app/public ./public
COPY --from=builder /app/.next/standalone ./
COPY --from=builder /app/.next/static ./.next/static
USER nextjs
EXPOSE 3000
ENV PORT=3000
CMD ["node", "server.js"]
Dockerfile.tools (Lab Tools)
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y \
curl unzip git jq python3 python3-pip \
&& rm -rf /var/lib/apt/lists/*
# Terraform
ARG TERRAFORM_VERSION=1.6.6
RUN curl -fsSL https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terraform_${TERRAFORM_VERSION}_linux_amd64.zip -o terraform.zip \
&& unzip terraform.zip -d /usr/local/bin/ && rm terraform.zip
# kubectl
RUN curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" \
&& install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl && rm kubectl
# AWS CLI
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
&& unzip awscliv2.zip && ./aws/install && rm -rf aws awscliv2.zip
# Helm
RUN curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
WORKDIR /workspace
CMD ["/bin/bash"]
Dockerfile.monitoring (Monitoring Stack)
FROM prom/prometheus:latest AS prometheus
FROM grafana/grafana:latest AS grafana
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y supervisor curl && rm -rf /var/lib/apt/lists/*
COPY --from=prometheus /bin/prometheus /usr/local/bin/prometheus
COPY --from=grafana /usr/share/grafana /usr/share/grafana
COPY --from=grafana /run.sh /run-grafana.sh
COPY monitoring/supervisord.conf /etc/supervisor/conf.d/supervisord.conf
EXPOSE 9090 3000
CMD ["/usr/bin/supervisord"]
Volume Mount Strategy
| Volume | Local Path | Container Path | Purpose |
|---|---|---|---|
| app-data | ./data |
/app/data |
Progress JSON, blog post drafts |
| app-assets | ./public/assets |
/app/public/assets |
Uploaded screenshots, diagrams |
| content (bind) | ./content |
/app/content |
Curriculum MDX, lab definitions (read-only) |
| terraform (bind) | ./terraform |
/workspace/terraform |
IaC configs for lab exercises |
| monitoring-data | Docker named volume | /var/lib/prometheus |
Prometheus time-series data |
Network Configuration
All containers communicate over a single Docker bridge network (cert-study-network):
- app → monitoring: Fetches GPU metrics for dashboard display
- app → tools: Triggers lab validation scripts
- tools → external: AWS API calls for lab provisioning (requires AWS credentials)
No container exposes ports to the host except:
app: port 3000 (web UI)monitoring: ports 9090 (Prometheus) and 3001 (Grafana)
Environment Variable Configuration (Local vs Cloud)
| Variable | Local Default | Cloud (ECS/EKS) | Purpose |
|---|---|---|---|
NODE_ENV |
development |
production |
App mode |
DATA_DIR |
/app/data |
/mnt/efs/data |
Persistent data path |
ASSETS_DIR |
/app/public/assets |
s3://bucket/assets |
Asset storage |
CONTENT_DIR |
/app/content |
/app/content |
Curriculum content |
MONITORING_URL |
http://monitoring:9090 |
http://prometheus.internal:9090 |
Metrics endpoint |
AWS_REGION |
us-east-1 |
(from task role) | AWS region |
GRAFANA_ROOT_URL |
http://localhost:3001 |
https://grafana.example.com |
Grafana base URL |
Correctness Properties
A property is a characteristic or behavior that should hold true across all valid executions of a system—essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.
Property 1: Week Module Structural Completeness
For any WeekModule in the curriculum, it SHALL contain non-empty learning objectives, at least one topic, at least one hands-on exercise with specific tasks and expected outcomes, and at least one AWS-specific deployment scenario reference.
Validates: Requirements 1.2, 9.1
Property 2: Progressive Complexity Ordering
For any pair of consecutive weeks (week N and week N+1) in the study plan, the maximum difficulty level of topics in week N+1 SHALL be greater than or equal to the maximum difficulty level in week N, ensuring foundational concepts precede advanced topics.
Validates: Requirements 1.3
Property 3: Dependency Graph Validity
For any dependency relationship where week B depends on week A, the index of week A SHALL be strictly less than the index of week B (no cycles, valid topological ordering).
Validates: Requirements 1.4
Property 4: Resource Entry Completeness
For any resource entry in the resource guide, it SHALL have a non-empty URL string and a non-empty description string.
Validates: Requirements 10.5
Property 5: Learning Goal to Lab Coverage
For any learning objective defined in the curriculum, there SHALL exist at least one Lab whose objectives array contains that learning objective's ID.
Validates: Requirements 11.1
Property 6: Lab Structural Completeness
For any Lab object, it SHALL have: a non-empty objectives array referencing valid learning objectives, a non-empty blueprintRefs array referencing valid blueprint objective IDs, a non-empty steps array where each step has instructions, and a non-empty validationCheckpoints array where each checkpoint has a validationType and expectedResult.
Validates: Requirements 11.2, 11.3, 11.4
Property 7: AWS Lab Cost and Provisioning
For any Lab where environment.awsServices is non-empty, the environment.setupInstructions SHALL be non-empty, environment.teardownInstructions SHALL be non-empty, and estimatedCost SHALL have a positive totalEstimate value.
Validates: Requirements 11.5
Property 7a: Lab Tier Classification Validity
For any Lab in the system, its tier field SHALL be one of "foundation", "advanced", or "elite", AND the recommendedInstance, estimatedHourlyCost, minimumGpuRequirement, multiGpuRequired, nvlinkRequired, and instanceJustification fields SHALL all be populated with valid values.
Validates: Requirements 11.6, 11.11
Property 7b: Lab Tier Distribution
For the complete set of Labs in the system, the proportion of foundation-tier labs SHALL be between 70% and 80% inclusive, the proportion of advanced-tier labs SHALL be between 10% and 20% inclusive, and the proportion of elite-tier labs SHALL be at most 10%.
Validates: Requirements 11.7
Property 7c: Lab Tier Instance Consistency
For any Lab with tier "foundation", the recommendedInstance SHALL be "g5.xlarge", "g5.2xlarge", or "local" AND estimatedHourlyCost SHALL be at most $2/hr. For any Lab with tier "advanced", the recommendedInstance SHALL be "g5.12xlarge" or "p4d.24xlarge". For any Lab with tier "elite", the recommendedInstance SHALL be "p4d.24xlarge" or "p5.48xlarge" AND nvlinkRequired or multiGpuRequired SHALL be true.
Validates: Requirements 11.8, 11.9, 11.10
Property 7d: Lab NVLink Requirement Implies Elite Tier
For any Lab where nvlinkRequired is true, the tier SHALL be "elite" and the recommendedInstance SHALL be "p4d.24xlarge" or "p5.48xlarge".
Validates: Requirements 11.10, 11.12
Property 8: Progress Evidence Storage Round-Trip
For any valid ProgressEvidence with type in ["screenshot", "code", "command_output", "config_file", "note"], storing it in the progress system and then retrieving it by its ID SHALL return an equivalent evidence object with all fields preserved.
Validates: Requirements 12.1, 12.2, 12.5
Property 9: Overall Completion Calculation Consistency
For any ProgressStore state, the overallCompletion percentage SHALL equal (number of completed objectives + completed labs) divided by (total objectives + total labs) multiplied by 100, and the summary counts (completed + in_progress + not_started) SHALL equal the total number of milestones.
Validates: Requirements 12.3, 12.6
Property 10: Evidence Organization by Milestone
For any ProgressEvidence stored with a non-null objectiveId or labId, querying evidence by that objectiveId or labId SHALL return a collection containing that evidence item.
Validates: Requirements 12.4
Property 11: Blog Template Pre-Population
For any completed milestone (week or lab), generating a blog template SHALL produce a BlogPost where the content contains the milestone title, the objectivesCovered array is non-empty and references valid objectives from that milestone, and all required TemplateSection headings are present.
Validates: Requirements 13.3
Property 12: Blog Export Validity
For any BlogPost with non-empty content and at least one asset, the export function SHALL produce valid Markdown where all image references point to valid asset paths and all code blocks are properly fenced.
Validates: Requirements 13.5
Property 13: Blueprint Objective Mapping Completeness
For any BlueprintObjective across all categories (deployment_validation, software_installation, performance_testing, troubleshooting), there SHALL exist an ObjectiveMapping entry with a valid coverageLevel classification.
Validates: Requirements 14.1, 14.2, 14.3, 14.4
Property 14: Covered Objectives Reference Content
For any ObjectiveMapping with coverageLevel "full" or "partial", the weekModules array or labs array SHALL be non-empty, indicating which study content addresses the objective.
Validates: Requirements 14.5
Property 15: Uncovered Objectives Have Alternatives
For any ObjectiveMapping with coverageLevel "none", the alternativeResources array SHALL be non-empty, providing recommendations for addressing the gap.
Validates: Requirements 14.6
Property 16: Gap Analysis Classification Completeness
For any BlueprintObjective in the system, there SHALL exist a GapAnalysisEntry with awsClassification in ["achievable", "partially_achievable", "not_achievable"].
Validates: Requirements 15.1
Property 17: Not-Achievable Entry Completeness
For any GapAnalysisEntry with awsClassification "not_achievable", the awsLimitation field SHALL be non-empty AND the alternatives array SHALL contain at least one AlternativePath.
Validates: Requirements 15.2, 15.3
Property 18: Partially-Achievable Entry Completeness
For any GapAnalysisEntry with awsClassification "partially_achievable", the awsCapabilities field SHALL be non-empty (describing what CAN be done on AWS) AND the alternatives array SHALL be non-empty (providing paths for what cannot).
Validates: Requirements 15.4
Property 19: Alternative Path Field Completeness
For any AlternativePath in any GapAnalysisEntry's alternatives array, the accessInstructions, estimatedCost, and availability fields SHALL all be non-empty strings.
Validates: Requirements 15.6
Property 20: Blog Asset Type Support
For any BlogAsset with type in ["image", "diagram", "code", "terminal_output"], the blog system SHALL accept and store it with a valid filePath and preserve the type classification on retrieval.
Validates: Requirements 13.2
Property 21: Container Isolation
For any service component defined in the system, it SHALL have a corresponding Dockerfile entry and be listed in docker-compose.yml, with no runtime dependencies (Node.js, Python, Terraform, kubectl, AWS CLI) required to be installed on the host machine beyond Docker and Docker Compose.
Validates: Requirements 16.1, 16.3, 16.8
Property 22: Cloud Portability
For any environment-specific configuration value (storage paths, service URLs, credentials, feature flags) used by the application, it SHALL be sourced from environment variables or external configuration files, such that changing only environment variables and volume mounts enables cloud deployment without code modifications.
Validates: Requirements 16.5, 16.6
Property 23: Volume Persistence
For any data written to a Docker volume mount path (progress data, blog posts, uploaded assets), that data SHALL be retrievable after a container restart, verifying that all persistent state is stored in volume-mounted directories and not in ephemeral container filesystem layers.
Validates: Requirements 16.4
Error Handling
Content Loading Errors
| Error Scenario | Handling Strategy |
|---|---|
| Missing MDX file for a week/topic | Display error message with file path, allow navigation to other content |
| Invalid JSON in progress.json | Attempt recovery from backup, prompt user to reset if unrecoverable |
| Missing lab Terraform module | Show warning, allow lab instructions to be read without provisioning |
| Corrupt asset file | Display placeholder with error message, log for user attention |
Progress System Errors
| Error Scenario | Handling Strategy |
|---|---|
| File write failure (disk full) | Queue writes, notify user, retry on space availability |
| Invalid evidence type | Reject with clear error message listing supported types |
| Duplicate evidence ID | Generate new unique ID, log warning |
| Progress calculation overflow | Cap at 100%, log inconsistency for review |
Blog System Errors
| Error Scenario | Handling Strategy |
|---|---|
| Asset upload exceeds size limit | Reject with size limit message, suggest compression |
| Template generation fails | Fall back to minimal template with just title and date |
| Export format error | Provide raw markdown as fallback, log formatting issue |
| Invalid milestone reference | Create blog without pre-population, notify user |
Blueprint/Gap Analysis Errors
| Error Scenario | Handling Strategy |
|---|---|
| Objective ID not found in mapping | Flag as "unmapped" in report, add to gap list |
| Circular dependency in objectives | Detect and break cycle, log warning |
| Missing alternative resources | Display "alternatives pending" status, flag for content update |
Container/Deployment Errors
| Error Scenario | Handling Strategy |
|---|---|
| Docker daemon not running | Display clear error message with instructions to start Docker |
| Port conflict (3000, 9090, 3001 already in use) | Log conflicting port, suggest alternative port mapping via env var |
| Volume mount permission denied | Check directory permissions, create directories if missing, log instructions |
| Container build failure | Display build stage that failed, suggest docker compose build --no-cache |
| Container health check failure | Retry with backoff, log container logs, suggest docker compose logs <service> |
| Cloud migration config mismatch | Validate all required env vars are set before startup, fail fast with missing var names |
| Named volume data corruption | Provide volume backup/restore instructions, allow fresh volume creation |
General Error Principles
- Graceful degradation — System remains usable even when individual components fail
- Data preservation — Never lose user progress data; prefer read-only mode over data loss
- Clear messaging — All errors provide actionable information to the user
- Recovery paths — Each error state has a documented recovery procedure
- Logging — All errors are logged with context for debugging
Testing Strategy
Property-Based Testing
This system is well-suited for property-based testing because it has clear data structures with invariants that must hold across all valid instances. The curriculum, lab, progress, and blueprint systems all have universal properties about structural completeness and data consistency.
Library: fast-check (TypeScript property-based testing library)
Configuration: Minimum 100 iterations per property test
Tag format: Feature: nvidia-aws-gpu-certification, Property {number}: {property_text}
Property tests will cover:
- Structural completeness of curriculum data (Properties 1-3)
- Resource and lab data integrity (Properties 4-7)
- Progress system round-trips and calculations (Properties 8-10)
- Blog system template generation and export (Properties 11-12, 20)
- Blueprint mapping completeness (Properties 13-15)
- Gap analysis classification and completeness (Properties 16-19)
- Container isolation, cloud portability, and volume persistence (Properties 21-23)
Unit Testing
Unit tests complement property tests by covering specific examples and edge cases:
- Cost Calculator: Specific instance type calculations with known expected values
- Progress percentage: Edge cases (0%, 100%, single item)
- Blog template: Specific milestone types produce expected template structures
- Gap Analysis: Known objectives with known AWS limitations
- Dependency resolution: Specific curriculum orderings
Integration Testing
- Content loading pipeline: MDX files parse correctly and render expected components
- Progress persistence: Write → read cycle across application restarts
- Blog export pipeline: End-to-end from template to exported markdown
- Terraform validation: Lab Terraform configs pass
terraform validate - Blueprint cross-reference: Full mapping produces expected coverage report
- Container orchestration: All services start via
docker compose upand communicate correctly - Volume persistence: Data written in containers survives
docker compose down && docker compose up
Smoke Testing
Given the large number of content-completeness requirements (Requirements 2-8, 10), smoke tests verify:
- All 6 week modules load without errors
- All referenced labs exist and have valid structure
- All external resource URLs are formatted correctly
- Terraform modules pass syntax validation
- Blueprint objectives JSON is valid and complete
Container Testing
Container-level tests verify Requirement 16 compliance:
- Dockerfile validity: All Dockerfiles build successfully with
docker build - Compose validation:
docker compose configpasses without errors - Service health: All services pass their health checks after startup
- No host dependencies: Application functions correctly with only Docker installed (no Node.js, Python, etc. on host)
- Volume mounts: Data directories are correctly mounted and writable
- Network connectivity: Services can communicate over the internal bridge network
- Cloud config compatibility: Container images run with cloud environment variable overrides
- Tools container: Lab tools (Terraform, kubectl, AWS CLI) are available and functional inside the tools container
Test Organization
tests/
├── properties/
│ ├── curriculum.property.test.ts
│ ├── labs.property.test.ts
│ ├── progress.property.test.ts
│ ├── blog.property.test.ts
│ ├── blueprint.property.test.ts
│ ├── gap-analysis.property.test.ts
│ └── container.property.test.ts
├── unit/
│ ├── cost-calculator.test.ts
│ ├── progress-calculation.test.ts
│ ├── blog-template.test.ts
│ └── dependency-resolver.test.ts
├── integration/
│ ├── content-loading.test.ts
│ ├── progress-persistence.test.ts
│ ├── blog-export.test.ts
│ └── container-orchestration.test.ts
├── smoke/
│ ├── content-completeness.test.ts
│ ├── terraform-validation.test.ts
│ └── docker-build.test.ts
└── container/
├── dockerfile-build.test.ts
├── compose-validation.test.ts
├── service-health.test.ts
├── volume-persistence.test.ts
└── cloud-config.test.ts