Requirements Document

Introduction

This document outlines the requirements for a comprehensive study plan and resource guide for the NVIDIA Certified Professional AI Infrastructure exam, with a focus on AWS GPU deployment best practices. The study plan will cover a 6-week curriculum designed to make users proficient in NVIDIA GPUs and AWS deployment tools and features.

Glossary

NVIDIA GPU: Graphics Processing Unit from NVIDIA, including A100, H100, and other datacenter GPUs
AWS GPU Instances: Amazon EC2 instances with NVIDIA GPUs, including P5, P4, and G5 families
Kubernetes (K8s): Open-source container orchestration platform
EKS: Amazon Elastic Kubernetes Service, AWS's managed Kubernetes offering
DCGM: Data Center GPU Manager, NVIDIA's monitoring and management tool
MIG: Multi-Instance GPU, technology allowing single GPU to be partitioned into multiple instances
Kubernetes Device Plugin: Kubernetes framework for exposing hardware resources to the scheduler
Slurm: Simple Linux Utility for Resource Management, workload manager for HPC
nvidia-smi: NVIDIA System Management Interface, command-line utility for GPU monitoring
Prometheus: Open-source monitoring and alerting toolkit
Grafana: Open-source platform for monitoring and observability
Terraform: Infrastructure as Code tool for provisioning AWS resources
P5 Instances: AWS EC2 instances powered by up to 8 NVIDIA H100 GPUs
A100/H100: NVIDIA datacenter GPUs (Ampere and Hopper architectures)
BMC: Baseboard Management Controller, used for remote server management
OOB: Out-of-Band management, server management independent of the main OS
TPM: Trusted Platform Module, hardware security module for cryptographic operations
HGX: NVIDIA HGX GPU platform, a multi-GPU baseboard for AI and HPC
BlueField: NVIDIA DPU (Data Processing Unit) network platform for data center infrastructure
BCM: Base Command Manager, NVIDIA's cluster management software
Enroot: NVIDIA container runtime optimized for HPC environments
Pyxis: Slurm plugin enabling container support for HPC workloads
DOCA: Data Center Infrastructure on a Chip Architecture, SDK for BlueField DPUs
NGC: NVIDIA GPU Cloud, hub for GPU-optimized software and containers
HPL: High-Performance Linpack benchmark, used for measuring floating-point performance
NCCL: NVIDIA Collective Communications Library, for multi-GPU and multi-node communication
NVLink: NVIDIA high-speed GPU interconnect technology
NeMo: NVIDIA framework for building, training, and deploying AI models
ClusterKit: NVIDIA cluster validation toolkit for infrastructure assessment
EFA: Elastic Fabric Adapter, AWS high-performance networking for HPC and ML
Docker: Container platform for packaging applications with their dependencies
Docker Compose: Tool for defining and running multi-container Docker applications
ECS: Amazon Elastic Container Service, AWS managed container orchestration
Container Image: Lightweight, standalone package containing application code and dependencies
Simulator: Browser-based terminal simulator using xterm.js that renders mock GPU command outputs for zero-cost practice
xterm.js: JavaScript terminal emulator library for rendering terminal interfaces in the browser

Requirements

Requirement 1: 6-Week Study Plan with Weekly Learning Objectives

User Story: As a candidate preparing for the NVIDIA Certified Professional AI Infrastructure exam, I want a structured 6-week study plan, so that I can systematically cover all exam topics and achieve proficiency.

Acceptance Criteria

THE Study_Plan SHALL divide content into 6 weekly modules covering: GPU basics, Kubernetes, Slurm, DCGM monitoring, cost optimization, and troubleshooting
FOR EACH week, THE Study_Plan SHALL specify clear learning objectives, key concepts, hands-on practice requirements, and AWS-specific deployment scenarios
THE Study_Plan SHALL include progressive complexity, building from foundational GPU concepts to advanced orchestration and troubleshooting
WHERE weekly topics overlap, THE Study_Plan SHALL indicate dependencies and recommended completion order

Requirement 2: GPU Fundamentals and Hardware Knowledge

User Story: As a candidate, I want to understand NVIDIA GPU architecture and hardware, so that I can make informed deployment decisions and troubleshoot hardware-related issues.

Acceptance Criteria

THE Study_Guide SHALL cover GPU architecture concepts including CUDA cores, tensor cores, memory hierarchy, and NVLink
WHEN reviewing hardware, THE Study_Guide SHALL detail A100 (Ampere), H100 (Hopper), and other relevant NVIDIA datacenter GPUs
WHERE P5 instances are referenced, THE Study_Guide SHALL explain NVIDIA H100 GPU specifications and capabilities
WHEN covering MIG, THE Study_Guide SHALL explain multi-instance GPU technology, partitioning strategies, and use cases
THE Study_Guide SHALL include hands-on exercises using nvidia-smi for GPU monitoring and diagnostics

Requirement 3: Kubernetes and EKS Deployment

User Story: As a candidate, I want to understand Kubernetes deployment with NVIDIA GPUs, so that I can orchestrate GPU workloads in production environments.

Acceptance Criteria

WHEN deploying Kubernetes, THE Study_Guide SHALL cover Kubernetes device plugin for NVIDIA GPUs
WHERE EKS is used, THE Study_Guide SHALL detail AWS-specific Kubernetes GPU deployment patterns
THE Study_Guide SHALL include Terraform configurations for provisioning EKS clusters with GPU nodes
WHILE deploying GPU workloads, THE Study_Guide SHALL cover resource requests, limits, and scheduling considerations
IF MIG is enabled, THE Study_Guide SHALL explain Kubernetes integration with MIG-enabled GPUs

Requirement 4: Slurm HPC Orchestration

User Story: As a candidate, I want to understand Slurm workload manager for GPU clusters, so that I can deploy and manage GPU workloads in HPC environments.

Acceptance Criteria

WHEN configuring Slurm, THE Study_Guide SHALL cover GPU resource management and scheduling
THE Study_Guide SHALL include Slurm job scripts for GPU workloads with proper resource specifications
WHILE managing GPU resources, THE Study_Guide SHALL explain Slurm integration with NVIDIA DCGM
WHERE MIG is used, THE Study_Guide SHALL detail Slurm configuration for MIG-enabled GPUs

Requirement 5: Monitoring and Observability with DCGM

User Story: As a candidate, I want to understand DCGM monitoring and observability, so that I can track GPU health, performance, and identify issues proactively.

Acceptance Criteria

WHEN monitoring GPUs, THE Study_Guide SHALL cover DCGM (Data Center GPU Manager) installation and configuration
THE Study_Guide SHALL include Prometheus and Grafana integration with DCGM for centralized monitoring
WHILE monitoring GPU health, THE Study_Guide SHALL cover key metrics including temperature, power, utilization, memory usage, and errors
WHERE alerts are configured, THE Study_Guide SHALL explain DCGM alerting mechanisms and notification channels
THE Study_Guide SHALL include dashboards for visualizing GPU metrics in Grafana

Requirement 6: Cost Optimization and Resource Management

User Story: As a candidate, I want to understand cost optimization strategies for GPU deployments and practice on the cheapest viable hardware, so that I can maximize repetition and learning while minimizing AWS spend.

Acceptance Criteria

WHEN evaluating instance types, THE Study_Guide SHALL compare P5, P4, and G5 instance families for cost-performance tradeoffs
WHERE cost optimization is needed, THE Study_Guide SHALL cover reserved instances, spot instances, and savings plans
WHILE managing GPU resources, THE Study_Guide SHALL explain MIG for better GPU utilization and cost efficiency
THE Study_Guide SHALL include AWS cost estimation examples for GPU workloads
FOR ALL cost optimization strategies, THE Study_Guide SHALL provide real-world examples and calculations
THE Study_Guide SHALL optimize for maximum repetition on cheap hardware rather than occasional access to expensive hardware, prioritizing learning frequency over hardware scale
THE Study_Guide SHALL recommend a daily workflow of foundation-tier labs on g5.xlarge instances (~$1-2/hr) for building muscle memory through repetition
WHEN expensive instances (p4d.24xlarge, p5.48xlarge) are required, THE Study_Guide SHALL recommend focused 2-4 hour sessions followed by immediate termination
THE Study_Guide SHALL target a cost distribution where 70-80% of lab hours are spent on foundation-tier instances, 10-20% on advanced-tier instances, and no more than 10% on elite-tier instances

Requirement 7: Troubleshooting and Debugging

User Story: As a candidate, I want to develop troubleshooting skills for GPU deployments, so that I can quickly identify and resolve issues in production environments.

Acceptance Criteria

WHEN GPU issues occur, THE Study_Guide SHALL provide systematic troubleshooting methodology
THE Study_Guide SHALL cover common GPU error codes and their meanings
WHILE debugging Kubernetes GPU deployments, THE Study_Guide SHALL explain device plugin issues, scheduling failures, and resource constraints
IF monitoring data is unavailable, THE Study_Guide SHALL provide DCGM diagnostic commands and log locations
THE Study_Guide SHALL include real-world troubleshooting scenarios with step-by-step resolution

Requirement 8: AWS-Specific Deployment Scenarios

User Story: As a candidate, I want to understand AWS-specific GPU deployment scenarios, so that I can apply best practices in real-world AWS environments.

Acceptance Criteria

WHEN deploying on AWS, THE Study_Guide SHALL cover P5 instances with NVIDIA H100 GPUs
THE Study_Guide SHALL include Terraform configurations for provisioning GPU instances on AWS
WHILE setting up networking, THE Study_Guide SHALL cover Elastic Fabric Adapter (EFA) for GPU cluster communication
WHERE security is concerned, THE Study_Guide SHALL explain IAM roles, security groups, and VPC configurations for GPU instances
FOR ALL AWS deployment scenarios, THE Study_Guide SHALL reference AWS best practices and NVIDIA recommendations

Requirement 9: Hands-On Practice Requirements

User Story: As a candidate, I want hands-on practice exercises, so that I can reinforce theoretical knowledge with practical experience.

Acceptance Criteria

FOR EACH weekly module, THE Study_Guide SHALL include hands-on exercises with specific tasks and expected outcomes
WHEN setting up environments, THE Study_Guide SHALL provide step-by-step instructions for AWS GPU instance provisioning
WHILE practicing Kubernetes deployments, THE Study_Guide SHALL include sample manifests and deployment scripts
WHERE monitoring is covered, THE Study_Guide SHALL provide DCGM, Prometheus, and Grafana configuration examples
THE Study_Guide SHALL include troubleshooting labs with simulated issues for practice

Requirement 10: Resource Guide and References

User Story: As a candidate, I want a comprehensive resource guide, so that I can access additional learning materials and documentation.

Acceptance Criteria

THE Resource_Guide SHALL include official NVIDIA documentation links for all covered tools
WHEN referencing AWS resources, THE Resource_Guide SHALL link to AWS documentation and whitepapers
WHERE community resources are available, THE Resource_Guide SHALL include relevant GitHub repositories and tutorials
THE Resource_Guide SHALL include practice exam questions and sample scenarios
FOR all external resources, THE Resource_Guide SHALL provide direct URLs and brief descriptions

Requirement 11: Structured Labs with Tiered Instance Architecture

User Story: As a candidate, I want structured hands-on labs organized into cost tiers with strict instance discipline, so that I can gain practical experience aligned with each certification objective while keeping costs predictable and low.

Acceptance Criteria

FOR EACH identified learning goal, THE Lab_System SHALL provide a structured hands-on lab exercise
WHEN a lab is presented, THE Lab_System SHALL specify the learning objective, prerequisites, step-by-step instructions, and expected outcomes
THE Lab_System SHALL tie each lab directly to one or more certification exam objectives
WHILE completing a lab, THE Lab_System SHALL provide validation checkpoints to confirm correct execution
WHERE AWS infrastructure is used, THE Lab_System SHALL include provisioning instructions and estimated cost for each lab exercise
THE Lab_System SHALL classify every lab into exactly one tier: "foundation" (~~$1-2/hr, g5.xlarge or g5.2xlarge), "advanced" (~~$5-6/hr, g5.12xlarge with 4x A10G), or "elite" (p4d.24xlarge or p5.48xlarge)
THE Lab_System SHALL ensure 70-80% of all labs are classified as foundation tier, 10-20% as advanced tier, and no more than 10% as elite tier
WHEN a lab requires only single-GPU features (nvidia-smi basics, CUDA fundamentals, Docker GPU runtime, EKS GPU plugin, Kubernetes scheduling, taints/tolerations, DCGM basics, Prometheus/Grafana, Terraform, GPU pod deployment), THE Lab_System SHALL assign it to foundation tier using g5.xlarge
WHEN a lab requires multiple GPU scheduling, inference scaling, profiling, concurrency testing, or MIG practicals (which require A100), THE Lab_System SHALL assign it to advanced tier using g5.12xlarge or p4d.24xlarge as appropriate
WHEN a lab requires NVLink, NCCL, distributed training, topology awareness, GPUDirect, EFA, or multi-node collectives, THE Lab_System SHALL assign it to elite tier
FOR EACH lab exercise, THE Lab_System SHALL document the tier, recommended instance type, estimated hourly cost, minimum GPU requirement, whether multi-GPU is required, whether NVLink is required, and a justification for the chosen instance
FOR EACH lab exercise, THE Lab_System SHALL select the cheapest EC2 instance type that satisfies the learning objectives of that lab, with explicit justification for why a more expensive instance cannot be avoided when one is selected
FOR ALL lab Terraform configurations, THE Lab_System SHALL avoid NAT gateways and use public subnets with internet gateways unless the lab specifically teaches private networking concepts
WHERE fixed-cost resources (NAT gateways, EKS control planes, Elastic IPs) are required, THE Lab_System SHALL prominently warn the user about always-on costs and provide teardown instructions

Requirement 12: Progress Documentation System

User Story: As a candidate, I want to actively document my progress with screenshots, code, and commands, so that I can track my learning journey and demonstrate completion of milestones.

Acceptance Criteria

THE Progress_System SHALL actively document completion status for each learning objective and lab exercise
WHEN a user completes a milestone, THE Progress_System SHALL allow attachment of screenshots, code snippets, and command outputs as evidence
THE Progress_System SHALL track completion of milestones and display overall progress toward certification readiness
WHILE documenting progress, THE Progress_System SHALL organize assets by milestone and learning objective
WHERE multiple asset types are attached, THE Progress_System SHALL support images, code blocks, terminal output, and configuration files
THE Progress_System SHALL provide a summary view showing completed, in-progress, and remaining milestones

User Story: As a candidate, I want to document and share my learning at the end of each milestone as a blog post, so that others in the open source community can benefit from my experience.

Acceptance Criteria

WHEN a milestone is completed, THE Blog_System SHALL provide an intuitive interface for creating a blog post documenting the milestone
THE Blog_System SHALL make content creation easy by supporting text entry, image upload, code block insertion, and command output formatting
WHILE creating a blog post, THE Blog_System SHALL pre-populate a template with milestone details, objectives covered, and placeholder sections
THE Blog_System SHALL allow users to upload screenshots, diagrams, and other visual assets with minimal friction
WHERE results need to be shared, THE Blog_System SHALL generate shareable content suitable for open source community platforms
THE Blog_System SHALL provide appropriate formatting and export options for publishing to common blog platforms

Requirement 14: NVIDIA Certified Infrastructure Professional Blueprint Cross-Check

User Story: As a candidate, I want the study plan to cross-reference all NVIDIA Professional examination blueprint objectives, so that I can verify complete coverage of the certification exam.

Acceptance Criteria

THE Blueprint_Checker SHALL map all Deployment and Validation objectives including: deployment event sequences, network topologies for AI factories, BMC/OOB/TPM configuration, firmware upgrades and fault detection on HGX, power and cooling validation, GPU server installation via SMI, hardware validation, cable and transceiver validation, physical GPU installation, workload hardware validation, third-party storage configuration, BlueField network platform configuration, and MIG configuration for AI and HPC
THE Blueprint_Checker SHALL map all Software Installation and Configuration objectives including: BCM installation with HA configuration, OS installation, cluster configuration with Slurm/Enroot/Pyxis, NVIDIA GPU and DOCA driver management, NVIDIA container toolkit installation, Docker GPU usage, and NGC CLI installation
THE Blueprint_Checker SHALL map all Performance Testing and Validation objectives including: single-node stress tests, HPL execution, single-node NCCL with NVLink Switch verification, cable signal quality validation, cabling verification, switch firmware and software confirmation, BlueField-3 firmware and software confirmation, transceiver firmware confirmation, ClusterKit node assessment, NCCL east-west fabric bandwidth verification, NCCL burn-in, HPL burn-in, NeMo burn-in, and storage testing
THE Blueprint_Checker SHALL map all Troubleshooting and Maintenance objectives including: hardware fault identification for GPUs, fans, and network cards, faulty component identification and replacement for cards, GPUs, and power supplies, AMD and Intel server performance optimization, and storage optimization
FOR EACH exam objective, THE Blueprint_Checker SHALL indicate which study plan module and lab exercise addresses the objective
IF an exam objective is not covered by the study plan, THE Blueprint_Checker SHALL flag the gap and recommend additional resources

Requirement 15: Gap Analysis for AWS vs On-Premises Objectives

User Story: As a candidate using AWS infrastructure, I want a clear gap analysis identifying which certification objectives can and cannot be performed on AWS, so that I can plan alternative learning paths for objectives requiring physical hardware.

Acceptance Criteria

THE Gap_Analysis SHALL list all certification blueprint objectives and classify each as "achievable on AWS", "partially achievable on AWS", or "not achievable on AWS"
WHEN an objective is not achievable on AWS, THE Gap_Analysis SHALL provide alternative options including NVIDIA LaunchPad, local hardware access, simulation tools, or virtual lab environments
FOR EACH objective marked as not achievable on AWS, THE Gap_Analysis SHALL explain the specific limitation preventing AWS-based practice
WHERE partial coverage exists on AWS, THE Gap_Analysis SHALL describe what aspects can be practiced on AWS and what aspects require alternatives
THE Gap_Analysis SHALL provide a recommended path for completing all objectives, combining AWS-based labs with alternative resources for gap areas
WHEN alternatives are recommended, THE Gap_Analysis SHALL include access instructions, estimated costs, and availability information for each alternative option

Requirement 16: Containerization and Cloud Portability

User Story: As a candidate, I want all system components to run in containers with no local machine pollution, so that I can easily move the entire system to the cloud and maintain a clean development environment.

Acceptance Criteria

THE System SHALL run all application components (Next.js app, monitoring tools, lab environments) inside Docker containers with no direct installation on the host machine
WHEN setting up the system locally, THE System SHALL use Docker Compose to orchestrate all services with a single command
THE System SHALL NOT require any runtime dependencies (Node.js, Python, Terraform, etc.) to be installed directly on the host machine — all tools SHALL be available within containers
WHERE data persistence is needed, THE System SHALL use Docker volumes mapped to a local directory for progress data, blog posts, and uploaded assets
THE System SHALL include container images that are compatible with AWS ECS, EKS, or similar cloud container orchestration platforms for seamless cloud migration
WHEN migrating to the cloud, THE System SHALL require only configuration changes (environment variables, volume mounts to cloud storage) without code modifications
FOR EACH lab exercise that requires infrastructure tools (Terraform, kubectl, AWS CLI), THE System SHALL provide a pre-configured container with all necessary tools installed
THE System SHALL include a Dockerfile for each service component and a docker-compose.yml for local orchestration

Requirement 17: Browser-Based Terminal Simulator (SimLab)

User Story: As a candidate, I want a browser-based terminal simulator that renders mock GPU command outputs, so that I can practice command syntax and parameter usage at zero cost before spinning up real AWS instances.

Acceptance Criteria

THE Simulator SHALL be a single Docker container running one Node.js static file server — no backend API, no database, no second container
THE Simulator SHALL have ALL simulation logic running in the browser via vanilla JavaScript ES modules reading data/commands.json — no React, Vue, webpack, or bundler
THE Simulator SHALL serve files from simlab/ directory at path a100/simlab/ with structure: Dockerfile, docker-compose.yml, package.json (zero dependencies), server.js, public/index.html, src/main.js, src/paramPicker.js, src/mockRunner.js, src/sessionStore.js, src/navRenderer.js, src/styles.css, data/commands.json
THE Simulator SHALL use xterm.js 5.3.0 loaded from CDN — not bundled
THE parameter picker SHALL be a DOM overlay sibling of the xterm canvas container, triggered by Ctrl+Space, using inputBuffer as the source of truth (not xterm's internal buffer)
THE Simulator SHALL use longest-prefix command matching against commands.json mockOutputs
THE Simulator SHALL persist session history in localStorage with key format nvidialab_session_{sessionId} with entry schema containing: id, cmd, tokens, output, ok, errorType, matched, outlineId, timestamp, elapsed
THE left nav SHALL show lab exercises with status dots (green=ok, red=error, grey=not run), nested history entries, replay mode with "live ▶" button, ad hoc section, and storage gauge
THE domain sidebar SHALL show all 5 domains and their labs — switching labs loads or creates the correct session
THE Simulator SHALL pass all 15 acceptance criteria: docker build succeeds, app loads at localhost:3000, param picker works, session persists across refresh, replay works, export JSON works
THE Simulator SHALL NOT be integrated into the Next.js app — it is a completely separate standalone project in a100/simlab/

Requirements Document

Introduction

Glossary

Requirements

Requirement 1: 6-Week Study Plan with Weekly Learning Objectives

Acceptance Criteria

Requirement 2: GPU Fundamentals and Hardware Knowledge

Acceptance Criteria

Requirement 3: Kubernetes and EKS Deployment

Acceptance Criteria

Requirement 4: Slurm HPC Orchestration

Acceptance Criteria

Requirement 5: Monitoring and Observability with DCGM

Acceptance Criteria

Requirement 6: Cost Optimization and Resource Management

Acceptance Criteria

Requirement 7: Troubleshooting and Debugging

Acceptance Criteria

Requirement 8: AWS-Specific Deployment Scenarios

Acceptance Criteria

Requirement 9: Hands-On Practice Requirements

Acceptance Criteria

Requirement 10: Resource Guide and References

Acceptance Criteria

Requirement 11: Structured Labs with Tiered Instance Architecture

Acceptance Criteria

Requirement 12: Progress Documentation System

Acceptance Criteria

Requirement 13: Blog and Knowledge Sharing at Milestone Completion

Acceptance Criteria

Requirement 14: NVIDIA Certified Infrastructure Professional Blueprint Cross-Check

Acceptance Criteria

Requirement 15: Gap Analysis for AWS vs On-Premises Objectives

Acceptance Criteria

Requirement 16: Containerization and Cloud Portability

Acceptance Criteria

Requirement 17: Browser-Based Terminal Simulator (SimLab)

Acceptance Criteria