Back to Blog
technical-referencegpunvidiaawscertificationrequirements

Requirements Document

Requirements for the NVIDIA AWS GPU certification project: compute targets, software stack, test harness, and acceptance criteria.

November 15, 2025·16 min read

Requirements Document

Introduction

This document outlines the requirements for a comprehensive study plan and resource guide for the NVIDIA Certified Professional AI Infrastructure exam, with a focus on AWS GPU deployment best practices. The study plan will cover a 6-week curriculum designed to make users proficient in NVIDIA GPUs and AWS deployment tools and features.

Glossary

  • NVIDIA GPU: Graphics Processing Unit from NVIDIA, including A100, H100, and other datacenter GPUs
  • AWS GPU Instances: Amazon EC2 instances with NVIDIA GPUs, including P5, P4, and G5 families
  • Kubernetes (K8s): Open-source container orchestration platform
  • EKS: Amazon Elastic Kubernetes Service, AWS's managed Kubernetes offering
  • DCGM: Data Center GPU Manager, NVIDIA's monitoring and management tool
  • MIG: Multi-Instance GPU, technology allowing single GPU to be partitioned into multiple instances
  • Kubernetes Device Plugin: Kubernetes framework for exposing hardware resources to the scheduler
  • Slurm: Simple Linux Utility for Resource Management, workload manager for HPC
  • nvidia-smi: NVIDIA System Management Interface, command-line utility for GPU monitoring
  • Prometheus: Open-source monitoring and alerting toolkit
  • Grafana: Open-source platform for monitoring and observability
  • Terraform: Infrastructure as Code tool for provisioning AWS resources
  • P5 Instances: AWS EC2 instances powered by up to 8 NVIDIA H100 GPUs
  • A100/H100: NVIDIA datacenter GPUs (Ampere and Hopper architectures)
  • BMC: Baseboard Management Controller, used for remote server management
  • OOB: Out-of-Band management, server management independent of the main OS
  • TPM: Trusted Platform Module, hardware security module for cryptographic operations
  • HGX: NVIDIA HGX GPU platform, a multi-GPU baseboard for AI and HPC
  • BlueField: NVIDIA DPU (Data Processing Unit) network platform for data center infrastructure
  • BCM: Base Command Manager, NVIDIA's cluster management software
  • Enroot: NVIDIA container runtime optimized for HPC environments
  • Pyxis: Slurm plugin enabling container support for HPC workloads
  • DOCA: Data Center Infrastructure on a Chip Architecture, SDK for BlueField DPUs
  • NGC: NVIDIA GPU Cloud, hub for GPU-optimized software and containers
  • HPL: High-Performance Linpack benchmark, used for measuring floating-point performance
  • NCCL: NVIDIA Collective Communications Library, for multi-GPU and multi-node communication
  • NVLink: NVIDIA high-speed GPU interconnect technology
  • NeMo: NVIDIA framework for building, training, and deploying AI models
  • ClusterKit: NVIDIA cluster validation toolkit for infrastructure assessment
  • EFA: Elastic Fabric Adapter, AWS high-performance networking for HPC and ML
  • Docker: Container platform for packaging applications with their dependencies
  • Docker Compose: Tool for defining and running multi-container Docker applications
  • ECS: Amazon Elastic Container Service, AWS managed container orchestration
  • Container Image: Lightweight, standalone package containing application code and dependencies
  • Simulator: Browser-based terminal simulator using xterm.js that renders mock GPU command outputs for zero-cost practice
  • xterm.js: JavaScript terminal emulator library for rendering terminal interfaces in the browser

Requirements

Requirement 1: 6-Week Study Plan with Weekly Learning Objectives

User Story: As a candidate preparing for the NVIDIA Certified Professional AI Infrastructure exam, I want a structured 6-week study plan, so that I can systematically cover all exam topics and achieve proficiency.

Acceptance Criteria

  1. THE Study_Plan SHALL divide content into 6 weekly modules covering: GPU basics, Kubernetes, Slurm, DCGM monitoring, cost optimization, and troubleshooting
  2. FOR EACH week, THE Study_Plan SHALL specify clear learning objectives, key concepts, hands-on practice requirements, and AWS-specific deployment scenarios
  3. THE Study_Plan SHALL include progressive complexity, building from foundational GPU concepts to advanced orchestration and troubleshooting
  4. WHERE weekly topics overlap, THE Study_Plan SHALL indicate dependencies and recommended completion order

Requirement 2: GPU Fundamentals and Hardware Knowledge

User Story: As a candidate, I want to understand NVIDIA GPU architecture and hardware, so that I can make informed deployment decisions and troubleshoot hardware-related issues.

Acceptance Criteria

  1. THE Study_Guide SHALL cover GPU architecture concepts including CUDA cores, tensor cores, memory hierarchy, and NVLink
  2. WHEN reviewing hardware, THE Study_Guide SHALL detail A100 (Ampere), H100 (Hopper), and other relevant NVIDIA datacenter GPUs
  3. WHERE P5 instances are referenced, THE Study_Guide SHALL explain NVIDIA H100 GPU specifications and capabilities
  4. WHEN covering MIG, THE Study_Guide SHALL explain multi-instance GPU technology, partitioning strategies, and use cases
  5. THE Study_Guide SHALL include hands-on exercises using nvidia-smi for GPU monitoring and diagnostics

Requirement 3: Kubernetes and EKS Deployment

User Story: As a candidate, I want to understand Kubernetes deployment with NVIDIA GPUs, so that I can orchestrate GPU workloads in production environments.

Acceptance Criteria

  1. WHEN deploying Kubernetes, THE Study_Guide SHALL cover Kubernetes device plugin for NVIDIA GPUs
  2. WHERE EKS is used, THE Study_Guide SHALL detail AWS-specific Kubernetes GPU deployment patterns
  3. THE Study_Guide SHALL include Terraform configurations for provisioning EKS clusters with GPU nodes
  4. WHILE deploying GPU workloads, THE Study_Guide SHALL cover resource requests, limits, and scheduling considerations
  5. IF MIG is enabled, THE Study_Guide SHALL explain Kubernetes integration with MIG-enabled GPUs

Requirement 4: Slurm HPC Orchestration

User Story: As a candidate, I want to understand Slurm workload manager for GPU clusters, so that I can deploy and manage GPU workloads in HPC environments.

Acceptance Criteria

  1. WHEN configuring Slurm, THE Study_Guide SHALL cover GPU resource management and scheduling
  2. THE Study_Guide SHALL include Slurm job scripts for GPU workloads with proper resource specifications
  3. WHILE managing GPU resources, THE Study_Guide SHALL explain Slurm integration with NVIDIA DCGM
  4. WHERE MIG is used, THE Study_Guide SHALL detail Slurm configuration for MIG-enabled GPUs

Requirement 5: Monitoring and Observability with DCGM

User Story: As a candidate, I want to understand DCGM monitoring and observability, so that I can track GPU health, performance, and identify issues proactively.

Acceptance Criteria

  1. WHEN monitoring GPUs, THE Study_Guide SHALL cover DCGM (Data Center GPU Manager) installation and configuration
  2. THE Study_Guide SHALL include Prometheus and Grafana integration with DCGM for centralized monitoring
  3. WHILE monitoring GPU health, THE Study_Guide SHALL cover key metrics including temperature, power, utilization, memory usage, and errors
  4. WHERE alerts are configured, THE Study_Guide SHALL explain DCGM alerting mechanisms and notification channels
  5. THE Study_Guide SHALL include dashboards for visualizing GPU metrics in Grafana

Requirement 6: Cost Optimization and Resource Management

User Story: As a candidate, I want to understand cost optimization strategies for GPU deployments and practice on the cheapest viable hardware, so that I can maximize repetition and learning while minimizing AWS spend.

Acceptance Criteria

  1. WHEN evaluating instance types, THE Study_Guide SHALL compare P5, P4, and G5 instance families for cost-performance tradeoffs
  2. WHERE cost optimization is needed, THE Study_Guide SHALL cover reserved instances, spot instances, and savings plans
  3. WHILE managing GPU resources, THE Study_Guide SHALL explain MIG for better GPU utilization and cost efficiency
  4. THE Study_Guide SHALL include AWS cost estimation examples for GPU workloads
  5. FOR ALL cost optimization strategies, THE Study_Guide SHALL provide real-world examples and calculations
  6. THE Study_Guide SHALL optimize for maximum repetition on cheap hardware rather than occasional access to expensive hardware, prioritizing learning frequency over hardware scale
  7. THE Study_Guide SHALL recommend a daily workflow of foundation-tier labs on g5.xlarge instances (~$1-2/hr) for building muscle memory through repetition
  8. WHEN expensive instances (p4d.24xlarge, p5.48xlarge) are required, THE Study_Guide SHALL recommend focused 2-4 hour sessions followed by immediate termination
  9. THE Study_Guide SHALL target a cost distribution where 70-80% of lab hours are spent on foundation-tier instances, 10-20% on advanced-tier instances, and no more than 10% on elite-tier instances

Requirement 7: Troubleshooting and Debugging

User Story: As a candidate, I want to develop troubleshooting skills for GPU deployments, so that I can quickly identify and resolve issues in production environments.

Acceptance Criteria

  1. WHEN GPU issues occur, THE Study_Guide SHALL provide systematic troubleshooting methodology
  2. THE Study_Guide SHALL cover common GPU error codes and their meanings
  3. WHILE debugging Kubernetes GPU deployments, THE Study_Guide SHALL explain device plugin issues, scheduling failures, and resource constraints
  4. IF monitoring data is unavailable, THE Study_Guide SHALL provide DCGM diagnostic commands and log locations
  5. THE Study_Guide SHALL include real-world troubleshooting scenarios with step-by-step resolution

Requirement 8: AWS-Specific Deployment Scenarios

User Story: As a candidate, I want to understand AWS-specific GPU deployment scenarios, so that I can apply best practices in real-world AWS environments.

Acceptance Criteria

  1. WHEN deploying on AWS, THE Study_Guide SHALL cover P5 instances with NVIDIA H100 GPUs
  2. THE Study_Guide SHALL include Terraform configurations for provisioning GPU instances on AWS
  3. WHILE setting up networking, THE Study_Guide SHALL cover Elastic Fabric Adapter (EFA) for GPU cluster communication
  4. WHERE security is concerned, THE Study_Guide SHALL explain IAM roles, security groups, and VPC configurations for GPU instances
  5. FOR ALL AWS deployment scenarios, THE Study_Guide SHALL reference AWS best practices and NVIDIA recommendations

Requirement 9: Hands-On Practice Requirements

User Story: As a candidate, I want hands-on practice exercises, so that I can reinforce theoretical knowledge with practical experience.

Acceptance Criteria

  1. FOR EACH weekly module, THE Study_Guide SHALL include hands-on exercises with specific tasks and expected outcomes
  2. WHEN setting up environments, THE Study_Guide SHALL provide step-by-step instructions for AWS GPU instance provisioning
  3. WHILE practicing Kubernetes deployments, THE Study_Guide SHALL include sample manifests and deployment scripts
  4. WHERE monitoring is covered, THE Study_Guide SHALL provide DCGM, Prometheus, and Grafana configuration examples
  5. THE Study_Guide SHALL include troubleshooting labs with simulated issues for practice

Requirement 10: Resource Guide and References

User Story: As a candidate, I want a comprehensive resource guide, so that I can access additional learning materials and documentation.

Acceptance Criteria

  1. THE Resource_Guide SHALL include official NVIDIA documentation links for all covered tools
  2. WHEN referencing AWS resources, THE Resource_Guide SHALL link to AWS documentation and whitepapers
  3. WHERE community resources are available, THE Resource_Guide SHALL include relevant GitHub repositories and tutorials
  4. THE Resource_Guide SHALL include practice exam questions and sample scenarios
  5. FOR all external resources, THE Resource_Guide SHALL provide direct URLs and brief descriptions

Requirement 11: Structured Labs with Tiered Instance Architecture

User Story: As a candidate, I want structured hands-on labs organized into cost tiers with strict instance discipline, so that I can gain practical experience aligned with each certification objective while keeping costs predictable and low.

Acceptance Criteria

  1. FOR EACH identified learning goal, THE Lab_System SHALL provide a structured hands-on lab exercise
  2. WHEN a lab is presented, THE Lab_System SHALL specify the learning objective, prerequisites, step-by-step instructions, and expected outcomes
  3. THE Lab_System SHALL tie each lab directly to one or more certification exam objectives
  4. WHILE completing a lab, THE Lab_System SHALL provide validation checkpoints to confirm correct execution
  5. WHERE AWS infrastructure is used, THE Lab_System SHALL include provisioning instructions and estimated cost for each lab exercise
  6. THE Lab_System SHALL classify every lab into exactly one tier: "foundation" ($1-2/hr, g5.xlarge or g5.2xlarge), "advanced" ($5-6/hr, g5.12xlarge with 4x A10G), or "elite" (p4d.24xlarge or p5.48xlarge)
  7. THE Lab_System SHALL ensure 70-80% of all labs are classified as foundation tier, 10-20% as advanced tier, and no more than 10% as elite tier
  8. WHEN a lab requires only single-GPU features (nvidia-smi basics, CUDA fundamentals, Docker GPU runtime, EKS GPU plugin, Kubernetes scheduling, taints/tolerations, DCGM basics, Prometheus/Grafana, Terraform, GPU pod deployment), THE Lab_System SHALL assign it to foundation tier using g5.xlarge
  9. WHEN a lab requires multiple GPU scheduling, inference scaling, profiling, concurrency testing, or MIG practicals (which require A100), THE Lab_System SHALL assign it to advanced tier using g5.12xlarge or p4d.24xlarge as appropriate
  10. WHEN a lab requires NVLink, NCCL, distributed training, topology awareness, GPUDirect, EFA, or multi-node collectives, THE Lab_System SHALL assign it to elite tier
  11. FOR EACH lab exercise, THE Lab_System SHALL document the tier, recommended instance type, estimated hourly cost, minimum GPU requirement, whether multi-GPU is required, whether NVLink is required, and a justification for the chosen instance
  12. FOR EACH lab exercise, THE Lab_System SHALL select the cheapest EC2 instance type that satisfies the learning objectives of that lab, with explicit justification for why a more expensive instance cannot be avoided when one is selected
  13. FOR ALL lab Terraform configurations, THE Lab_System SHALL avoid NAT gateways and use public subnets with internet gateways unless the lab specifically teaches private networking concepts
  14. WHERE fixed-cost resources (NAT gateways, EKS control planes, Elastic IPs) are required, THE Lab_System SHALL prominently warn the user about always-on costs and provide teardown instructions

Requirement 12: Progress Documentation System

User Story: As a candidate, I want to actively document my progress with screenshots, code, and commands, so that I can track my learning journey and demonstrate completion of milestones.

Acceptance Criteria

  1. THE Progress_System SHALL actively document completion status for each learning objective and lab exercise
  2. WHEN a user completes a milestone, THE Progress_System SHALL allow attachment of screenshots, code snippets, and command outputs as evidence
  3. THE Progress_System SHALL track completion of milestones and display overall progress toward certification readiness
  4. WHILE documenting progress, THE Progress_System SHALL organize assets by milestone and learning objective
  5. WHERE multiple asset types are attached, THE Progress_System SHALL support images, code blocks, terminal output, and configuration files
  6. THE Progress_System SHALL provide a summary view showing completed, in-progress, and remaining milestones

Requirement 13: Blog and Knowledge Sharing at Milestone Completion

User Story: As a candidate, I want to document and share my learning at the end of each milestone as a blog post, so that others in the open source community can benefit from my experience.

Acceptance Criteria

  1. WHEN a milestone is completed, THE Blog_System SHALL provide an intuitive interface for creating a blog post documenting the milestone
  2. THE Blog_System SHALL make content creation easy by supporting text entry, image upload, code block insertion, and command output formatting
  3. WHILE creating a blog post, THE Blog_System SHALL pre-populate a template with milestone details, objectives covered, and placeholder sections
  4. THE Blog_System SHALL allow users to upload screenshots, diagrams, and other visual assets with minimal friction
  5. WHERE results need to be shared, THE Blog_System SHALL generate shareable content suitable for open source community platforms
  6. THE Blog_System SHALL provide appropriate formatting and export options for publishing to common blog platforms

Requirement 14: NVIDIA Certified Infrastructure Professional Blueprint Cross-Check

User Story: As a candidate, I want the study plan to cross-reference all NVIDIA Professional examination blueprint objectives, so that I can verify complete coverage of the certification exam.

Acceptance Criteria

  1. THE Blueprint_Checker SHALL map all Deployment and Validation objectives including: deployment event sequences, network topologies for AI factories, BMC/OOB/TPM configuration, firmware upgrades and fault detection on HGX, power and cooling validation, GPU server installation via SMI, hardware validation, cable and transceiver validation, physical GPU installation, workload hardware validation, third-party storage configuration, BlueField network platform configuration, and MIG configuration for AI and HPC
  2. THE Blueprint_Checker SHALL map all Software Installation and Configuration objectives including: BCM installation with HA configuration, OS installation, cluster configuration with Slurm/Enroot/Pyxis, NVIDIA GPU and DOCA driver management, NVIDIA container toolkit installation, Docker GPU usage, and NGC CLI installation
  3. THE Blueprint_Checker SHALL map all Performance Testing and Validation objectives including: single-node stress tests, HPL execution, single-node NCCL with NVLink Switch verification, cable signal quality validation, cabling verification, switch firmware and software confirmation, BlueField-3 firmware and software confirmation, transceiver firmware confirmation, ClusterKit node assessment, NCCL east-west fabric bandwidth verification, NCCL burn-in, HPL burn-in, NeMo burn-in, and storage testing
  4. THE Blueprint_Checker SHALL map all Troubleshooting and Maintenance objectives including: hardware fault identification for GPUs, fans, and network cards, faulty component identification and replacement for cards, GPUs, and power supplies, AMD and Intel server performance optimization, and storage optimization
  5. FOR EACH exam objective, THE Blueprint_Checker SHALL indicate which study plan module and lab exercise addresses the objective
  6. IF an exam objective is not covered by the study plan, THE Blueprint_Checker SHALL flag the gap and recommend additional resources

Requirement 15: Gap Analysis for AWS vs On-Premises Objectives

User Story: As a candidate using AWS infrastructure, I want a clear gap analysis identifying which certification objectives can and cannot be performed on AWS, so that I can plan alternative learning paths for objectives requiring physical hardware.

Acceptance Criteria

  1. THE Gap_Analysis SHALL list all certification blueprint objectives and classify each as "achievable on AWS", "partially achievable on AWS", or "not achievable on AWS"
  2. WHEN an objective is not achievable on AWS, THE Gap_Analysis SHALL provide alternative options including NVIDIA LaunchPad, local hardware access, simulation tools, or virtual lab environments
  3. FOR EACH objective marked as not achievable on AWS, THE Gap_Analysis SHALL explain the specific limitation preventing AWS-based practice
  4. WHERE partial coverage exists on AWS, THE Gap_Analysis SHALL describe what aspects can be practiced on AWS and what aspects require alternatives
  5. THE Gap_Analysis SHALL provide a recommended path for completing all objectives, combining AWS-based labs with alternative resources for gap areas
  6. WHEN alternatives are recommended, THE Gap_Analysis SHALL include access instructions, estimated costs, and availability information for each alternative option

Requirement 16: Containerization and Cloud Portability

User Story: As a candidate, I want all system components to run in containers with no local machine pollution, so that I can easily move the entire system to the cloud and maintain a clean development environment.

Acceptance Criteria

  1. THE System SHALL run all application components (Next.js app, monitoring tools, lab environments) inside Docker containers with no direct installation on the host machine
  2. WHEN setting up the system locally, THE System SHALL use Docker Compose to orchestrate all services with a single command
  3. THE System SHALL NOT require any runtime dependencies (Node.js, Python, Terraform, etc.) to be installed directly on the host machine — all tools SHALL be available within containers
  4. WHERE data persistence is needed, THE System SHALL use Docker volumes mapped to a local directory for progress data, blog posts, and uploaded assets
  5. THE System SHALL include container images that are compatible with AWS ECS, EKS, or similar cloud container orchestration platforms for seamless cloud migration
  6. WHEN migrating to the cloud, THE System SHALL require only configuration changes (environment variables, volume mounts to cloud storage) without code modifications
  7. FOR EACH lab exercise that requires infrastructure tools (Terraform, kubectl, AWS CLI), THE System SHALL provide a pre-configured container with all necessary tools installed
  8. THE System SHALL include a Dockerfile for each service component and a docker-compose.yml for local orchestration

Requirement 17: Browser-Based Terminal Simulator (SimLab)

User Story: As a candidate, I want a browser-based terminal simulator that renders mock GPU command outputs, so that I can practice command syntax and parameter usage at zero cost before spinning up real AWS instances.

Acceptance Criteria

  1. THE Simulator SHALL be a single Docker container running one Node.js static file server — no backend API, no database, no second container
  2. THE Simulator SHALL have ALL simulation logic running in the browser via vanilla JavaScript ES modules reading data/commands.json — no React, Vue, webpack, or bundler
  3. THE Simulator SHALL serve files from simlab/ directory at path a100/simlab/ with structure: Dockerfile, docker-compose.yml, package.json (zero dependencies), server.js, public/index.html, src/main.js, src/paramPicker.js, src/mockRunner.js, src/sessionStore.js, src/navRenderer.js, src/styles.css, data/commands.json
  4. THE Simulator SHALL use xterm.js 5.3.0 loaded from CDN — not bundled
  5. THE parameter picker SHALL be a DOM overlay sibling of the xterm canvas container, triggered by Ctrl+Space, using inputBuffer as the source of truth (not xterm's internal buffer)
  6. THE Simulator SHALL use longest-prefix command matching against commands.json mockOutputs
  7. THE Simulator SHALL persist session history in localStorage with key format nvidialab_session_{sessionId} with entry schema containing: id, cmd, tokens, output, ok, errorType, matched, outlineId, timestamp, elapsed
  8. THE left nav SHALL show lab exercises with status dots (green=ok, red=error, grey=not run), nested history entries, replay mode with "live ▶" button, ad hoc section, and storage gauge
  9. THE domain sidebar SHALL show all 5 domains and their labs — switching labs loads or creates the correct session
  10. THE Simulator SHALL pass all 15 acceptance criteria: docker build succeeds, app loads at localhost:3000, param picker works, session persists across refresh, replay works, export JSON works
  11. THE Simulator SHALL NOT be integrated into the Next.js app — it is a completely separate standalone project in a100/simlab/