Summer Sale - Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 65percent

Welcome To DumpsPedia

NCP-AII Sample Questions Answers

Questions 4

You are preparing a Spectrum-based NVIDIA switch for integration into a production AI cluster. To confirm that all modules are running approved firmware versions, you must use the appropriate command from the switch CLI. Which step most accurately meets best practices for ensuring firmware version consistency and cluster compliance?

Options:

A.

Use the show version command to check the overall system version and confirm all modules are updated if the system version matches the documentation.

B.

Use the show interfaces status command to verify all ports are up, and proceed with integration if no interface errors are shown.

C.

Use the show asic-version command to review firmware versions for all modules, then compare these against the documented approved versions.

D.

Use the show inventory command to display component details and serial numbers before proceeding, as this output will include all firmware versions for review.

Buy Now
Questions 5

After a recent OS upgrade, you need to reinstall NVIDIA GPU and DOCA drivers to support both AI training and accelerated networking. What best practice ensures successful installation and full hardware capability?

Options:

A.

Download and install only the specific versions of GPU and DOCA drivers listed as compatible with the current OS and hardware.

B.

Apply legacy drivers for hardware released within the last two years to maintain maximum compatibility across versions.

C.

Install the latest available drivers directly from the NVIDIA website.

D.

Use the default drivers provided by the Linux distribution, unless an installation fails during system boot.

Buy Now
Questions 6

As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware?

Options:

A.

Perform a software-driven restart on the operating system of every compute node, then use advanced tools to check firmware status, and reissue update commands if any firmware appears inactive afterward.

B.

Execute a single AC power cycle on the DGX after the update process, then reset the software stack and verify status using diagnostic commands on each node for confirmation of all component updates.

C.

Initiate a cold power cycle on all node trays to activate firmware, follow with a DGX reboot procedure, and use the management interface to finish activating CPLD firmware on the host.

D.

Initiate a cold power cycle on the system to activate firmware for components, reset the BMC using the recommended command, and perform an AC power cycle to ensure EROT and CPLD firmware is activated.

Buy Now
Questions 7

A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?

Options:

A.

Run a deep learning workload to stress test the GPUs and check whether the issue persists.

B.

Check the NVIDIA System Management Interface (nvidia-smi) for GPU status and temperatures.

C.

Power drain then restart the DGX and check if the performance degradation resolves.

D.

Increase the fan speed to maximum and check whether the performance improves.

Buy Now
Questions 8

A System Administrator needs to change the scheduling behavior of a single GPU to use a fixed share scheduler. What command achieves this?

Options:

A.

esxcli system module parameters set -m nvidia -p

B.

esxcli -i 0 -mig 18

C.

nvidia-smi -i 0 -mig 1

D.

mlxconfig -d /dev/mst/mt4123_pciconf0 set LINK_TYPE_P1 =2

Buy Now
Questions 9

A leaf switch shows " FW Version Mismatch " alerts for transceivers after cluster expansion. Which tool validates transceiver firmware against expected versions?

Options:

A.

flint

B.

iblinkinfo

C.

mlxconfig

D.

ethtool

Buy Now
Questions 10

During a 48-hour NeMo question-answering model burn-in test, GPU memory errors occur when processing large datasets. Which configuration strategy prevents Out-of-Memory (OOM) errors while maintaining processing efficiency?

Options:

A.

Set blocksize= " 1GB " for data loading and enable RMM asynchronous allocation.

B.

Switch from FP16 to FP32 precision for numerical stability.

C.

Disable add_filename for Parquet files to reduce metadata.

D.

Increase files_per_partition to 1000 for larger batch processing.

Buy Now
Questions 11

A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?

Options:

A.

Navigate to ’Devices " > select a switch > " Cables ' tab to see ASIC firmware and transceiver versions.

B.

Use " Topology’ view to visually inspect cable icons.

C.

Run mlxlink -d lid- < LID > -m on each port manually.

D.

Export all switch logs and grep for ’FW Version " .

Buy Now
Questions 12

An AI training cluster with NVIDIA GPUs experiences prolonged data loading times during checkpoint reloading, causing GPUs to idle frequently. CPU utilization during data transfers remains high. Which solution most effectively optimizes storage-to-GPU throughput while reducing CPU overhead?

Options:

A.

Increase batch sizes to reduce the frequency of storage access.

B.

Migrate datasets to SATA SSDs with RAID 0 for higher sequential read speeds.

C.

Add more GPUs to the cluster to parallelize data loading tasks.

D.

Implement GPUDirect Storage to enable direct data transfers.

Buy Now
Questions 13

An infrastructure engineer in an AI factory has successfully replaced a power supply unit on an NVIDIA DGX H100. After installation, both the IN and OUT LEDs on the new power supply illuminate solid green. Which NVSM CLI command should the engineer use to quickly verify the overall system status and ensure it is operating as expected?

Options:

A.

nvsm show power

B.

nvsm show powermode

C.

nvsm show health

D.

nvsm show alerts

Buy Now
Questions 14

When configuring an out-of-core HPL burn-in for a 40B matrix on 8x H100 nodes, which environment variable prevents GPU out-of-memory errors while reserving space for drivers?

Options:

A.

export HPL_OOC_SAFE_SIZE=4.0

B.

export HPL_OOC_MODE=0

C.

export HPL_OOC_NUM_STREAMS=8

D.

export HPL_OOC_MAX_GPU_MEM=90

Buy Now
Questions 15

What is the primary purpose of running an NCCL burn-in test on a new GPU cluster?

Options:

A.

To test whether GPUs are properly detected by the operating system and have the correct drivers installed.

B.

To maximize GPU utilization for machine learning workloads and automatically tune deep learning frameworks.

C.

To detect and resolve hardware or interconnect issues before production by stressing GPU communication links.

D.

To benchmark application-specific runtime performance of AI models using real user data and production training scripts.

Buy Now
Questions 16

To validate bisectional bandwidth across two racks in a Spectrum-X Ethernet fabric, which NCCL test configuration isolates East-West traffic?

Options:

A.

NCCL_TESTS_SPLIT= " OR 0x7 " ./all_reduce_perf -g 8

B.

Run without splits and analyze per-rack averages.

C.

NCCL_TESTS_SPLIT= " MOD 2 " ./all_reduce_perf -g 8

D.

NCCL_TESTS_SPLIT= " DIV 8 " ./all_reduce_perf -g 1

Buy Now
Questions 17

ClusterKit’s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

Options:

A.

Critical failure; expected is greater than 390 GB/s for HDR InfiniBand.

B.

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

D.

Inconclusive; rerun with --stress=cpu to validate.

Buy Now
Questions 18

An administrator needs to verify HA functionality after configuring BCM (Bright Cluster Manager). Which command confirms the active head node and failover readiness?

Options:

A.

cmsh status to check HA status and active/standby roles.

B.

nvsm show health to validate GPU status on both head nodes.

C.

systemctl restart cmdaemon to force a failover test.

D.

ping < secondary-head-node-ip > to test basic connectivity.

Buy Now
Questions 19

A system administrator wants to configure MIG for seven slices on an H100 GPU in an NVIDIA HGX system. Which command should be used?

Options:

A.

mig-parted

B.

nvidia-smi

C.

nvcc

D.

nvlink-config

Buy Now
Questions 20

ClusterKit ' s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

Options:

A.

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

B.

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.

Critical failure; expected is > 390 GB/s for HDR InfiniBand.

D.

Inconclusive; rerun with --stress=cpu to validate.

Buy Now
Questions 21

An engineer must ensure that a BlueField-3 NIC firmware download matches the cluster’s PSID. Which step is critical before installation?

Options:

A.

Check that the DPU’s BMC IP is reachable by ping.

B.

Confirm that the firmware file size matches the DPU’s flash capacity.

C.

Use mstflint -d < PCI_ID > query to validate the device PSID before selecting the firmware image.

D.

Verify that the SHA256 hash of the firmware matches NVIDIA’s public ledger.

Buy Now
Questions 22

Which function is used to collect the cluster counters information?

Options:

A.

SM

B.

PM

C.

GM

D.

FM

Buy Now
Questions 23

You are responsible for ensuring interoperability between AI applications deployed across a diverse IT landscape, including an on-premises data center equipped with NVIDIA GPUs and multiple cloud platforms from different vendors. These environments need to support complex AI workflows that involve large-scale data processing, real-time analytics, and machine learning model training. To maintain consistent performance and flexibility, which strategy should you prioritize?

Options:

A.

Choose one vendor and standardize on one storage solution across all environments to simplify management and improve interoperability.

B.

Implement a multi-cloud strategy that uses only native storage solutions in each cloud platform while relying on middleware to ensure interoperability and data consistency.

C.

Ensure that all environments use compatible storage protocols and APIs, such as NFS or S3, to facilitate data exchange and integration across platforms.

D.

Focus only on increasing network bandwidth between locations to reduce latency and improve data transfer speeds.

Buy Now
Questions 24

A system administrator needs to install a container toolkit and successfully run the following commands:

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime docker

What step should be taken next to finish the installation?

Options:

A.

dpkg -i doca-host-repo-ubuntu < version > _amd64.deb

B.

apt-get install cuda-drivers

C.

systemctl restart docker

D.

apt-get remove nvidia-container-toolkit

Buy Now
Questions 25

A customer has just completed the first boot of their DGX system and is prompted to create an administrative user. What is the correct approach for setting up this user to ensure secure BMC and GRUB access?

Options:

A.

Create a unique, strong, lower-case username and password that will be used for both BMC and GRUB access, avoiding default or weak credentials.

B.

Create separate usernames for BMC and GRUB to maximize flexibility.

C.

Skip the creation of a new user and retain the default admin account for BMC and GRUB access.

D.

Use “sysadmin” as the username and a simple password for ease of management.

Buy Now
Questions 26

During a 72-hour HPL burn-in test on a DGX H100 cluster, one node shows a 15% performance drop after 48 hours. What are the two most likely causes and diagnostic steps?

Pick the 2 correct responses below.

Options:

A.

MPI configuration error; rerun with --cpu-affinity adjustments.

B.

Network packet loss; analyze ibdiagnet reports.

C.

Thermal throttling due to cooling issues; check nvidia-smi dmon.

D.

Memory corruption; reboot the node and reduce problem size N.

Buy Now
Questions 27

After configuring HA, the administrator runs cmsh status and notices the secondary head node reports mysql [FAIL]. What is the most likely cause?

Options:

A.

The BCM license expired after HA configuration.

B.

Network connectivity issues between the primary and secondary head nodes.

C.

The secondary head node lacks NVIDIA GPU drivers.

D.

The cluster nodes are powered on during the HA configuration.

Buy Now
Questions 28

A healthcare organization is deploying an AI system to analyze patient data for predictive diagnostics. The system must comply with strict data protection regulations such as HIPAA, ensuring that sensitive information remains confidential and secure. Considering the need for robust security measures, which combination of strategies should the organization prioritize to protect against data breaches and ensure regulatory compliance?

Options:

A.

Deploy data masking to obscure sensitive data during processing and use role-based access control (RBAC) to limit data access based on user roles.

B.

Use tokenization to replace sensitive data with non-sensitive tokens and employ multi-factor authentication (MFA) for system access.

C.

Implement symmetric encryption for all data at rest and rely solely on password-based access controls.

D.

Rely on asymmetric encryption for all communications and use data deduplication to minimize storage costs without additional security measures.

Buy Now
Questions 29

After a firmware upgrade on a DGX H100, the administrator notices that one GPU is not detected by the system. Which troubleshooting step should be performed first to identify the root cause?

Options:

A.

Review firmware update logs and run nvsm show health to check for hardware or firmware errors on the affected GPU.

B.

Remove the GPU from the system and replace it with a new one before any diagnostics.

C.

Ignore the issue and proceed with production workloads if the other GPUs are operational.

D.

Immediately re-run the firmware upgrade on all system components.

Buy Now
Questions 30

You are validating the environment of an NVIDIA GPU-accelerated data center during post-deployment checks. Which one action is essential to confirm that power and cooling are sufficient for the stable operation of NVIDIA DGX H100 systems?

Options:

A.

Confirm the system fans are running at 100% under all workloads to prevent overheating.

B.

Review the system BIOS to ensure GPU overclocking is enabled for maximum performance.

C.

Use NVSM to disable unused PCIe devices to reduce overall system heat output.

D.

Verify that each DGX system is connected to redundant, properly rated PDUs and that all power supplies are reporting nominal input.

Buy Now
Questions 31

Refer to the output:

~ $ sudo nvsm show healthinfo

—Timestamp: Sat Dec 16 16:26:32 2017 -0800

Version: 17.12-5

Checks—BIOS Revision [5.11].........................

DGX Serial Number [YSY72800016)..................

Verify installed DIMM memory sticks........................Healthy

...[output truncated)

Verify Ethernet controllers...........................Healthy

Verify installed GPU ' s..............................Unhealthy

Checking output of ' lspci ' for expected GPU ' s

Missing GPU at PCI address ' 07:00.0 '

Verify installed InfiniBand controllers....................Healthy

Verify PCIe switches..................................Healthy

...[output truncated)

What insights can a system administrator gain regarding the DGX system ' s health?

Options:

A.

A GPU tray upgrade failed.

B.

A GPU is missing on the DGX system.

C.

A GPU driver upgrade has failed.

D.

The system has passed the hardware health check successfully.

Buy Now
Questions 32

An administrator needs to add additional GPUs to an existing server. What are the server requirements to check before installing new GPUs?

Options:

A.

Sufficient networking, water-cooled racks, adequate rack power, sufficient storage, and rack space.

B.

Sufficient storage, sufficient networking, adequate rack power, and compatible hardware.

C.

Sufficient CPU capacity, PCIe slot allocation, sufficient cooling in the data center, and rack space.

D.

Sufficient cooling in the data center, adequate rack power, compatible hardware, and PCIe slot allocation.

Buy Now
Questions 33

You are tasked with setting up High Availability (HA) for NVIDIA Base Command Manager (BCM) in a new GPU cluster. The cluster consists of a primary head node, a secondary head node, and several compute nodes. The requirements are automatic failover of BCM services, minimal disruption to workloads, and proper cluster health monitoring during and after installation. During your BCM HA installation and configuration process, which two of the following actions are mandatory for ensuring a robust and verified HA cluster configuration?

Pick the 2 correct responses below.

Options:

A.

Assign a floating Virtual IP address that can automatically migrate between the primary and secondary head nodes during failover.

B.

Compute nodes must be powered on and performing work to initiate synchronization of the head nodes.

C.

After configuration is complete, simulate a failover by stopping BCM services on the active head node to verify that all services are running on the secondary node with no interruption.

D.

Configure both head nodes to use independent static IP addresses for BCM services instead of relying on a shared virtual IP address.

E.

During configuration, explicitly synchronize both the configuration and state data directories from the primary to the secondary head node to ensure consistency.

Buy Now
Questions 34

An administrator needs to perform a comprehensive pre-production stress test on a DGX H100 system. Which command validates GPU, CPU, memory, and storage components while following NVIDIA’s recommended procedure?

Options:

A.

nvidia-smi -q | grep " GPU Stress Test "

B.

sudo nvsm stress-test --force

C.

stress --cpu $(nproc) --io $(nproc) --timeout 600

D.

./gpu_burn 60

Buy Now
Questions 35

You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize to develop more accurate energy-efficiency metrics?

Options:

A.

Focus on integrating kilowatt-hours into existing metrics to better reflect the actual energy used for productive work.

B.

Use Power Usage Effectiveness as the primary metric while supplementing it with additional measures of useful work done per unit of energy.

C.

Develop benchmarks tailored to specific workloads, such as MLPerf for AI applications, to better understand energy use in real-world scenarios.

D.

Use watts-used as the primary measure of efficiency, as it accurately reflects the power input at any given time.

Buy Now
Questions 36

The system administrator plans to use Multi-Instance GPU profiles. What command should be used to verify that the GPU has this mode enabled?

Options:

A.

nvidia-mode

B.

nvidia-mig

C.

nvidia-enable

D.

nvidia-smi

Buy Now
Exam Code: NCP-AII
Exam Name: NVIDIA AI Infrastructure
Last Update: Jun 25, 2026
Questions: 123
$64.99  $185.69
$49.99  $142.83
$54.99  $157.11
buy now NCP-AII