- Ubuntu 22.04 or 24.04
- NVIDIA Driver
- Docker
- NVIDIA Container Toolkit
Follow this post for the installation instructions.
git clone https://github.com/j3soon/hpc-samples.git
cd hpc-samplesWe use the nvidia/nvhpc NGC image as the base image. See the documentation for more details.
cd src
docker build -f Dockerfile_cuda13.0 -t j3soon/hpc-samples:nvhpc-25.9-devel-cuda13.0-ubuntu24.04 .
docker build -f Dockerfile_cuda12.9 -t j3soon/hpc-samples:nvhpc-25.7-devel-cuda12.9-ubuntu24.04 .
docker build -f Dockerfile_cuda12.4 -t j3soon/hpc-samples:nvhpc-24.5-devel-cuda12.4-ubuntu22.04 .
docker run --rm -it --gpus all -v $PWD:/app j3soon/hpc-samples:nvhpc-25.9-devel-cuda13.0-ubuntu24.04
docker run --rm -it --gpus all -v $PWD:/app j3soon/hpc-samples:nvhpc-25.7-devel-cuda12.9-ubuntu24.04
docker run --rm -it --gpus all -v $PWD:/app j3soon/hpc-samples:nvhpc-24.5-devel-cuda12.4-ubuntu22.04To compile, run, and clean the built-in examples at /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples, you can use the following commands:
# C++ Standard Parallelism
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/stdpar/stdblas
make all
# OpenACC Examples
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/OpenACC/samples
make all
# OpenMP Examples
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/OpenMP
make all
# CUDA-Libraries Examples
# - cuBLAS
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/CUDA-Libraries/cuBLAS
make all
# - cuFFT
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/CUDA-Libraries/cuFFT
make all
# - cuRAND
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/CUDA-Libraries/cuRAND
make all
# - cuSPARSE
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/CUDA-Libraries/cuSPARSE
make all
# - thrust
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/CUDA-Libraries/thrust
make all
# MPI Examples
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/MPI
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
make all
# CUDA-Fortran Examples
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/CUDA-Fortran/CUDA-Fortran-Book
make all
# AutoPar Examples
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/AutoPar
make all
# F2003 Examples
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/F2003
make all
# NVLAmath Examples
cd /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/examples/NVLAmath
make allNVIDIA/cuda-samples has been pre-built and included in the docker image at /workspace/cuda-samples. For example, to run the deviceQuery example, you can run the following command:
/workspace/cuda-samples/build/Samples/1_Utilities/deviceQuery/deviceQueryor the p2pBandwidthLatencyTest example to test the GPU-to-GPU communication:
/workspace/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/p2pBandwidthLatencyTestSee the full list of examples here.
If you are using a custom docker image, follow the official instructions:
git clone https://github.com/NVIDIA/cuda-samples cd cuda-samples git checkout v13.0 # Replace with the CUDA version matching your image mkdir build && cd build cmake .. make -j$(nproc)You might also need to set
CUDA_PATHandLIBRARY_PATHaccording to your environment if the build fails.
NVIDIA/nccl-tests has been pre-built and included in the docker image at /workspace/nccl-tests. For example, to run the all_reduce_perf test, you can run the following command:
cd /workspace/nccl-tests
# single node 8 GPUs
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# two node 16 GPUs
mpirun -np 16 -N 2 ./build/all_reduce_perf_mpi -b 8 -e 8G -f 2 -g 1or with Slurm:
# Enroot+Pyxis
srun -N 2 --ntasks-per-node=8 --mpi=pmix \
--container-image=j3soon/hpc-samples:nvhpc-24.5-devel-cuda12.4-ubuntu22.04 \
/usr/local/bin/hpcx-entrypoint.sh \
/workspace/nccl-tests/build/all_reduce_perf_mpi -b 8 -e 8G -f 2 -g 1
# Apptainer/Singularity (To be confirmed)
singularity pull docker://j3soon/hpc-samples:nvhpc-24.5-devel-cuda12.4-ubuntu22.04
singularity build --sandbox hpc-samples-cuda12/ hpc-samples_nvhpc-24.5-devel-cuda12.4-ubuntu22.04.sif
srun -N 2 --ntasks-per-node 8 --mpi=pmix --gres=gpu:8 \
singularity exec --nv hpc-samples-cuda12/ \
/usr/local/bin/hpcx-entrypoint.sh \
/workspace/nccl-tests/build/all_reduce_perf_mpi -b 8 -e 8G -f 2 -g 1or with debug flags:
cd /workspace/nccl-tests
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8If you are using a custom docker image, follow the official instructions:
git clone https://github.com/NVIDIA/nccl-tests cd nccl-tests git checkout v2.17.6 # Replace with the NCCL version matching your image make -j$(nproc) make -j$(nproc) MPI=1 NAME_SUFFIX=_mpiYou might also need to set
CUDA_HOME,NCCL_HOME, andMPI_HOMEaccording to your environment if the build fails.
NVIDIA/nvbandwidth has been pre-built and included in the docker image at /workspace/nvbandwidth. For example, to run the nvbandwidth tool, you can run the following command:
cd /workspace/nvbandwidth
./nvbandwidthor verbose mode:
./nvbandwidth -vor single test case:
./nvbandwidth -t device_to_device_memcpy_read_ceor the multi-node version:
cd /workspace/nvbandwidth_mpi
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
mpirun -n 4 ./nvbandwidth -p multinodeIf you are using a custom docker image, follow the official instructions:
git clone https://github.com/NVIDIA/nvbandwidth cd nvbandwidth git checkout v0.8 # Replace with the NVBandwidth version matching your image cp -r . ../nvbandwidth_mpi apt-get update && apt-get install -y libboost-program-options-dev cmake . make -j$(nproc) cd ../nvbandwidth_mpi cmake -DMULTINODE=1 . make -j$(nproc)
NVIDIA/CUDALibrarySamples is not yet included.
NVIDIA/compute-sanitizer-samples is not yet included.
NVIDIA/multi-gpu-programming-models is not yet included.
Use the nvidia-smi tool to query GPU status.
Check local GPU topology status:
nvidia-smi topo -p2p nTopology connections and affinities matrix between the GPUs and NICs in the system:
nvidia-smi topo -mUse compute-sanitizer to detect CUDA errors.
compute-sanitizer ./a.outSee nsight-guided-profiling.md for more details.