-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Environment
• OS: Ubuntu 24.04.2 LTS
• Hardware: Dual NVIDIA RTX 4090 GPUs (24GB VRAM each), 64+ CPU cores
• Software:
a3fe: 0.33
GROMACS (compiled with CUDA)
GROMACS version: 2025.1
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
NBNxM GPU setup: super-cluster 2x2x2 / cluster 8 (cluster-pair splitting on)
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.10-sse2-avx-avx2-avx2_128
GPU FFT library: cuFFT
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 13.3.0
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 13.3.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict SHELL:-fopenmp -O3 -DNDEBUG
BLAS library: Internal
LAPACK library: Internal
CUDA compiler: /usr/local/cuda-12.6/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2024 NVIDIA Corporation;Built on Tue_Oct_29_23:50:19_PDT_2024;Cuda compilation tools, release 12.6, V12.6.85;Build cuda_12.6.r12.6/compiler.35059454_0
CUDA compiler flags: -O3 -DNDEBUG
CUDA driver: 12.60
CUDA runtime: 12.60
• cat run_somd.sh
#!/bin/bash
#SBATCH -o somd-array-gpu-%A.%a.out
#SBATCH -n 1
#SBATCH --time 24:00:00
#SBATCH --gres=gpu:1
lam=$1
echo "lambda is: " $lam
srun somd-freenrg -C somd.cfg -l $lam -p CUDA
• a3fe script: run_a3fe.py
import a3fe as a3
calc = a3.Calculation(ensemble_size = 5)
calc.setup()
# Get optimised lambda schedule with thermodynamic speed
# of 2 kcal mol-1
calc.get_optimal_lam_vals(delta_er = 2)
# Run adaptively with a runtime constant of 0.0005 kcal**2 mol-2 ns**-1
# Note that automatic equilibration detection with the paired t-test
# method will also be carried out.
calc.run(adaptive=True, runtime_constant = 0.0005)
calc.wait()
calc.analyse()
calc.save()
Observed Behavior
• When a3fe begins to enter the ensemble equilibration step, the GPU load drops sharply.
check the slurm job:
scontrol show jobs 46054
JobId=46054 JobName=ensemble_equil_bound.sh
UserId=gkxiao(997) GroupId=gkxiao(984) MCS_label=N/A
Priority=1 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:36:28 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2025-06-12T15:14:11 EligibleTime=2025-06-12T15:14:11
AccrueTime=2025-06-12T15:14:11
StartTime=2025-06-12T15:14:12 EndTime=2025-06-13T15:14:12 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-06-12T15:14:12 Scheduler=Main
Partition=batch AllocNode:Sid=master:250144
ReqNodeList=(null) ExcNodeList=(null)
NodeList=master
BatchHost=master
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=1M,node=1,billing=1,gres/gpu=1
AllocTRES=cpu=1,node=1,billing=1,gres/gpu=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2/ensemble_equil_bound.sh
WorkDir=/public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2
StdErr=/public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2/somd-array-gpu-46054.4294967294.out
StdIn=/dev/null
StdOut=/public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2/somd-array-gpu-46054.4294967294.out
Power=
TresPerNode=gres/gpu:1
check the slurm task:
cat bound/ensemble_equilibration_2/ensemble_equil_bound.sh
#!/bin/bash
#SBATCH -o somd-array-gpu-%A.%a.out
#SBATCH -n 1
#SBATCH --time 24:00:00
#SBATCH --gres=gpu:1
python -c 'from a3fe.run.system_prep import slurm_ensemble_equilibration_bound; slurm_ensemble_equilibration_bound()'
• Two gmx mdrun processes each consuming ~32 CPU cores (3245% CPU usage via top).
Tasks: 1511 total, 4 running, 1504 sleeping, 0 stopped, 3 zombie
%Cpu(s): 50.9 us, 0.1 sy, 0.0 ni, 49.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 6.1/257752.1 [|||||| ]
MiB Swap: 0.0/8192.0 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2909717 gkxiao 20 0 9953.0m 372272 145244 R 3245 0.1 6,09 /usr/local/gromacs/bin/gmx mdrun -deffnm gromacs -c /public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2/gromacs_out.gro
2909711 gkxiao 20 0 9941.9m 292484 142192 R 3242 0.1 6,21 /usr/local/gromacs/bin/gmx mdrun -deffnm gromacs -c /public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_1/gromacs_out.gro
• GPUs at 1% utilization with minimal VRAM usage (392MB/24GB per GPU via nvidia-smi).
Thu Jun 12 15:20:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 D On | 00000000:01:00.0 Off | Off |
| 30% 48C P0 64W / 425W | 415MiB / 24564MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 D On | 00000000:41:00.0 Off | Off |
| 30% 53C P0 61W / 425W | 415MiB / 24564MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 4636 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2909711 C /usr/local/gromacs/bin/gmx 392MiB |
| 1 N/A N/A 4636 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2909717 C /usr/local/gromacs/bin/gmx 392MiB |
+-----------------------------------------------------------------------------------------+
Expected Outcome
Implementing these changes should:
• Raise GPU utilization to >90% .
• Reduce CPU core usage per process to <16 cores, balancing workload.
• Improve simulation throughput by 5–10× based on GROMACS benchmarks.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request