Experimental gpu bernd2 #7

kloeffel · 2025-08-25T12:25:31Z

Code used for performance benchmarks published in PSI-K B8.08
https://www.psik2025.net/program/schedule

…xHost and zmm-usage=high

nplist/nolist not even used, so remove them

allgather bandwith added some more interfaces: mpi allreduce with min mpi reduce with mpi inplace allgatherv inplace routines

for both MPI and MPI+OMP rsync -a --include="*/" --include "*html" --include="*out" --exclude="*" Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment settings: export I_MPI_CBWR=2 export MKL_CBWR=COMPATIBLE export MKL_DYNAMIC=false export OMP_DYNAMIC=false export KMP_DETERMINISTIC_REDUCTION=true

distribute atoms taking into account the number of betaprojectors for effective load balancing get rid of problematic code for distributing stuff, use dist_entity2 for old behavior # Conflicts: # src/SOURCES # src/distribution_utils.mod.F90 # src/vdw_utils.mod.F90

for both MPI and MPI+OMP rsync -a --include="*/" --include "*html" --include="*out" --exclude="*" Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment settings: export I_MPI_CBWR=2 export MKL_CBWR=COMPATIBLE export MKL_DYNAMIC=false export OMP_DYNAMIC=false export KMP_DETERMINISTIC_REDUCTION=true

symmetrization can be disabled via optional flag symmetrization communicator can be changed via optional flag gid summat_parent is obsolete via optional flag parent able to pack both spins in a single mpi call via optional flag lsd without optional flags, returns to the original version com=allgrp symmetrization = .true. additional routine to pack and unpack a symmetric matrix

for both MPI and MPI+OMP rsync -a --include="*/" --include "*html" --include="*out" --exclude="*" Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment settings: export I_MPI_CBWR=2 export MKL_CBWR=COMPATIBLE export MKL_DYNAMIC=false export OMP_DYNAMIC=false export KMP_DETERMINISTIC_REDUCTION=true

unpack/pack sort

for both MPI and MPI+OMP rsync -a --include="*/" --include "*html" --include="*out" --exclude="*" Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment settings: export I_MPI_CBWR=2 export MKL_CBWR=COMPATIBLE export MKL_DYNAMIC=false export OMP_DYNAMIC=false export KMP_DETERMINISTIC_REDUCTION=true

…to account

for both MPI and MPI+OMP rsync -a --include="*/" --include "*html" --include="*out" --exclude="*" Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment settings: export I_MPI_CBWR=2 export MKL_CBWR=COMPATIBLE export MKL_DYNAMIC=false export OMP_DYNAMIC=false export KMP_DETERMINISTIC_REDUCTION=true

provide routines to build betaprojector arrays as needed in nlforce/rnlsm1/rnlsm2/spsi rnlsm1/rnlsm2 helper routines to simplify both routines additional arrays: fnlgam_packed / fnl_packed / dfnl_packed will be used later in uspp branch Rewritten rnlsm1/2 routines, overlapping communication/computation possible with autotuning algorithm or user defined blocksizes dfnla/fnla special pointers for gamma only case: ignore first and last dimension. Adding reshape_inplace_r6_r4 and r5_r3 for dfnla/fnla. provide cp_grp_redist_array redistribution of arrays distributed along the second dimension is straight forward using allgatherv instead of allreduce cp_grp_redist_array_f redistribution of arrays distributed along the first deimension, current implementation uses a buffer and allgather - should test performance of multiple broadcasts with custom datatype to avoid the buffer, calling redist_array_r1 multiple times should also work but probably very inefficient cp_grp_get_sizes now also accecpts ncpw%nhg, use part_1d routine to avoid problems with cp_grp redistribution routines cp_grp_split_atoms generates custom na mapping to distribute atoms between cp_grps cp_grp_redist_dfnl_fnl redistributes fnl and/or dfnl arrays

for both MPI and MPI+OM rsync -a --include="*/" --include "*html" --include="*out" --exclude="*" Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment settings: export I_MPI_CBWR=2 export MKL_CBWR=COMPATIBLE export MKL_DYNAMIC=false export OMP_DYNAMIC=false export KMP_DETERMINISTIC_REDUCTION=true

…e disabled

…into configure.sh; please use the -vdw option

…irectory not in libdir

-vdw instead

extrapolation is required, since that will be handled by the force-driver.

…nto non-orthogonal basis

…put file" VERBOSE FORCE POSITIONS VELOCITIES" instead of enabling it at compile time

…ays better than with some modified gamma value, hence removing all gamma related code

…gth 0 or negative

…last segment, the following segment was lost

… optimization, BO

port neccessary data arrays to device environment

… optimization, BO

enable VDW lib on GPU

kloeffelt added 30 commits February 25, 2023 21:51

Adding configure file for Intel compiler/ Intel MKL / Intel MPI with …

e51c414

…xHost and zmm-usage=high

Bugfix: tdmal not implemented for uspp

5085dc6

Wrong useage of mpi_cart_rank, second argument must be an array!

b193e67

nplist/nolist not even used, so remove them

bugfix old fft

ef6be60

bring back mpi bandwiths

f646382

allgather bandwith added some more interfaces: mpi allreduce with min mpi reduce with mpi inplace allgatherv inplace routines

add node groups and an interface for mpi shared memory window creation

3b5a389

possible bugfixes for sorting, do not check real numbers a==b

5b9a9ef

adding driver routines for dsyevd/dsyevx

e93c5c1

Bugfix: openmp reduction of fion leads to strange errors

fa5fbf2

prepare for rnlsm* rework, get rid of give_scr_routines

a80f4f2

clean up rhoofr: get rid of give_scr_rhoofr

149e5f9

auxiliary routies for fnl/dfnl handling:

9ba5742

unpack/pack sort

provide cp_grp aware dotp functions

abad1f5

adding iatpe_cp ipept_cp, same as iatpe,ipept but taking cp_groups in…

298485b

…to account

on request: use cp_grps inside rottr and rotate, redistribution can b…

603f068

…e disabled

g-mathias and others added 30 commits September 12, 2024 16:03

moved vdw_lib to modules/vdw_lib; integrated building of libgrimme.a …

e6cdf12

…into configure.sh; please use the -vdw option

fixed one glitch with vdw_lib; libgrimme.a still ends up in the obj d…

94c3a63

…irectory not in libdir

libgrimme now goes to the lib directory

c31a262

moved scratchlibrary to modules dir; use option -scr to activate it

13c2cdb

fixed two bugs concerning scratch_lib

813caea

remove paths to scratchlib and vdwlib use configure options -scf and

b8b049e

-vdw instead

In case of nonorthogonal oribtals no re-normalization after wavefunction

5df6757

extrapolation is required, since that will be handled by the force-driver.

rpiint and vdw, only recalculate enegery and forces if required

ca5413a

enable new copy routines for density ffts

de06536

bug fix, in case of CP-MD second derivaties need to be rotated back i…

7d54ff1

…nto non-orthogonal basis

enable verbose ionic force, positions and velocities debugging via in…

2ce4a52

…put file" VERBOSE FORCE POSITIONS VELOCITIES" instead of enabling it at compile time

clean up some configure files

cb2f691

fix for -vdw makefile

f7b5f54

odiis_utils.mod.F90: further testing indicates gamma=0.0 performs alw…

c800ded

…ays better than with some modified gamma value, hence removing all gamma related code

scratchmodule_lib: use standard allocate/deallocate for arrays of len…

43b0778

…gth 0 or negative

adding configure files for IFX and IFORT using OpenMPI

5155021

avoid zeroing of C0 when extrpolating the WF

5b0adc4

bugfix in scratchmodule_lib when splitting a segment that is not the …

be7b3de

…last segment, the following segment was lost

autotuning: print out the current iteration

9cab5cd

vdw_lib: enable PBE0

1a0b54c

fix for BO MD, WF opt plus cp_groups and nonorthogonal orbitals

2cc635d

enable OpenMP offload for USPP: currently working CP-MD, Wavefunction…

91ee0a1

… optimization, BO

First steps towrads enabling of GPU offload:

a58b097

port neccessary data arrays to device environment

enable OpenMP offload for USPP: currently working CP-MD, Wavefunction…

9c25b7b

… optimization, BO

fix write restartfile, fix for forces_driver cp-md

881adc3

printing status of autotuing

c7788f2

bring prcp_com on the device

6fd9faf

fixes for allocations in md_driver

49b31d0

enable dsyevx on the device in crotwf, utils, lapack_utils

d60c229

enable GPU offload for density ffts and vofrho

e873f51

enable VDW lib on GPU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experimental gpu bernd2 #7

Experimental gpu bernd2 #7

Uh oh!

kloeffel commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Experimental gpu bernd2 #7

Are you sure you want to change the base?

Experimental gpu bernd2 #7

Uh oh!

Conversation

kloeffel commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants