forked from CPMD-code/CPMD
-
Notifications
You must be signed in to change notification settings - Fork 0
Experimental gpu bernd2 #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kloeffel
wants to merge
217
commits into
OpenCPMD:main
Choose a base branch
from
kloeffel:experimental_gpu_bernd2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…xHost and zmm-usage=high
nplist/nolist not even used, so remove them
allgather bandwith added some more interfaces: mpi allreduce with min mpi reduce with mpi inplace allgatherv inplace routines
for both MPI and MPI+OMP rsync -a --include="*/" --include "*html" --include="*out" --exclude="*" Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment settings: export I_MPI_CBWR=2 export MKL_CBWR=COMPATIBLE export MKL_DYNAMIC=false export OMP_DYNAMIC=false export KMP_DETERMINISTIC_REDUCTION=true
for both MPI and MPI+OMP
rsync -a --include="*/" --include "*html" --include="*out" --exclude="*"
Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment
settings:
export I_MPI_CBWR=2
export MKL_CBWR=COMPATIBLE
export MKL_DYNAMIC=false
export OMP_DYNAMIC=false
export KMP_DETERMINISTIC_REDUCTION=true
distribute atoms taking into account the number of betaprojectors for effective load balancing get rid of problematic code for distributing stuff, use dist_entity2 for old behavior # Conflicts: # src/SOURCES # src/distribution_utils.mod.F90 # src/vdw_utils.mod.F90
for both MPI and MPI+OMP
rsync -a --include="*/" --include "*html" --include="*out" --exclude="*"
Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment
settings:
export I_MPI_CBWR=2
export MKL_CBWR=COMPATIBLE
export MKL_DYNAMIC=false
export OMP_DYNAMIC=false
export KMP_DETERMINISTIC_REDUCTION=true
symmetrization can be disabled via optional flag symmetrization communicator can be changed via optional flag gid summat_parent is obsolete via optional flag parent able to pack both spins in a single mpi call via optional flag lsd without optional flags, returns to the original version com=allgrp symmetrization = .true. additional routine to pack and unpack a symmetric matrix
for both MPI and MPI+OMP
rsync -a --include="*/" --include "*html" --include="*out" --exclude="*"
Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment
settings:
export I_MPI_CBWR=2
export MKL_CBWR=COMPATIBLE
export MKL_DYNAMIC=false
export OMP_DYNAMIC=false
export KMP_DETERMINISTIC_REDUCTION=true
for both MPI and MPI+OMP
rsync -a --include="*/" --include "*html" --include="*out" --exclude="*"
Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment
settings:
export I_MPI_CBWR=2
export MKL_CBWR=COMPATIBLE
export MKL_DYNAMIC=false
export OMP_DYNAMIC=false
export KMP_DETERMINISTIC_REDUCTION=true
for both MPI and MPI+OMP
rsync -a --include="*/" --include "*html" --include="*out" --exclude="*"
Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment
settings:
export I_MPI_CBWR=2
export MKL_CBWR=COMPATIBLE
export MKL_DYNAMIC=false
export OMP_DYNAMIC=false
export KMP_DETERMINISTIC_REDUCTION=true
for both MPI and MPI+OMP
rsync -a --include="*/" --include "*html" --include="*out" --exclude="*"
Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment
settings:
export I_MPI_CBWR=2
export MKL_CBWR=COMPATIBLE
export MKL_DYNAMIC=false
export OMP_DYNAMIC=false
export KMP_DETERMINISTIC_REDUCTION=true
for both MPI and MPI+OMP
rsync -a --include="*/" --include "*html" --include="*out" --exclude="*"
Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment
settings:
export I_MPI_CBWR=2
export MKL_CBWR=COMPATIBLE
export MKL_DYNAMIC=false
export OMP_DYNAMIC=false
export KMP_DETERMINISTIC_REDUCTION=true
unpack/pack sort
for both MPI and MPI+OMP
rsync -a --include="*/" --include "*html" --include="*out" --exclude="*"
Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment
settings:
export I_MPI_CBWR=2
export MKL_CBWR=COMPATIBLE
export MKL_DYNAMIC=false
export OMP_DYNAMIC=false
export KMP_DETERMINISTIC_REDUCTION=true
for both MPI and MPI+OMP
rsync -a --include="*/" --include "*html" --include="*out" --exclude="*"
Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment
settings:
export I_MPI_CBWR=2
export MKL_CBWR=COMPATIBLE
export MKL_DYNAMIC=false
export OMP_DYNAMIC=false
export KMP_DETERMINISTIC_REDUCTION=true
for both MPI and MPI+OMP
rsync -a --include="*/" --include "*html" --include="*out" --exclude="*"
Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment
settings:
export I_MPI_CBWR=2
export MKL_CBWR=COMPATIBLE
export MKL_DYNAMIC=false
export OMP_DYNAMIC=false
export KMP_DETERMINISTIC_REDUCTION=true
provide routines to build betaprojector arrays as needed in nlforce/rnlsm1/rnlsm2/spsi
rnlsm1/rnlsm2 helper routines to simplify both routines
additional arrays: fnlgam_packed / fnl_packed / dfnl_packed will be used later in uspp branch
Rewritten rnlsm1/2 routines, overlapping communication/computation possible with autotuning algorithm or user defined blocksizes
dfnla/fnla special pointers for gamma only case: ignore first and last dimension.
Adding reshape_inplace_r6_r4 and r5_r3 for dfnla/fnla.
provide cp_grp_redist_array
redistribution of arrays distributed along the second dimension is straight forward using allgatherv instead of allreduce
cp_grp_redist_array_f
redistribution of arrays distributed along the first deimension, current implementation uses a buffer and allgather - should test performance of multiple broadcasts with custom datatype to avoid the buffer, calling redist_array_r1 multiple times should also work but probably very inefficient
cp_grp_get_sizes now also accecpts ncpw%nhg, use part_1d routine to avoid problems with cp_grp redistribution routines
cp_grp_split_atoms generates custom na mapping to distribute atoms between cp_grps
cp_grp_redist_dfnl_fnl redistributes fnl and/or dfnl arrays
for both MPI and MPI+OM
rsync -a --include="*/" --include "*html" --include="*out" --exclude="*"
Regtest are performed with debug INTEL-XHOST-IFORT-MPI and environment
settings:
export I_MPI_CBWR=2
export MKL_CBWR=COMPATIBLE
export MKL_DYNAMIC=false
export OMP_DYNAMIC=false
export KMP_DETERMINISTIC_REDUCTION=true
…into configure.sh; please use the -vdw option
…irectory not in libdir
extrapolation is required, since that will be handled by the force-driver.
…nto non-orthogonal basis
…put file" VERBOSE FORCE POSITIONS VELOCITIES" instead of enabling it at compile time
…ays better than with some modified gamma value, hence removing all gamma related code
…gth 0 or negative
…last segment, the following segment was lost
… optimization, BO
port neccessary data arrays to device environment
… optimization, BO
enable VDW lib on GPU
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Code used for performance benchmarks published in PSI-K B8.08
https://www.psik2025.net/program/schedule