-
Notifications
You must be signed in to change notification settings - Fork 2
Description
We have assumed that LSB_MCPU_HOSTS contains a list of hostname cores pairs as follows:
$ echo $LSB_MCPU_HOSTS
host1 7 host2 7 host3 7 host4 7
And the number of cores of each host is computed as follows with some trick to allow duplication of host names.
ray-integration/ray_launch_cluster.sh
Lines 64 to 75 in c63630a
| echo "Num cpus per host is:" $LSB_MCPU_HOSTS | |
| IFS=' ' read -r -a array <<< "$LSB_MCPU_HOSTS" | |
| declare -A associative | |
| i=0 | |
| len=${#array[@]} | |
| while [ $i -lt $len ] | |
| do | |
| key=${array[$i]} | |
| value=${array[$i+1]} | |
| associative[$key]+=$value | |
| i=$((i=i+2)) | |
| done |
The problem is that LSB_MCPU_HOSTS is actually a list of hostname slots pairs as described in Running parallel jobs on specific hosts. slot may contain multiple cores. Thus the calculation above may produce wrong numbers.
Here is an example.
# job submitted by: bsub -n 4 -R "affinity[core(7,same=socket)]" -gpu num=1/task
$ echo $LSB_MCPU_HOSTS
host1 1 host2 3
$ cat $LSB_AFFINITY_HOSTFILE
host1 1,2,3,4,5,6,7
host2 0,2,3,4,6,7,8
host2 19,21,22,23,24,26,27
host2 28,29,37,41,48,49,50
I have requested a job consisting of 4 slots. Each slot has 7 cores and 1 GPU. As a result, 1 slot is allocated on host1 and 3 slots are allocated on host2 as described by LSB_MCPU_HOSTS variable.
The file specified by LSB_MCPU_HOSTS contains a list of slots and core allocation for each slot. Each line of the files is of the form of hostname core-list. core-list is comma separated list of core IDs.
So possible solution is to count up core IDs for each host from $LSB_AFFINITY_HOSTFILE file.