Skip to content

Number of cores on nodes calculated incorrectly #12

@takaomoriyama

Description

@takaomoriyama

We have assumed that LSB_MCPU_HOSTS contains a list of hostname cores pairs as follows:

$ echo $LSB_MCPU_HOSTS
host1 7 host2 7 host3 7 host4 7

And the number of cores of each host is computed as follows with some trick to allow duplication of host names.

echo "Num cpus per host is:" $LSB_MCPU_HOSTS
IFS=' ' read -r -a array <<< "$LSB_MCPU_HOSTS"
declare -A associative
i=0
len=${#array[@]}
while [ $i -lt $len ]
do
key=${array[$i]}
value=${array[$i+1]}
associative[$key]+=$value
i=$((i=i+2))
done

The problem is that LSB_MCPU_HOSTS is actually a list of hostname slots pairs as described in Running parallel jobs on specific hosts. slot may contain multiple cores. Thus the calculation above may produce wrong numbers.

Here is an example.

# job submitted by: bsub -n 4 -R "affinity[core(7,same=socket)]" -gpu num=1/task
$ echo $LSB_MCPU_HOSTS
host1 1 host2 3
$ cat $LSB_AFFINITY_HOSTFILE
host1 1,2,3,4,5,6,7
host2 0,2,3,4,6,7,8
host2 19,21,22,23,24,26,27
host2 28,29,37,41,48,49,50

I have requested a job consisting of 4 slots. Each slot has 7 cores and 1 GPU. As a result, 1 slot is allocated on host1 and 3 slots are allocated on host2 as described by LSB_MCPU_HOSTS variable.
The file specified by LSB_MCPU_HOSTS contains a list of slots and core allocation for each slot. Each line of the files is of the form of hostname core-list. core-list is comma separated list of core IDs.
So possible solution is to count up core IDs for each host from $LSB_AFFINITY_HOSTFILE file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions