-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Hi,
I have tested Flops on a dual socket-Xeon Gold 6140 system. Interestingly the performace of for example Fused Multiply Add scales perfectly linear with SIMD width when using all cpu cores:
Double-Precision - 512-bit AVX512 - Fused Multiply Add GFlops = 2405.38
Double-Precision - 256-bit FMA3 - Fused Multiply Add GFlops = 1202.5
Double-Precision - 128-bit FMA3 - Fused Multiply Add GFlops = 601.248
This should NOT be the case, because the possible turbo frequency of the cpu is very different for different SIMD widths when using all cores (see http://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade/8):
128-bit FMA3 3.0 GHz
256-bit FMA3 2.6 GHz
512-bit AVX512 2.1 GHz
This difference should show up in the performance.
Indeed, if I use the Haswell binary instead of the Skylake binary, I get the following results:
Double-Precision - 256-bit FMA3 - Fused Multiply Add GFlops = 1492.8
Double-Precision - 128-bit FMA3 - Fused Multiply Add GFlops = 860.496
These numbers are on the one hand higher than the numbers from the Skylake binary and the relative differences reflect the frequency differences:
1492.8/1202.5=1.24 is approx. 2.6 GHz/2.1 GHz
860.496/601.248=1.43 is approx. 3.0 GHz/2.1 GHz
This shows that the Skylake-binary runs at the AVX-512 frequency, even for smaller SIMD-widths, thus producing too low results for smaller SIMD-width.
This happens even if I remove the AVX512 tests from the code. Thus the only difference is -march=skylake-avx512.
Compilation details:
Kernel 4.11.0-1-amd64, g++ 6.3.0