Improve SGEMM, DGEMM, CGEMM and ZGEMM kernels for RISC-V ZVL128B and ZVL256B #1
+1,360
−8,177
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
These patches improve the performance of the kernels by making use of variable-length vectors in the RISC-V vector extensions.
OpenBLAS generally assumes that SIMD instructions work only with a number of elements in powers of 2, so for example when dealing with a tail of length 7, it is expected that we will do a vector operation with 4 elements, then with 2, and finally 1. On RISC-V this is wasteful as this could be done in just one operation instead of three.
OpenBLAS reorders the matrix layout before it reaches the kernel in order to optimise cache usage, so this needs to be modified first. As only the LHS matrix is read into vector registers, only the tcopy operation needs to be modified to present the tail in one chunk instead of breaking it down into powers of 2.
In the kernels, we can then simply set the vector length on each instruction (which is presented as the last argument in the intrinsics) to the tile size for most of the computation and the remainder in the tail.
There is a complication in that vector types like vfloat32m1_t cannot be placed into arrays by the compiler (presumably because they represent actual registers, not areas of memory), so if we want the same vector operation done N times to N different vectors, it needs to be written out N times instead of using a for loop. This was originally done using a Python code generator to autogenerate some very repetitive code, but for ease of experimentation I moved to using macros instead.
A RISCV_REPEAT macro is used to repeat a vector instruction N times. However, as no loops are allowed, this would necessitate at least one conditional jump (if we use a switch on N with fallthrough reverse-ordered cases) which causes a significant performance penalty on such a tight loop. Instead, I ensured that N is set to a compile-time constant at the call site and force inline it, relying on the compiler's constant propagation and dead-code elimination to remove any extraneous calls.
With this approach, the lines of code in the kernel are dramatically decreased, and I believe is much more readable. I further decreased the numbers of code by factoring out the common code in SGEMM & DGEMM, CGEMM & ZGEMM, gemm_tcopy_* and zgemm_tcopy_* into separate files and including it as required.