Improve SGEMM, DGEMM, CGEMM and ZGEMM kernels for RISC-V ZVL128B and ZVL256B #1

k-yeung · 2025-12-05T21:38:17Z

These patches improve the performance of the kernels by making use of variable-length vectors in the RISC-V vector extensions.

OpenBLAS generally assumes that SIMD instructions work only with a number of elements in powers of 2, so for example when dealing with a tail of length 7, it is expected that we will do a vector operation with 4 elements, then with 2, and finally 1. On RISC-V this is wasteful as this could be done in just one operation instead of three.

OpenBLAS reorders the matrix layout before it reaches the kernel in order to optimise cache usage, so this needs to be modified first. As only the LHS matrix is read into vector registers, only the tcopy operation needs to be modified to present the tail in one chunk instead of breaking it down into powers of 2.

In the kernels, we can then simply set the vector length on each instruction (which is presented as the last argument in the intrinsics) to the tile size for most of the computation and the remainder in the tail.

There is a complication in that vector types like vfloat32m1_t cannot be placed into arrays by the compiler (presumably because they represent actual registers, not areas of memory), so if we want the same vector operation done N times to N different vectors, it needs to be written out N times instead of using a for loop. This was originally done using a Python code generator to autogenerate some very repetitive code, but for ease of experimentation I moved to using macros instead.

A RISCV_REPEAT macro is used to repeat a vector instruction N times. However, as no loops are allowed, this would necessitate at least one conditional jump (if we use a switch on N with fallthrough reverse-ordered cases) which causes a significant performance penalty on such a tight loop. Instead, I ensured that N is set to a compile-time constant at the call site and force inline it, relying on the compiler's constant propagation and dead-code elimination to remove any extraneous calls.

With this approach, the lines of code in the kernel are dramatically decreased, and I believe is much more readable. I further decreased the numbers of code by factoring out the common code in SGEMM & DGEMM, CGEMM & ZGEMM, gemm_tcopy_* and zgemm_tcopy_* into separate files and including it as required.

These kernels use the same algorithm as the original ones, but improve on them in several respects: - Common code is shared between the two kernel types and architectures, reducing the overall lines of code. - Instead of using auto-generated code with each vector operation repeated multiple times (necessitated in part due to the vector types not being storable in arrays), macros and forced inlining are used to achieve the same effect directly for better clarity and a major reduction in lines of code. - The tiling sizes can be modified within a certain range by modifying the unroll parameters. - The RVV extension is not restricted to having the number of elements in a vector register being a power of two, so the tails of the matrix involving vector operations can be dealt with in a single operation rather than in decreasing powers of two, thereby improving performance. The tcopy operations need to be modified to take this into account.

…6B architectures This applies the changes made to the SGEMM/DGEMM kernels to their complex CGEMM/ZGEMM equivalents.

The original kernels have been deleted with this commit.

…sizes for RISC-V The unroll factor of the TRMM kernels are currently set to those of the equivalent GEMM kernels. As we are not dealing with the TRMM kernels for now, I have added extra xTRMM_UNROLL_(M|N) parameters to allow them to be set independently of the xGEMM_UNROLL_(M|N) parameters. If the new TRMM parameter is not defined, then it defaults back to the original behaviour of using the GEMM parameters.

…tecture Testing has shown that 4x8 or 4x4 performs better than the original 8x8 tiling size for this kernel. As we do not wish to perturb the behaviour of the DTRMM kernel at this point, the DTRMM tiling size is explicitly set to the original 8x8.

If DYNAMIC_ARCH is enabled, then the various xGEMM_UNROLL_(M|N) macros expand to 'gotoblas-><parameter>', which prevent their use in macro conditionals as the value is only defined at runtime. To get around this, we use the xGEMM_UNROLL_(M|N)_DEFAULT macros instead, which should expand to a compile-time constant. As the tiling factors used in the tcopy/ncopy functions are selected by these at compile time as well, it does not result in reduced functionality as changing the tiling at runtime without changing tcopy/ncopy would result in incorrect results anyway.

Kwok Cheung Yeung added 6 commits December 5, 2025 20:21

Add improved CGEMM and ZGEMM kernels for the RISC-V ZVL128B and ZVL25…

65108ac

…6B architectures This applies the changes made to the SGEMM/DGEMM kernels to their complex CGEMM/ZGEMM equivalents.

Replace original RISC-V ZVL kernels with new kernels

320e2d9

The original kernels have been deleted with this commit.

k-yeung marked this pull request as ready for review December 5, 2025 21:38

luhenry requested a review from sergei-lewis December 15, 2025 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SGEMM, DGEMM, CGEMM and ZGEMM kernels for RISC-V ZVL128B and ZVL256B #1

Improve SGEMM, DGEMM, CGEMM and ZGEMM kernels for RISC-V ZVL128B and ZVL256B #1

k-yeung commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Improve SGEMM, DGEMM, CGEMM and ZGEMM kernels for RISC-V ZVL128B and ZVL256B #1

Are you sure you want to change the base?

Improve SGEMM, DGEMM, CGEMM and ZGEMM kernels for RISC-V ZVL128B and ZVL256B #1

Conversation

k-yeung commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant