Skip to content

Conversation

@k-yeung
Copy link
Collaborator

@k-yeung k-yeung commented Dec 5, 2025

These patches improve the performance of the kernels by making use of variable-length vectors in the RISC-V vector extensions.

OpenBLAS generally assumes that SIMD instructions work only with a number of elements in powers of 2, so for example when dealing with a tail of length 7, it is expected that we will do a vector operation with 4 elements, then with 2, and finally 1. On RISC-V this is wasteful as this could be done in just one operation instead of three.

OpenBLAS reorders the matrix layout before it reaches the kernel in order to optimise cache usage, so this needs to be modified first. As only the LHS matrix is read into vector registers, only the tcopy operation needs to be modified to present the tail in one chunk instead of breaking it down into powers of 2.

In the kernels, we can then simply set the vector length on each instruction (which is presented as the last argument in the intrinsics) to the tile size for most of the computation and the remainder in the tail.

There is a complication in that vector types like vfloat32m1_t cannot be placed into arrays by the compiler (presumably because they represent actual registers, not areas of memory), so if we want the same vector operation done N times to N different vectors, it needs to be written out N times instead of using a for loop. This was originally done using a Python code generator to autogenerate some very repetitive code, but for ease of experimentation I moved to using macros instead.

A RISCV_REPEAT macro is used to repeat a vector instruction N times. However, as no loops are allowed, this would necessitate at least one conditional jump (if we use a switch on N with fallthrough reverse-ordered cases) which causes a significant performance penalty on such a tight loop. Instead, I ensured that N is set to a compile-time constant at the call site and force inline it, relying on the compiler's constant propagation and dead-code elimination to remove any extraneous calls.

With this approach, the lines of code in the kernel are dramatically decreased, and I believe is much more readable. I further decreased the numbers of code by factoring out the common code in SGEMM & DGEMM, CGEMM & ZGEMM, gemm_tcopy_* and zgemm_tcopy_* into separate files and including it as required.

Kwok Cheung Yeung added 6 commits December 5, 2025 20:21
These kernels use the same algorithm as the original ones, but improve on them
in several respects:

- Common code is shared between the two kernel types and architectures,
reducing the overall lines of code.

- Instead of using auto-generated code with each vector operation repeated
multiple times (necessitated in part due to the vector types not being
storable in arrays), macros and forced inlining are used to achieve the
same effect directly for better clarity and a major reduction in lines of
code.

- The tiling sizes can be modified within a certain range by modifying the
unroll parameters.

- The RVV extension is not restricted to having the number of elements in a
vector register being a power of two, so the tails of the matrix involving
vector operations can be dealt with in a single operation rather than in
decreasing powers of two, thereby improving performance.  The tcopy operations
need to be modified to take this into account.
…6B architectures

This applies the changes made to the SGEMM/DGEMM kernels to their complex
CGEMM/ZGEMM equivalents.
The original kernels have been deleted with this commit.
…sizes for RISC-V

The unroll factor of the TRMM kernels are currently set to those of the
equivalent GEMM kernels.  As we are not dealing with the TRMM kernels for now,
I have added extra xTRMM_UNROLL_(M|N) parameters to allow them to be set
independently of the xGEMM_UNROLL_(M|N) parameters.

If the new TRMM parameter is not defined, then it defaults back to the
original behaviour of using the GEMM parameters.
…tecture

Testing has shown that 4x8 or 4x4 performs better than the original 8x8 tiling
size for this kernel.

As we do not wish to perturb the behaviour of the DTRMM kernel at this point,
the DTRMM tiling size is explicitly set to the original 8x8.
If DYNAMIC_ARCH is enabled, then the various xGEMM_UNROLL_(M|N) macros
expand to 'gotoblas-><parameter>', which prevent their use in macro
conditionals as the value is only defined at runtime.

To get around this, we use the xGEMM_UNROLL_(M|N)_DEFAULT macros instead,
which should expand to a compile-time constant.  As the tiling factors
used in the tcopy/ncopy functions are selected by these at compile time
as well, it does not result in reduced functionality as changing the
tiling at runtime without changing tcopy/ncopy would result in incorrect
results anyway.
@k-yeung k-yeung marked this pull request as ready for review December 5, 2025 21:38
@luhenry luhenry requested a review from sergei-lewis December 15, 2025 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant