-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Problem statement
The BLAS level-1 scal kernel (scale a vector in-place) does not yet have a CuTe (CUTLASS/CuTe-style) backend implementation in this project. While the README BLAS table lists a CuTe column, the scal row is currently unimplemented, preventing a complete set of cross-backend examples for this fundamental operation.
Without a CuTe scal kernel:
- users cannot study how a simple BLAS-1 operation maps onto CuTe primitives and memory layouts,
- there is no CuTe baseline for performance comparison against PyTorch and Triton
scalimplementations, - higher-level modules that aim to mix and match backends cannot rely on a fully populated BLAS-1 set.
Proposed solution
Add a CuTe backend implementation of the scal kernel that follows the mathematical semantics of the Python reference and matches the public API conventions used for other backends.
Concretely:
- Introduce a CuTe-based
scalkernel (file and namespace layout to match existing CuTe kernels once they are introduced in the repository). - Implement
$y = \alpha y$ for 1D vectors using CuTe primitives and recommended memory layout abstractions. - Align the entry-point interface with the other backends so that higher-level code can dispatch to CuTe
scalin a uniform way.
Alternatives considered
Alternatives such as skipping scal for CuTe or relying on other backends for performance analysis would:
- reduce the educational value of comparing CuTe against PyTorch/Triton on a simple kernel,
- leave the CuTe column in the README BLAS table partially incomplete,
- make it harder to build CuTe-based examples that mirror the Python/PyTorch/Triton stacks.
Implementation details
- Define the precise file path and build integration for CuTe kernels (e.g. a
cute_opsor analogous directory, aligned with project conventions). - Implement the
scalkernel using CuTe constructs appropriate for vector operations, focusing on clear mapping between math and low-level execution. - Ensure numerical equivalence with the Python
scalreference across supported dtypes. - Integrate the CuTe
scalimplementation into any existing or planned dispatch, testing, and benchmarking infrastructure.
Use case
The CuTe scal kernel will:
- demonstrate how a simple BLAS-1 operation is expressed in CuTe,
- serve as a stepping stone toward more complex CuTe-based BLAS and Transformer modules,
- provide a basis for cross-backend performance benchmarking and tuning.
Related work
- BLAS level-1
scalimplementations in other GPU libraries. - CuTe/CUTLASS examples for vector and memory-bound kernels.
Additional context
This issue fits into a broader effort to provide CuTe counterparts for each BLAS kernel in the README, starting from the simplest operations (copy, swap, scal).