Skip to content

[FEATURE REQUEST] scal CuTe kernel implementation #8

@LoserCheems

Description

@LoserCheems

Problem statement

The BLAS level-1 scal kernel (scale a vector in-place) does not yet have a CuTe (CUTLASS/CuTe-style) backend implementation in this project. While the README BLAS table lists a CuTe column, the scal row is currently unimplemented, preventing a complete set of cross-backend examples for this fundamental operation.

Without a CuTe scal kernel:

  • users cannot study how a simple BLAS-1 operation maps onto CuTe primitives and memory layouts,
  • there is no CuTe baseline for performance comparison against PyTorch and Triton scal implementations,
  • higher-level modules that aim to mix and match backends cannot rely on a fully populated BLAS-1 set.

Proposed solution

Add a CuTe backend implementation of the scal kernel that follows the mathematical semantics of the Python reference and matches the public API conventions used for other backends.

Concretely:

  • Introduce a CuTe-based scal kernel (file and namespace layout to match existing CuTe kernels once they are introduced in the repository).
  • Implement $y = \alpha y$ for 1D vectors using CuTe primitives and recommended memory layout abstractions.
  • Align the entry-point interface with the other backends so that higher-level code can dispatch to CuTe scal in a uniform way.

Alternatives considered

Alternatives such as skipping scal for CuTe or relying on other backends for performance analysis would:

  • reduce the educational value of comparing CuTe against PyTorch/Triton on a simple kernel,
  • leave the CuTe column in the README BLAS table partially incomplete,
  • make it harder to build CuTe-based examples that mirror the Python/PyTorch/Triton stacks.

Implementation details

  • Define the precise file path and build integration for CuTe kernels (e.g. a cute_ops or analogous directory, aligned with project conventions).
  • Implement the scal kernel using CuTe constructs appropriate for vector operations, focusing on clear mapping between math and low-level execution.
  • Ensure numerical equivalence with the Python scal reference across supported dtypes.
  • Integrate the CuTe scal implementation into any existing or planned dispatch, testing, and benchmarking infrastructure.

Use case

The CuTe scal kernel will:

  • demonstrate how a simple BLAS-1 operation is expressed in CuTe,
  • serve as a stepping stone toward more complex CuTe-based BLAS and Transformer modules,
  • provide a basis for cross-backend performance benchmarking and tuning.

Related work

  • BLAS level-1 scal implementations in other GPU libraries.
  • CuTe/CUTLASS examples for vector and memory-bound kernels.

Additional context

This issue fits into a broader effort to provide CuTe counterparts for each BLAS kernel in the README, starting from the simplest operations (copy, swap, scal).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions