[FEATURE REQUEST] `scal` CuTe kernel implementation

### Problem statement

The BLAS level-1 `scal` kernel (scale a vector in-place) does not yet have a CuTe (CUTLASS/CuTe-style) backend implementation in this project. While the README BLAS table lists a CuTe column, the `scal` row is currently unimplemented, preventing a complete set of cross-backend examples for this fundamental operation.

Without a CuTe `scal` kernel:
- users cannot study how a simple BLAS-1 operation maps onto CuTe primitives and memory layouts,
- there is no CuTe baseline for performance comparison against PyTorch and Triton `scal` implementations,
- higher-level modules that aim to mix and match backends cannot rely on a fully populated BLAS-1 set.

### Proposed solution

Add a CuTe backend implementation of the `scal` kernel that follows the mathematical semantics of the Python reference and matches the public API conventions used for other backends.

Concretely:
- Introduce a CuTe-based `scal` kernel (file and namespace layout to match existing CuTe kernels once they are introduced in the repository).
- Implement $y = \alpha y$ for 1D vectors using CuTe primitives and recommended memory layout abstractions.
- Align the entry-point interface with the other backends so that higher-level code can dispatch to CuTe `scal` in a uniform way.

### Alternatives considered

Alternatives such as skipping `scal` for CuTe or relying on other backends for performance analysis would:
- reduce the educational value of comparing CuTe against PyTorch/Triton on a simple kernel,
- leave the CuTe column in the README BLAS table partially incomplete,
- make it harder to build CuTe-based examples that mirror the Python/PyTorch/Triton stacks.

### Implementation details

- Define the precise file path and build integration for CuTe kernels (e.g. a `cute_ops` or analogous directory, aligned with project conventions).
- Implement the `scal` kernel using CuTe constructs appropriate for vector operations, focusing on clear mapping between math and low-level execution.
- Ensure numerical equivalence with the Python `scal` reference across supported dtypes.
- Integrate the CuTe `scal` implementation into any existing or planned dispatch, testing, and benchmarking infrastructure.

### Use case

The CuTe `scal` kernel will:
- demonstrate how a simple BLAS-1 operation is expressed in CuTe,
- serve as a stepping stone toward more complex CuTe-based BLAS and Transformer modules,
- provide a basis for cross-backend performance benchmarking and tuning.

### Related work

- BLAS level-1 `scal` implementations in other GPU libraries.
- CuTe/CUTLASS examples for vector and memory-bound kernels.

### Additional context

This issue fits into a broader effort to provide CuTe counterparts for each BLAS kernel in the README, starting from the simplest operations (`copy`, `swap`, `scal`).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQUEST] `scal` CuTe kernel implementation #8

Problem statement

Proposed solution

Alternatives considered

Implementation details

Use case

Related work

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE REQUEST] scal CuTe kernel implementation #8

Description

Problem statement

Proposed solution

Alternatives considered

Implementation details

Use case

Related work

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[FEATURE REQUEST] `scal` CuTe kernel implementation #8