Skip to content

Adaptive vector lengths without global state #21

@sunfishcode

Description

@sunfishcode

Context: #4, #20. Global state is problematic, but even if we make it explicit, the ability to have arbitrary length switches is hard to implement on many architectures. The following is a sketch of a design which avoids both of these, while preserving some key advantages of flexible vectors.

Similar to the explicit loop variant of @aardappel's post here, it has a vec_loop, but this proposal is lower-level, making state explicit and avoiding unnecessary restrictions, while still hiding information as needed by implementations. I haven't prototyped this, but in theory it should be implementable on RISC-V V, SVE, AVX-512, as well as simd128-style architectures.

Implementations with dynamic vectors or masking could use them. But implementations could also chose to emit multiple loops, to handle alignments or remainders. Such implementations could also chose to run some vector loops at twice or more the hardware length, making a register-pressure vs. speed tradeoff as they see fit.

New Types:

  • vec<n>: a flexible vector of element bitwidth n
  • vec_len<n>: a scalar but opaque element count, limited to n as the widest bitwidth

New Opcodes:

  • vec_loop<vl, n>: like loop but

    • vl: immediate identifying a local variable to set to the number of elements being processed within an iteration of the loop, with type vec_len<n>. Using this local outside of a vec_loop is prohibited.
    • n: immediate which declares the bitwidth of the widest element type which will be processed in the loop
    • has one operand, the total number of elements the program wants to process in the loop
  • vec_step<vl, scale>: (i32) -> (i32)

    • vl: immediate identifying a vec_len local variable
    • scale: immediate scaling factor
    • computes operand + vl * scale
  • vec_load<vl, n>, vec_store<vl, n>: like load and store` but

    • vl: immediate identifying a vec_len local variable
    • n: immediate specifying the vector element bitwidth to use
  • The arithmetic operations from simd128, with a similar vl immediate added.

Example

Add two arrays of length $n starting at $A and $B and store the result in an array starting at $C.

  (local $A i32) (local $B i32) (local $C i32) (local $n i32)
  (local $vl vec_len.32)
  ...

  local.get $n        ;; number of elements in array to process (passing zero is ok!)
  vec_loop 32 $vl     ;; start vector loop processing (at most) 32-bit elements

    (local.set $t0 (vec_load $vl 32 (local.get $A)))
    (local.set $t1 (vec_load $vl 32 (local.get $B)))
    (local.set $t2 (vec_add $vl (local.get $t0) (local.get $t1)))
    (vec_store $vl 32 (local.get $C) (local.get $t2))

    (local.set $A (vec_step $vl 4 (local.get $A)))
    (local.set $B (vec_step $vl 4 (local.get $B)))
    (local.set $C (vec_step $vl 4 (local.get $C)))
    (local.set $n (vec_step $vl -1 (local.get $n)))

    (br_if 0 (local.get $n) (local.get $n)) ;; pass the count back to the top
  end                ;; end vector loop

Nondeterminism

The length in each iteration of a vec_loop is nondeterministic. It's visible in vec_step, in the number of elements loaded and stored in vec_load and vec_store, and in any side effects of the loop. It's expensive to completely hide the hardware vector length with any flexible-vector proposal, so whether or not there should be nondeterminism is an independent question.

One thing this proposal does do is avoid having nondeterminism which has to be constant for the duration of the program. Wasm doesn't currently have any nondeterminism that behaves like that, and it would have implications for suspending a program and resuming it on another architecture, or for distributing a program over multiple architectures.

Vector subroutines

A call within a vec_loop could make sense if the callsite passes the vec_len value to the callee. Since vec_len is a type, it'd be part of the function signature, so implementations could compile the callee to be called from within a vector loop without whole-program analysis.

Nested vec_loops

Some architectures wouldn't be able to implement arbitrary nested vector parallelism, so one possible approach would be to prohibit it. Lexical nesting is easy to detect, and dynamic nesting -- a call inside a vec_loop calling a function that contains another vec_loop could be prohibited if we require calls in vec_loops to have exactly one vec_len argument, and prohibit vec_loop inside functions with a vec_len argument.

Masks, shuffles, extracts and inserts, strided loads/stores

There are ways this design could be extended to include these, but left them out to keep things simple to start with. It's mostly an independent question whether these can be implemented efficiently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions