Adaptive vector lengths without global state

Context: #4, #20. Global state is problematic, but even if we make it explicit, the ability to have [arbitrary length switches](https://github.com/penzn/flexible-vectors/blob/416dbcb3c1ccd67b6dd5226564daff8f73453b48/proposals/flexible-vectors/FlexibleVectors.md#setting-vector-length) is hard to implement on many architectures. The following is a sketch of a design which avoids both of these, while preserving some key advantages of flexible vectors.

Similar to the explicit loop variant of [@aardappel's post here](https://github.com/WebAssembly/flexible-vectors/issues/4#issuecomment-626813462), it has a `vec_loop`, but this proposal is lower-level, making state explicit and avoiding unnecessary restrictions, while still hiding information as needed by implementations. I haven't prototyped this, but in theory it should be implementable on RISC-V V, SVE, AVX-512, as well as simd128-style architectures.

Implementations with dynamic vectors or masking could use them. But implementations could also chose to emit multiple loops, to handle alignments or remainders. Such implementations could also chose to run some vector loops at twice or more the hardware length, making a register-pressure vs. speed tradeoff as they see fit.

### New Types:
 - `vec<n>`: a flexible vector of element bitwidth `n`
 - `vec_len<n>`: a scalar but opaque element count, limited to `n` as the widest bitwidth

### New Opcodes:
 - `vec_loop<vl, n>`: like `loop` but
    - `vl`: immediate identifying a local variable to set to the number of elements being processed within an iteration of the loop, with type `vec_len<n>`. Using this local outside of a `vec_loop` is prohibited.
    - `n`: immediate which declares the bitwidth of the widest element type which will be processed in the loop
    - has one operand, the total number of elements the program wants to process in the loop

 - `vec_step<vl, scale>`: `(i32) -> (i32)`
    - `vl`: immediate identifying a `vec_len` local variable
    - `scale`: immediate scaling factor
    - computes `operand + vl * scale`

 - `vec_load<vl, n>`, `vec_store<vl, n>`: like `load` and store` but
    - `vl`: immediate identifying a `vec_len` local variable
    - `n`: immediate specifying the vector element bitwidth to use

 - The arithmetic operations from simd128, with a similar `vl` immediate added.

### Example

Add two arrays of length `$n` starting at `$A` and `$B` and store the result in an array starting at `$C`.

```
  (local $A i32) (local $B i32) (local $C i32) (local $n i32)
  (local $vl vec_len.32)
  ...

  local.get $n        ;; number of elements in array to process (passing zero is ok!)
  vec_loop 32 $vl     ;; start vector loop processing (at most) 32-bit elements

    (local.set $t0 (vec_load $vl 32 (local.get $A)))
    (local.set $t1 (vec_load $vl 32 (local.get $B)))
    (local.set $t2 (vec_add $vl (local.get $t0) (local.get $t1)))
    (vec_store $vl 32 (local.get $C) (local.get $t2))

    (local.set $A (vec_step $vl 4 (local.get $A)))
    (local.set $B (vec_step $vl 4 (local.get $B)))
    (local.set $C (vec_step $vl 4 (local.get $C)))
    (local.set $n (vec_step $vl -1 (local.get $n)))

    (br_if 0 (local.get $n) (local.get $n)) ;; pass the count back to the top
  end                ;; end vector loop
```

### Nondeterminism

The length in each iteration of a `vec_loop` is nondeterministic.  It's visible in `vec_step`, in the number of elements loaded and stored in `vec_load` and `vec_store`, and in any side effects of the loop. It's expensive to completely hide the hardware vector length with any flexible-vector proposal, so whether or not there should be nondeterminism is an independent question.

One thing this proposal does do is avoid having nondeterminism which has to be constant for the duration of the program. Wasm doesn't currently have any nondeterminism that behaves like that, and it would have implications for suspending a program and resuming it on another architecture, or for distributing a program over multiple architectures.

### Vector subroutines

A call within a `vec_loop` could make sense if the callsite passes the `vec_len` value to the callee. Since `vec_len` is a type, it'd be part of the function signature, so implementations could compile the callee to be called from within a vector loop without whole-program analysis.

### Nested `vec_loop`s

Some architectures wouldn't be able to implement arbitrary nested vector parallelism, so one possible approach would be to prohibit it. Lexical nesting is easy to detect, and dynamic nesting -- a call inside a `vec_loop` calling a function that contains another `vec_loop` could be prohibited if we require calls in `vec_loop`s to have exactly one `vec_len` argument, and prohibit `vec_loop` inside functions with a `vec_len` argument.

### Masks, shuffles, extracts and inserts, strided loads/stores

There are ways this design could be extended to include these, but left them out to keep things simple to start with. It's mostly an independent question whether these can be implemented efficiently.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adaptive vector lengths without global state #21

New Types:

New Opcodes:

Example

Nondeterminism

Vector subroutines

Nested `vec_loop`s

Masks, shuffles, extracts and inserts, strided loads/stores

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adaptive vector lengths without global state #21

Description

New Types:

New Opcodes:

Example

Nondeterminism

Vector subroutines

Nested vec_loops

Masks, shuffles, extracts and inserts, strided loads/stores

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Nested `vec_loop`s