-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Context: #4, #20. Global state is problematic, but even if we make it explicit, the ability to have arbitrary length switches is hard to implement on many architectures. The following is a sketch of a design which avoids both of these, while preserving some key advantages of flexible vectors.
Similar to the explicit loop variant of @aardappel's post here, it has a vec_loop, but this proposal is lower-level, making state explicit and avoiding unnecessary restrictions, while still hiding information as needed by implementations. I haven't prototyped this, but in theory it should be implementable on RISC-V V, SVE, AVX-512, as well as simd128-style architectures.
Implementations with dynamic vectors or masking could use them. But implementations could also chose to emit multiple loops, to handle alignments or remainders. Such implementations could also chose to run some vector loops at twice or more the hardware length, making a register-pressure vs. speed tradeoff as they see fit.
New Types:
vec<n>: a flexible vector of element bitwidthnvec_len<n>: a scalar but opaque element count, limited tonas the widest bitwidth
New Opcodes:
-
vec_loop<vl, n>: likeloopbutvl: immediate identifying a local variable to set to the number of elements being processed within an iteration of the loop, with typevec_len<n>. Using this local outside of avec_loopis prohibited.n: immediate which declares the bitwidth of the widest element type which will be processed in the loop- has one operand, the total number of elements the program wants to process in the loop
-
vec_step<vl, scale>:(i32) -> (i32)vl: immediate identifying avec_lenlocal variablescale: immediate scaling factor- computes
operand + vl * scale
-
vec_load<vl, n>,vec_store<vl, n>: likeloadand store` butvl: immediate identifying avec_lenlocal variablen: immediate specifying the vector element bitwidth to use
-
The arithmetic operations from simd128, with a similar
vlimmediate added.
Example
Add two arrays of length $n starting at $A and $B and store the result in an array starting at $C.
(local $A i32) (local $B i32) (local $C i32) (local $n i32)
(local $vl vec_len.32)
...
local.get $n ;; number of elements in array to process (passing zero is ok!)
vec_loop 32 $vl ;; start vector loop processing (at most) 32-bit elements
(local.set $t0 (vec_load $vl 32 (local.get $A)))
(local.set $t1 (vec_load $vl 32 (local.get $B)))
(local.set $t2 (vec_add $vl (local.get $t0) (local.get $t1)))
(vec_store $vl 32 (local.get $C) (local.get $t2))
(local.set $A (vec_step $vl 4 (local.get $A)))
(local.set $B (vec_step $vl 4 (local.get $B)))
(local.set $C (vec_step $vl 4 (local.get $C)))
(local.set $n (vec_step $vl -1 (local.get $n)))
(br_if 0 (local.get $n) (local.get $n)) ;; pass the count back to the top
end ;; end vector loop
Nondeterminism
The length in each iteration of a vec_loop is nondeterministic. It's visible in vec_step, in the number of elements loaded and stored in vec_load and vec_store, and in any side effects of the loop. It's expensive to completely hide the hardware vector length with any flexible-vector proposal, so whether or not there should be nondeterminism is an independent question.
One thing this proposal does do is avoid having nondeterminism which has to be constant for the duration of the program. Wasm doesn't currently have any nondeterminism that behaves like that, and it would have implications for suspending a program and resuming it on another architecture, or for distributing a program over multiple architectures.
Vector subroutines
A call within a vec_loop could make sense if the callsite passes the vec_len value to the callee. Since vec_len is a type, it'd be part of the function signature, so implementations could compile the callee to be called from within a vector loop without whole-program analysis.
Nested vec_loops
Some architectures wouldn't be able to implement arbitrary nested vector parallelism, so one possible approach would be to prohibit it. Lexical nesting is easy to detect, and dynamic nesting -- a call inside a vec_loop calling a function that contains another vec_loop could be prohibited if we require calls in vec_loops to have exactly one vec_len argument, and prohibit vec_loop inside functions with a vec_len argument.
Masks, shuffles, extracts and inserts, strided loads/stores
There are ways this design could be extended to include these, but left them out to keep things simple to start with. It's mostly an independent question whether these can be implemented efficiently.