Horizontal_add f64 - fix avx2, sse41, sse2, and neon #83

austinorr · 2025-08-26T06:41:23Z

I'm really enjoying this crate, especially the ergonomics of the unified op names in v2. I've been experimenting with the new API in a side project and I noticed that my results for S::Vf64 were double what I expected, and double what the S::Vf32 implementation returned for the same inputs. I managed to narrow it down to the horizontal_add operation. Here's a minimal reproduction of the issue:

cloned simdeez repo @ 3b78391
added the following to ./examples/f64.rs -- this untracked file is not part of this PR but made it convenient to try things.

// examples/f64.rs
use simdeez::prelude::*;
use simdeez::simd_runtime_generate;

simd_runtime_generate!(
    pub fn check_horizontal_add_f64() -> bool {
        let ones = S::Vf64::set1(1.into());
        let w = S::Vf64::WIDTH as f64;
        let resultf64 = S::Vf64::horizontal_add(ones);
        println!(
            "w: {w:.0?} ones: {ones:?} result: {resultf64} passes: {:?}",
            w == resultf64
        );
        w == resultf64
    }
);

fn main() {
    #[cfg(target_arch = "x86_64")]
    {
        println!("Checking x86_64 features:");
        println!("  SSE is supported: {:?}", is_x86_feature_detected!("sse"));
        println!(
            "  SSE2 is supported: {:?}",
            is_x86_feature_detected!("sse2")
        );
        println!("  AVX is supported: {:?}", is_x86_feature_detected!("avx"));
        println!(
            "  AVX2 is supported: {:?}",
            is_x86_feature_detected!("avx2")
        );
        println!(
            "  AVX-512F is supported: {:?}",
            is_x86_feature_detected!("avx512f")
        );
    }
    
    check_horizontal_add_f64();
    
}

I ran it with cargo run --example f64

Checking x86_64 features:
  SSE is supported: true
  SSE2 is supported: true
  AVX is supported: true
  AVX2 is supported: true
  AVX-512F is supported: false
w: 4 ones: F64x4([[1.0, 1.0, 1.0, 1.0]]) result: 8 passes: false

We fed the horizontal_add function four ones, and it returns 8!

I'm on a Dell laptop (intel i9/x86), and this is running on ubuntu via WSL2.

~/wsl-sources/simdeez$ rustc -vV
rustc 1.88.0 (6b00bc388 2025-06-23)
binary: rustc
commit-hash: 6b00bc3880198600130e1cf62b8f8a93494488cc
commit-date: 2025-06-23
host: x86_64-unknown-linux-gnu
release: 1.88.0
LLVM version: 20.1.5

I started out by expanding my test to include all the supported simd instruction sets using the new simd_unsafe_generate_all that is analogous to the v1 simd_runtime_generate which makes all the variants for us. I couldn't get the macro working as is, but it looks like there's a copy/paste typo in the 'invoking.rs' file macro(s). The first commit of this PR is just a guess at what this macro was intended to do, but it made it a lot more useful to me with the patch included here, and it didn't alter your much-appreciated safety preferences. If you choose not to pull this commit in, I do hope there will be a v2 macro that is able to generate all the variants like the v1 one, it's really useful for development and sanity checks.

After that macro patch, I can re-write the reprod to show us all the simd float results for horizontal_add.

// examples/f64.rs
use simdeez::prelude::*;
use simdeez::simd_unsafe_generate_all;

simd_unsafe_generate_all!(
    pub fn check_horizontal_add_f32() -> bool {
        let ones = S::Vf32::set1(1.0);
        let w = S::Vf32::WIDTH as f32;
        let resultf32 = S::Vf32::horizontal_add(ones);
        println!(
            "w: {w:.0?} ones: {ones:?} result: {resultf32} passes: {:?}",
            w == resultf32
        );

        w == resultf32
    }
);

simd_unsafe_generate_all!(
    pub fn check_horizontal_add_f64() -> bool {
        let ones = S::Vf64::set1(1.into());
        let w = S::Vf64::WIDTH as f64;
        let resultf64 = S::Vf64::horizontal_add(ones);
        println!(
            "w: {w:.0?} ones: {ones:?} result: {resultf64} passes: {:?}",
            w == resultf64
        );
        w == resultf64
    }
);

fn main() {
    #[cfg(target_arch = "x86_64")]
    {
        println!("Checking x86_64 features:");
        println!("  SSE is supported: {:?}", is_x86_feature_detected!("sse"));
        println!(
            "  SSE2 is supported: {:?}",
            is_x86_feature_detected!("sse2")
        );
        println!("  AVX is supported: {:?}", is_x86_feature_detected!("avx"));
        println!(
            "  AVX2 is supported: {:?}",
            is_x86_feature_detected!("avx2")
        );
        println!(
            "  AVX-512F is supported: {:?}",
            is_x86_feature_detected!("avx512f")
        );
    }
    unsafe {
        check_horizontal_add_f32_scalar();
        check_horizontal_add_f32_sse2();
        check_horizontal_add_f32_sse41();
        check_horizontal_add_f32_avx2();
        check_horizontal_add_f32();
        check_horizontal_add_f64_scalar();
        check_horizontal_add_f64_sse2();
        check_horizontal_add_f64_sse41();
        check_horizontal_add_f64_avx2();
        check_horizontal_add_f64();
    }
}

which yields:

Checking x86_64 features:
  SSE is supported: true
  SSE2 is supported: true
  AVX is supported: true
  AVX2 is supported: true
  AVX-512F is supported: false
w: 1 ones: F32x1([[1.0]]) result: 1 passes: true
w: 4 ones: F32x4([[1.0, 1.0, 1.0, 1.0]]) result: 4 passes: true
w: 4 ones: F32x4_41([[1.0, 1.0, 1.0, 1.0]]) result: 4 passes: true
w: 8 ones: F32x8([[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]) result: 8 passes: true
w: 8 ones: F32x8([[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]) result: 8 passes: true
w: 1 ones: F64x1([[1.0]]) result: 1 passes: true
w: 2 ones: F64x2([[1.0, 1.0]]) result: 4 passes: false
w: 2 ones: F64x2_41([[1.0, 1.0]]) result: 4 passes: false
w: 4 ones: F64x4([[1.0, 1.0, 1.0, 1.0]]) result: 8 passes: false
w: 4 ones: F64x4([[1.0, 1.0, 1.0, 1.0]]) result: 8 passes: false

This indicates that all the f64 simd implementations are exactly 2x the expected result.... but all the tests pass! What's going on?!

RUSTFLAGS='-C target-feature=+avx2,+fma' cargo test horizontal_add

running 40 tests
test tests::run::signed_horizontal_add_scalar_f32 ... ok
test tests::run::signed_horizontal_add_scalar_i32 ... ok
test tests::run::signed_horizontal_add_scalar_i16 ... ok
test tests::run::signed_horizontal_add_scalar_f64 ... ok
...
test tests::run::unsigned_horizontal_add_sse41_i8 ... ok
test tests::run::signed_horizontal_add_avx2_i8 ... ok
test tests::run::unsigned_horizontal_add_avx2_i8 ... ok

test result: ok. 40 passed; 0 failed; 0 ignored; 0 measured; 592 filtered out; finished in 0.16s

These tests use an 'equals' check or an 'almost equals' check deep in the test suite, snipped here:

{ //  some larger scope that has other eq checks
        // tests/numbers.rs line 189
        match precision {
            EqPrecision::Exact => self == other,
            EqPrecision::Almost { figs } => {
                let bigger = self.max(other);
                let norm_diff = (self / bigger) - (other / bigger);
                let epsilon = 10.0f32.powi(-(figs as i32));
                norm_diff < epsilon
            }
        }
}

The intent here is to scale the two values and do a relative check, but there's an issue: if self is ever set to something small -- or to zero -- and other is not (say it's mistakenly doubled), then norm_diff will always be negative. In fact, any time that other is larger than self, the norm_diff will be negative and this check will pass -- no matter the magnitude. I did a silly println! in the check closure to print each result so I could inspect them... and I found out that every test is actually 1000 tests, which is awesome, but for f64 many are 'not close' yet passing the test. Note that this check is implemented identically for the f32 tests as well.

The second commit of this PR includes a modification to this relative equality check, though I'd suggest a dev-dependency on is_close instead.

This change does two things: first an absolute difference check that's less than epsilon passes early to handle the small-self issue. Then we proceed with the normalized check, but which doesn't care about the sign so much since epsilon is always positive. FWIW, this was edited to be a minimal intervention (I'm an unknown contributor after all) and this PR is already touching way to many files. I'd gladly update this PR to use is_close for this check if adding a dev-dependency is acceptable.

Anyway, with the update to almost_eq, I'm now getting test errors for f64:

RUSTFLAGS='-C target-feature=+avx2,+fma' cargo test horizontal_add

running 40 tests
test tests::run::signed_horizontal_add_avx2_f64 ... FAILED
test tests::run::signed_horizontal_add_scalar_i16 ... ok
test tests::run::signed_horizontal_add_scalar_f64 ... ok
test tests::run::signed_horizontal_add_scalar_f32 ... ok
test tests::run::signed_horizontal_add_sse2_f64 ... FAILED
test tests::run::signed_horizontal_add_scalar_i32 ... ok
test tests::run::signed_horizontal_add_scalar_i8 ... ok
test tests::run::signed_horizontal_add_scalar_i64 ... ok
test tests::run::signed_horizontal_add_sse41_f64 ... FAILED
....
test tests::run::unsigned_horizontal_add_avx2_i8 ... ok
test tests::run::signed_horizontal_add_avx2_i8 ... ok

failures:

---- tests::run::signed_horizontal_add_avx2_f64 stdout ----

thread 'tests::run::signed_horizontal_add_avx2_f64' panicked at src/tests/lib/tester.rs:20:13:

Failed for (F64x4([[0.0, 1.0, -1.0, 0.5]]),): Failed: Expected sum to be 0.5, got 1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- tests::run::signed_horizontal_add_sse2_f64 stdout ----

thread 'tests::run::signed_horizontal_add_sse2_f64' panicked at src/tests/lib/tester.rs:20:13:

Failed for (F64x2([[0.0, 1.0]]),): Failed: Expected sum to be 1, got 2

---- tests::run::signed_horizontal_add_sse41_f64 stdout ----

thread 'tests::run::signed_horizontal_add_sse41_f64' panicked at src/tests/lib/tester.rs:20:13:

Failed for (F64x2_41([[0.0, 1.0]]),): Failed: Expected sum to be 1, got 2


failures:
    tests::run::signed_horizontal_add_avx2_f64
    tests::run::signed_horizontal_add_sse2_f64
    tests::run::signed_horizontal_add_sse41_f64

test result: FAILED. 37 passed; 3 failed; 0 ignored; 0 measured; 592 filtered out; finished in 0.16s

Note that, despite changing the almost_eq implementation significantly, the only failures are with simd implementations of f64s. Awesome.

The third commit in this PR includes adjustments to the avx2, sse2, sse41, and Neon implementations for f64 (I set up actions on my fork and also noticed that neon was affected by this issue too!), and now all the tests pass CI (including neon). The issue with the v2 implementations for f64 horizontal_add seemed to be too many instructions, as though twice as many f64s needed to be reduced for each instruction set. For this PR I did not spend time 'optimizing' these instruction sequences, my focus was on minimal intervention and correctness.

Thanks for maintaining this crate and for reviewing this PR, and apologies for the long-winded writeup!

austinorr · 2025-09-14T05:57:03Z

@arduano I know this is a long one, my apologies. The TLDR is that the tests for floats always pass if the other value is larger. This PR fixes both the tests (for all types) and the implementations (for f64).

Thanks again -- hopefully it's possible to approve the workflow to give some confidence that my suggested changes don't break your project and let me know if there's anything more I can do to support a review.

arduano · 2025-09-14T06:17:05Z

Interesting... Sorry for missing this PR earlier

Yeah looks good. I haven't had much time to contribute to this crate myself, but if you're interested in helping bring v2.0 across the finish line, we need to figure out what to do about sleef. Do we just deprecate it, or re-implement it with simdeez primitives, etc

arduano · 2025-09-14T06:20:13Z

I'll investigate the ARM test failure, looks like an issue on my end

arduano · 2025-09-14T06:46:25Z

Update from master please

austinorr · 2025-09-14T15:51:59Z

Update from master please

I’m traveling today so can’t rebase easily, but it occurs to me that we might be able to just retry the job. I think gha works on ‘as-merged’ prs rather than on ‘as-is’ prs and we might get lucky.

If not I can of course rebase and force push to trigger the tests again.

austinorr · 2025-09-14T16:23:19Z

we need to figure out what to do about sleef. Do we just deprecate it, or re-implement it with simdeez primitives, etc

This is a tough one. I have not yet used the sleef features (there’s still a lot of value in V2 without it), but from what I can see the crate we’ve used for the bindings has been abandoned. V2 might be the right release semantically to remove the feature/functionality (hopefully temporarily) until the community that relies on it can volunteer the time needed to add it back.

Planning to begin adding things back in 2.1+ also gives time to choose an approach (fork sleef-sys, roll our own, make them scalar only, etc) and get feedback from the folks who relied on it from version 1.

It’s a bummer to remove functionality, but this can be a good and clear way to ask for help, make a clean semantic break, and stay sane while supporting this project with the limited time we have.

austinorr · 2025-09-20T16:54:18Z

Update from master please

@arduano rebased, force-pushed, ready for another workflow approval

arduano

Looks good, I'll merge. I'll also look into releasing v2.0 today with the relevant mentions of temporary sleef depreciation

austinorr · 2025-09-21T14:45:39Z

Looks good, I'll merge. I'll also look into releasing v2.0 today with the relevant mentions of temporary sleef depreciation

If it’d be helpful, I’d be glad to do a review of the docs changes and announcement before the release.

arduano approved these changes Sep 14, 2025

View reviewed changes

austinorr added 3 commits September 20, 2025 09:51

fix macro fall-through calls

ac145c4

fix f32 & f64 EqPrecision::almost implementation, esp for near-zeros

235b5c6

fix avx, sse41, sse2, and neon horizontal add impl

df4fdc1

austinorr force-pushed the horizontal-add-fix-f64 branch from 8a330e5 to df4fdc1 Compare September 20, 2025 16:52

arduano approved these changes Sep 20, 2025

View reviewed changes

arduano merged commit ba62d09 into arduano:master Sep 20, 2025
4 checks passed

austinorr deleted the horizontal-add-fix-f64 branch September 20, 2025 23:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horizontal_add f64 - fix avx2, sse41, sse2, and neon #83

Horizontal_add f64 - fix avx2, sse41, sse2, and neon #83

Uh oh!

austinorr commented Aug 26, 2025

Uh oh!

austinorr commented Sep 14, 2025

Uh oh!

arduano commented Sep 14, 2025

Uh oh!

arduano commented Sep 14, 2025

Uh oh!

arduano commented Sep 14, 2025

Uh oh!

austinorr commented Sep 14, 2025

Uh oh!

austinorr commented Sep 14, 2025

Uh oh!

austinorr commented Sep 20, 2025

Uh oh!

arduano left a comment

Uh oh!

Uh oh!

austinorr commented Sep 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Horizontal_add f64 - fix avx2, sse41, sse2, and neon #83

Horizontal_add f64 - fix avx2, sse41, sse2, and neon #83

Uh oh!

Conversation

austinorr commented Aug 26, 2025

Uh oh!

austinorr commented Sep 14, 2025

Uh oh!

arduano commented Sep 14, 2025

Uh oh!

arduano commented Sep 14, 2025

Uh oh!

arduano commented Sep 14, 2025

Uh oh!

austinorr commented Sep 14, 2025

Uh oh!

austinorr commented Sep 14, 2025

Uh oh!

austinorr commented Sep 20, 2025

Uh oh!

arduano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

austinorr commented Sep 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants