-
Notifications
You must be signed in to change notification settings - Fork 27
Horizontal_add f64 - fix avx2, sse41, sse2, and neon #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@arduano I know this is a long one, my apologies. The TLDR is that the tests for floats always pass if the other value is larger. This PR fixes both the tests (for all types) and the implementations (for f64). Thanks again -- hopefully it's possible to approve the workflow to give some confidence that my suggested changes don't break your project and let me know if there's anything more I can do to support a review. |
|
Interesting... Sorry for missing this PR earlier Yeah looks good. I haven't had much time to contribute to this crate myself, but if you're interested in helping bring v2.0 across the finish line, we need to figure out what to do about sleef. Do we just deprecate it, or re-implement it with simdeez primitives, etc |
|
I'll investigate the ARM test failure, looks like an issue on my end |
|
Update from master please |
I’m traveling today so can’t rebase easily, but it occurs to me that we might be able to just retry the job. I think gha works on ‘as-merged’ prs rather than on ‘as-is’ prs and we might get lucky. If not I can of course rebase and force push to trigger the tests again. |
This is a tough one. I have not yet used the sleef features (there’s still a lot of value in V2 without it), but from what I can see the crate we’ve used for the bindings has been abandoned. V2 might be the right release semantically to remove the feature/functionality (hopefully temporarily) until the community that relies on it can volunteer the time needed to add it back. Planning to begin adding things back in 2.1+ also gives time to choose an approach (fork sleef-sys, roll our own, make them scalar only, etc) and get feedback from the folks who relied on it from version 1. It’s a bummer to remove functionality, but this can be a good and clear way to ask for help, make a clean semantic break, and stay sane while supporting this project with the limited time we have. |
8a330e5 to
df4fdc1
Compare
@arduano rebased, force-pushed, ready for another workflow approval |
arduano
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I'll merge. I'll also look into releasing v2.0 today with the relevant mentions of temporary sleef depreciation
If it’d be helpful, I’d be glad to do a review of the docs changes and announcement before the release. |
I'm really enjoying this crate, especially the ergonomics of the unified op names in v2. I've been experimenting with the new API in a side project and I noticed that my results for S::Vf64 were double what I expected, and double what the S::Vf32 implementation returned for the same inputs. I managed to narrow it down to the horizontal_add operation. Here's a minimal reproduction of the issue:
cloned
simdeezrepo @ 3b78391added the following to ./examples/f64.rs -- this untracked file is not part of this PR but made it convenient to try things.
I ran it with
cargo run --example f64We fed the horizontal_add function four ones, and it returns 8!
I'm on a Dell laptop (intel i9/x86), and this is running on ubuntu via WSL2.
~/wsl-sources/simdeez$ rustc -vV rustc 1.88.0 (6b00bc388 2025-06-23) binary: rustc commit-hash: 6b00bc3880198600130e1cf62b8f8a93494488cc commit-date: 2025-06-23 host: x86_64-unknown-linux-gnu release: 1.88.0 LLVM version: 20.1.5I started out by expanding my test to include all the supported simd instruction sets using the new
simd_unsafe_generate_allthat is analogous to the v1simd_runtime_generatewhich makes all the variants for us. I couldn't get the macro working as is, but it looks like there's a copy/paste typo in the 'invoking.rs' file macro(s). The first commit of this PR is just a guess at what this macro was intended to do, but it made it a lot more useful to me with the patch included here, and it didn't alter your much-appreciated safety preferences. If you choose not to pull this commit in, I do hope there will be a v2 macro that is able to generate all the variants like the v1 one, it's really useful for development and sanity checks.After that macro patch, I can re-write the reprod to show us all the simd float results for horizontal_add.
which yields:
This indicates that all the f64 simd implementations are exactly 2x the expected result.... but all the tests pass! What's going on?!
These tests use an 'equals' check or an 'almost equals' check deep in the test suite, snipped here:
The intent here is to scale the two values and do a relative check, but there's an issue: if
selfis ever set to something small -- or to zero -- andotheris not (say it's mistakenly doubled), thennorm_diffwill always be negative. In fact, any time thatotheris larger than self, thenorm_diffwill be negative and this check will pass -- no matter the magnitude. I did a sillyprintln!in the check closure to print each result so I could inspect them... and I found out that every test is actually 1000 tests, which is awesome, but for f64 many are 'not close' yet passing the test. Note that this check is implemented identically for the f32 tests as well.The second commit of this PR includes a modification to this relative equality check, though I'd suggest a dev-dependency on
is_closeinstead.This change does two things: first an absolute difference check that's less than epsilon passes early to handle the small-self issue. Then we proceed with the normalized check, but which doesn't care about the sign so much since epsilon is always positive. FWIW, this was edited to be a minimal intervention (I'm an unknown contributor after all) and this PR is already touching way to many files. I'd gladly update this PR to use
is_closefor this check if adding a dev-dependency is acceptable.Anyway, with the update to
almost_eq, I'm now getting test errors for f64:Note that, despite changing the
almost_eqimplementation significantly, the only failures are with simd implementations of f64s. Awesome.The third commit in this PR includes adjustments to the avx2, sse2, sse41, and Neon implementations for f64 (I set up actions on my fork and also noticed that neon was affected by this issue too!), and now all the tests pass CI (including neon). The issue with the v2 implementations for f64 horizontal_add seemed to be too many instructions, as though twice as many f64s needed to be reduced for each instruction set. For this PR I did not spend time 'optimizing' these instruction sequences, my focus was on minimal intervention and correctness.
Thanks for maintaining this crate and for reviewing this PR, and apologies for the long-winded writeup!