feat: Add Geometric Sparse Attention (AETHER) #134
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds
GeometricSparseAttention, a new modular layer that enables data-dependent sparse attention using geometric upper bounds.Unlike static sparse patterns (e.g., Sliding Window, BigBird), this layer uses AETHER (Adaptive Event-driven Threshold Hybrid Entangled Rendering) logic to dynamically prune blocks at runtime based on the Cauchy-Schwarz inequality.
Mathematical Guarantee
The pruning is safe because it relies on the geometric upper bound:
$$\max_{k \in B} (q \cdot k) \le q \cdot \mu_B + |q| \cdot r_B$$ $\tau$ , the entire block $B$ can be skipped with mathematical certainty that no high-scoring keys exist within it.
If this upper bound is below the threshold
Key Features
pz.select().at_instances_of(pz.nn.Attention).apply(...).epsilonandphistate parameters that self-tune the sparsity level based on input entropy.NamedArrayand Treescope visualization.Verification
tests/nn/geometric_attention_test.pywith 13 comprehensive tests.jit,vmap) work correctly.