(perf): Dynamic Selective Rewind & Typed-Fallback Optimization#17
Merged
(perf): Dynamic Selective Rewind & Typed-Fallback Optimization#17
Conversation
Add safety-lock tests and desired-behavior (RED) tests for the dynamic selective rewind feature (use_typed=false path optimization). Characterization tests (GREEN - lock current invariants): - _acquire_impl! bypasses _mark_untracked! by design - full checkpoint! eagerly saves all 8 typed pools - parent state preserved across child scope rewind - others-type (UInt8) sets has_others flag correctly - empty scope round-trips cleanly RED tests (fail until Phase 2+3 implement the feature): - _depth_only_checkpoint! function does not exist yet - depth-only checkpoint should not eagerly save typed pools - lazy first-touch checkpoint on acquire! in dynamic mode - macro use_typed=false should emit _depth_only_checkpoint! - macro use_typed=false should emit _selective_rewind_fixed_slots! Key finding: _transform_acquire_calls runs unconditionally (even for use_typed=false), so all acquire! → _acquire_impl! and _mark_untracked! is bypassed. Phase 3 must skip the transformation for dynamic mode.
…ive mode Add _depth_only_checkpoint! and _selective_rewind_fixed_slots! to state.jl, and extend _mark_untracked! with lazy first-touch checkpoint in acquire.jl. _depth_only_checkpoint! (state.jl): - Lightweight enter: increments depth + pushes bitmask sentinels only - Sets bit 15 in _untracked_fixed_masks as "dynamic-selective" mode flag - Eagerly checkpoints pre-existing others entries (lazy is not feasible for non-fixed-slot types without per-type tracking) - ~2ns vs ~540ns for full checkpoint! _selective_rewind_fixed_slots! (state.jl): - Rewinds only the 8 fixed-slot pools whose bits are set in mask - Each bit maps to the same encoding as _fixed_slot_bit (bits 0-7) - Callers must strip bit 15 (mode flag) before passing mask _mark_untracked! (acquire.jl): - AdaptiveArrayPool-specific override adds lazy first-touch checkpoint - On first acquire of each fixed-slot type T in dynamic mode (bit 15 set): saves current n_active BEFORE the acquire so rewind restores parent state - Without lazy checkpoint, Case B in _rewind_typed_pool! would restore from a stale parent checkpoint rather than the true pre-scope value - Second and subsequent acquires of same type skip the lazy checkpoint
…tive mode Route use_typed=false paths through _depth_only_checkpoint! + _dynamic_selective_rewind! instead of full checkpoint/rewind. Avoids ~1080ns overhead when macro cannot extract static types (local vars, similar!, eltype(arr) patterns). Key changes: - macros.jl: emit _depth_only_checkpoint!/_dynamic_selective_rewind! for use_typed=false; disable _transform_acquire_calls in dynamic mode so _mark_untracked! is called via public acquire! wrappers (prevents n_active leaks) - state.jl: add _dynamic_selective_rewind! as standalone @inline function (avoids let-block boxing in finally clauses that caused 1152B allocation) Test additions: - test_macro_expansion.jl: GREEN assertions for dynamic path; negative guards confirming AdaptiveArrayPools.checkpoint!/rewind! are NOT emitted in use_typed=false expansions - test_macro_internals.jl: 7 new runtime n_active cleanup tests using internal APIs directly with fresh pools (nested scopes, similar!, mixed static+dynamic types) - test_allocation.jl: extra warmup call for test-order robustness (N-way bitarray cache state from earlier tests caused alloc2==1152 in full suite; alloc3 was 0) - test_backend_macro_expansion.jl: update stale _selective_rewind_fixed_slots! assertion to _dynamic_selective_rewind!
Replace full checkpoint/rewind (~1080ns) in the use_typed=true fallback path with typed checkpoint + selective rewind (~N×9ns per touched type). Key changes: - src/acquire.jl: extend lazy-checkpoint condition from bit 15 only (0x8000) to bit 14 OR bit 15 (0xC000), enabling lazy first-touch checkpoint in typed lazy mode for extra types touched by helpers - src/state.jl: add _typed_checkpoint_with_lazy! (typed checkpoint + set bit 14) and _typed_selective_rewind! (rewind tracked|untracked mask) - src/macros.jl: update _generate_typed_checkpoint_call/_generate_typed_rewind_call false branches from full checkpoint!/rewind! to the new helpers - ext/.../state.jl: CUDA parity for both new helpers using direct field access (foreach_fixed_slot has no bit-yielding variant) - tests: RED→GREEN coverage for bit 14 semantics, P0 safety regression (parent n_active preserved for extra types), and expansion assertions Bit encoding: bit 15 (0x8000): dynamic selective mode (_depth_only_checkpoint!) bit 14 (0x4000): typed lazy mode (_typed_checkpoint_with_lazy!) bits 0-7: fixed-slot type bits (_mark_untracked!) bits 8-13: reserved Safety: bit 14 ensures extra types get lazy first-touch checkpoint (Case A at rewind), preventing Case B from incorrectly restoring parent n_active from the sentinel value 0.
…_lazy! Without this, a child scope using _typed_checkpoint_with_lazy! (typed-fallback path) would skip snapshotting pre-existing pool.others entries (e.g. CPU Float16, UInt8). If a helper then re-acquired the same type, _typed_selective_rewind! would hit Case B (no checkpoint at depth) and restore the wrong sentinel value, corrupting the parent's n_active. Fixes (src/state.jl): - _typed_checkpoint_with_lazy! now iterates pool.others and snapshots each entry that is not already checkpointed at the current depth (avoiding a double-push for types explicitly listed in types..., e.g. Float16). - Sets _untracked_has_others[d] = true whenever pool.others is non-empty, so _typed_selective_rewind! enters the others loop even when no helper called _mark_untracked! (e.g. when Float16 is a tracked type and _acquire_impl! bypasses the untracked recording path). Also clarifies the isempty(types) fallback comment in _generate_typed_checkpoint_call and _generate_typed_rewind_call (src/macros.jl) to make it clear these branches exist for direct external callers (test_coverage.jl), not macro-generated code. Tests (test/test_state.jl): - Moved _typed_checkpoint_with_lazy! import to file-level for shared access. - Added "Phase 5 (Issue #3): typed lazy mode preserves parent n_active for others types" to cover the UInt8 others-type parent-preservation scenario.
…s with Float16 tracking Adds full CUDA parity for the Phase 3 dynamic-selective and Phase 5 typed-fallback optimizations, including correct handling of Float16 which is a direct struct field on CuAdaptiveArrayPool (unlike CPU where it lives in pool.others). Dynamic-selective mode (ext/state.jl): - _depth_only_checkpoint! for CuAdaptiveArrayPool: sets bit 15, eagerly snapshots pool.others, and relies on _mark_untracked! bit-7 for lazy Float16 tracking. - _dynamic_selective_rewind! for CuAdaptiveArrayPool: dispatches on bits 0-7 of the untracked mask (bit 7 = Float16 on CUDA), then handles pool.others. Typed-fallback updates (ext/state.jl): - _typed_checkpoint_with_lazy!: now eagerly snapshots pool.others entries (same fix as CPU side — avoids Case B at rewind for pre-existing others-type acquires). - _typed_selective_rewind!: adds depth-check fallback for Float16: since _tracked_mask_for_types(Float16)==0 and _acquire_impl! bypasses _mark_untracked!, neither tracked_mask nor untracked bit 7 is set for a tracked Float16 type. The depth check detects "Float16 was checkpointed at this depth" (by _typed_checkpoint_with_lazy! → checkpoint!(pool, Float16)) and ensures the pool is rewound, preserving the parent scope's float16.n_active. CUDA _mark_untracked! override (ext/acquire.jl): - Float16 on CUDA is a direct field with _fixed_slot_bit(Float16)=0. Overrides the base AbstractArrayPool _mark_untracked! to route Float16 through bit 7 (unused on CUDA; CPU uses bit 7 for the Bit type which has no GPU equivalent). - Gives Float16 the same lazy first-touch checkpoint behavior (bit 14 OR bit 15 check) as other fixed-slot types, ensuring Case A fires at rewind and parent n_active is preserved. Genuine others types (UInt8, Int8, etc.) fall through to has_others flag.
… in selective rewind Replace `mask & (UInt16(1) << n) != 0` with `_has_bit(mask, TypeName)` across _selective_rewind_fixed_slots! (CPU), _dynamic_selective_rewind! (CUDA), and _typed_selective_rewind! (CUDA). CUDA Float16 (bit 7 reassignment) uses `_cuda_float16_bit()` directly since _fixed_slot_bit(Float16) == 0. Zero runtime cost: _has_bit is @inline and _fixed_slot_bit returns compile-time constants, so the compiler folds them identically to the original bit operations.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #17 +/- ##
==========================================
- Coverage 97.08% 96.46% -0.62%
==========================================
Files 9 9
Lines 1200 1273 +73
==========================================
+ Hits 1165 1228 +63
- Misses 35 45 +10
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes @with_pool scope management by introducing dynamic selective checkpoint/rewind for the use_typed=false path and a typed-lazy fallback for the use_typed=true path to avoid eager checkpoint/rewind of all fixed slots when unnecessary.
Changes:
- Add
_depth_only_checkpoint!+_dynamic_selective_rewind!and update macro codegen to use them for dynamic (use_typed=false) scopes. - Add
_typed_checkpoint_with_lazy!+_typed_selective_rewind!and update typed fallback codegen to selectively rewind tracked + lazily checkpointed extra types. - Mirror the above behavior for CUDA pools and expand/update tests to validate runtime behavior + macro expansion.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/state.jl |
Implements depth-only checkpoint, dynamic selective rewind, typed-lazy checkpoint, and typed selective rewind. |
src/acquire.jl |
Extends _mark_untracked! to support lazy first-touch checkpointing for fixed-slot types in bit-14/15 modes. |
src/macros.jl |
Updates macro codegen to use dynamic selective mode when static types can’t be extracted; updates typed fallback to typed-lazy helpers. |
src/types.jl |
Adds _has_bit(mask, T) helper for clearer bitmask checks. |
ext/AdaptiveArrayPoolsCUDAExt/state.jl |
CUDA parity for dynamic selective + typed-lazy rewind logic (incl. Float16 bit handling). |
ext/AdaptiveArrayPoolsCUDAExt/acquire.jl |
CUDA _mark_untracked! override to support lazy first-touch checkpoints (incl. Float16 special-case). |
test/test_state.jl |
Adds/updates state-layer tests for dynamic selective + typed-lazy behavior and safety invariants. |
test/test_macro_internals.jl |
Adds runtime correctness tests for dynamic selective rewind + typed-lazy behavior. |
test/test_macro_expansion.jl |
Updates and adds macro-expansion assertions for new generated calls. |
test/test_backend_macro_expansion.jl |
Updates CUDA backend macro-expansion assertions for dynamic selective mode. |
test/test_allocation.jl |
Adjusts allocation test warmup assumptions to stabilize “hot path” zero-allocation checks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… rewind modes - CPU/CUDA _depth_only_checkpoint!: set _untracked_has_others=true when eagerly checkpointing pre-existing others entries, so _dynamic_selective_rewind! enters the others loop and pops the checkpoint (prevents unbounded stack leak in loops) - CPU/CUDA _mark_untracked!: add _checkpoint_depths[end] != depth guard before lazy _checkpoint_typed_pool!, preventing double-push when a tracked type is also acquired by a helper via acquire! (restores correct parent n_active on rewind) - CUDA state.jl: import _has_bit (was used 14 times without import → UndefVarError) - CUDA _typed_checkpoint_with_lazy!: add double-checkpoint guard and has_others flag, matching CPU version parity
mgyoo86
referenced
this pull request
Feb 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces dynamic selective rewind — a major optimization for the
use_typed=falsemacro path — and a typed-fallback selective rewind for theuse_typed=truepath when helper functions touch extra types beyond the statically known set.Instead of checkpointing and rewinding all 8 typed pools on every scope entry/exit, only the pools actually touched during execution are saved and restored, reducing overhead from ~1080ns (full checkpoint/rewind) to ~N×9ns per touched type.
What Changed
Core feature — Dynamic selective mode (Phases 1–3):
_depth_only_checkpoint!: lightweight scope entry that defers per-type snapshots_mark_untracked!: lazy first-touch checkpoint triggered on firstacquire!of each type_dynamic_selective_rewind!: rewinds only the pools whose bitmask bits were setuse_typed=falsenow emits selective checkpoint/rewind instead of fullcheckpoint!/rewind!Typed-fallback optimization (Phase 5):
_typed_checkpoint_with_lazy!: combines typed checkpoint for known types + bit 14 flag for lazy extras_typed_selective_rewind!: rewinds tracked types + any lazily-checkpointed untracked typescheckpoint!/rewind!in theuse_typed=truefallback pathBug fix — others-type snapshot (standalone fix):
_typed_checkpoint_with_lazy!now eagerly snapshots pre-existingpool.othersentries to prevent parentn_activecorruption when helpers re-acquire the same others-typeCUDA parity:
CuAdaptiveArrayPool, including Float16 handling (bit 7 reassignment since CPU's Bit type has no GPU equivalent)Code quality:
_has_bit(mask, T)helper replaces raw bitmask literals across all selective rewind paths