Skip to content

Comments

(perf): Dynamic Selective Rewind & Typed-Fallback Optimization#17

Merged
mgyoo86 merged 8 commits intomasterfrom
perf/more_optimized
Feb 18, 2026
Merged

(perf): Dynamic Selective Rewind & Typed-Fallback Optimization#17
mgyoo86 merged 8 commits intomasterfrom
perf/more_optimized

Conversation

@mgyoo86
Copy link
Collaborator

@mgyoo86 mgyoo86 commented Feb 18, 2026

Summary

This PR introduces dynamic selective rewind — a major optimization for the use_typed=false macro path — and a typed-fallback selective rewind for the use_typed=true path when helper functions touch extra types beyond the statically known set.

Instead of checkpointing and rewinding all 8 typed pools on every scope entry/exit, only the pools actually touched during execution are saved and restored, reducing overhead from ~1080ns (full checkpoint/rewind) to ~N×9ns per touched type.

What Changed

Core feature — Dynamic selective mode (Phases 1–3):

  • _depth_only_checkpoint!: lightweight scope entry that defers per-type snapshots
  • _mark_untracked!: lazy first-touch checkpoint triggered on first acquire! of each type
  • _dynamic_selective_rewind!: rewinds only the pools whose bitmask bits were set
  • Macro integration: use_typed=false now emits selective checkpoint/rewind instead of full checkpoint!/rewind!

Typed-fallback optimization (Phase 5):

  • _typed_checkpoint_with_lazy!: combines typed checkpoint for known types + bit 14 flag for lazy extras
  • _typed_selective_rewind!: rewinds tracked types + any lazily-checkpointed untracked types
  • Replaces full checkpoint!/rewind! in the use_typed=true fallback path

Bug fix — others-type snapshot (standalone fix):

  • _typed_checkpoint_with_lazy! now eagerly snapshots pre-existing pool.others entries to prevent parent n_active corruption when helpers re-acquire the same others-type

CUDA parity:

  • Full mirror of both modes for CuAdaptiveArrayPool, including Float16 handling (bit 7 reassignment since CPU's Bit type has no GPU equivalent)

Code quality:

  • _has_bit(mask, T) helper replaces raw bitmask literals across all selective rewind paths

Add safety-lock tests and desired-behavior (RED) tests for the
dynamic selective rewind feature (use_typed=false path optimization).

Characterization tests (GREEN - lock current invariants):
- _acquire_impl! bypasses _mark_untracked! by design
- full checkpoint! eagerly saves all 8 typed pools
- parent state preserved across child scope rewind
- others-type (UInt8) sets has_others flag correctly
- empty scope round-trips cleanly

RED tests (fail until Phase 2+3 implement the feature):
- _depth_only_checkpoint! function does not exist yet
- depth-only checkpoint should not eagerly save typed pools
- lazy first-touch checkpoint on acquire! in dynamic mode
- macro use_typed=false should emit _depth_only_checkpoint!
- macro use_typed=false should emit _selective_rewind_fixed_slots!

Key finding: _transform_acquire_calls runs unconditionally (even for
use_typed=false), so all acquire! → _acquire_impl! and _mark_untracked!
is bypassed. Phase 3 must skip the transformation for dynamic mode.
…ive mode

Add _depth_only_checkpoint! and _selective_rewind_fixed_slots! to state.jl,
and extend _mark_untracked! with lazy first-touch checkpoint in acquire.jl.

_depth_only_checkpoint! (state.jl):
- Lightweight enter: increments depth + pushes bitmask sentinels only
- Sets bit 15 in _untracked_fixed_masks as "dynamic-selective" mode flag
- Eagerly checkpoints pre-existing others entries (lazy is not feasible
  for non-fixed-slot types without per-type tracking)
- ~2ns vs ~540ns for full checkpoint!

_selective_rewind_fixed_slots! (state.jl):
- Rewinds only the 8 fixed-slot pools whose bits are set in mask
- Each bit maps to the same encoding as _fixed_slot_bit (bits 0-7)
- Callers must strip bit 15 (mode flag) before passing mask

_mark_untracked! (acquire.jl):
- AdaptiveArrayPool-specific override adds lazy first-touch checkpoint
- On first acquire of each fixed-slot type T in dynamic mode (bit 15 set):
  saves current n_active BEFORE the acquire so rewind restores parent state
- Without lazy checkpoint, Case B in _rewind_typed_pool! would restore from
  a stale parent checkpoint rather than the true pre-scope value
- Second and subsequent acquires of same type skip the lazy checkpoint
…tive mode

Route use_typed=false paths through _depth_only_checkpoint! + _dynamic_selective_rewind!
instead of full checkpoint/rewind. Avoids ~1080ns overhead when macro cannot extract
static types (local vars, similar!, eltype(arr) patterns).

Key changes:
- macros.jl: emit _depth_only_checkpoint!/_dynamic_selective_rewind! for use_typed=false;
  disable _transform_acquire_calls in dynamic mode so _mark_untracked! is called via
  public acquire! wrappers (prevents n_active leaks)
- state.jl: add _dynamic_selective_rewind! as standalone @inline function (avoids
  let-block boxing in finally clauses that caused 1152B allocation)

Test additions:
- test_macro_expansion.jl: GREEN assertions for dynamic path; negative guards confirming
  AdaptiveArrayPools.checkpoint!/rewind! are NOT emitted in use_typed=false expansions
- test_macro_internals.jl: 7 new runtime n_active cleanup tests using internal APIs
  directly with fresh pools (nested scopes, similar!, mixed static+dynamic types)
- test_allocation.jl: extra warmup call for test-order robustness (N-way bitarray
  cache state from earlier tests caused alloc2==1152 in full suite; alloc3 was 0)
- test_backend_macro_expansion.jl: update stale _selective_rewind_fixed_slots! assertion
  to _dynamic_selective_rewind!
Replace full checkpoint/rewind (~1080ns) in the use_typed=true fallback
path with typed checkpoint + selective rewind (~N×9ns per touched type).

Key changes:
- src/acquire.jl: extend lazy-checkpoint condition from bit 15 only
  (0x8000) to bit 14 OR bit 15 (0xC000), enabling lazy first-touch
  checkpoint in typed lazy mode for extra types touched by helpers
- src/state.jl: add _typed_checkpoint_with_lazy! (typed checkpoint +
  set bit 14) and _typed_selective_rewind! (rewind tracked|untracked mask)
- src/macros.jl: update _generate_typed_checkpoint_call/_generate_typed_rewind_call
  false branches from full checkpoint!/rewind! to the new helpers
- ext/.../state.jl: CUDA parity for both new helpers using direct field
  access (foreach_fixed_slot has no bit-yielding variant)
- tests: RED→GREEN coverage for bit 14 semantics, P0 safety regression
  (parent n_active preserved for extra types), and expansion assertions

Bit encoding:
  bit 15 (0x8000): dynamic selective mode (_depth_only_checkpoint!)
  bit 14 (0x4000): typed lazy mode (_typed_checkpoint_with_lazy!)
  bits 0-7: fixed-slot type bits (_mark_untracked!)
  bits 8-13: reserved

Safety: bit 14 ensures extra types get lazy first-touch checkpoint
(Case A at rewind), preventing Case B from incorrectly restoring
parent n_active from the sentinel value 0.
…_lazy!

Without this, a child scope using _typed_checkpoint_with_lazy! (typed-fallback path)
would skip snapshotting pre-existing pool.others entries (e.g. CPU Float16, UInt8).
If a helper then re-acquired the same type, _typed_selective_rewind! would hit Case B
(no checkpoint at depth) and restore the wrong sentinel value, corrupting the parent's
n_active.

Fixes (src/state.jl):
- _typed_checkpoint_with_lazy! now iterates pool.others and snapshots each entry that
  is not already checkpointed at the current depth (avoiding a double-push for types
  explicitly listed in types..., e.g. Float16).
- Sets _untracked_has_others[d] = true whenever pool.others is non-empty, so
  _typed_selective_rewind! enters the others loop even when no helper called
  _mark_untracked! (e.g. when Float16 is a tracked type and _acquire_impl! bypasses
  the untracked recording path).

Also clarifies the isempty(types) fallback comment in _generate_typed_checkpoint_call
and _generate_typed_rewind_call (src/macros.jl) to make it clear these branches exist
for direct external callers (test_coverage.jl), not macro-generated code.

Tests (test/test_state.jl):
- Moved _typed_checkpoint_with_lazy! import to file-level for shared access.
- Added "Phase 5 (Issue #3): typed lazy mode preserves parent n_active for others types"
  to cover the UInt8 others-type parent-preservation scenario.
…s with Float16 tracking

Adds full CUDA parity for the Phase 3 dynamic-selective and Phase 5 typed-fallback
optimizations, including correct handling of Float16 which is a direct struct field
on CuAdaptiveArrayPool (unlike CPU where it lives in pool.others).

Dynamic-selective mode (ext/state.jl):
- _depth_only_checkpoint! for CuAdaptiveArrayPool: sets bit 15, eagerly snapshots
  pool.others, and relies on _mark_untracked! bit-7 for lazy Float16 tracking.
- _dynamic_selective_rewind! for CuAdaptiveArrayPool: dispatches on bits 0-7 of
  the untracked mask (bit 7 = Float16 on CUDA), then handles pool.others.

Typed-fallback updates (ext/state.jl):
- _typed_checkpoint_with_lazy!: now eagerly snapshots pool.others entries (same fix
  as CPU side — avoids Case B at rewind for pre-existing others-type acquires).
- _typed_selective_rewind!: adds depth-check fallback for Float16:
  since _tracked_mask_for_types(Float16)==0 and _acquire_impl! bypasses
  _mark_untracked!, neither tracked_mask nor untracked bit 7 is set for a tracked
  Float16 type. The depth check detects "Float16 was checkpointed at this depth"
  (by _typed_checkpoint_with_lazy! → checkpoint!(pool, Float16)) and ensures the
  pool is rewound, preserving the parent scope's float16.n_active.

CUDA _mark_untracked! override (ext/acquire.jl):
- Float16 on CUDA is a direct field with _fixed_slot_bit(Float16)=0. Overrides the
  base AbstractArrayPool _mark_untracked! to route Float16 through bit 7 (unused on
  CUDA; CPU uses bit 7 for the Bit type which has no GPU equivalent).
- Gives Float16 the same lazy first-touch checkpoint behavior (bit 14 OR bit 15 check)
  as other fixed-slot types, ensuring Case A fires at rewind and parent n_active is
  preserved. Genuine others types (UInt8, Int8, etc.) fall through to has_others flag.
… in selective rewind

Replace `mask & (UInt16(1) << n) != 0` with `_has_bit(mask, TypeName)` across
_selective_rewind_fixed_slots! (CPU), _dynamic_selective_rewind! (CUDA), and
_typed_selective_rewind! (CUDA). CUDA Float16 (bit 7 reassignment) uses
`_cuda_float16_bit()` directly since _fixed_slot_bit(Float16) == 0.

Zero runtime cost: _has_bit is @inline and _fixed_slot_bit returns compile-time
constants, so the compiler folds them identically to the original bit operations.
@codecov
Copy link

codecov bot commented Feb 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.46%. Comparing base (a2fb750) to head (831f8a1).
⚠️ Report is 9 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #17      +/-   ##
==========================================
- Coverage   97.08%   96.46%   -0.62%     
==========================================
  Files           9        9              
  Lines        1200     1273      +73     
==========================================
+ Hits         1165     1228      +63     
- Misses         35       45      +10     
Files with missing lines Coverage Δ
src/acquire.jl 95.23% <100.00%> (-4.77%) ⬇️
src/macros.jl 92.78% <100.00%> (+0.07%) ⬆️
src/state.jl 98.70% <100.00%> (-1.30%) ⬇️
src/types.jl 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes @with_pool scope management by introducing dynamic selective checkpoint/rewind for the use_typed=false path and a typed-lazy fallback for the use_typed=true path to avoid eager checkpoint/rewind of all fixed slots when unnecessary.

Changes:

  • Add _depth_only_checkpoint! + _dynamic_selective_rewind! and update macro codegen to use them for dynamic (use_typed=false) scopes.
  • Add _typed_checkpoint_with_lazy! + _typed_selective_rewind! and update typed fallback codegen to selectively rewind tracked + lazily checkpointed extra types.
  • Mirror the above behavior for CUDA pools and expand/update tests to validate runtime behavior + macro expansion.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/state.jl Implements depth-only checkpoint, dynamic selective rewind, typed-lazy checkpoint, and typed selective rewind.
src/acquire.jl Extends _mark_untracked! to support lazy first-touch checkpointing for fixed-slot types in bit-14/15 modes.
src/macros.jl Updates macro codegen to use dynamic selective mode when static types can’t be extracted; updates typed fallback to typed-lazy helpers.
src/types.jl Adds _has_bit(mask, T) helper for clearer bitmask checks.
ext/AdaptiveArrayPoolsCUDAExt/state.jl CUDA parity for dynamic selective + typed-lazy rewind logic (incl. Float16 bit handling).
ext/AdaptiveArrayPoolsCUDAExt/acquire.jl CUDA _mark_untracked! override to support lazy first-touch checkpoints (incl. Float16 special-case).
test/test_state.jl Adds/updates state-layer tests for dynamic selective + typed-lazy behavior and safety invariants.
test/test_macro_internals.jl Adds runtime correctness tests for dynamic selective rewind + typed-lazy behavior.
test/test_macro_expansion.jl Updates and adds macro-expansion assertions for new generated calls.
test/test_backend_macro_expansion.jl Updates CUDA backend macro-expansion assertions for dynamic selective mode.
test/test_allocation.jl Adjusts allocation test warmup assumptions to stabilize “hot path” zero-allocation checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… rewind modes

- CPU/CUDA _depth_only_checkpoint!: set _untracked_has_others=true when eagerly
  checkpointing pre-existing others entries, so _dynamic_selective_rewind! enters
  the others loop and pops the checkpoint (prevents unbounded stack leak in loops)
- CPU/CUDA _mark_untracked!: add _checkpoint_depths[end] != depth guard before
  lazy _checkpoint_typed_pool!, preventing double-push when a tracked type is also
  acquired by a helper via acquire! (restores correct parent n_active on rewind)
- CUDA state.jl: import _has_bit (was used 14 times without import → UndefVarError)
- CUDA _typed_checkpoint_with_lazy!: add double-checkpoint guard and has_others
  flag, matching CPU version parity
@mgyoo86 mgyoo86 merged commit 59d002e into master Feb 18, 2026
10 checks passed
@mgyoo86 mgyoo86 deleted the perf/more_optimized branch February 18, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant