Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 15 additions & 15 deletions docs/src/architecture/macro-internals.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,24 +115,24 @@ end
# If only checkpoint!(pool, Int64), Float64 arrays won't be rewound!
```

### The Solution: Bitmask-Based Untracked Tracking
### The Solution: Bitmask-Based Type Touch Tracking

Every `acquire!` call (and convenience functions) marks itself as "untracked" with type-specific bitmask information:
Every `acquire!` call (and convenience functions) records the type touch with type-specific bitmask information:

```julia
# Public API (called from user code outside macro)
@inline function acquire!(pool, ::Type{T}, n::Int) where {T}
_mark_untracked!(pool, T) # ← Sets type-specific bitmask!
_record_type_touch!(pool, T) # ← Records type-specific bitmask!
_acquire_impl!(pool, T, n)
end

# Macro-transformed calls skip the marking
# Macro-transformed calls skip the recording
# (because macro already knows about them)
_acquire_impl!(pool, T, n) # ← No marking
_acquire_impl!(pool, T, n) # ← No recording
```

Each fixed-slot type maps to a bit in a `UInt16` bitmask via `_fixed_slot_bit(T)`.
Non-fixed-slot types set a separate `_untracked_has_others` flag.
Non-fixed-slot types set a separate `_touched_has_others` flag.

### Flow Diagram

Expand All @@ -144,7 +144,7 @@ Non-fixed-slot types set a separate `_untracked_has_others` flag.
│ A = _acquire_impl!(...) (macro-transformed, no mark)
│ B = helper!(pool)
│ └─► zeros!(pool, Float64, N)
│ └─► _mark_untracked!(pool, Float64)
│ └─► _record_type_touch!(pool, Float64)
│ masks[2] |= 0x0001 (Float64 bit) ←───┐
│ │
│ ... more code ... │
Expand All @@ -161,9 +161,9 @@ end

### Why This Works

1. **Macro-tracked calls**: Transformed to `_acquire_impl!` → no bitmask mark → typed path
2. **Untracked calls**: Use public API → sets type-specific bitmask → subset check at rewind
3. **Subset optimization**: If untracked types are a subset of tracked types, the typed path is still safe
1. **Macro-tracked calls**: Transformed to `_acquire_impl!` → no bitmask touch → typed path
2. **External calls**: Use public API → records type-specific bitmask → subset check at rewind
3. **Subset optimization**: If touched types are a subset of tracked types, the typed path is still safe
4. **Result**: Always safe, with finer-grained optimization than a single boolean flag

## Nested `@with_pool` Handling
Expand Down Expand Up @@ -191,14 +191,14 @@ end depth: 2 → 1, bitmask checked
struct AdaptiveArrayPool
# ... type pools ...
_current_depth::Int # Current scope depth (1 = global)
_untracked_fixed_masks::Vector{UInt16} # Per-depth: which fixed slots untracked
_untracked_has_others::Vector{Bool} # Per-depth: any non-fixed-slot untracked
_touched_type_masks::Vector{UInt16} # Per-depth: which fixed slots were touched
_touched_has_others::Vector{Bool} # Per-depth: any non-fixed-slot type touched
end

# Initialized with sentinel:
_current_depth = 1 # Global scope
_untracked_fixed_masks = [UInt16(0)] # Sentinel for depth=1
_untracked_has_others = [false] # Sentinel for depth=1
_touched_type_masks = [UInt16(0)] # Sentinel for depth=1
_touched_has_others = [false] # Sentinel for depth=1
```

## Performance Impact
Expand Down Expand Up @@ -256,7 +256,7 @@ end
| `_extract_acquire_types(expr, pool_name)` | AST walk to find types |
| `_filter_static_types(types, local_vars)` | Filter out locally-defined types |
| `_transform_acquire_calls(expr, pool_name)` | Replace `acquire!` → `_acquire_impl!` |
| `_mark_untracked!(pool, T)` | Set type-specific bitmask for current depth |
| `_record_type_touch!(pool, T)` | Record type touch in bitmask for current depth |
| `_can_use_typed_path(pool, mask)` | Bitmask subset check for typed vs full path |
| `_tracked_mask_for_types(T...)` | Compile-time bitmask for tracked types |
| `_generate_typed_checkpoint_call(pool, types)` | Generate bitmask-aware checkpoint |
Expand Down
21 changes: 11 additions & 10 deletions ext/AdaptiveArrayPoolsCUDAExt/acquire.jl
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@
# ==============================================================================

using AdaptiveArrayPools: get_view!, get_nd_view!, get_nd_array!, allocate_vector, safe_prod,
_mark_untracked!, _fixed_slot_bit, _checkpoint_typed_pool!
_record_type_touch!, _fixed_slot_bit, _checkpoint_typed_pool!,
_MODE_BITS_MASK

"""
get_view!(tp::CuTypedPool{T}, n::Int) -> CuVector{T}
Expand Down Expand Up @@ -165,44 +166,44 @@ Used by `unsafe_acquire!` - same zero-allocation behavior as `acquire!`.
end

# ==============================================================================
# CUDA _mark_untracked! override (Issue #2 / #2a fix)
# CUDA _record_type_touch! override (Issue #2 / #2a fix)
# ==============================================================================
# Float16 on CUDA: direct struct field with _fixed_slot_bit(Float16)=0.
# We track Float16 via bit 7 (CUDA reassignment; CPU uses bit 7 for Bit type, absent on GPU).
# This gives Float16 lazy first-touch checkpointing in bit-14 (typed lazy) and bit-15 (dynamic)
# modes, ensuring Case A (not Case B) fires at rewind and parent n_active is preserved.

@inline function AdaptiveArrayPools._mark_untracked!(pool::CuAdaptiveArrayPool, ::Type{T}) where {T}
@inline function AdaptiveArrayPools._record_type_touch!(pool::CuAdaptiveArrayPool, ::Type{T}) where {T}
depth = pool._current_depth
b = _fixed_slot_bit(T)
if b == UInt16(0)
if T === Float16
# Float16: CUDA direct field tracked via bit 7 (not in pool.others dict).
b16 = UInt16(1) << 7
current_mask = @inbounds pool._untracked_fixed_masks[depth]
current_mask = @inbounds pool._touched_type_masks[depth]
# Lazy first-touch checkpoint: bit 14 (typed lazy) OR bit 15 (dynamic), first touch only.
# Guard: skip if already checkpointed at this depth (prevents double-push).
if (current_mask & 0xC000) != 0 && (current_mask & b16) == 0
if (current_mask & _MODE_BITS_MASK) != 0 && (current_mask & b16) == 0
if @inbounds(pool.float16._checkpoint_depths[end]) != depth
_checkpoint_typed_pool!(pool.float16, depth)
end
end
@inbounds pool._untracked_fixed_masks[depth] = current_mask | b16
@inbounds pool._touched_type_masks[depth] = current_mask | b16
else
# Genuine others type (UInt8, Int8, etc.) — eagerly snapshotted at scope entry.
@inbounds pool._untracked_has_others[depth] = true
@inbounds pool._touched_has_others[depth] = true
end
else
current_mask = @inbounds pool._untracked_fixed_masks[depth]
current_mask = @inbounds pool._touched_type_masks[depth]
# Lazy first-touch checkpoint for fixed-slot types in bit 14/15 modes.
# Guard: skip if already checkpointed at this depth (prevents double-push).
if (current_mask & 0xC000) != 0 && (current_mask & b) == 0
if (current_mask & _MODE_BITS_MASK) != 0 && (current_mask & b) == 0
tp = AdaptiveArrayPools.get_typed_pool!(pool, T)
if @inbounds(tp._checkpoint_depths[end]) != depth
_checkpoint_typed_pool!(tp, depth)
end
end
@inbounds pool._untracked_fixed_masks[depth] = current_mask | b
@inbounds pool._touched_type_masks[depth] = current_mask | b
end
nothing
end
105 changes: 53 additions & 52 deletions ext/AdaptiveArrayPoolsCUDAExt/state.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
# AbstractTypedPool, so they work for CuTypedPool automatically.

using AdaptiveArrayPools: checkpoint!, rewind!, reset!,
_checkpoint_typed_pool!, _rewind_typed_pool!, _has_bit
_checkpoint_typed_pool!, _rewind_typed_pool!, _has_bit,
_LAZY_MODE_BIT, _TYPED_LAZY_BIT, _TYPE_BITS_MASK

# ==============================================================================
# GPU Fixed Slot Iteration
Expand All @@ -31,10 +32,10 @@ end
# ==============================================================================

function AdaptiveArrayPools.checkpoint!(pool::CuAdaptiveArrayPool)
# Increment depth and initialize untracked bitmask state
# Increment depth and initialize type-touch tracking state
pool._current_depth += 1
push!(pool._untracked_fixed_masks, UInt16(0))
push!(pool._untracked_has_others, false)
push!(pool._touched_type_masks, UInt16(0))
push!(pool._touched_has_others, false)
depth = pool._current_depth

# Fixed slots - zero allocation via @generated iteration
Expand All @@ -53,8 +54,8 @@ end
# Type-specific checkpoint (single type)
@inline function AdaptiveArrayPools.checkpoint!(pool::CuAdaptiveArrayPool, ::Type{T}) where {T}
pool._current_depth += 1
push!(pool._untracked_fixed_masks, UInt16(0))
push!(pool._untracked_has_others, false)
push!(pool._touched_type_masks, UInt16(0))
push!(pool._touched_has_others, false)
_checkpoint_typed_pool!(AdaptiveArrayPools.get_typed_pool!(pool, T), pool._current_depth)
nothing
end
Expand All @@ -72,8 +73,8 @@ end
checkpoint_exprs = [:(_checkpoint_typed_pool!(AdaptiveArrayPools.get_typed_pool!(pool, types[$i]), pool._current_depth)) for i in unique_indices]
quote
pool._current_depth += 1
push!(pool._untracked_fixed_masks, UInt16(0))
push!(pool._untracked_has_others, false)
push!(pool._touched_type_masks, UInt16(0))
push!(pool._touched_has_others, false)
$(checkpoint_exprs...)
nothing
end
Expand Down Expand Up @@ -102,8 +103,8 @@ function AdaptiveArrayPools.rewind!(pool::CuAdaptiveArrayPool)
_rewind_typed_pool!(tp, cur_depth)
end

pop!(pool._untracked_fixed_masks)
pop!(pool._untracked_has_others)
pop!(pool._touched_type_masks)
pop!(pool._touched_has_others)
pool._current_depth -= 1

return nothing
Expand All @@ -116,8 +117,8 @@ end
return nothing
end
_rewind_typed_pool!(AdaptiveArrayPools.get_typed_pool!(pool, T), pool._current_depth)
pop!(pool._untracked_fixed_masks)
pop!(pool._untracked_has_others)
pop!(pool._touched_type_masks)
pop!(pool._touched_has_others)
pool._current_depth -= 1
nothing
end
Expand All @@ -140,17 +141,17 @@ end
return nothing
end
$(rewind_exprs...)
pop!(pool._untracked_fixed_masks)
pop!(pool._untracked_has_others)
pop!(pool._touched_type_masks)
pop!(pool._touched_has_others)
pool._current_depth -= 1
nothing
end
end

# ==============================================================================
# Dynamic-Selective Mode for CuAdaptiveArrayPool (use_typed=false path)
# Lazy Mode for CuAdaptiveArrayPool (use_typed=false path)
# ==============================================================================
# Mirrors CPU _depth_only_checkpoint! / _dynamic_selective_rewind! in src/state.jl.
# Mirrors CPU _lazy_checkpoint! / _lazy_rewind! in src/state.jl.
#
# Float16 on CUDA: direct struct field (not in pool.others dict), but _fixed_slot_bit(Float16)=0.
# We reassign Float16 to bit 7 (unused on CUDA; CPU uses bit 7 for Bit type which has no GPU equivalent).
Expand All @@ -160,25 +161,25 @@ end
# Bit 7 on CUDA is reserved for Float16 (CPU uses it for Bit; Bit type does not exist on GPU).
@inline _cuda_float16_bit() = UInt16(1) << 7

@inline function AdaptiveArrayPools._depth_only_checkpoint!(pool::CuAdaptiveArrayPool)
@inline function AdaptiveArrayPools._lazy_checkpoint!(pool::CuAdaptiveArrayPool)
pool._current_depth += 1
push!(pool._untracked_fixed_masks, UInt16(0x8000)) # bit 15: dynamic-selective mode
push!(pool._untracked_has_others, false)
push!(pool._touched_type_masks, _LAZY_MODE_BIT) # lazy mode flag
push!(pool._touched_has_others, false)
depth = pool._current_depth
# Eagerly checkpoint pre-existing others entries — same as CPU _depth_only_checkpoint!.
# Eagerly checkpoint pre-existing others entries — same as CPU _lazy_checkpoint!.
# New types created during the scope start at n_active=0 (sentinel covers them, Case B safe).
# Pre-existing types need their count saved now so Case A fires correctly at rewind.
for p in values(pool.others)
_checkpoint_typed_pool!(p, depth)
@inbounds pool._untracked_has_others[depth] = true
@inbounds pool._touched_has_others[depth] = true
end
# Float16 uses lazy first-touch via bit 7 in _mark_untracked! — no eager checkpoint needed.
# Float16 uses lazy first-touch via bit 7 in _record_type_touch! — no eager checkpoint needed.
nothing
end

@inline function AdaptiveArrayPools._dynamic_selective_rewind!(pool::CuAdaptiveArrayPool)
@inline function AdaptiveArrayPools._lazy_rewind!(pool::CuAdaptiveArrayPool)
d = pool._current_depth
mask = @inbounds(pool._untracked_fixed_masks[d]) & UInt16(0x00FF)
mask = @inbounds(pool._touched_type_masks[d]) & _TYPE_BITS_MASK
_has_bit(mask, Float64) && _rewind_typed_pool!(pool.float64, d)
_has_bit(mask, Float32) && _rewind_typed_pool!(pool.float32, d)
_has_bit(mask, Int64) && _rewind_typed_pool!(pool.int64, d)
Expand All @@ -188,13 +189,13 @@ end
_has_bit(mask, Bool) && _rewind_typed_pool!(pool.bool, d)
# Bit 7: Float16 (CUDA reassignment — _fixed_slot_bit(Float16)==0, must use explicit bit check)
mask & _cuda_float16_bit() != 0 && _rewind_typed_pool!(pool.float16, d)
if @inbounds(pool._untracked_has_others[d])
if @inbounds(pool._touched_has_others[d])
for tp in values(pool.others)
_rewind_typed_pool!(tp, d)
end
end
pop!(pool._untracked_fixed_masks)
pop!(pool._untracked_has_others)
pop!(pool._touched_type_masks)
pop!(pool._touched_has_others)
pool._current_depth -= 1
nothing
end
Expand All @@ -203,57 +204,57 @@ end
# Typed-Fallback Helpers for CuAdaptiveArrayPool (Phase 5 parity)
# ==============================================================================

# _typed_checkpoint_with_lazy!: typed checkpoint + set bit 14 for lazy extra-type tracking.
# _typed_lazy_checkpoint!: typed checkpoint + set bit 14 for lazy extra-type tracking.
# Also eagerly snapshots pre-existing others entries (mirrors CPU fix for Issue #3).
@inline function AdaptiveArrayPools._typed_checkpoint_with_lazy!(pool::CuAdaptiveArrayPool, types::Type...)
@inline function AdaptiveArrayPools._typed_lazy_checkpoint!(pool::CuAdaptiveArrayPool, types::Type...)
checkpoint!(pool, types...)
d = pool._current_depth
@inbounds pool._untracked_fixed_masks[d] |= UInt16(0x4000) # set bit 14
# Eagerly snapshot pre-existing others entries — same reasoning as _depth_only_checkpoint!.
@inbounds pool._touched_type_masks[d] |= _TYPED_LAZY_BIT
# Eagerly snapshot pre-existing others entries — same reasoning as _lazy_checkpoint!.
# Skip re-snapshot for entries already checkpointed at d by checkpoint!(pool, types...)
# (e.g. Float16 in types... was just checkpointed above — avoid double-push).
for p in values(pool.others)
if @inbounds(p._checkpoint_depths[end]) != d
_checkpoint_typed_pool!(p, d)
end
@inbounds pool._untracked_has_others[d] = true
@inbounds pool._touched_has_others[d] = true
end
# Float16 uses lazy first-touch via bit 7 in _mark_untracked! — no eager checkpoint needed.
# Float16 uses lazy first-touch via bit 7 in _record_type_touch! — no eager checkpoint needed.
nothing
end

# _typed_selective_rewind!: selective rewind of (tracked | untracked) mask.
# _typed_lazy_rewind!: selective rewind of (tracked | touched) mask.
# Uses direct field access with bit checks — foreach_fixed_slot is single-argument (no bit yield).
# Bit 7: Float16 (CUDA-specific; lazy-checkpointed on first touch by _mark_untracked!).
# Bit 7: Float16 (CUDA-specific; lazy-checkpointed on first touch by _record_type_touch!).
# has_others: genuine others types (UInt8, Int8, etc.) — eagerly checkpointed at scope entry.
@inline function AdaptiveArrayPools._typed_selective_rewind!(pool::CuAdaptiveArrayPool, tracked_mask::UInt16)
@inline function AdaptiveArrayPools._typed_lazy_rewind!(pool::CuAdaptiveArrayPool, tracked_mask::UInt16)
d = pool._current_depth
untracked = @inbounds(pool._untracked_fixed_masks[d]) & UInt16(0x00FF)
combined = tracked_mask | untracked
touched = @inbounds(pool._touched_type_masks[d]) & _TYPE_BITS_MASK
combined = tracked_mask | touched
_has_bit(combined, Float64) && _rewind_typed_pool!(pool.float64, d)
_has_bit(combined, Float32) && _rewind_typed_pool!(pool.float32, d)
_has_bit(combined, Int64) && _rewind_typed_pool!(pool.int64, d)
_has_bit(combined, Int32) && _rewind_typed_pool!(pool.int32, d)
_has_bit(combined, ComplexF64) && _rewind_typed_pool!(pool.complexf64, d)
_has_bit(combined, ComplexF32) && _rewind_typed_pool!(pool.complexf32, d)
_has_bit(combined, Bool) && _rewind_typed_pool!(pool.bool, d)
# Float16: bit 7 is set by _mark_untracked! on first untracked touch (lazy first-touch).
# Also rewind when Float16 was a *tracked* type in the macro: _typed_checkpoint_with_lazy!
# Float16: bit 7 is set by _record_type_touch! on first touch (lazy first-touch).
# Also rewind when Float16 was a *tracked* type in the macro: _typed_lazy_checkpoint!
# calls checkpoint!(pool, Float16) which pushes a checkpoint at depth d, but _acquire_impl!
# (macro transform) bypasses _mark_untracked!, leaving bit 7 = 0.
# (macro transform) bypasses _record_type_touch!, leaving bit 7 = 0.
# _tracked_mask_for_types(Float16) == 0 (since _fixed_slot_bit(Float16) == 0), so
# tracked_mask carries no bit for Float16 either.
# Solution: check _checkpoint_depths to detect "Float16 was checkpointed at this depth".
if combined & _cuda_float16_bit() != 0 || @inbounds(pool.float16._checkpoint_depths[end]) == d
_rewind_typed_pool!(pool.float16, d)
end
if @inbounds(pool._untracked_has_others[d])
if @inbounds(pool._touched_has_others[d])
for tp in values(pool.others)
_rewind_typed_pool!(tp, d)
end
end
pop!(pool._untracked_fixed_masks)
pop!(pool._untracked_has_others)
pop!(pool._touched_type_masks)
pop!(pool._touched_has_others)
pool._current_depth -= 1
nothing
end
Expand All @@ -275,10 +276,10 @@ function AdaptiveArrayPools.reset!(pool::CuAdaptiveArrayPool)

# Reset depth and bitmask sentinel state
pool._current_depth = 1
empty!(pool._untracked_fixed_masks)
push!(pool._untracked_fixed_masks, UInt16(0)) # Sentinel: no bits set
empty!(pool._untracked_has_others)
push!(pool._untracked_has_others, false) # Sentinel: no others
empty!(pool._touched_type_masks)
push!(pool._touched_type_masks, UInt16(0)) # Sentinel: no bits set
empty!(pool._touched_has_others)
push!(pool._touched_has_others, false) # Sentinel: no others

return pool
end
Expand Down Expand Up @@ -334,10 +335,10 @@ function Base.empty!(pool::CuAdaptiveArrayPool)

# Reset depth and bitmask sentinel state
pool._current_depth = 1
empty!(pool._untracked_fixed_masks)
push!(pool._untracked_fixed_masks, UInt16(0)) # Sentinel: no bits set
empty!(pool._untracked_has_others)
push!(pool._untracked_has_others, false) # Sentinel: no others
empty!(pool._touched_type_masks)
push!(pool._touched_type_masks, UInt16(0)) # Sentinel: no bits set
empty!(pool._touched_has_others)
push!(pool._touched_has_others, false) # Sentinel: no others

return pool
end
Loading