Skip to content

Comments

Replace expression tree pipeline execution with prewired invokers#7631

Open
danielmarbach wants to merge 3 commits intomasterfrom
pipeline-wired
Open

Replace expression tree pipeline execution with prewired invokers#7631
danielmarbach wants to merge 3 commits intomasterfrom
pipeline-wired

Conversation

@danielmarbach
Copy link
Contributor

@danielmarbach danielmarbach commented Feb 22, 2026

This PR aligns pipeline execution with a prewired immutable continuation chain, keeps state on the context bag, and improves AOT/trimming friendliness while minimizing broader architectural change and delivering significant performance improvements across all execution paths..

Summary

The previous trampoline model proposal (see #7625) executed the pipeline as a flat loop-like progression over prebuilt parts, tracking position via a mutable frame on the context. This new approach prewires the entire continuation chain at pipeline build time, eliminating frame manipulation entirely.

Key change: Instead of computing "what's next" at runtime via frame index advancement, each behavior receives a prewired next delegate that directly invokes the subsequent behavior. It is built once at pipeline construction and executed millions of times with zero allocation. It uses key revelations from the trampoline investigations.

Alignment with expression-tree model: Like the current expression-tree-based model (the "Current" baseline in benchmarks), this approach pre-composes the continuation chain at build time. However, it achieves this through static generic type instantiation rather than runtime expression compilation, delivering:

  • Same allocation-free success path
  • Same exception handling simplicity
  • Same thread-safety model: pipeline is immutable, per-invocation state lives on ContextBag
  • 25-29% faster execution
  • ~330x faster warmup (no expression-tree compilation)
  • Full AOT/trimming compatibility (no runtime codegen)

Architecture

Prewired Continuation Chain

Pipeline Construction Time:
┌─────────────────────────────────────────────────────────────────────────────┐
│  PipelineInvoker.Build(steps)                                               │
│                                                                             │
│  for i = steps.Count - 1 → 0:                                               │
│      next = CreateInvokerNode(step[i], i, next)                             │
│                                                                             │
│  Result: Linked chain of InvokerNode<TIn,TOut> instances                    │
│  Each node holds: behaviorIndex + prewired next delegate                    │
└─────────────────────────────────────────────────────────────────────────────┘

Execution Time:
┌─────────────────────────────────────────────────────────────────────────────┐
│  context.Extensions.Invoker(context)                                        │
│       ↓                                                                     │
│  InvokerNode<TIn,TOut>.Invoke(context)                                      │
│       ↓                                                                     │
│  behavior.Invoke(context, next)    ← next is prewired delegate              │
│       ↓                                                                     │
│  await next(context)              ← direct call to next InvokerNode         │
│       ↓                                                                     │
│  ... continues down the chain                                               │
└─────────────────────────────────────────────────────────────────────────────┘

Node Structure

abstract class InvokerNode
{
    public abstract Task Invoke(IBehaviorContext context);
}

sealed class InvokerNode<TInContext, TOutContext>(int behaviorIndex, Func<TOutContext, Task> next) : InvokerNode
    where TInContext : class, IBehaviorContext
    where TOutContext : class, IBehaviorContext
{
    public override Task Invoke(IBehaviorContext context)
    {
        var behavior = context.Extensions.GetBehavior<TInContext, TOutContext>(behaviorIndex);
        return behavior.Invoke(Unsafe.As<TInContext>(context), next);
    }
}

Critical allocation optimization:

  • next is cached at construction time, not per-invocation
  • next.Invoke is a bound method-group delegate (no closure)
  • Static generic cache for terminal next: CompletedNextCache<TOutContext>.Next

Comparison: Trampoline vs Prewired

Aspect Trampoline (Previous) Prewired (Current)
Runtime state Mutable frame (Index, RangeEnd) on context No runtime state
"Next" computation AdvanceFrame() + dispatch by index Direct delegate call
Allocation per invocation None (success path) None
Allocation on exception Frame save/restore + state machine overhead Minimal (exception only)
Stage transitions SetFrame() to adjust range Pre-linked continuation
Concurrency risk Frame mutation in concurrent next calls Immutable chain, safer

Benefits

Performance

Success path (allocation-free, fastest):

  • Up to 29% faster than expression-based baseline
  • Consistent across pipeline depths

Exception path (async throw):

  • Same allocation profile as baseline (~1.3 KB)
  • Parity with baseline timing

Warmup (first invocation):

  • ~330x faster than expression-based baseline
  • 50% less allocation

Design Simplicity

  • No mutable frame state: Eliminates InitFrame, AdvanceFrame, SetFrame, Frame property
  • No restore logic: Exception handling is natural async propagation, no frame save/restore
  • Immutable by construction: Each node is immutable, thread-safe by design
  • Cleaner concurrency story: No shared mutable state to corrupt

AOT/Trimming Friendly

  • No runtime code generation
  • No expression trees
  • No reflection on hot path
  • Static generic type instantiation at build time

Benchmarks

https://github.com/danielmarbach/MicroBenchmarks and branches starting with bare-metal

Results.zip

BenchmarkDotNet v0.15.8, macOS Tahoe 26.3 (25D125) [Darwin 25.3.0]
Apple M3 Max, 1 CPU, 14 logical and 14 physical cores
.NET SDK 10.0.101
  [Host]     : .NET 10.0.1 (10.0.1, 10.0.125.57005), Arm64 RyuJIT armv8.0-a
  DefaultJob : .NET 10.0.1 (10.0.1, 10.0.125.57005), Arm64 RyuJIT armv8.0-a

Execution

Method PipelineDepth Mean Error StdDev Ratio Allocated Alloc Ratio
Wired 10 20.94 ns 0.202 ns 0.189 ns 0.75 - NA
Current 10 28.01 ns 0.469 ns 0.439 ns 1.00 - NA
Wired 20 43.05 ns 0.046 ns 0.036 ns 0.75 - NA
Current 20 57.45 ns 0.351 ns 0.311 ns 1.00 - NA
Wired 40 91.66 ns 1.022 ns 0.956 ns 0.71 - NA
Current 40 129.26 ns 0.703 ns 0.587 ns 1.00 - NA

Throwing

Method PipelineDepth Mean Error StdDev Median Ratio Gen0 Allocated Alloc Ratio
Current 10 6.102 μs 0.1432 μs 0.4130 μs 6.033 μs 1.00 0.1602 1.31 KB 1.00
Wired 10 6.426 μs 0.2527 μs 0.7252 μs 6.285 μs 1.06 0.1602 1.31 KB 1.00
Wired 20 6.441 μs 0.2101 μs 0.6162 μs 6.240 μs 0.92 0.1526 1.31 KB 1.00
Current 20 7.114 μs 0.3204 μs 0.9397 μs 6.982 μs 1.02 0.1526 1.31 KB 1.00
Wired 40 5.646 μs 0.1093 μs 0.1258 μs 5.597 μs 0.97 0.1602 1.3 KB 1.00
Current 40 5.803 μs 0.1141 μs 0.2225 μs 5.753 μs 1.00 0.1602 1.31 KB 1.00

Warmup

Method PipelineDepth Mean Error StdDev Ratio Gen0 Gen1 Allocated Alloc Ratio
Wired 10 2.021 μs 0.0396 μs 0.0407 μs 0.003 2.5368 0.0305 20.73 KB 0.44
Current 10 672.160 μs 7.0983 μs 6.2925 μs 1.000 4.8828 1.9531 46.85 KB 1.00
Wired 20 4.067 μs 0.0799 μs 0.1196 μs 0.003 5.0430 0.0916 41.2 KB 0.47
Current 20 1,369.096 μs 27.2932 μs 45.6008 μs 1.001 9.7656 3.9063 88.34 KB 1.00
Wired 40 7.911 μs 0.1577 μs 0.2211 μs 0.003 10.0403 0.2899 82.12 KB 0.48
Current 40 2,746.676 μs 54.2674 μs 105.8443 μs 1.001 19.5313 7.8125 171.74 KB 1.00

Extended Comparison (MediumRun)

Additional benchmark with sync exception, replay, and trampoline comparison:

Job=MediumRun  IterationCount=15  LaunchCount=2  WarmupCount=10

Success Path Comparison

Method PipelineDepth Mean Ratio Allocated
Prewired_Success 10 17.61 ns 0.76 -
Current_Success 10 23.16 ns 1.00 -
Trampo_Success 10 28.38 ns 1.23 -
Prewired_Success 40 91.07 ns 0.75 -
Current_Success 40 121.73 ns 1.00 -
Trampo_Success 40 118.21 ns 0.97 -

Async Exception Comparison

Method PipelineDepth Mean Ratio Allocated
Prewired_Exception 10 6,051.69 ns 0.98 1337 B
Current_Exception 10 6,158.12 ns 1.00 1339 B
Trampo_Exception 10 38,927.61 ns 6.33 12987 B
Prewired_Exception 40 5,802.32 ns 0.98 1336 B
Current_Exception 40 5,918.23 ns 1.00 1339 B
Trampo_Exception 40 129,951.86 ns 21.99 129507 B

Sync Exception Comparison

Method PipelineDepth Mean Ratio Allocated
Prewired_Exception_Sync 10 2,338.83 ns 0.91 392 B
Current_Exception_Sync 10 2,563.78 ns 1.00 600 B
Trampo_Exception_Sync 10 23,015.87 ns 8.98 1624 B
Prewired_Exception_Sync 40 2,450.91 ns 0.92 392 B
Current_Exception_Sync 40 2,653.10 ns 1.00 600 B
Trampo_Exception_Sync 40 93,409.16 ns 35.22 5496 B

Replay Comparison (Multiple next calls)

Method PipelineDepth Mean Ratio Allocated
Trampo_Replay 10 38.53 ns 0.49 -
Prewired_Replay 10 57.12 ns 0.73 -
Current_Replay 10 78.54 ns 1.00 -
Trampo_Replay 40 131.87 ns 0.35 -
Prewired_Replay 40 280.09 ns 0.75 -
Current_Replay 40 375.26 ns 1.00 -

Concurrency Improvement

The prewired approach reduces concurrency risk compared to the trampoline:

Scenario Trampoline Risk Prewired Risk
Sequential next calls Safe Safe
Concurrent await Task.WhenAll(next(ctx), next(ctx)) Frame corruption possible No engine state to corrupt

Note: Concurrent next calls are still not a supported contract (downstream behaviors may not be thread-safe for shared context mutation), but the pipeline engine itself no longer has mutable traversal state.

Implementation Details

Build-Time Composition

// PipelineInvoker.cs
public static Func<IBehaviorContext, Task> Build(IReadOnlyList<RegisterStep> steps)
{
    if (steps.Count == 0)
        return CompletedRoot;

    InvokerNode? next = null;
    for (var i = steps.Count - 1; i >= 0; i--)
    {
        next = CreateInvokerNode(steps[i], i, next);
    }

    return next!.Invoke;
}

Reverse composition: Build from end to start so each node can reference the already-built continuation.

Delegate Caching

static Func<TOutContext, Task> CreateNext<TOutContext>(InvokerNode? next)
    where TOutContext : class, IBehaviorContext =>
    next is null ? CompletedNextCache<TOutContext>.Next : next.Invoke;

static class CompletedNextCache<TOutContext> where TOutContext : class, IBehaviorContext
{
    public static readonly Func<TOutContext, Task> Next = _ => Task.CompletedTask;
}

Key optimization: next.Invoke is a bound method-group delegate, not a closure. This is critical because the earlier implementation had context => next.Invoke(context), which allocated per invocation.

Context Integration

// Pipeline.cs
public Task Invoke(TContext context)
{
    context.Extensions.Initialize(behaviors, invoker);
    return context.Extensions.Invoker(context);
}

// ContextBag.cs
internal void Initialize(IBehavior[] withBehaviors, Func<IBehaviorContext, Task> withInvoker)
{
    behaviors = withBehaviors;
    Invoker = withInvoker;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal IBehavior<TInContext, TOutContext> GetBehavior<TInContext, TOutContext>(int index)
    where TInContext : class, IBehaviorContext
    where TOutContext : class, IBehaviorContext =>
    Unsafe.As<IBehavior<TInContext, TOutContext>>(Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(behaviors), index))!;

Behaviors and invoker are set on the context once per invocation, enabling indexed access without closures.

Pipeline Extensibility Constraints

The core pipeline architecture constrains extensibility in specific ways:

What IS Possible via Public API

  • Replace existing stage connectors using PipelineSettings.Replace() (StageConnector<TFrom,TTo> is public)
  • Replace existing stage fork connectors (StageForkConnector is public)
  • Register behaviors within existing stages using built-in context types
  • Implement custom behaviors targeting supported context transitions

What is NOT Possible

  • Create new custom stages (StageConnector<ICustomContext, IAnotherContext>)
  • Create new custom stage forks
  • Any pipeline topology changes beyond replacing existing connectors

Why Custom Stages Are Blocked

  1. BehaviorContext is internal: Without inheriting from it, custom context implementations cannot participate in the pipeline's ContextBag hierarchy.

  2. Factory methods are hardcoded: ConnectorContextExtensions provides public factory methods only for built-in context types (CreateRoutingContext(), CreateDispatchContext(), etc.). No generic factory exists for custom contexts.

  3. Context implementations are internal: All concrete context types (IncomingPhysicalMessageContext, DispatchContext, etc.) inherit from BehaviorContext and are internal.

  4. Invoker node creation is closed: PipelineInvoker.Factory.cs contains switch expressions for all known stage transitions. Adding a new stage requires modifying this internal code.

Design rationale: These constraints ensure the prewired chain can be fully constructed at build time with known type mappings, enabling the performance characteristics demonstrated above.

Alternatives Considered

Trampoline Model (Benchmarked, Rejected)

The trampoline model was implemented and benchmarked as a predecessor to this approach. See PR #7625 for full details.

How it worked:

  • Pipeline executed as a flat loop-like progression over prebuilt parts
  • Mutable frame on context (Index, RangeEnd) tracked current position
  • Each next() call yielded to PipelineRunner.Next() which advanced frame and dispatched
  • Stage connectors adjusted frame to jump into child ranges

Why rejected:

  • Exception paths showed severe degradation (6-22x slower than baseline, 10-97x more allocation)
  • Sync exceptions were particularly problematic due to frame save/restore overhead
  • Mutable frame state created concurrency risk if next was called concurrently
  • Frame manipulation added complexity to exception handling (save/restore patterns)

Where it won:

  • Replay path was fastest (sequential next calls benefited from shared iterator)
  • Warmup was comparable to prewired approach

Decision: The prewired approach provides better overall characteristics. It improves success-path performance, simplifies exception handling, and enhances concurrency safety while accepting a modest trade-off on replay performance.

Delegate-Factory Approach (Benchmarked, Rejected)

Instead of abstract InvokerNode with virtual dispatch, tried a delegate-factory approach:

sealed class BehaviorInvoker<TIn, TOut>
{
    readonly Func<IBehaviorContext, Task> next;
    readonly Func<TOut, Task> nextDelegate; // Cached to avoid allocation
    
    public Task Invoke(IBehaviorContext context) => ...;
}

Result: ~2x slower on success/replay paths than node-based virtual dispatch. Extra delegate indirection and adapter layers cost more than predicted. Benchmark data confirmed that virtual dispatch on sealed generic types with known shapes is highly optimizable by the JIT.

Source-Generated Invokers

Would likely work, but rejected to avoid:

  • Generator complexity and maintenance
  • Build-time coupling
  • AOT/trimming achieved without source gen

Runtime Codegen/Expression Trees

Rejected for AOT/trimming constraints.

Migration Notes

  • No behavior interface changes: IBehavior<TIn, TOut> remains unchanged
  • No public API changes: Behavior registration and replacement work identically
  • Diagnostics preserved: PipelineStepDiagnostics.PrettyPrint() output unchanged
  • Tests pass: All existing pipeline tests pass, including multi-next replay semantics

Files Changed

  • Pipeline/PipelineInvoker.cs: Core invoker node structure and build logic
  • Pipeline/PipelineInvoker.Factory.cs: Type mapping for known stage transitions
  • Pipeline/Pipeline.cs: Updated to use the prewired invoker
  • Extensibility/ContextBag.cs: Removed frame state and added Invoker property
  • Pipeline/PipelineFrame.cs: Deleted (no longer needed)
  • Pipeline/PipelineRunner.cs: Simplified to a single Start() method

Sequence Diagram

sequenceDiagram
    participant Pipeline as Pipeline.Invoke()
    participant Context as ContextBag
    participant Invoker as InvokerNode.Invoke()
    participant Behavior as behavior.Invoke()
    participant Next as prewired next delegate
    
    Note over Pipeline: Build time: invoker chain created
    Note over Context: Behaviors[] + invoker set once
    
    Pipeline->>Context: Initialize(behaviors, invoker)
    Pipeline->>Context: Invoker(context)
    Context->>Invoker: rootInvoker.Invoke(context)
    
    loop Each behavior in chain
        Invoker->>Behavior: behavior.Invoke(context, next)
        Note right of Behavior: next is prewired delegate
        Behavior->>Next: await next(context)
        Next->>Invoker: nextNode.Invoke(context)
    end
    
    Note over Invoker: Chain completes naturally
Loading

The prewired chain replaces the trampoline loop entirely. Each next call is a direct virtual dispatch to the next InvokerNode, not a yield-and-resume pattern.

@danielmarbach danielmarbach changed the title Use a simple delegate chain similar to what previously the expression tree did Replace expression tree pipeline execution with prewired invokers Feb 22, 2026
@danielmarbach danielmarbach marked this pull request as ready for review February 22, 2026 21:43
Copy link
Member

@tmasternak tmasternak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Stuff!

Copy link
Member

@SzymonPobiega SzymonPobiega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my understanding is that:

  • Pipeline forks are not represented in the model because they are implemented simply as invoking another pipeline -- there is no change in their implementation from the current solution
  • BuildPipelineFor is called for each pipeline to compile the ordered list of behaviors, connectors and a terminator for each pipeline
  • Build iterates over that list in the reverse order, creating an invoker node using CreatedInvokerNode as a factory method
  • The structure of the pipeline which is currently represented as the IL code of the generated expression tree is in the new model represented as links between an invoker node and its successor that are captured in the next delegates. The "secret sauce" that makes it a viable solution is the behavior downcast "table"

Assuming that at least some of my understanding is correct, I herby approve this PR.

@danielmarbach
Copy link
Contributor Author

@SzymonPobiega yes absolutely correct.

Co-authored-by: Tomasz Masternak <tomasz.masternak@particular.net>
@danielmarbach
Copy link
Contributor Author

I ran the latest proposal from @tmasternak on my M4. Since it is different hardware those things cannot be directly compared but it seems to delivery a save of ~7 nanoseconds per pipeline invocation regardless of hardware. This translates to 25-32% improvement from the cast change alone. Of course these are synthetic benchmarks but still a good baseline

Method PipelineDepth Mean Ratio Allocated
Prewired_Success 10 11.08 ns 0.61 -
Current_Success 10 18.12 ns 1.00 -
Trampo_Success 10 26.03 ns 1.44 -
Prewired_Success 20 26.19 ns 0.71 -
Current_Success 20 36.95 ns 1.00 -
Trampo_Success 20 50.83 ns 1.38 -
Prewired_Success 40 59.11 ns 0.73 -
Current_Success 40 80.95 ns 1.00 -
Trampo_Success 40 99.41 ns 1.23 -

Improvement: 27-39% faster than Current, 56-73% faster than Trampoline

Async Exception

Method PipelineDepth Mean Ratio Allocated
Prewired_Exception 10 4,740.32 ns 1.01 1336 B
Current_Exception 10 4,705.66 ns 1.00 1337 B
Trampo_Exception 10 30,341.34 ns 6.45 12991 B
Prewired_Exception 20 4,655.57 ns 0.99 1336 B
Current_Exception 20 4,727.29 ns 1.00 1336 B
Trampo_Exception 20 59,908.93 ns 12.68 38936 B
Prewired_Exception 40 4,875.63 ns 1.01 1336 B
Current_Exception 40 4,841.44 ns 1.00 1337 B
Trampo_Exception 40 102,432.39 ns 21.16 129649 B

Prewired matches Current (~1.0x), Trampoline is 6-21x slower

Sync Exception

Method PipelineDepth Mean Ratio Allocated
Prewired_Exception_Sync 10 1,733.52 ns 0.88 392 B
Current_Exception_Sync 10 1,977.99 ns 1.00 600 B
Trampo_Exception_Sync 10 17,932.47 ns 9.07 1624 B
Prewired_Exception_Sync 20 1,797.82 ns 0.89 392 B
Current_Exception_Sync 20 2,015.61 ns 1.00 600 B
Trampo_Exception_Sync 20 36,234.52 ns 17.98 2928 B
Prewired_Exception_Sync 40 1,828.73 ns 0.89 392 B
Current_Exception_Sync 40 2,058.15 ns 1.00 600 B
Trampo_Exception_Sync 40 74,153.48 ns 36.05 5496 B

Prewired is 11-12% faster than Current, 35x faster than Trampoline, 35% less allocation than Current

Replay (Multiple next calls)

Method PipelineDepth Mean Ratio Allocated
Trampo_Replay 10 35.21 ns 0.59 -
Prewired_Replay 10 40.66 ns 0.68 -
Current_Replay 10 59.70 ns 1.00 -
Trampo_Replay 20 59.21 ns 0.49 -
Prewired_Replay 20 87.88 ns 0.73 -
Current_Replay 20 120.17 ns 1.00 -
Trampo_Replay 40 113.91 ns 0.44 -
Prewired_Replay 40 196.49 ns 0.76 -
Current_Replay 40 258.29 ns 1.00 -

Trampoline fastest for replay, Prewired 24-27% faster than Current


Performance Improvement Difference

Before vs After (Hardware Change: M3 Max → M4 Max)

Scenario Depth Original (M3) New (M4) Current (M3) Current (M4) Wired Improvement
Execution 10 20.94 ns 14.41 ns 28.01 ns 21.27 ns 31% → 32%
Execution 20 43.05 ns 30.35 ns 57.45 ns 41.83 ns 25% → 27%
Execution 40 91.66 ns 64.48 ns 129.26 ns 86.45 ns 29% → 25%
Scenario Depth Original (M3) New (M4) Current (M3) Current (M4) Improvement
Async Exception 10 6.426 μs 4.749 μs 6.102 μs 4.836 μs -2% → 2% faster
Async Exception 40 5.646 μs 4.739 μs 5.803 μs 4.861 μs -3% → 2% faster
Sync Exception 10 2,338.83 ns 1,733.52 ns 2,563.78 ns 1,977.99 ns 9% → 12% faster
Sync Exception 40 2,450.91 ns 1,828.73 ns 2,653.10 ns 2,058.15 ns 8% → 11% faster
Scenario Depth Original (M3) New (M4) Current (M3) Current (M4) Improvement
Warmup 10 2.021 μs 1.924 μs 672.160 μs 543.887 μs 333x → 283x faster
Warmup 20 4.067 μs 3.594 μs 1,369.096 μs 1,064.967 μs 337x → 296x faster
Warmup 40 7.911 μs 6.895 μs 2,746.676 μs 2,107.101 μs 347x → 306x faster

Summary

The new hardware (M4 Max) shows ~30-40% better absolute performance across all scenarios. The relative improvement ratios remain consistent between the two hardware platforms, confirming the wired approach delivers consistent performance gains regardless of hardware.

Scenario Wired/Prewired Current Improvement
Execution (40 deep) 64.48 ns 86.45 ns 25% faster
Async Exception (40 deep) 4.739 μs 4.861 μs 2% faster
Sync Exception (40 deep) 1,828.73 ns 2,058.15 ns 11% faster, 35% less alloc
Warmup (40 deep) 6.895 μs 2,107.101 μs 306x faster, 49% less alloc
Success (40 deep) 59.11 ns 80.95 ns 27% faster

@andreasohlund andreasohlund added this to the 10.2.0 milestone Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants