Replace expression tree pipeline execution with prewired invokers#7631
Replace expression tree pipeline execution with prewired invokers#7631danielmarbach wants to merge 3 commits intomasterfrom
Conversation
1d1b640 to
9202760
Compare
SzymonPobiega
left a comment
There was a problem hiding this comment.
So, my understanding is that:
- Pipeline forks are not represented in the model because they are implemented simply as invoking another pipeline -- there is no change in their implementation from the current solution
BuildPipelineForis called for each pipeline to compile the ordered list of behaviors, connectors and a terminator for each pipelineBuilditerates over that list in the reverse order, creating an invoker node usingCreatedInvokerNodeas a factory method- The structure of the pipeline which is currently represented as the IL code of the generated expression tree is in the new model represented as links between an invoker node and its successor that are captured in the
nextdelegates. The "secret sauce" that makes it a viable solution is the behavior downcast "table"
Assuming that at least some of my understanding is correct, I herby approve this PR.
|
@SzymonPobiega yes absolutely correct. |
Co-authored-by: Tomasz Masternak <tomasz.masternak@particular.net>
|
I ran the latest proposal from @tmasternak on my M4. Since it is different hardware those things cannot be directly compared but it seems to delivery a save of ~7 nanoseconds per pipeline invocation regardless of hardware. This translates to 25-32% improvement from the cast change alone. Of course these are synthetic benchmarks but still a good baseline
Improvement: 27-39% faster than Current, 56-73% faster than Trampoline Async Exception
Prewired matches Current (~1.0x), Trampoline is 6-21x slower Sync Exception
Prewired is 11-12% faster than Current, 35x faster than Trampoline, 35% less allocation than Current Replay (Multiple
|
| Method | PipelineDepth | Mean | Ratio | Allocated |
|---|---|---|---|---|
| Trampo_Replay | 10 | 35.21 ns | 0.59 | - |
| Prewired_Replay | 10 | 40.66 ns | 0.68 | - |
| Current_Replay | 10 | 59.70 ns | 1.00 | - |
| Trampo_Replay | 20 | 59.21 ns | 0.49 | - |
| Prewired_Replay | 20 | 87.88 ns | 0.73 | - |
| Current_Replay | 20 | 120.17 ns | 1.00 | - |
| Trampo_Replay | 40 | 113.91 ns | 0.44 | - |
| Prewired_Replay | 40 | 196.49 ns | 0.76 | - |
| Current_Replay | 40 | 258.29 ns | 1.00 | - |
Trampoline fastest for replay, Prewired 24-27% faster than Current
Performance Improvement Difference
Before vs After (Hardware Change: M3 Max → M4 Max)
| Scenario | Depth | Original (M3) | New (M4) | Current (M3) | Current (M4) | Wired Improvement |
|---|---|---|---|---|---|---|
| Execution | 10 | 20.94 ns | 14.41 ns | 28.01 ns | 21.27 ns | 31% → 32% |
| Execution | 20 | 43.05 ns | 30.35 ns | 57.45 ns | 41.83 ns | 25% → 27% |
| Execution | 40 | 91.66 ns | 64.48 ns | 129.26 ns | 86.45 ns | 29% → 25% |
| Scenario | Depth | Original (M3) | New (M4) | Current (M3) | Current (M4) | Improvement |
|---|---|---|---|---|---|---|
| Async Exception | 10 | 6.426 μs | 4.749 μs | 6.102 μs | 4.836 μs | -2% → 2% faster |
| Async Exception | 40 | 5.646 μs | 4.739 μs | 5.803 μs | 4.861 μs | -3% → 2% faster |
| Sync Exception | 10 | 2,338.83 ns | 1,733.52 ns | 2,563.78 ns | 1,977.99 ns | 9% → 12% faster |
| Sync Exception | 40 | 2,450.91 ns | 1,828.73 ns | 2,653.10 ns | 2,058.15 ns | 8% → 11% faster |
| Scenario | Depth | Original (M3) | New (M4) | Current (M3) | Current (M4) | Improvement |
|---|---|---|---|---|---|---|
| Warmup | 10 | 2.021 μs | 1.924 μs | 672.160 μs | 543.887 μs | 333x → 283x faster |
| Warmup | 20 | 4.067 μs | 3.594 μs | 1,369.096 μs | 1,064.967 μs | 337x → 296x faster |
| Warmup | 40 | 7.911 μs | 6.895 μs | 2,746.676 μs | 2,107.101 μs | 347x → 306x faster |
Summary
The new hardware (M4 Max) shows ~30-40% better absolute performance across all scenarios. The relative improvement ratios remain consistent between the two hardware platforms, confirming the wired approach delivers consistent performance gains regardless of hardware.
| Scenario | Wired/Prewired | Current | Improvement |
|---|---|---|---|
| Execution (40 deep) | 64.48 ns | 86.45 ns | 25% faster |
| Async Exception (40 deep) | 4.739 μs | 4.861 μs | 2% faster |
| Sync Exception (40 deep) | 1,828.73 ns | 2,058.15 ns | 11% faster, 35% less alloc |
| Warmup (40 deep) | 6.895 μs | 2,107.101 μs | 306x faster, 49% less alloc |
| Success (40 deep) | 59.11 ns | 80.95 ns | 27% faster |
This PR aligns pipeline execution with a prewired immutable continuation chain, keeps state on the context bag, and improves AOT/trimming friendliness while minimizing broader architectural change and delivering significant performance improvements across all execution paths..
Summary
The previous trampoline model proposal (see #7625) executed the pipeline as a flat loop-like progression over prebuilt parts, tracking position via a mutable frame on the context. This new approach prewires the entire continuation chain at pipeline build time, eliminating frame manipulation entirely.
Key change: Instead of computing "what's next" at runtime via frame index advancement, each behavior receives a prewired
nextdelegate that directly invokes the subsequent behavior. It is built once at pipeline construction and executed millions of times with zero allocation. It uses key revelations from the trampoline investigations.Alignment with expression-tree model: Like the current expression-tree-based model (the "Current" baseline in benchmarks), this approach pre-composes the continuation chain at build time. However, it achieves this through static generic type instantiation rather than runtime expression compilation, delivering:
Architecture
Prewired Continuation Chain
Node Structure
Critical allocation optimization:
nextis cached at construction time, not per-invocationnext.Invokeis a bound method-group delegate (no closure)next:CompletedNextCache<TOutContext>.NextComparison: Trampoline vs Prewired
AdvanceFrame()+ dispatch by indexnextcallsBenefits
Performance
Success path (allocation-free, fastest):
Exception path (async throw):
Warmup (first invocation):
Design Simplicity
InitFrame,AdvanceFrame,SetFrame,FramepropertyAOT/Trimming Friendly
Benchmarks
https://github.com/danielmarbach/MicroBenchmarks and branches starting with
bare-metalResults.zip
Execution
Throwing
Warmup
Extended Comparison (MediumRun)
Additional benchmark with sync exception, replay, and trampoline comparison:
Success Path Comparison
Async Exception Comparison
Sync Exception Comparison
Replay Comparison (Multiple
nextcalls)Concurrency Improvement
The prewired approach reduces concurrency risk compared to the trampoline:
nextcallsawait Task.WhenAll(next(ctx), next(ctx))Note: Concurrent
nextcalls are still not a supported contract (downstream behaviors may not be thread-safe for shared context mutation), but the pipeline engine itself no longer has mutable traversal state.Implementation Details
Build-Time Composition
Reverse composition: Build from end to start so each node can reference the already-built continuation.
Delegate Caching
Key optimization:
next.Invokeis a bound method-group delegate, not a closure. This is critical because the earlier implementation hadcontext => next.Invoke(context), which allocated per invocation.Context Integration
Behaviors and invoker are set on the context once per invocation, enabling indexed access without closures.
Pipeline Extensibility Constraints
The core pipeline architecture constrains extensibility in specific ways:
What IS Possible via Public API
PipelineSettings.Replace()(StageConnector<TFrom,TTo>is public)StageForkConnectoris public)What is NOT Possible
StageConnector<ICustomContext, IAnotherContext>)Why Custom Stages Are Blocked
BehaviorContextis internal: Without inheriting from it, custom context implementations cannot participate in the pipeline'sContextBaghierarchy.Factory methods are hardcoded:
ConnectorContextExtensionsprovides public factory methods only for built-in context types (CreateRoutingContext(),CreateDispatchContext(), etc.). No generic factory exists for custom contexts.Context implementations are internal: All concrete context types (
IncomingPhysicalMessageContext,DispatchContext, etc.) inherit fromBehaviorContextand are internal.Invoker node creation is closed:
PipelineInvoker.Factory.cscontains switch expressions for all known stage transitions. Adding a new stage requires modifying this internal code.Design rationale: These constraints ensure the prewired chain can be fully constructed at build time with known type mappings, enabling the performance characteristics demonstrated above.
Alternatives Considered
Trampoline Model (Benchmarked, Rejected)
The trampoline model was implemented and benchmarked as a predecessor to this approach. See PR #7625 for full details.
How it worked:
Index,RangeEnd) tracked current positionnext()call yielded toPipelineRunner.Next()which advanced frame and dispatchedWhy rejected:
nextwas called concurrentlyWhere it won:
nextcalls benefited from shared iterator)Decision: The prewired approach provides better overall characteristics. It improves success-path performance, simplifies exception handling, and enhances concurrency safety while accepting a modest trade-off on replay performance.
Delegate-Factory Approach (Benchmarked, Rejected)
Instead of abstract
InvokerNodewith virtual dispatch, tried a delegate-factory approach:Result: ~2x slower on success/replay paths than node-based virtual dispatch. Extra delegate indirection and adapter layers cost more than predicted. Benchmark data confirmed that virtual dispatch on sealed generic types with known shapes is highly optimizable by the JIT.
Source-Generated Invokers
Would likely work, but rejected to avoid:
Runtime Codegen/Expression Trees
Rejected for AOT/trimming constraints.
Migration Notes
IBehavior<TIn, TOut>remains unchangedPipelineStepDiagnostics.PrettyPrint()output unchangednextreplay semanticsFiles Changed
Pipeline/PipelineInvoker.cs: Core invoker node structure and build logicPipeline/PipelineInvoker.Factory.cs: Type mapping for known stage transitionsPipeline/Pipeline.cs: Updated to use the prewired invokerExtensibility/ContextBag.cs: Removed frame state and addedInvokerpropertyPipeline/PipelineFrame.cs: Deleted (no longer needed)Pipeline/PipelineRunner.cs: Simplified to a singleStart()methodSequence Diagram
sequenceDiagram participant Pipeline as Pipeline.Invoke() participant Context as ContextBag participant Invoker as InvokerNode.Invoke() participant Behavior as behavior.Invoke() participant Next as prewired next delegate Note over Pipeline: Build time: invoker chain created Note over Context: Behaviors[] + invoker set once Pipeline->>Context: Initialize(behaviors, invoker) Pipeline->>Context: Invoker(context) Context->>Invoker: rootInvoker.Invoke(context) loop Each behavior in chain Invoker->>Behavior: behavior.Invoke(context, next) Note right of Behavior: next is prewired delegate Behavior->>Next: await next(context) Next->>Invoker: nextNode.Invoke(context) end Note over Invoker: Chain completes naturallyThe prewired chain replaces the trampoline loop entirely. Each
nextcall is a direct virtual dispatch to the nextInvokerNode, not a yield-and-resume pattern.