-
Notifications
You must be signed in to change notification settings - Fork 356
New blog: AI Dynamo on AKS series part 2 #5558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds Part 2 of the "AI Dynamo on AKS" blog series, focusing on NVIDIA Dynamo's Planner Profiler and SLO-based Planner for optimizing multi-node LLM inference. The PR also updates the Part 1 post's tags to establish series consistency.
Changes:
- New blog post introducing Dynamo Planner tools for automated performance tuning and dynamic scaling
- Added "Dynamo series" tag to Part 1 for series linking
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| website/blog/2026-01-22-dynamo-on-aks-part-2/index.md | New Part 2 blog post covering the Dynamo Planner Profiler and SLO-based Planner with real-world airline app scenario |
| website/blog/2025-10-24-dynamo-on-aks/index.md | Updated tags to include "Dynamo series" for series consistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
website/blog/2026-01-22-dynamo-on-aks-part-2/index.md:102
- The acronym "AIC" is defined here but never used again in the document. Consider removing the acronym since it's only mentioned once, or if it will be referenced later, ensure it's used consistently.
* **Hardware Simulation**: Using the **AI Configurator (AIC)** mode, the
allyford
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits, overall lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.
| <!-- truncate --> | ||
|
|
||
| ## The Challenge: Balancing the "Rate Matching" Equation | ||
|
|
||
| Disaggregated serving separates the prefill and decode phases of inference | ||
| across distinct GPU nodes. This allows each phase to be independently | ||
| optimized with custom GPU counts and model parallelism configurations. | ||
|
|
||
|  |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the blog post content guidelines, a hero image should be placed immediately after the truncate marker. The current structure has the truncate marker on line 26, but the first image appears on line 34 after heading content. Consider adding a hero image right after the truncate marker to follow the recommended pattern.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.
| static optimization becomes a dynamic, SLO-aware reality on AKS, capable of | ||
| weathering the unpredictable traffic spikes of production environments. | ||
|
|
||
| Ultimately, this suite transforms your inference stack from a series of |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing comma after introductory phrase. The sentence should read: "Ultimately, this suite transforms your inference stack..."
| Now, you can try this yourself by running the NVIDIA Dynamo Planner Profiler | ||
| to capture burst and request behavior, then using the SLO-based Planner to | ||
| translate latency targets into placement and scaling decisions on your AKS | ||
| cluster. Setting it up in this order - profile under stress, define SLOs, |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dash separating items in this sentence should be an em dash (—) without spaces, or use a colon or period to separate the two independent clauses. Current usage with a hyphen is incorrect punctuation.
| cluster. Setting it up in this order - profile under stress, define SLOs, | |
| cluster. Setting it up in this order: profile under stress, define SLOs, |
| these peaks, the underlying system requires the precise orchestration | ||
| offered by a disaggregated architecture. | ||
|
|
||
| To build a truly efficient disaggregated AI inference system you |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing comma after introductory phrase. The sentence should read: "To build a truly efficient disaggregated AI inference system, you need to transition from manual..."
| To build a truly efficient disaggregated AI inference system you | |
| To build a truly efficient disaggregated AI inference system, you |
| configuration that maximizes "Goodput", the maximum throughput | ||
| achievable while staying strictly within your latency bounds. | ||
|
|
||
| Ultimately the app developers and AI engineers reduce their time |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing comma after introductory phrase. The sentence should read: "Ultimately, the app developers and AI engineers reduce their time..."
| Ultimately the app developers and AI engineers reduce their time | |
| Ultimately, the app developers and AI engineers reduce their time |
|  | ||
|
|
||
| One of the main challenges in disaggregated serving is **rate matching**: | ||
| determining the right GPU allocation between prefill and decode stages to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add more context on what the prefill and decode stages are and their importance before diving into how to solve these issues?
| One of the main challenges in disaggregated serving is **rate matching**: | ||
| determining the right GPU allocation between prefill and decode stages to | ||
| meet a specific Service Level Objective (SLO). If you miscalculate the GPU | ||
| ratio between these stages, you face two "silent killers" of performance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a quick example on how the gpu ratio is calculated without the AI dynamo solution to show the benefit of AI dynamo?
| rerouting during flight delays. This use case is a 'stress test' for | ||
| inference: it is subject to massive, sudden bursts in traffic and highly | ||
| variable request patterns, such as a mix of short status queries and | ||
| long-context itinerary processing. To prevent latency spikes during |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very familar with long-context itinerary processing. An example I could think of is when a flight gets cancelled and all the passengers on that flight are rebooking at the same time
| profiler can simulate performance in just 20–30 seconds | ||
| based on pre-measured performance data, allowing for rapid | ||
| iteration before you ever touch a physical GPU. | ||
| * **Resulting Recommendation**: The output is a highly tuned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit hard for me to follow who dynamo planner profiler automatically does this and the added benefits. Is there an example of what this looks like?
| Using the performance bounds identified earlier by the profiler, the Planner | ||
| proactively scales the number of prefill and decode workers up or down. For | ||
| example, if a *sudden burst of long-context itinerary queries* floods the | ||
| system, the Planner detects the spike in the prefill queue and shifts available | ||
| GPU resources to the prefill pool *before* your TTFT violates its SLO. | ||
|
|
||
| Now, you can try this yourself by running the NVIDIA Dynamo Planner Profiler | ||
| to capture burst and request behavior, then using the SLO-based Planner to | ||
| translate latency targets into placement and scaling decisions on your AKS | ||
| cluster. Setting it up in this order - profile under stress, define SLOs, | ||
| and let the planner orchestrate your disaggregated inference system to | ||
| handle sudden traffic spikes without latency spikes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More detailed example could help here as well
No description provided.