Skip to content

Conversation

@sdesai345
Copy link
Contributor

No description provided.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds Part 2 of the "AI Dynamo on AKS" blog series, focusing on NVIDIA Dynamo's Planner Profiler and SLO-based Planner for optimizing multi-node LLM inference. The PR also updates the Part 1 post's tags to establish series consistency.

Changes:

  • New blog post introducing Dynamo Planner tools for automated performance tuning and dynamic scaling
  • Added "Dynamo series" tag to Part 1 for series linking

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 8 comments.

File Description
website/blog/2026-01-22-dynamo-on-aks-part-2/index.md New Part 2 blog post covering the Dynamo Planner Profiler and SLO-based Planner with real-world airline app scenario
website/blog/2025-10-24-dynamo-on-aks/index.md Updated tags to include "Dynamo series" for series consistency

Copilot AI review requested due to automatic review settings January 16, 2026 21:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

website/blog/2026-01-22-dynamo-on-aks-part-2/index.md:102

  • The acronym "AIC" is defined here but never used again in the document. Consider removing the acronym since it's only mentioned once, or if it will be referenced later, ensure it's used consistently.
* **Hardware Simulation**: Using the **AI Configurator (AIC)** mode, the

Copy link
Contributor

@allyford allyford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, overall lgtm!

Copilot AI review requested due to automatic review settings January 20, 2026 15:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings January 20, 2026 15:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

Comment on lines +26 to +34
<!-- truncate -->

## The Challenge: Balancing the "Rate Matching" Equation

Disaggregated serving separates the prefill and decode phases of inference
across distinct GPU nodes. This allows each phase to be independently
optimized with custom GPU counts and model parallelism configurations.

![Disaggregated serving with Dynamo](./disag-serving-with-dynamo.png)
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the blog post content guidelines, a hero image should be placed immediately after the truncate marker. The current structure has the truncate marker on line 26, but the first image appears on line 34 after heading content. Consider adding a hero image right after the truncate marker to follow the recommended pattern.

Copilot generated this review using guidance from repository custom instructions.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings January 20, 2026 16:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

@Azure Azure deleted a comment from Copilot AI Jan 20, 2026
@Azure Azure deleted a comment from Copilot AI Jan 20, 2026
Copilot AI review requested due to automatic review settings January 20, 2026 18:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

static optimization becomes a dynamic, SLO-aware reality on AKS, capable of
weathering the unpredictable traffic spikes of production environments.

Ultimately, this suite transforms your inference stack from a series of
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comma after introductory phrase. The sentence should read: "Ultimately, this suite transforms your inference stack..."

Copilot uses AI. Check for mistakes.
Now, you can try this yourself by running the NVIDIA Dynamo Planner Profiler
to capture burst and request behavior, then using the SLO-based Planner to
translate latency targets into placement and scaling decisions on your AKS
cluster. Setting it up in this order - profile under stress, define SLOs,
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dash separating items in this sentence should be an em dash (—) without spaces, or use a colon or period to separate the two independent clauses. Current usage with a hyphen is incorrect punctuation.

Suggested change
cluster. Setting it up in this order - profile under stress, define SLOs,
cluster. Setting it up in this order: profile under stress, define SLOs,

Copilot uses AI. Check for mistakes.
these peaks, the underlying system requires the precise orchestration
offered by a disaggregated architecture.

To build a truly efficient disaggregated AI inference system you
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comma after introductory phrase. The sentence should read: "To build a truly efficient disaggregated AI inference system, you need to transition from manual..."

Suggested change
To build a truly efficient disaggregated AI inference system you
To build a truly efficient disaggregated AI inference system, you

Copilot uses AI. Check for mistakes.
configuration that maximizes "Goodput", the maximum throughput
achievable while staying strictly within your latency bounds.

Ultimately the app developers and AI engineers reduce their time
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comma after introductory phrase. The sentence should read: "Ultimately, the app developers and AI engineers reduce their time..."

Suggested change
Ultimately the app developers and AI engineers reduce their time
Ultimately, the app developers and AI engineers reduce their time

Copilot uses AI. Check for mistakes.
![Disaggregated serving with Dynamo](./disag-serving-with-dynamo.png)

One of the main challenges in disaggregated serving is **rate matching**:
determining the right GPU allocation between prefill and decode stages to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add more context on what the prefill and decode stages are and their importance before diving into how to solve these issues?

One of the main challenges in disaggregated serving is **rate matching**:
determining the right GPU allocation between prefill and decode stages to
meet a specific Service Level Objective (SLO). If you miscalculate the GPU
ratio between these stages, you face two "silent killers" of performance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a quick example on how the gpu ratio is calculated without the AI dynamo solution to show the benefit of AI dynamo?

rerouting during flight delays. This use case is a 'stress test' for
inference: it is subject to massive, sudden bursts in traffic and highly
variable request patterns, such as a mix of short status queries and
long-context itinerary processing. To prevent latency spikes during
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very familar with long-context itinerary processing. An example I could think of is when a flight gets cancelled and all the passengers on that flight are rebooking at the same time

profiler can simulate performance in just 20–30 seconds
based on pre-measured performance data, allowing for rapid
iteration before you ever touch a physical GPU.
* **Resulting Recommendation**: The output is a highly tuned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit hard for me to follow who dynamo planner profiler automatically does this and the added benefits. Is there an example of what this looks like?

Comment on lines +128 to +139
Using the performance bounds identified earlier by the profiler, the Planner
proactively scales the number of prefill and decode workers up or down. For
example, if a *sudden burst of long-context itinerary queries* floods the
system, the Planner detects the spike in the prefill queue and shifts available
GPU resources to the prefill pool *before* your TTFT violates its SLO.

Now, you can try this yourself by running the NVIDIA Dynamo Planner Profiler
to capture burst and request behavior, then using the SLO-based Planner to
translate latency targets into placement and scaling decisions on your AKS
cluster. Setting it up in this order - profile under stress, define SLOs,
and let the planner orchestrate your disaggregated inference system to
handle sudden traffic spikes without latency spikes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More detailed example could help here as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants