Skip to content

Conversation

@colinmixonn
Copy link
Contributor

This blog post introduces AKS Configurable Scheduler Profiles, highlighting their benefits for optimizing resource utilization and improving scheduling strategies for web-distributed and AI workloads. It covers configuration examples for GPU utilization, pod distribution across topology domains, and memory-optimized scheduling.

This blog post introduces AKS Configurable Scheduler Profiles, highlighting their benefits for optimizing resource utilization and improving scheduling strategies for web-distributed and AI workloads. It covers configuration examples for GPU utilization, pod distribution across topology domains, and memory-optimized scheduling.
Added a new tag for Scheduler with relevant details.
Updated blog post on AKS Configurable Scheduler Profiles to improve clarity and correctness, including sections on GPU utilization, pod distribution, and memory-optimized scheduling.
Corrected typos and improved clarity in the blog post about AKS Configurable Scheduler Profiles.
Updated the blog to clarify the objectives of configuring AKS Configurable Scheduler Profiles, improved section titles, and ensured consistency in terminology.
Clarified the objectives and improved the wording in the blog post about AKS Configurable Scheduler Profiles.
@colinmixonn colinmixonn marked this pull request as ready for review December 11, 2025 17:52
@colinmixonn colinmixonn requested review from a team, circy9, Copilot, qpetraroia and seanmck and removed request for Copilot December 11, 2025 17:52
@colinmixonn colinmixonn requested a review from palma21 as a code owner December 11, 2025 17:52
Copilot AI review requested due to automatic review settings December 11, 2025 18:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a new blog post announcing the preview of AKS Configurable Scheduler Profiles, a feature that enables fine-grained control over pod scheduling strategies to optimize resource utilization and improve workload performance.

Key Changes

  • Introduces a new "scheduler" tag to categorize blog posts related to pod placement and scheduling optimization
  • Adds comprehensive blog post covering three main scheduling use cases: GPU bin-packing for AI workloads, pod distribution across topology domains for resilience, and memory-optimized scheduling with PVC-aware placement
  • Provides YAML configuration examples and best practices for implementing custom scheduler profiles

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 20 comments.

File Description
website/blog/tags.yml Adds new "scheduler" tag for categorizing posts about pod placement and scheduling techniques
website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md New blog post introducing AKS Configurable Scheduler Profiles with configuration examples for GPU utilization, topology distribution, and memory-optimized scheduling


## AKS Configurable Scheduler Profiles

A scheduler profile is a set of one or more in-tree scheduling plugins and configurations that dictate how to schedule a pod. Previously, the scheduler configuration wasn't accessible to users. Starting from Kubernetes version 1.33, you can now configure and set a scheduler profile for the AKS scheduler on your cluster.
Copy link
Contributor

@Fei-Guo Fei-Guo Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to explain the technology here. We introduce a CRD to allow user to configure the scheduler profile and there is a controller to sync user specified configuration to the actual k8s scheduler deployment which is not visible to the user. Otherwise, no one would understand the "SchedulerConfiguration" resource shown in the yamls below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great callout. Added an explanation. please let me know what you think @Fei-Guo

colinmixonn and others added 6 commits December 11, 2025 11:14
…index.md

Co-authored-by: Diego Casati <diego.casati@gmail.com>
…index.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…index.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…index.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…index.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Updated the blog post to clarify the benefits of AKS Configurable Scheduler Profiles, including user feedback on increased resiliency and reduced operational overhead. Adjusted language for improved clarity and conciseness throughout the article.
Clarified the explanation of the AKS default scheduler and its resource allocation strategy, emphasizing the benefits of using NodeResourceFit for GPU utilization.
@colinmixonn colinmixonn requested a review from Fei-Guo December 12, 2025 23:09
@sabbour sabbour requested a review from Copilot December 15, 2025 20:25
- best-practices
---

Thoughtful scheduling strategies can resolve pervasive challenges across web-distributed workloads and AI workloads like resiliency and resource utilization. But the default scheduler was primarily designed for general-purpose workloads and out-of-box pod scheduling that could be restrictive if you needed more fine-grain control since it applies a set of criteria in a fixed priority order. The scheduler selects the optimal node for newly created pod(s) based on several criteria, including (but not limited to):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's "web-distributed"?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


Out of the available nodes, the scheduler then filters out nodes that don't meet the requirements to identify the node that is most optimal for the pod(s). Today, the AKS default scheduler lacks the flexibility for users to change which criteria should be prioritized, and ignored, in the scheduling cycle on a per workload basis. This means the default scheduling criteria, and their fixed priority order, are not suitable for workloads that demand co-locating pods with their persistent volumes for increased data locality, optimizing GPU utilization for machine learning, or enforcing strict zone-level distribution for microservices. This rigidity often forces users to either accept suboptimal placement or manage a separate custom scheduler, both of which increase operational complexity.

**[AKS Configurable Scheduler Profiles][concepts-scheduler-configuration] reduces operational complexity by providing extensibility and control.** Now, customers can define their own scheduling logic by selecting specific policies, altering parameter weight, changing policy priority, adding additional policy parameters, and changing policy evaluation point (i.e. pre-Filter, Filter, Score) without deploying a second scheduler. On AKS, customers have mentioned that AKS Configurable Scheduler Profiles allows them to increase resiliency without operational overhead of YAML wrangling or reduce cluster costs without adopting a secondary scheduler. Additionally, our AI and HPC customers have batch workloads that have benefitted from improved bin-packing and increased GPU utilization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this before <!-- truncate --> as this is what will show in the blog index.

- scheduler
- best-practices
---

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start the problem statement here with something like:

The default Kubernetes scheduler was primarily designed to run general-purpose workloads while fulfilling requirements like resources (CPU, memory), node affinity, pod affinity, and spread. Scheduling AI workloads presents a different set of requirements and challenges such as <...>. With the introduction of AKS configurable scheduler profiles, you can now define your own scheduling logic by <..> without <..>.

Then use the <!-- truncate -->.

Comment on lines +15 to +32

1. Resource requirements (CPU, memory)
2. Node affinity/anti-affinity
3. Pod affinity/anti-affinity
4. Taints and tolerations

Out of the available nodes, the scheduler then filters out nodes that don't meet the requirements to identify the node that is most optimal for the pod(s). Today, the AKS default scheduler lacks the flexibility for users to change which criteria should be prioritized, and ignored, in the scheduling cycle on a per workload basis. This means the default scheduling criteria, and their fixed priority order, are not suitable for workloads that demand co-locating pods with their persistent volumes for increased data locality, optimizing GPU utilization for machine learning, or enforcing strict zone-level distribution for microservices. This rigidity often forces users to either accept suboptimal placement or manage a separate custom scheduler, both of which increase operational complexity.

**[AKS Configurable Scheduler Profiles][concepts-scheduler-configuration] reduces operational complexity by providing extensibility and control.** Now, customers can define their own scheduling logic by selecting specific policies, altering parameter weight, changing policy priority, adding additional policy parameters, and changing policy evaluation point (i.e. pre-Filter, Filter, Score) without deploying a second scheduler. On AKS, customers have mentioned that AKS Configurable Scheduler Profiles allows them to increase resiliency without operational overhead of YAML wrangling or reduce cluster costs without adopting a secondary scheduler. Additionally, our AI and HPC customers have batch workloads that have benefitted from improved bin-packing and increased GPU utilization.

In this blog you will learn how to configure AKS Configurable Scheduler Profiles for three workload objectives:

1. [Increase GPU utilization by bin packing GPU-backed nodes](#increase-gpu-utilization-by-bin-packing-gpu-backed-nodes)
2. [Increase resilience by distributing pods across topology domains](#increase-resilience-by-distributing-pods-across-topology-domains)
3. [Optimize data locality with memory and PVC-aware scheduling](#optimize-data-locality-with-memory-and-pvc-aware-scheduling)

Lastly, you will find [best practices](#best-practices-and-configuration-considerations) to help guide how you consider both individual plugin configurations, your custom scheduler configuration, and your Deployment design holistically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After addressing my other comment on the opening problem statement, I think a lot of this content can move below <!-- truncate --> with a header of How the default Kubernetes scheduler works. Then, you can follow that by the AKS configurable scheduler profiles section.

Comment on lines +47 to +49
:::note
Adjust VM SKUs in `NodeAffinity`, shift utilization curves or weights, and use the right zones for your cluster(s) in the configurations below.
:::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a note?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an explicit out for the reader, how would you like it formatted? @sabbour


The AKS default scheduler scores nodes for workload placement based on a _LeastAllocated_ strategy, to spread across the nodes in a cluster. However, this behavior results in inefficient resource utilization, as nodes with higher allocation are not favored. You can use `NodeResourceFit` to control how pods are assigned to nodes based on available resources (CPU, GPU, memory, etc.), including favoring nodes with high resource utilization, within the set configuration.

For example, scheduling pending jobs on nodes with a higher relative GPU utilization, users can reduce costs and increase GPU Utilization while maintaining performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to show actual differences between the utilization you get with the default scheduler versus this custom plugin, including numbers and graphs if possible.

weight: 1
```
### Increase resilience by distributing pods across topology domains
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An actual example of before and after would help here.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Comment on lines +117 to +122
- maxSkew: 2
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not explaining why those values are better than the defaults.

Comment on lines +160 to +163
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: [westus3-1, westus3-2, westus3-3]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the zone have to do with memory optimization?

Co-authored-by: Ahmed Sabbour <103856+sabbour@users.noreply.github.com>

For example, combine `VolumeBinding` and `VolumeZone` plugins, with `NodeAffinity` and `NodeResourcesFit` with `RequestedToCapacityRatio`, to influence pod placement on [Azure memory-optimized SKUs][memory-optimized-vm], while ensuring PVC's bind quickly in the target zone to minimize cross‑zone traffic and latency.

**This scheduler configuration ensures workloads needing large memory footprints are placed on nodes that provide sufficient RAM and maintain proximity to their volumes, enabling fast, zone‑aligned PVC binding for optimal data locality.**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this example configuration coming from? We're not really explaining how we came up with it either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean we aren't explaining how we came up with it? I wanted to highlight the storage-based plugins for a scheduler configuration, so I piecemealed other configs together. @sabbour


As a reminder, there are many parameters the scheduler considers across the [scheduling cycle][scheduling-framework/#interfaces] before a pod is placed on a node that impacts how a pod is assigned. This section is meant to help guide how you consider both individual plugin configurations, your custom scheduler configuration, and your Deployment design holistically.

1. Ensure the intended deployment is assigned to the _correct_ scheduler profile.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the examples above showed how to assign this.

As a reminder, there are many parameters the scheduler considers across the [scheduling cycle][scheduling-framework/#interfaces] before a pod is placed on a node that impacts how a pod is assigned. This section is meant to help guide how you consider both individual plugin configurations, your custom scheduler configuration, and your Deployment design holistically.

1. Ensure the intended deployment is assigned to the _correct_ scheduler profile.
2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this sentence mean?


1. Ensure the intended deployment is assigned to the _correct_ scheduler profile.
2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.
3. Ensure there are enough nodes in each zone to accommodate your deployment replicas and ensure your AKS node pool spans the right availability zones. If not, pods may remain in a pending state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth linking to autoscaling docs?

1. Ensure the intended deployment is assigned to the _correct_ scheduler profile.
2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.
3. Ensure there are enough nodes in each zone to accommodate your deployment replicas and ensure your AKS node pool spans the right availability zones. If not, pods may remain in a pending state.
4. Use namespaces to separate workloads which improves your ability to validate or troubleshoot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't relevant to configurable schedulers

score: 0
```

## Best Practices and Configuration Considerations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this entire section (or a selection of it) should be in docs, not here.

2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.
3. Ensure there are enough nodes in each zone to accommodate your deployment replicas and ensure your AKS node pool spans the right availability zones. If not, pods may remain in a pending state.
4. Use namespaces to separate workloads which improves your ability to validate or troubleshoot.
5. Assign `priorityClassName` for workloads that should preempt others, this is critical if you use the DefaultPreemption plugin.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't mentioned DefaultPreemption plugin before in this doc

Comment on lines +215 to +219
6. If you use the `ImageLocality` plugin, use DaemonSets or node pre-pulling for latency-sensitive images, otherwise the benefit may be minimal.
7. If your cluster is large, a low `PercentageOfNodesToScore` speeds scheduling by reducing the number of nodes scored, _but_ it may reduce optimal placement.
8. If you enable a plugin in the `plugins:multipoint` section but do not define it in `pluginConfig`, AKS uses the default configuration for that plugin.
9. For `NodeResourcesFit`, the ratio matters more than absolute values. So CPU:Memory:Storage = 3:1:2, which means CPU is 3× more influential than memory, and storage is 2x more influential than memory in the scoring phase.
10. Pair `PodTopologySpread` with Pod Disruption Budget's (PDB) and multi‑replica strategies for HA during upgrades.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These while potentially good best practices, I think they're better suited to docs. They look too random here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what makes them random for the blog? I am happy to move them to the overview page of the documentation, so we don't crowd the configuration section @sabbour

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blog post is about an announcement, and it starts at a high level with what scheduling is and what this feature is trying to accomplish, then near the end (here) is talking about best practices. I'm not sure the same reader, assumed here as someone who's not very familiar with this feature would understand what those best practices are or how to apply them from a blog post. And when they do need to apply them, I'd rather they find them out in docs or in a dedicated post on best practices.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.


### Optimize data locality with Memory and PVC-Aware Scheduling

Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaim's_ (PVC) can bind to _PersistentVolume's_ (PV). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Kubernetes terminology guidelines, "PersistentVolumeClaim" should be written as one word when referring to the Kubernetes resource type, not with an apostrophe for possession. Use "PersistentVolumeClaims" (plural) instead of "PersistentVolumeClaim's" (possessive).

Suggested change
Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaim's_ (PVC) can bind to _PersistentVolume's_ (PV). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.
Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaims_ (PVC) can bind to _PersistentVolumes_ (PV). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.

Copilot uses AI. Check for mistakes.

### Optimize data locality with Memory and PVC-Aware Scheduling

Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaim's_ (PVC) can bind to _PersistentVolume's_ (PV). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Kubernetes terminology guidelines, "PersistentVolume" should be written as one word when referring to the Kubernetes resource type, not with an apostrophe for possession. Use "PersistentVolumes" (plural) instead of "PersistentVolume's" (possessive).

Suggested change
Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaim's_ (PVC) can bind to _PersistentVolume's_ (PV). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.
Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaims_ (PVCs) can bind to _PersistentVolumes_ (PVs). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.

Copilot uses AI. Check for mistakes.
As a reminder, there are many parameters the scheduler considers across the [scheduling cycle][scheduling-framework/#interfaces] before a pod is placed on a node that impacts how a pod is assigned. This section is meant to help guide how you consider both individual plugin configurations, your custom scheduler configuration, and your Deployment design holistically.

1. Ensure the intended deployment is assigned to the _correct_ scheduler profile.
2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Kubernetes terminology guidelines, use "PersistentVolumeClaims" (plural) instead of "PersistentVolumeClaim's" (possessive with apostrophe).

Suggested change
2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.
2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaims. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.

Copilot uses AI. Check for mistakes.
7. If your cluster is large, a low `PercentageOfNodesToScore` speeds scheduling by reducing the number of nodes scored, _but_ it may reduce optimal placement.
8. If you enable a plugin in the `plugins:multipoint` section but do not define it in `pluginConfig`, AKS uses the default configuration for that plugin.
9. For `NodeResourcesFit`, the ratio matters more than absolute values. So CPU:Memory:Storage = 3:1:2, which means CPU is 3× more influential than memory, and storage is 2x more influential than memory in the scoring phase.
10. Pair `PodTopologySpread` with Pod Disruption Budget's (PDB) and multi‑replica strategies for HA during upgrades.
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Microsoft Style Guide, use "Pod Disruption Budgets" (plural) instead of "Pod Disruption Budget's" (possessive with apostrophe). The apostrophe is unnecessary here.

Suggested change
10. Pair `PodTopologySpread` with Pod Disruption Budget's (PDB) and multi‑replica strategies for HA during upgrades.
10. Pair `PodTopologySpread` with Pod Disruption Budgets (PDBs) and multi‑replica strategies for HA during upgrades.

Copilot uses AI. Check for mistakes.
### Increase resilience by distributing pods across topology domains
If you don't configure any cluster-level default constraints for pod topology spreading, then the default kube-scheduler can unevenly spread pods across zones and hosts unless specified in each deployments, every time. Use `PodTopologySpread` to control how pods are distributed across failure domains to ensure high availability and fault tolerance in the event of zone or node failures, without overprovisioning.
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Microsoft Style Guide, use a contraction for "can not" - it should be "can't" instead of the more formal "cannot". The style guide emphasizes using common contractions to maintain a conversational, friendly tone.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@kaarthis kaarthis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of Food for thought.

- best-practices
---

Thoughtful scheduling strategies can resolve pervasive challenges across web-distributed workloads and AI workloads like resiliency and resource utilization. But the default scheduler was primarily designed for general-purpose workloads and out-of-box pod scheduling that could be restrictive if you needed more fine-grain control since it applies a set of criteria in a fixed priority order. The scheduler selects the optimal node for newly created pod(s) based on several criteria, including (but not limited to):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the pervasive challenges - always write the blog post with a view for the laymen reader . Its good to elaborate on the problems as that makes things easy for folks to follow eg : Resiliency 1 2 bullets res. utilization like a b c

3. Pod affinity/anti-affinity
4. Taints and tolerations

Out of the available nodes, the scheduler then filters out nodes that don't meet the requirements to identify the node that is most optimal for the pod(s). Today, the AKS default scheduler lacks the flexibility for users to change which criteria should be prioritized, and ignored, in the scheduling cycle on a per workload basis. This means the default scheduling criteria, and their fixed priority order, are not suitable for workloads that demand co-locating pods with their persistent volumes for increased data locality, optimizing GPU utilization for machine learning, or enforcing strict zone-level distribution for microservices. This rigidity often forces users to either accept suboptimal placement or manage a separate custom scheduler, both of which increase operational complexity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can start with how the AKS default scheduler is different if any from the Kuberentes scheduler upstream defined.

:::

### Increase GPU Utilization by Bin Packing GPU-backed Nodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give tangible use cases / scenarios for when increase GPU utilization comes advantageous in terms of verticals / industry even better eg: Gaming, AI workloads for weather etc.

- performance
- scheduler
- best-practices
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What customers want to know:

What business problem does this solve? (cost, performance, reliability)
What is new vs. what existed before?
Which types of customers benefit most?
How does this simplify operations compared to Karpenter, Volcano, custom schedulers, etc.?


<!-- truncate -->

## AKS Configurable Scheduler Profiles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a visual explanation a conceptural diagram that helps before delving into this.


For example, scheduling pending jobs on nodes with a higher relative GPU utilization, users can reduce costs and increase GPU Utilization while maintaining performance.

**This scheduler configuration maximizes GPU efficiency for larger batch jobs by consolidating smaller jobs onto fewer nodes and lowering the operational cost of underutilized resources without sacrificing performance.**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this easy to change configuration - can i quickly rollback ? Any safe practices like testing in staging etc?

### Increase resilience by distributing pods across topology domains
If you don't configure any cluster-level default constraints for pod topology spreading, then the default kube-scheduler can unevenly spread pods across zones and hosts unless specified in each deployments, every time. Use `PodTopologySpread` to control how pods are distributed across failure domains to ensure high availability and fault tolerance in the event of zone or node failures, without overprovisioning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When to use default scheduler and when NOT to use section as a table. Answering these : When customers should stick with the default scheduler
When this should not be used (e.g., mission-critical schedulers without careful tuning)
When Kueue or custom controllers are preferable

- name: ImageLocality
- name: NodeResourcesFit
- name: NodeResourcesBalancedAllocation
pluginConfig:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What each plugin does in this configuration
Why weights were chosen (e.g., weight 3 GPU vs weight 1 CPU)
What happens if weights are misconfigured

whenUnsatisfiable: ScheduleAnyway
```

### Optimize data locality with Memory and PVC-Aware Scheduling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an operator for optimizing How does this reduce cloud cost?
Does it reduce the number of GPU nodes?
Does it reduce cross‑zone data egress?

8. If you enable a plugin in the `plugins:multipoint` section but do not define it in `pluginConfig`, AKS uses the default configuration for that plugin.
9. For `NodeResourcesFit`, the ratio matters more than absolute values. So CPU:Memory:Storage = 3:1:2, which means CPU is 3× more influential than memory, and storage is 2x more influential than memory in the scoring phase.
10. Pair `PodTopologySpread` with Pod Disruption Budget's (PDB) and multi‑replica strategies for HA during upgrades.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing FAQ section . Any troubleshooting guidance from recent problems . Please end strongly with a clear call to action and conclusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants