Add blog post on AKS Configurable Scheduler Profiles #5505

colinmixonn · 2025-12-10T23:33:55Z

This blog post introduces AKS Configurable Scheduler Profiles, highlighting their benefits for optimizing resource utilization and improving scheduling strategies for web-distributed and AI workloads. It covers configuration examples for GPU utilization, pod distribution across topology domains, and memory-optimized scheduling.

Added a new tag for Scheduler with relevant details.

Updated blog post on AKS Configurable Scheduler Profiles to improve clarity and correctness, including sections on GPU utilization, pod distribution, and memory-optimized scheduling.

Corrected typos and improved clarity in the blog post about AKS Configurable Scheduler Profiles.

Updated the blog to clarify the objectives of configuring AKS Configurable Scheduler Profiles, improved section titles, and ensured consistency in terminology.

Clarified the objectives and improved the wording in the blog post about AKS Configurable Scheduler Profiles.

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

Copilot

Pull request overview

This pull request adds a new blog post announcing the preview of AKS Configurable Scheduler Profiles, a feature that enables fine-grained control over pod scheduling strategies to optimize resource utilization and improve workload performance.

Key Changes

Introduces a new "scheduler" tag to categorize blog posts related to pod placement and scheduling optimization
Adds comprehensive blog post covering three main scheduling use cases: GPU bin-packing for AI workloads, pod distribution across topology domains for resilience, and memory-optimized scheduling with PVC-aware placement
Provides YAML configuration examples and best practices for implementing custom scheduler profiles

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 20 comments.

File	Description
website/blog/tags.yml	Adds new "scheduler" tag for categorizing posts about pod placement and scheduling techniques
website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md	New blog post introducing AKS Configurable Scheduler Profiles with configuration examples for GPU utilization, topology distribution, and memory-optimized scheduling

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

website/blog/tags.yml

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

Fei-Guo · 2025-12-11T18:53:07Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+## AKS Configurable Scheduler Profiles
+
+A scheduler profile is a set of one or more in-tree scheduling plugins and configurations that dictate how to schedule a pod. Previously, the scheduler configuration wasn't accessible to users. Starting from Kubernetes version 1.33, you can now configure and set a scheduler profile for the AKS scheduler on your cluster.


We need to explain the technology here. We introduce a CRD to allow user to configure the scheduler profile and there is a controller to sync user specified configuration to the actual k8s scheduler deployment which is not visible to the user. Otherwise, no one would understand the "SchedulerConfiguration" resource shown in the yamls below.

great callout. Added an explanation. please let me know what you think @Fei-Guo

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

…index.md Co-authored-by: Diego Casati <diego.casati@gmail.com>

…index.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Updated the blog post to clarify the benefits of AKS Configurable Scheduler Profiles, including user feedback on increased resiliency and reduced operational overhead. Adjusted language for improved clarity and conciseness throughout the article.

Clarified the explanation of the AKS default scheduler and its resource allocation strategy, emphasizing the benefits of using NodeResourceFit for GPU utilization.

sabbour · 2025-12-15T20:25:58Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+  - best-practices
+---
+
+Thoughtful scheduling strategies can resolve pervasive challenges across web-distributed workloads and AI workloads like resiliency and resource utilization. But the default scheduler was primarily designed for general-purpose workloads and out-of-box pod scheduling that could be restrictive if you needed more fine-grain control since it applies a set of criteria in a fixed priority order. The scheduler selects the optimal node for newly created pod(s) based on several criteria, including (but not limited to):


What's "web-distributed"?

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

sabbour · 2025-12-15T20:28:18Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+Out of the available nodes, the scheduler then filters out nodes that don't meet the requirements to identify the node that is most optimal for the pod(s). Today, the AKS default scheduler lacks the flexibility for users to change which criteria should be prioritized, and ignored, in the scheduling cycle on a per workload basis. This means the default scheduling criteria, and their fixed priority order, are not suitable for workloads that demand co-locating pods with their persistent volumes for increased data locality, optimizing GPU utilization for machine learning, or enforcing strict zone-level distribution for microservices. This rigidity often forces users to either accept suboptimal placement or manage a separate custom scheduler, both of which increase operational complexity.
+
+**[AKS Configurable Scheduler Profiles][concepts-scheduler-configuration] reduces operational complexity by providing extensibility and control.** Now, customers can define their own scheduling logic by selecting specific policies, altering parameter weight, changing policy priority, adding additional policy parameters, and changing policy evaluation point (i.e. pre-Filter, Filter, Score) without deploying a second scheduler. On AKS, customers have mentioned that AKS Configurable Scheduler Profiles allows them to increase resiliency without operational overhead of YAML wrangling or reduce cluster costs without adopting a secondary scheduler. Additionally, our AI and HPC customers have batch workloads that have benefitted from improved bin-packing and increased GPU utilization.


Keep this before  as this is what will show in the blog index.

sabbour · 2025-12-15T20:35:11Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+  - scheduler
+  - best-practices
+---
+


I think we should start the problem statement here with something like:

The default Kubernetes scheduler was primarily designed to run general-purpose workloads while fulfilling requirements like resources (CPU, memory), node affinity, pod affinity, and spread. Scheduling AI workloads presents a different set of requirements and challenges such as <...>. With the introduction of AKS configurable scheduler profiles, you can now define your own scheduling logic by <..> without <..>.

Then use the .

sabbour · 2025-12-15T20:37:34Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+1. Resource requirements (CPU, memory)
+2. Node affinity/anti-affinity
+3. Pod affinity/anti-affinity
+4. Taints and tolerations
+
+Out of the available nodes, the scheduler then filters out nodes that don't meet the requirements to identify the node that is most optimal for the pod(s). Today, the AKS default scheduler lacks the flexibility for users to change which criteria should be prioritized, and ignored, in the scheduling cycle on a per workload basis. This means the default scheduling criteria, and their fixed priority order, are not suitable for workloads that demand co-locating pods with their persistent volumes for increased data locality, optimizing GPU utilization for machine learning, or enforcing strict zone-level distribution for microservices. This rigidity often forces users to either accept suboptimal placement or manage a separate custom scheduler, both of which increase operational complexity.
+
+**[AKS Configurable Scheduler Profiles][concepts-scheduler-configuration] reduces operational complexity by providing extensibility and control.** Now, customers can define their own scheduling logic by selecting specific policies, altering parameter weight, changing policy priority, adding additional policy parameters, and changing policy evaluation point (i.e. pre-Filter, Filter, Score) without deploying a second scheduler. On AKS, customers have mentioned that AKS Configurable Scheduler Profiles allows them to increase resiliency without operational overhead of YAML wrangling or reduce cluster costs without adopting a secondary scheduler. Additionally, our AI and HPC customers have batch workloads that have benefitted from improved bin-packing and increased GPU utilization.
+
+In this blog you will learn how to configure AKS Configurable Scheduler Profiles for three workload objectives:
+
+1. [Increase GPU utilization by bin packing GPU-backed nodes](#increase-gpu-utilization-by-bin-packing-gpu-backed-nodes)
+2. [Increase resilience by distributing pods across topology domains](#increase-resilience-by-distributing-pods-across-topology-domains)
+3. [Optimize data locality with memory and PVC-aware scheduling](#optimize-data-locality-with-memory-and-pvc-aware-scheduling)
+
+Lastly, you will find [best practices](#best-practices-and-configuration-considerations) to help guide how you consider both individual plugin configurations, your custom scheduler configuration, and your Deployment design holistically.
+


After addressing my other comment on the opening problem statement, I think a lot of this content can move below  with a header of How the default Kubernetes scheduler works. Then, you can follow that by the AKS configurable scheduler profiles section.

sabbour · 2025-12-15T20:38:12Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+:::note
+Adjust VM SKUs in `NodeAffinity`, shift utilization curves or weights, and use the right zones for your cluster(s) in the configurations below.
+:::


Why is this a note?

just an explicit out for the reader, how would you like it formatted? @sabbour

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

sabbour · 2025-12-15T20:40:04Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+The AKS default scheduler scores nodes for workload placement based on a _LeastAllocated_ strategy, to spread across the nodes in a cluster. However, this behavior results in inefficient resource utilization, as nodes with higher allocation are not favored. You can use `NodeResourceFit` to control how pods are assigned to nodes based on available resources (CPU, GPU, memory, etc.), including favoring nodes with high resource utilization, within the set configuration.
+
+For example, scheduling pending jobs on nodes with a higher relative GPU utilization, users can reduce costs and increase GPU Utilization while maintaining performance.


I think this needs to show actual differences between the utilization you get with the default scheduler versus this custom plugin, including numbers and graphs if possible.

sabbour · 2025-12-15T20:42:28Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+                  weight: 1
+```
+
+### Increase resilience by distributing pods across topology domains


An actual example of before and after would help here.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

sabbour · 2025-12-15T20:43:23Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+                - maxSkew: 2
+                  topologyKey: topology.kubernetes.io/zone
+                  whenUnsatisfiable: DoNotSchedule
+                - maxSkew: 1
+                  topologyKey: kubernetes.io/hostname
+                  whenUnsatisfiable: ScheduleAnyway


We're not explaining why those values are better than the defaults.

sabbour · 2025-12-15T20:45:01Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+                      matchExpressions:
+                        - key: topology.kubernetes.io/zone
+                          operator: In
+                          values: [westus3-1, westus3-2, westus3-3]


What does the zone have to do with memory optimization?

Co-authored-by: Ahmed Sabbour <103856+sabbour@users.noreply.github.com>

sabbour · 2025-12-15T20:45:53Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+For example, combine `VolumeBinding` and `VolumeZone` plugins, with `NodeAffinity` and `NodeResourcesFit` with `RequestedToCapacityRatio`, to influence pod placement on [Azure memory-optimized SKUs][memory-optimized-vm], while ensuring PVC's bind quickly in the target zone to minimize cross‑zone traffic and latency.
+
+**This scheduler configuration ensures workloads needing large memory footprints are placed on nodes that provide sufficient RAM and maintain proximity to their volumes, enabling fast, zone‑aligned PVC binding for optimal data locality.**


Where is this example configuration coming from? We're not really explaining how we came up with it either.

What do you mean we aren't explaining how we came up with it? I wanted to highlight the storage-based plugins for a scheduler configuration, so I piecemealed other configs together. @sabbour

sabbour · 2025-12-15T20:47:43Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+As a reminder, there are many parameters the scheduler considers across the [scheduling cycle][scheduling-framework/#interfaces] before a pod is placed on a node that impacts how a pod is assigned. This section is meant to help guide how you consider both individual plugin configurations, your custom scheduler configuration, and your Deployment design holistically.
+
+1. Ensure the intended deployment is assigned to the _correct_ scheduler profile.


None of the examples above showed how to assign this.

sabbour · 2025-12-15T20:48:55Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+As a reminder, there are many parameters the scheduler considers across the [scheduling cycle][scheduling-framework/#interfaces] before a pod is placed on a node that impacts how a pod is assigned. This section is meant to help guide how you consider both individual plugin configurations, your custom scheduler configuration, and your Deployment design holistically.
+
+1. Ensure the intended deployment is assigned to the _correct_ scheduler profile.
+2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.


What does this sentence mean?

sabbour · 2025-12-15T20:49:17Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+1. Ensure the intended deployment is assigned to the _correct_ scheduler profile.
+2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.
+3. Ensure there are enough nodes in each zone to accommodate your deployment replicas and ensure your AKS node pool spans the right availability zones. If not, pods may remain in a pending state.


Worth linking to autoscaling docs?

sabbour · 2025-12-15T20:49:38Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+1. Ensure the intended deployment is assigned to the _correct_ scheduler profile.
+2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.
+3. Ensure there are enough nodes in each zone to accommodate your deployment replicas and ensure your AKS node pool spans the right availability zones. If not, pods may remain in a pending state.
+4. Use namespaces to separate workloads which improves your ability to validate or troubleshoot.


That isn't relevant to configurable schedulers

sabbour · 2025-12-15T20:50:37Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+                      score: 0
+```
+
+## Best Practices and Configuration Considerations


I think this entire section (or a selection of it) should be in docs, not here.

sabbour · 2025-12-15T20:51:16Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.
+3. Ensure there are enough nodes in each zone to accommodate your deployment replicas and ensure your AKS node pool spans the right availability zones. If not, pods may remain in a pending state.
+4. Use namespaces to separate workloads which improves your ability to validate or troubleshoot.
+5. Assign `priorityClassName` for workloads that should preempt others, this is critical if you use the DefaultPreemption plugin.


We haven't mentioned DefaultPreemption plugin before in this doc

sabbour · 2025-12-15T20:51:58Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+6. If you use the `ImageLocality` plugin, use DaemonSets or node pre-pulling for latency-sensitive images, otherwise the benefit may be minimal.
+7. If your cluster is large, a low `PercentageOfNodesToScore` speeds scheduling by reducing the number of nodes scored, _but_ it may reduce optimal placement.
+8. If you enable a plugin in the `plugins:multipoint` section but do not define it in `pluginConfig`, AKS uses the default configuration for that plugin.
+9. For `NodeResourcesFit`, the ratio matters more than absolute values. So CPU:Memory:Storage = 3:1:2, which means CPU is 3× more influential than memory, and storage is 2x more influential than memory in the scoring phase.
+10. Pair `PodTopologySpread` with Pod Disruption Budget's (PDB) and multi‑replica strategies for HA during upgrades.


These while potentially good best practices, I think they're better suited to docs. They look too random here.

Out of curiosity, what makes them random for the blog? I am happy to move them to the overview page of the documentation, so we don't crowd the configuration section @sabbour

The blog post is about an announcement, and it starts at a high level with what scheduling is and what this feature is trying to accomplish, then near the end (here) is talking about best practices. I'm not sure the same reader, assumed here as someone who's not very familiar with this feature would understand what those best practices are or how to apply them from a blog post. And when they do need to apply them, I'd rather they find them out in docs or in a dedicated post on best practices.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

Copilot · 2025-12-15T20:56:19Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+### Optimize data locality with Memory and PVC-Aware Scheduling
+
+Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaim's_ (PVC) can bind to _PersistentVolume's_ (PV). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.


According to the Kubernetes terminology guidelines, "PersistentVolumeClaim" should be written as one word when referring to the Kubernetes resource type, not with an apostrophe for possession. Use "PersistentVolumeClaims" (plural) instead of "PersistentVolumeClaim's" (possessive).

Suggested change

Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaim's_ (PVC) can bind to _PersistentVolume's_ (PV). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.

Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaims_ (PVC) can bind to _PersistentVolumes_ (PV). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.

Copilot · 2025-12-15T20:56:19Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+### Optimize data locality with Memory and PVC-Aware Scheduling
+
+Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaim's_ (PVC) can bind to _PersistentVolume's_ (PV). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.


According to the Kubernetes terminology guidelines, "PersistentVolume" should be written as one word when referring to the Kubernetes resource type, not with an apostrophe for possession. Use "PersistentVolumes" (plural) instead of "PersistentVolume's" (possessive).

Suggested change

Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaim's_ (PVC) can bind to _PersistentVolume's_ (PV). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.

Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaims_ (PVCs) can bind to _PersistentVolumes_ (PVs). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.

Copilot · 2025-12-15T20:56:19Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+As a reminder, there are many parameters the scheduler considers across the [scheduling cycle][scheduling-framework/#interfaces] before a pod is placed on a node that impacts how a pod is assigned. This section is meant to help guide how you consider both individual plugin configurations, your custom scheduler configuration, and your Deployment design holistically.
+
+1. Ensure the intended deployment is assigned to the _correct_ scheduler profile.
+2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.


According to the Kubernetes terminology guidelines, use "PersistentVolumeClaims" (plural) instead of "PersistentVolumeClaim's" (possessive with apostrophe).

Suggested change

2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.

2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaims. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.

Copilot · 2025-12-15T20:56:20Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+7. If your cluster is large, a low `PercentageOfNodesToScore` speeds scheduling by reducing the number of nodes scored, _but_ it may reduce optimal placement.
+8. If you enable a plugin in the `plugins:multipoint` section but do not define it in `pluginConfig`, AKS uses the default configuration for that plugin.
+9. For `NodeResourcesFit`, the ratio matters more than absolute values. So CPU:Memory:Storage = 3:1:2, which means CPU is 3× more influential than memory, and storage is 2x more influential than memory in the scoring phase.
+10. Pair `PodTopologySpread` with Pod Disruption Budget's (PDB) and multi‑replica strategies for HA during upgrades.


According to the Microsoft Style Guide, use "Pod Disruption Budgets" (plural) instead of "Pod Disruption Budget's" (possessive with apostrophe). The apostrophe is unnecessary here.

Suggested change

10. Pair `PodTopologySpread` with Pod Disruption Budget's (PDB) and multi‑replica strategies for HA during upgrades.

10. Pair `PodTopologySpread` with Pod Disruption Budgets (PDBs) and multi‑replica strategies for HA during upgrades.

Copilot · 2025-12-15T20:56:20Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+### Increase resilience by distributing pods across topology domains
+
+If you don't configure any cluster-level default constraints for pod topology spreading, then the default kube-scheduler can unevenly spread pods across zones and hosts unless specified in each deployments, every time. Use `PodTopologySpread` to control how pods are distributed across failure domains to ensure high availability and fault tolerance in the event of zone or node failures, without overprovisioning.


According to the Microsoft Style Guide, use a contraction for "can not" - it should be "can't" instead of the more formal "cannot". The style guide emphasizes using common contractions to maintain a conversational, friendly tone.

kaarthis

Lots of Food for thought.

kaarthis · 2025-12-15T20:58:59Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+  - best-practices
+---
+
+Thoughtful scheduling strategies can resolve pervasive challenges across web-distributed workloads and AI workloads like resiliency and resource utilization. But the default scheduler was primarily designed for general-purpose workloads and out-of-box pod scheduling that could be restrictive if you needed more fine-grain control since it applies a set of criteria in a fixed priority order. The scheduler selects the optimal node for newly created pod(s) based on several criteria, including (but not limited to):


what are the pervasive challenges - always write the blog post with a view for the laymen reader . Its good to elaborate on the problems as that makes things easy for folks to follow eg : Resiliency 1 2 bullets res. utilization like a b c

kaarthis · 2025-12-15T21:00:37Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+3. Pod affinity/anti-affinity
+4. Taints and tolerations
+
+Out of the available nodes, the scheduler then filters out nodes that don't meet the requirements to identify the node that is most optimal for the pod(s). Today, the AKS default scheduler lacks the flexibility for users to change which criteria should be prioritized, and ignored, in the scheduling cycle on a per workload basis. This means the default scheduling criteria, and their fixed priority order, are not suitable for workloads that demand co-locating pods with their persistent volumes for increased data locality, optimizing GPU utilization for machine learning, or enforcing strict zone-level distribution for microservices. This rigidity often forces users to either accept suboptimal placement or manage a separate custom scheduler, both of which increase operational complexity.


You can start with how the AKS default scheduler is different if any from the Kuberentes scheduler upstream defined.

kaarthis · 2025-12-15T21:04:29Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+:::
+
+### Increase GPU Utilization by Bin Packing GPU-backed Nodes
+


Give tangible use cases / scenarios for when increase GPU utilization comes advantageous in terms of verticals / industry even better eg: Gaming, AI workloads for weather etc.

kaarthis · 2025-12-15T21:10:20Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+  - performance
+  - scheduler
+  - best-practices
+---


What customers want to know:

What business problem does this solve? (cost, performance, reliability)
What is new vs. what existed before?
Which types of customers benefit most?
How does this simplify operations compared to Karpenter, Volcano, custom schedulers, etc.?

kaarthis · 2025-12-15T21:10:59Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+<!-- truncate -->
+
+## AKS Configurable Scheduler Profiles


Do you have a visual explanation a conceptural diagram that helps before delving into this.

kaarthis · 2025-12-15T21:12:32Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+For example, scheduling pending jobs on nodes with a higher relative GPU utilization, users can reduce costs and increase GPU Utilization while maintaining performance.
+
+**This scheduler configuration maximizes GPU efficiency for larger batch jobs by consolidating smaller jobs onto fewer nodes and lowering the operational cost of underutilized resources without sacrificing performance.**


Is this easy to change configuration - can i quickly rollback ? Any safe practices like testing in staging etc?

kaarthis · 2025-12-15T21:13:28Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+
+### Increase resilience by distributing pods across topology domains
+
+If you don't configure any cluster-level default constraints for pod topology spreading, then the default kube-scheduler can unevenly spread pods across zones and hosts unless specified in each deployments, every time. Use `PodTopologySpread` to control how pods are distributed across failure domains to ensure high availability and fault tolerance in the event of zone or node failures, without overprovisioning.


When to use default scheduler and when NOT to use section as a table. Answering these : When customers should stick with the default scheduler
When this should not be used (e.g., mission-critical schedulers without careful tuning)
When Kueue or custom controllers are preferable

kaarthis · 2025-12-15T21:14:09Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+              - name: ImageLocality
+              - name: NodeResourcesFit
+              - name: NodeResourcesBalancedAllocation
+        pluginConfig:


What each plugin does in this configuration
Why weights were chosen (e.g., weight 3 GPU vs weight 1 CPU)
What happens if weights are misconfigured

kaarthis · 2025-12-15T21:15:12Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+                  whenUnsatisfiable: ScheduleAnyway
+```
+
+### Optimize data locality with Memory and PVC-Aware Scheduling


As an operator for optimizing How does this reduce cloud cost?
Does it reduce the number of GPU nodes?
Does it reduce cross‑zone data egress?

kaarthis · 2025-12-15T21:16:05Z

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md

+8. If you enable a plugin in the `plugins:multipoint` section but do not define it in `pluginConfig`, AKS uses the default configuration for that plugin.
+9. For `NodeResourcesFit`, the ratio matters more than absolute values. So CPU:Memory:Storage = 3:1:2, which means CPU is 3× more influential than memory, and storage is 2x more influential than memory in the scoring phase.
+10. Pair `PodTopologySpread` with Pod Disruption Budget's (PDB) and multi‑replica strategies for HA during upgrades.
+


Missing FAQ section . Any troubleshooting guidance from recent problems . Please end strongly with a clear call to action and conclusion.

colinmixonn added 9 commits December 10, 2025 15:33

Add Scheduler tag to blog tags configuration

8103dda

Added a new tag for Scheduler with relevant details.

Update index.md

6c035be

Fix typos in AKS configuration blog post

4d7a3cc

Revise AKS Configurable Scheduler Profiles blog post

0b3b6b9

Updated blog post on AKS Configurable Scheduler Profiles to improve clarity and correctness, including sections on GPU utilization, pod distribution, and memory-optimized scheduling.

Fix typos and enhance clarity in AKS blog post

4fd375d

Corrected typos and improved clarity in the blog post about AKS Configurable Scheduler Profiles.

Fix links and typos in AKS Configurable Scheduler blog

5d34e50

Clarify objectives and improve section titles in blog

0760340

Updated the blog to clarify the objectives of configuring AKS Configurable Scheduler Profiles, improved section titles, and ensured consistency in terminology.

Enhance clarity in AKS Configurable Scheduler blog

92ff663

Clarified the objectives and improved the wording in the blog post about AKS Configurable Scheduler Profiles.

colinmixonn marked this pull request as ready for review December 11, 2025 17:52

colinmixonn requested review from a team, circy9, Copilot, qpetraroia and seanmck and removed request for Copilot December 11, 2025 17:52

colinmixonn requested a review from palma21 as a code owner December 11, 2025 17:52

Update index.md

ef5c000

Copilot AI review requested due to automatic review settings December 11, 2025 18:00

Copilot started reviewing on behalf of colinmixonn December 11, 2025 18:01 View session

dcasati reviewed Dec 11, 2025

View reviewed changes

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md Outdated Show resolved Hide resolved

Copilot AI reviewed Dec 11, 2025

View reviewed changes

Fei-Guo reviewed Dec 11, 2025

View reviewed changes

colinmixonn and others added 6 commits December 11, 2025 11:14

Update website/blog/2025-12-16-aks-config-scheduler-profiles-preview/…

a488c12

…index.md Co-authored-by: Diego Casati <diego.casati@gmail.com>

Update website/blog/2025-12-16-aks-config-scheduler-profiles-preview/…

3999b23

…index.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update website/blog/2025-12-16-aks-config-scheduler-profiles-preview/…

37ced5b

…index.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update website/blog/2025-12-16-aks-config-scheduler-profiles-preview/…

ef30f6e

…index.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update website/blog/2025-12-16-aks-config-scheduler-profiles-preview/…

4dd4ee0

…index.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update website/blog/tags.yml

0cf3a81

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

colinmixonn added 2 commits December 12, 2025 14:58

Enhance GPU utilization explanation in AKS scheduler docs

df42499

Clarified the explanation of the AKS default scheduler and its resource allocation strategy, emphasizing the benefits of using NodeResourceFit for GPU utilization.

colinmixonn requested a review from Fei-Guo December 12, 2025 23:09

sabbour requested a review from Copilot December 15, 2025 20:25

sabbour reviewed Dec 15, 2025

View reviewed changes

Copilot AI reviewed Dec 15, 2025

View reviewed changes

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md Outdated Show resolved Hide resolved

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md Outdated Show resolved Hide resolved

Correct typo in best practices

aedfcc4

sabbour reviewed Dec 15, 2025

View reviewed changes

website/blog/2025-12-16-aks-config-scheduler-profiles-preview/index.md Outdated Show resolved Hide resolved

sabbour reviewed Dec 15, 2025

View reviewed changes

Apply suggestions from code review

ab2dcc1

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

sabbour reviewed Dec 15, 2025

View reviewed changes

Apply suggestions from code review

065d1b4

Co-authored-by: Ahmed Sabbour <103856+sabbour@users.noreply.github.com>

sabbour reviewed Dec 15, 2025

View reviewed changes

sabbour requested a review from Copilot December 15, 2025 20:52

Copilot started reviewing on behalf of sabbour December 15, 2025 20:53 View session

Copilot AI reviewed Dec 15, 2025

View reviewed changes

kaarthis requested changes Dec 15, 2025

View reviewed changes


		## AKS Configurable Scheduler Profiles

		A scheduler profile is a set of one or more in-tree scheduling plugins and configurations that dictate how to schedule a pod. Previously, the scheduler configuration wasn't accessible to users. Starting from Kubernetes version 1.33, you can now configure and set a scheduler profile for the AKS scheduler on your cluster.


		Out of the available nodes, the scheduler then filters out nodes that don't meet the requirements to identify the node that is most optimal for the pod(s). Today, the AKS default scheduler lacks the flexibility for users to change which criteria should be prioritized, and ignored, in the scheduling cycle on a per workload basis. This means the default scheduling criteria, and their fixed priority order, are not suitable for workloads that demand co-locating pods with their persistent volumes for increased data locality, optimizing GPU utilization for machine learning, or enforcing strict zone-level distribution for microservices. This rigidity often forces users to either accept suboptimal placement or manage a separate custom scheduler, both of which increase operational complexity.

		[AKS Configurable Scheduler Profiles][concepts-scheduler-configuration] reduces operational complexity by providing extensibility and control. Now, customers can define their own scheduling logic by selecting specific policies, altering parameter weight, changing policy priority, adding additional policy parameters, and changing policy evaluation point (i.e. pre-Filter, Filter, Score) without deploying a second scheduler. On AKS, customers have mentioned that AKS Configurable Scheduler Profiles allows them to increase resiliency without operational overhead of YAML wrangling or reduce cluster costs without adopting a secondary scheduler. Additionally, our AI and HPC customers have batch workloads that have benefitted from improved bin-packing and increased GPU utilization.


		The AKS default scheduler scores nodes for workload placement based on a _LeastAllocated_ strategy, to spread across the nodes in a cluster. However, this behavior results in inefficient resource utilization, as nodes with higher allocation are not favored. You can use `NodeResourceFit` to control how pods are assigned to nodes based on available resources (CPU, GPU, memory, etc.), including favoring nodes with high resource utilization, within the set configuration.

		For example, scheduling pending jobs on nodes with a higher relative GPU utilization, users can reduce costs and increase GPU Utilization while maintaining performance.


		For example, combine `VolumeBinding` and `VolumeZone` plugins, with `NodeAffinity` and `NodeResourcesFit` with `RequestedToCapacityRatio`, to influence pod placement on [Azure memory-optimized SKUs][memory-optimized-vm], while ensuring PVC's bind quickly in the target zone to minimize cross‑zone traffic and latency.

		This scheduler configuration ensures workloads needing large memory footprints are placed on nodes that provide sufficient RAM and maintain proximity to their volumes, enabling fast, zone‑aligned PVC binding for optimal data locality.


		As a reminder, there are many parameters the scheduler considers across the [scheduling cycle][scheduling-framework/#interfaces] before a pod is placed on a node that impacts how a pod is assigned. This section is meant to help guide how you consider both individual plugin configurations, your custom scheduler configuration, and your Deployment design holistically.

		1. Ensure the intended deployment is assigned to the _correct_ scheduler profile.


		### Optimize data locality with Memory and PVC-Aware Scheduling

		Use `VolumeBinding` to ensure pods are placed on nodes where _PersistentVolumeClaim's_ (PVC) can bind to _PersistentVolume's_ (PV). `VolumeZone` validates that nodes and volumes satisfy zonal requirements to avoid cross-zone storage access.

	2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaim's. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.
	2. Ensure the custom scheduler profile complements the implementation of Deployments, StorageClasses, and PersistentVolumeClaims. Misalignment can lead to pending pods and degraded workload performance, even when the scheduler is functioning as expected.

	10. Pair `PodTopologySpread` with Pod Disruption Budget's (PDB) and multi‑replica strategies for HA during upgrades.
	10. Pair `PodTopologySpread` with Pod Disruption Budgets (PDBs) and multi‑replica strategies for HA during upgrades.


		### Increase resilience by distributing pods across topology domains

		If you don't configure any cluster-level default constraints for pod topology spreading, then the default kube-scheduler can unevenly spread pods across zones and hosts unless specified in each deployments, every time. Use `PodTopologySpread` to control how pods are distributed across failure domains to ensure high availability and fault tolerance in the event of zone or node failures, without overprovisioning.

		:::

		### Increase GPU Utilization by Bin Packing GPU-backed Nodes


		For example, scheduling pending jobs on nodes with a higher relative GPU utilization, users can reduce costs and increase GPU Utilization while maintaining performance.

		This scheduler configuration maximizes GPU efficiency for larger batch jobs by consolidating smaller jobs onto fewer nodes and lowering the operational cost of underutilized resources without sacrificing performance.

Add blog post on AKS Configurable Scheduler Profiles #5505

Are you sure you want to change the base?

Add blog post on AKS Configurable Scheduler Profiles #5505

Conversation

colinmixonn commented Dec 10, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fei-Guo Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Fei-Guo Dec 11, 2025 •

edited

Loading