-
Notifications
You must be signed in to change notification settings - Fork 402
fix: Filter topology domains by NodePool compatibility with pod requirements #2671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: Filter topology domains by NodePool compatibility with pod requirements #2671
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: moko-poi The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @moko-poi. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
bd53bcd to
3874c99
Compare
Pull Request Test Coverage Report for Build 20517483264Details
💛 - Coveralls |
|
This doesn't seem correct to me. There is a mechanism to restrict topology domains, but its via node selector or required node affinity (see https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#interaction-with-node-affinity-and-node-selectors). The scheduler isn't going to be aware of this special logic. Taking your example:
Your pods are part of a deployment targeting NodePool A. Assume other node exist in the cluster for other reasons in all zones so the scheduler is aware that the zones exist. If Karpenter launches nodes for a required spread over 1a/1b, the scheduler will refuse to schedule them if it can't also schedule enough Pods to 1c. There's nothing in the deployment spec to inform the scheduler of your new constraint. |
|
Thank you for the detailed feedback, @tzneal. You raise a critical point about the Kubernetes scheduler interaction that I need to think through more carefully. Let me make sure I understand your concern correctly: Even if Karpenter filters domains based on NodePool compatibility, the Kubernetes scheduler will independently evaluate topology spread constraints against all domains visible from existing nodes in the cluster. This could create a mismatch where:
My assumption was that when a pod has Questions:
I'm happy to test this in a real cluster scenario with:
This would help verify whether my implementation creates scheduler conflicts. Would that be a valuable next step, or do you already know this approach won't work based on the scheduler's design? I appreciate your guidance on the right direction here. |
|
Testing would be ideal, its quite possible I'm mis-remembering how this worked. |
|
As an additional note, the Kubernetes documentation describes how topology domains are derived when nodeSelector or required nodeAffinity is used:
My interpretation is that the scheduler forms topology domains only from nodes that match the pod’s selector or required affinity, and zones containing only non-matching nodes are not included in the skew evaluation. I will verify this behavior in a real cluster and share the results so we can confirm how the scheduler handles this case in practice. |
|
@tzneal @engedaam @tallaxes Thank you for your feedback. I've completed testing with KWOK (Kubernetes Without Kubelet) to verify the actual behavior comparing before and after this fix. Test Setup Environment:
Results Before Fix (main branch) Nodes provisioned: 2 Pod distribution: Karpenter logs (main branch): Issue: All 3 new pods concentrated in zone-a, failing to satisfy the topology spread constraint properly. After Fix (this PR) Nodes provisioned: 3 Pod distribution: Skew: 2-1 (satisfies maxSkew=1) ✅ Karpenter logs (this PR): Findings The fix correctly addresses the issue:
This aligns with the Kubernetes documentation you referenced: "The scheduler will skip the non-matching nodes from the skew calculations if the incoming Pod has spec.nodeSelector or spec.affinity.nodeAffinity defined." The fix ensures Karpenter's provisioning logic matches the scheduler's domain filtering behavior. |
3874c99 to
ec50744
Compare
Fixes #2227
Description
This PR fixes an issue where topology spread constraints incorrectly evaluated all cluster zones instead of only zones available in NodePools that match the pod's nodeSelector or nodeAffinity requirements.
Problem:
When a pod specifies a nodeSelector or nodeAffinity targeting a specific NodePool with limited zones, Karpenter's topology spread evaluation considered all zones across all NodePools in the cluster. This caused unnecessary scheduling failures when the topology spread constraint couldn't be satisfied within the pod's targeted NodePool's zone scope.
For example:
Before this fix, Karpenter would try to spread across all 3 zones (a, b, c), even though the pod can only schedule to NodePool A which has 2 zones. This would cause the pod to fail scheduling when a 3-pod distribution couldn't be satisfied.
Solution:
Track NodePool requirements (labels + requirements) with each topology domain and filter domains based on compatibility with pod requirements during topology spread evaluation. This ensures that only zones from NodePools compatible with the pod's nodeSelector/nodeAffinity are considered.
Changes:
DomainSourcestruct to track NodePool requirements and taints per domainTopologyDomainGroup.Insert()to accept NodePool requirementsTopologyDomainGroup.ForEachDomain()to filter domains by compatibility with pod requirementsHow was this change tested?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.