Skip to content

Conversation

@moko-poi
Copy link
Contributor

Fixes #2227

Description

This PR fixes an issue where topology spread constraints incorrectly evaluated all cluster zones instead of only zones available in NodePools that match the pod's nodeSelector or nodeAffinity requirements.

Problem:
When a pod specifies a nodeSelector or nodeAffinity targeting a specific NodePool with limited zones, Karpenter's topology spread evaluation considered all zones across all NodePools in the cluster. This caused unnecessary scheduling failures when the topology spread constraint couldn't be satisfied within the pod's targeted NodePool's zone scope.

For example:

  • NodePool A: zones [us-east-1a, us-east-1b]
  • NodePool B: zones [us-east-1a, us-east-1b, us-east-1c]
  • Pod with nodeSelector targeting NodePool A and maxSkew=1

Before this fix, Karpenter would try to spread across all 3 zones (a, b, c), even though the pod can only schedule to NodePool A which has 2 zones. This would cause the pod to fail scheduling when a 3-pod distribution couldn't be satisfied.

Solution:
Track NodePool requirements (labels + requirements) with each topology domain and filter domains based on compatibility with pod requirements during topology spread evaluation. This ensures that only zones from NodePools compatible with the pod's nodeSelector/nodeAffinity are considered.

Changes:

  1. Added DomainSource struct to track NodePool requirements and taints per domain
  2. Modified TopologyDomainGroup.Insert() to accept NodePool requirements
  3. Modified TopologyDomainGroup.ForEachDomain() to filter domains by compatibility with pod requirements
  4. Updated topology building logic to pass NodePool requirements through the stack

How was this change tested?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: moko-poi
Once this PR has been reviewed and has the lgtm label, please assign maciekpytel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 30, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 30, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @moko-poi. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 30, 2025
@moko-poi moko-poi force-pushed the fix/topology-spread-nodepool-zone-filtering branch 2 times, most recently from bd53bcd to 3874c99 Compare December 1, 2025 02:31
@coveralls
Copy link

coveralls commented Dec 1, 2025

Pull Request Test Coverage Report for Build 20517483264

Details

  • 73 of 75 (97.33%) changed or added relevant lines in 3 files are covered.
  • 23 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.1%) to 80.365%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/provisioning/scheduling/topologydomaingroup.go 48 50 96.0%
Files with Coverage Reduction New Missed Lines %
pkg/scheduling/requirements.go 6 94.48%
pkg/scheduling/zz_generated.deepcopy.go 17 54.48%
Totals Coverage Status
Change from base Build 20489292152: -0.1%
Covered Lines: 12017
Relevant Lines: 14953

💛 - Coveralls

@tzneal
Copy link
Contributor

tzneal commented Dec 2, 2025

This doesn't seem correct to me. There is a mechanism to restrict topology domains, but its via node selector or required node affinity (see https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#interaction-with-node-affinity-and-node-selectors).

The scheduler isn't going to be aware of this special logic.

Taking your example:

For example:
NodePool A: zones [us-east-1a, us-east-1b]
NodePool B: zones [us-east-1a, us-east-1b, us-east-1c]
Pod with nodeSelector targeting NodePool A and maxSkew=1

Your pods are part of a deployment targeting NodePool A. Assume other node exist in the cluster for other reasons in all zones so the scheduler is aware that the zones exist.

If Karpenter launches nodes for a required spread over 1a/1b, the scheduler will refuse to schedule them if it can't also schedule enough Pods to 1c. There's nothing in the deployment spec to inform the scheduler of your new constraint.

@moko-poi
Copy link
Contributor Author

moko-poi commented Dec 2, 2025

Thank you for the detailed feedback, @tzneal. You raise a critical point about the Kubernetes scheduler interaction that I need to think through more carefully.

Let me make sure I understand your concern correctly: Even if Karpenter filters domains based on NodePool compatibility, the Kubernetes scheduler will independently evaluate topology spread constraints against all domains visible from existing nodes in the cluster. This could create a mismatch where:

  1. Karpenter provisions nodes only in zones [a, b, c] based on NodePool constraints
  2. The scheduler sees existing nodes in zones [a, b, c, d, e, f] and expects pods to spread across all 6 zones
  3. The scheduler rejects pod placements that Karpenter considers valid

My assumption was that when a pod has nodeSelector: {nodeType: x}, the Kubernetes scheduler would only consider nodes matching that selector when calculating available topology domains, per the interaction with node selectors documentation. However, I may be misunderstanding how this works in practice.

Questions:

  1. Does the scheduler calculate topology domains globally first (from all nodes), then filter candidates? Or does it only consider nodes matching the pod's nodeSelector/nodeAffinity from the start?
  2. If the former, should users always explicitly specify zone constraints in the pod's nodeAffinity (e.g., topology.kubernetes.io/zone In [us-east-1a, us-east-1b, us-east-1c]) in addition to NodePool selectors?
  3. Would the correct solution be to document this requirement rather than changing Karpenter's behavior?

I'm happy to test this in a real cluster scenario with:

  • Multiple NodePools with different zone constraints
  • Existing nodes in various zones
  • Pods with topology spread constraints + nodeSelector

This would help verify whether my implementation creates scheduler conflicts. Would that be a valuable next step, or do you already know this approach won't work based on the scheduler's design?

I appreciate your guidance on the right direction here.

@tzneal
Copy link
Contributor

tzneal commented Dec 2, 2025

Testing would be ideal, its quite possible I'm mis-remembering how this worked.

@moko-poi
Copy link
Contributor Author

moko-poi commented Dec 2, 2025

As an additional note, the Kubernetes documentation describes how topology domains are derived when nodeSelector or required nodeAffinity is used:

"The scheduler will skip the non-matching nodes from the skew calculations if the incoming Pod has spec.nodeSelector or spec.affinity.nodeAffinity defined."

https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#interaction-with-node-affinity-and-node-selectors

My interpretation is that the scheduler forms topology domains only from nodes that match the pod’s selector or required affinity, and zones containing only non-matching nodes are not included in the skew evaluation. I will verify this behavior in a real cluster and share the results so we can confirm how the scheduler handles this case in practice.

@moko-poi
Copy link
Contributor Author

@tzneal @engedaam @tallaxes Thank you for your feedback. I've completed testing with KWOK (Kubernetes Without Kubelet) to verify the actual behavior comparing before and after this fix.

Test Setup

Environment:

  • NodePool A: zones [test-zone-a, test-zone-b] only
  • NodePool B: zones [test-zone-c] only
  • Test scenario:
    • Step 1: Deploy 1 existing pod to zone-c (using NodePool B)
    • Step 2: Deploy 3 new pods targeting zones a & b via nodeAffinity (using NodePool A)
    • Constraint: maxSkew=1 across topology.kubernetes.io/zone

Results

Before Fix (main branch)

Nodes provisioned: 2
├─ test-zone-a: 1 node (nodepool-a)
├─ test-zone-b: 0 nodes ← Problem!
└─ test-zone-c: 1 node (nodepool-b)

Pod distribution:
├─ test-zone-a: 3 new-pods ← Violates maxSkew=1!
├─ test-zone-b: 0 new-pods
└─ test-zone-c: 1 existing-pod

Karpenter logs (main branch):
{"message":"computed new nodeclaim(s) to fit pod(s)","nodeclaims":2,"pods":4}
{"message":"launched nodeclaim","zone":"test-zone-a"}
// Zone-b node was NOT created

Issue: All 3 new pods concentrated in zone-a, failing to satisfy the topology spread constraint properly.

After Fix (this PR)

Nodes provisioned: 3
├─ test-zone-a: 1 node (nodepool-a)
├─ test-zone-b: 1 node (nodepool-a) ← Fixed!
└─ test-zone-c: 1 node (nodepool-b)

Pod distribution:
├─ test-zone-a: 2 new-pods
├─ test-zone-b: 1 new-pod ← Properly distributed!
└─ test-zone-c: 1 existing-pod

Skew: 2-1 (satisfies maxSkew=1) ✅

Karpenter logs (this PR):
{"message":"computed new nodeclaim(s) to fit pod(s)","nodeclaims":2,"pods":3}
{"message":"launched nodeclaim","zone":"test-zone-a"}
{"message":"launched nodeclaim","zone":"test-zone-b"} ← Now creates zone-b node!

Findings

The fix correctly addresses the issue:

  1. Before: Karpenter evaluated topology domains from all NodePools (including zone-c from NodePool B), causing suboptimal node placement
  2. After: Karpenter only considers zones from NodePools compatible with pod requirements (zones a & b from NodePool A)
  3. Result: Topology spread constraint is properly satisfied with 2-1 distribution across available zones

This aligns with the Kubernetes documentation you referenced: "The scheduler will skip the non-matching nodes from the skew calculations if the incoming Pod has spec.nodeSelector or spec.affinity.nodeAffinity defined."

The fix ensures Karpenter's provisioning logic matches the scheduler's domain filtering behavior.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 25, 2025
@moko-poi moko-poi force-pushed the fix/topology-spread-nodepool-zone-filtering branch from 3874c99 to ec50744 Compare December 25, 2025 07:00
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 25, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

All AZ domains included in topology spread constraints evaluation even when nodepool restricts to specific AZs

4 participants