fix: Filter topology domains by NodePool compatibility with pod requirements #2671

moko-poi · 2025-11-30T15:31:16Z

Description

This PR fixes an issue where topology spread constraints incorrectly evaluated all cluster zones instead of only zones available in NodePools that match the pod's nodeSelector or nodeAffinity requirements.

Problem:
When a pod specifies a nodeSelector or nodeAffinity targeting a specific NodePool with limited zones, Karpenter's topology spread evaluation considered all zones across all NodePools in the cluster. This caused unnecessary scheduling failures when the topology spread constraint couldn't be satisfied within the pod's targeted NodePool's zone scope.

For example:

NodePool A: zones [us-east-1a, us-east-1b]
NodePool B: zones [us-east-1a, us-east-1b, us-east-1c]
Pod with nodeSelector targeting NodePool A and maxSkew=1

Before this fix, Karpenter would try to spread across all 3 zones (a, b, c), even though the pod can only schedule to NodePool A which has 2 zones. This would cause the pod to fail scheduling when a 3-pod distribution couldn't be satisfied.

Solution:
Track NodePool requirements (labels + requirements) with each topology domain and filter domains based on compatibility with pod requirements during topology spread evaluation. This ensures that only zones from NodePools compatible with the pod's nodeSelector/nodeAffinity are considered.

Changes:

Added DomainSource struct to track NodePool requirements and taints per domain
Modified TopologyDomainGroup.Insert() to accept NodePool requirements
Modified TopologyDomainGroup.ForEachDomain() to filter domains by compatibility with pod requirements
Updated topology building logic to pass NodePool requirements through the stack

How was this change tested?

Added 4 comprehensive test cases covering:
- Pods with nodeSelector targeting specific zones
- Pods with nodeAffinity targeting zone ranges
- Pods without NodePool restrictions using all available zones
- The exact scenario from Issue All AZ domains included in topology spread constraints evaluation even when nodepool restricts to specific AZs #2227 (maxSkew=1, 3 pods, 2-zone NodePool)
All existing topology tests pass
No compilation errors

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2025-11-30T15:31:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: moko-poi
Once this PR has been reviewed and has the lgtm label, please assign maciekpytel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-11-30T15:31:26Z

Hi @moko-poi. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2025-12-01T02:45:14Z

Pull Request Test Coverage Report for Build 20517483264

Details

73 of 75 (97.33%) changed or added relevant lines in 3 files are covered.
23 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.1%) to 80.365%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/provisioning/scheduling/topologydomaingroup.go	48	50	96.0%

Files with Coverage Reduction	New Missed Lines	%
pkg/scheduling/requirements.go	6	94.48%
pkg/scheduling/zz_generated.deepcopy.go	17	54.48%

Totals
Change from base Build 20489292152:	-0.1%
Covered Lines:	12017
Relevant Lines:	14953

💛 - Coveralls

tzneal · 2025-12-02T00:40:57Z

This doesn't seem correct to me. There is a mechanism to restrict topology domains, but its via node selector or required node affinity (see https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#interaction-with-node-affinity-and-node-selectors).

The scheduler isn't going to be aware of this special logic.

Taking your example:

For example:
NodePool A: zones [us-east-1a, us-east-1b]
NodePool B: zones [us-east-1a, us-east-1b, us-east-1c]
Pod with nodeSelector targeting NodePool A and maxSkew=1

Your pods are part of a deployment targeting NodePool A. Assume other node exist in the cluster for other reasons in all zones so the scheduler is aware that the zones exist.

If Karpenter launches nodes for a required spread over 1a/1b, the scheduler will refuse to schedule them if it can't also schedule enough Pods to 1c. There's nothing in the deployment spec to inform the scheduler of your new constraint.

moko-poi · 2025-12-02T02:35:02Z

Thank you for the detailed feedback, @tzneal. You raise a critical point about the Kubernetes scheduler interaction that I need to think through more carefully.

Let me make sure I understand your concern correctly: Even if Karpenter filters domains based on NodePool compatibility, the Kubernetes scheduler will independently evaluate topology spread constraints against all domains visible from existing nodes in the cluster. This could create a mismatch where:

Karpenter provisions nodes only in zones [a, b, c] based on NodePool constraints
The scheduler sees existing nodes in zones [a, b, c, d, e, f] and expects pods to spread across all 6 zones
The scheduler rejects pod placements that Karpenter considers valid

My assumption was that when a pod has nodeSelector: {nodeType: x}, the Kubernetes scheduler would only consider nodes matching that selector when calculating available topology domains, per the interaction with node selectors documentation. However, I may be misunderstanding how this works in practice.

Questions:

Does the scheduler calculate topology domains globally first (from all nodes), then filter candidates? Or does it only consider nodes matching the pod's nodeSelector/nodeAffinity from the start?
If the former, should users always explicitly specify zone constraints in the pod's nodeAffinity (e.g., topology.kubernetes.io/zone In [us-east-1a, us-east-1b, us-east-1c]) in addition to NodePool selectors?
Would the correct solution be to document this requirement rather than changing Karpenter's behavior?

I'm happy to test this in a real cluster scenario with:

Multiple NodePools with different zone constraints
Existing nodes in various zones
Pods with topology spread constraints + nodeSelector

This would help verify whether my implementation creates scheduler conflicts. Would that be a valuable next step, or do you already know this approach won't work based on the scheduler's design?

I appreciate your guidance on the right direction here.

tzneal · 2025-12-02T02:50:50Z

Testing would be ideal, its quite possible I'm mis-remembering how this worked.

moko-poi · 2025-12-02T05:59:01Z

As an additional note, the Kubernetes documentation describes how topology domains are derived when nodeSelector or required nodeAffinity is used:

"The scheduler will skip the non-matching nodes from the skew calculations if the incoming Pod has spec.nodeSelector or spec.affinity.nodeAffinity defined."

https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#interaction-with-node-affinity-and-node-selectors

My interpretation is that the scheduler forms topology domains only from nodes that match the pod’s selector or required affinity, and zones containing only non-matching nodes are not included in the skew evaluation. I will verify this behavior in a real cluster and share the results so we can confirm how the scheduler handles this case in practice.

moko-poi · 2025-12-13T10:55:51Z

@tzneal @engedaam @tallaxes Thank you for your feedback. I've completed testing with KWOK (Kubernetes Without Kubelet) to verify the actual behavior comparing before and after this fix.

Test Setup

Environment:

NodePool A: zones [test-zone-a, test-zone-b] only
NodePool B: zones [test-zone-c] only
Test scenario:
- Step 1: Deploy 1 existing pod to zone-c (using NodePool B)
- Step 2: Deploy 3 new pods targeting zones a & b via nodeAffinity (using NodePool A)
- Constraint: maxSkew=1 across topology.kubernetes.io/zone

Results

Before Fix (main branch)

Nodes provisioned: 2
├─ test-zone-a: 1 node (nodepool-a)
├─ test-zone-b: 0 nodes ← Problem!
└─ test-zone-c: 1 node (nodepool-b)

Pod distribution:
├─ test-zone-a: 3 new-pods ← Violates maxSkew=1!
├─ test-zone-b: 0 new-pods
└─ test-zone-c: 1 existing-pod

Karpenter logs (main branch):
{"message":"computed new nodeclaim(s) to fit pod(s)","nodeclaims":2,"pods":4}
{"message":"launched nodeclaim","zone":"test-zone-a"}
// Zone-b node was NOT created

Issue: All 3 new pods concentrated in zone-a, failing to satisfy the topology spread constraint properly.

After Fix (this PR)

Nodes provisioned: 3
├─ test-zone-a: 1 node (nodepool-a)
├─ test-zone-b: 1 node (nodepool-a) ← Fixed!
└─ test-zone-c: 1 node (nodepool-b)

Pod distribution:
├─ test-zone-a: 2 new-pods
├─ test-zone-b: 1 new-pod ← Properly distributed!
└─ test-zone-c: 1 existing-pod

Skew: 2-1 (satisfies maxSkew=1) ✅

Karpenter logs (this PR):
{"message":"computed new nodeclaim(s) to fit pod(s)","nodeclaims":2,"pods":3}
{"message":"launched nodeclaim","zone":"test-zone-a"}
{"message":"launched nodeclaim","zone":"test-zone-b"} ← Now creates zone-b node!

Findings

The fix correctly addresses the issue:

Before: Karpenter evaluated topology domains from all NodePools (including zone-c from NodePool B), causing suboptimal node placement
After: Karpenter only considers zones from NodePools compatible with pod requirements (zones a & b from NodePool A)
Result: Topology spread constraint is properly satisfied with 2-1 distribution across available zones

This aligns with the Kubernetes documentation you referenced: "The scheduler will skip the non-matching nodes from the skew calculations if the incoming Pod has spec.nodeSelector or spec.affinity.nodeAffinity defined."

The fix ensures Karpenter's provisioning logic matches the scheduler's domain filtering behavior.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 30, 2025

k8s-ci-robot requested review from engedaam and tallaxes November 30, 2025 15:31

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 30, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 30, 2025

moko-poi force-pushed the fix/topology-spread-nodepool-zone-filtering branch 2 times, most recently from bd53bcd to 3874c99 Compare December 1, 2025 02:31

tzneal mentioned this pull request Dec 2, 2025

All AZ domains included in topology spread constraints evaluation even when nodepool restricts to specific AZs #2227

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 25, 2025

Fix topology spread evaluating wrong NodePool zones

ec50744

moko-poi force-pushed the fix/topology-spread-nodepool-zone-filtering branch from 3874c99 to ec50744 Compare December 25, 2025 07:00

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 25, 2025

fix: keep compatible NodePool sources for topology domains

e1c2875

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Filter topology domains by NodePool compatibility with pod requirements #2671

fix: Filter topology domains by NodePool compatibility with pod requirements #2671

moko-poi commented Nov 30, 2025

Uh oh!

k8s-ci-robot commented Nov 30, 2025

Uh oh!

k8s-ci-robot commented Nov 30, 2025

Uh oh!

coveralls commented Dec 1, 2025 •

edited

Loading

Uh oh!

tzneal commented Dec 2, 2025

Uh oh!

moko-poi commented Dec 2, 2025 •

edited

Loading

Uh oh!

tzneal commented Dec 2, 2025

Uh oh!

moko-poi commented Dec 2, 2025

Uh oh!

moko-poi commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: Filter topology domains by NodePool compatibility with pod requirements #2671

Are you sure you want to change the base?

fix: Filter topology domains by NodePool compatibility with pod requirements #2671

Conversation

moko-poi commented Nov 30, 2025

Uh oh!

k8s-ci-robot commented Nov 30, 2025

Uh oh!

k8s-ci-robot commented Nov 30, 2025

Uh oh!

coveralls commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 20517483264

Details

💛 - Coveralls

Uh oh!

tzneal commented Dec 2, 2025

Uh oh!

moko-poi commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tzneal commented Dec 2, 2025

Uh oh!

moko-poi commented Dec 2, 2025

Uh oh!

moko-poi commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Dec 1, 2025 •

edited

Loading

moko-poi commented Dec 2, 2025 •

edited

Loading