-
Notifications
You must be signed in to change notification settings - Fork 296
Add NFD image compatibility scheduler proposal. #2403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for kubernetes-sigs-nfd ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @Xunli-Yang. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds KEP-2403, which proposes a compatibility scheduler plugin for Node Feature Discovery (NFD). Building on KEP-1845 (which established node compatibility validation), this proposal introduces automated scheduling capabilities to ensure pods are scheduled on nodes compatible with their container image requirements.
Key changes:
- Introduces three alternative solution designs for implementing image compatibility scheduling
- Proposes an
ImageCompatibilityPluginthat leveragesNodeFeatureGroupCRs to filter compatible nodes - Presents performance tradeoffs from basic validation (Solution 1) to optimized large-scale approaches (Solutions 2 and 3)
Reviewed changes
Copilot reviewed 1 out of 4 changed files in this pull request and generated 26 comments.
| File | Description |
|---|---|
| enhancements/2403-nfd-image-compatibility-scheduler/README.md | Complete KEP document proposing three solutions for image compatibility scheduling with detailed workflows, merits/demerits analysis, and test plans |
| enhancements/2403-nfd-image-compatibility-scheduler/solution1.png | Architectural diagram illustrating the basic NodeFeatureGroup check approach |
| enhancements/2403-nfd-image-compatibility-scheduler/solution2.png | Architectural diagram showing the SQLite database caching solution |
| enhancements/2403-nfd-image-compatibility-scheduler/solution3.png | Architectural diagram depicting the node pre-grouping optimization strategy |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 1. **CR Creation and Update(Prefilter Phase):** When a pod with specific image requirements enters the scheduling queue, scheduler plugin fetches the attached OCI Artifact. It extracts the compatibility metadata (e.g., required kernel features) and **instantly creates a new `NodeFeatureGroup` CR**. This CR's specification defines the dynamic compatibility rules. | ||
|
|
||
| The `update NodeFeatureGroup` operation evaluates **all nodes in the cluster** against the CR's specification rules and updates the CR's `status` field with the list of nodes that satisfy the compatibility demands. | ||
|
|
||
| ```yaml | ||
| apiVersion: nfd.k8s-sigs.io/v1alpha1 | ||
| kind: NodeFeatureGroup | ||
| metadata: | ||
| name: node-feature-group-example | ||
| spec: | ||
| featureGroupRules: | ||
| - name: "kernel version" | ||
| matchFeatures: | ||
| - feature: kernel.version | ||
| matchExpressions: | ||
| major: {op: In, value: ["6"]} | ||
| status: | ||
| nodes: | ||
| - name: node-1 | ||
| - name: node-2 | ||
| - name: node-3 | ||
| ``` | ||
|
|
||
| 2. **Node Filtering (Filter Phase):** In the scheduler's final filter phase, retrieve the dynamically created `NodeFeatureGroup` CR and filters the candidate nodes, ensuring that only nodes listed in the CR's `status` are considered compatible. |
Copilot
AI
Dec 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing critical information about lifecycle management. The proposal mentions creating NodeFeatureGroup CRs dynamically during scheduling but doesn't address cleanup. When and how are these ephemeral CRs deleted? Without proper cleanup, they could accumulate and cause resource exhaustion. This is particularly important for Solution 1 and potentially Solution 3, which create CRs per scheduling request.
|
|
||
| The process involves three main phases: | ||
|
|
||
| 1. **Initial Cluster Grouping:** In the cluster preparation stage, administrator should divide the cluster nodes into several groups by `NodeFeatureGroup`. Multiple `NodeFeatureGroup` Custom Resources (CRs) are created declaratively, each defining a grouping rule. Their status is populated with all matching nodes, completing the pre-grouping setup. |
Copilot
AI
Dec 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing important implementation detail. The proposal mentions that "administrator should divide the cluster nodes into several groups by NodeFeatureGroup" but doesn't provide guidance on how to determine appropriate grouping rules or how many groups are optimal. Additionally, it doesn't address what happens when new nodes are added to the cluster - how are they assigned to groups? These are critical considerations for the practical implementation of this solution.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Xunli-Yang The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Co-authored-by: Joe Huang <[email protected]>
|
Hi |
Add KEP: NFD image compatibility scheduler proposal.
What the proposal does?
Building upon the first phase of KEP-1845 Proposal, which completed node compatibility validation. This proposal introduces a compatibility scheduling plugin. The compatibility scheduler plugin automatically analyzes the compatibility requirements of container images, filters suitable nodes for scheduling, and ensures that containers run on compatible nodes.
Special notes for reviewer:
Based on the discussions on node-feature-discovery Slack channel, this proposal has presented three solutions and intends to get consensus on the implementation direction.
Co-authored-by: @ChaoyiHuang