Skip to content

Conversation

@Xunli-Yang
Copy link

Add KEP: NFD image compatibility scheduler proposal.

What the proposal does?
Building upon the first phase of KEP-1845 Proposal, which completed node compatibility validation. This proposal introduces a compatibility scheduling plugin. The compatibility scheduler plugin automatically analyzes the compatibility requirements of container images, filters suitable nodes for scheduling, and ensures that containers run on compatible nodes.

Special notes for reviewer:
Based on the discussions on node-feature-discovery Slack channel, this proposal has presented three solutions and intends to get consensus on the implementation direction.
Co-authored-by: @ChaoyiHuang

@netlify
Copy link

netlify bot commented Dec 27, 2025

Deploy Preview for kubernetes-sigs-nfd ready!

Name Link
🔨 Latest commit f6ebba8
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-nfd/deploys/695487f811e4f1000850980d
😎 Deploy Preview https://deploy-preview-2403--kubernetes-sigs-nfd.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @Xunli-Yang. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 27, 2025
@ArangoGutierrez ArangoGutierrez self-assigned this Dec 27, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds KEP-2403, which proposes a compatibility scheduler plugin for Node Feature Discovery (NFD). Building on KEP-1845 (which established node compatibility validation), this proposal introduces automated scheduling capabilities to ensure pods are scheduled on nodes compatible with their container image requirements.

Key changes:

  • Introduces three alternative solution designs for implementing image compatibility scheduling
  • Proposes an ImageCompatibilityPlugin that leverages NodeFeatureGroup CRs to filter compatible nodes
  • Presents performance tradeoffs from basic validation (Solution 1) to optimized large-scale approaches (Solutions 2 and 3)

Reviewed changes

Copilot reviewed 1 out of 4 changed files in this pull request and generated 26 comments.

File Description
enhancements/2403-nfd-image-compatibility-scheduler/README.md Complete KEP document proposing three solutions for image compatibility scheduling with detailed workflows, merits/demerits analysis, and test plans
enhancements/2403-nfd-image-compatibility-scheduler/solution1.png Architectural diagram illustrating the basic NodeFeatureGroup check approach
enhancements/2403-nfd-image-compatibility-scheduler/solution2.png Architectural diagram showing the SQLite database caching solution
enhancements/2403-nfd-image-compatibility-scheduler/solution3.png Architectural diagram depicting the node pre-grouping optimization strategy

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 28 to 51
1. **CR Creation and Update(Prefilter Phase):** When a pod with specific image requirements enters the scheduling queue, scheduler plugin fetches the attached OCI Artifact. It extracts the compatibility metadata (e.g., required kernel features) and **instantly creates a new `NodeFeatureGroup` CR**. This CR's specification defines the dynamic compatibility rules.

The `update NodeFeatureGroup` operation evaluates **all nodes in the cluster** against the CR's specification rules and updates the CR's `status` field with the list of nodes that satisfy the compatibility demands.

```yaml
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureGroup
metadata:
name: node-feature-group-example
spec:
featureGroupRules:
- name: "kernel version"
matchFeatures:
- feature: kernel.version
matchExpressions:
major: {op: In, value: ["6"]}
status:
nodes:
- name: node-1
- name: node-2
- name: node-3
```

2. **Node Filtering (Filter Phase):** In the scheduler's final filter phase, retrieve the dynamically created `NodeFeatureGroup` CR and filters the candidate nodes, ensuring that only nodes listed in the CR's `status` are considered compatible.
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing critical information about lifecycle management. The proposal mentions creating NodeFeatureGroup CRs dynamically during scheduling but doesn't address cleanup. When and how are these ephemeral CRs deleted? Without proper cleanup, they could accumulate and cause resource exhaustion. This is particularly important for Solution 1 and potentially Solution 3, which create CRs per scheduling request.

Copilot uses AI. Check for mistakes.

The process involves three main phases:

1. **Initial Cluster Grouping:** In the cluster preparation stage, administrator should divide the cluster nodes into several groups by `NodeFeatureGroup`. Multiple `NodeFeatureGroup` Custom Resources (CRs) are created declaratively, each defining a grouping rule. Their status is populated with all matching nodes, completing the pre-grouping setup.
Copy link

Copilot AI Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing important implementation detail. The proposal mentions that "administrator should divide the cluster nodes into several groups by NodeFeatureGroup" but doesn't provide guidance on how to determine appropriate grouping rules or how many groups are optimal. Additionally, it doesn't address what happens when new nodes are added to the cluster - how are they assigned to groups? These are critical considerations for the practical implementation of this solution.

Copilot uses AI. Check for mistakes.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Xunli-Yang
Once this PR has been reviewed and has the lgtm label, please ask for approval from arangogutierrez. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Lohon0
Copy link

Lohon0 commented Jan 3, 2026

Hi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants