-
Notifications
You must be signed in to change notification settings - Fork 98
Possible KEP for gang scheduling #1111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kannon92 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a Kubernetes Enhancement Proposal (KEP) for adding gang scheduling support to JobSet. Gang scheduling ensures that groups of pods are scheduled atomically, preventing resource deadlock in distributed AI/ML training and HPC workloads.
Key Changes:
- Adds KEP-969 metadata file defining the proposal status, milestones, and feature gates
- Introduces comprehensive KEP documentation describing three gang scheduling modes: JobSetAsGang, JobSetGangPerReplicatedJob, and JobSetWorkloadTemplate
- Proposes API design integrating with Kubernetes Workload API for scheduler interoperability
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| keps/969-GangScheduling/kep.yaml | Defines KEP metadata including authors, status, milestones, and feature gate configuration |
| keps/969-GangScheduling/README.md | Comprehensive KEP document covering motivation, API proposal, implementation details, test plans, and graduation criteria |
andreyvelich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keps/969-GangScheduling/README.md
Outdated
| - **JobSetAsGang**: The entire JobSet is treated as a single gang (all-or-nothing scheduling) | ||
| - **JobSetGangPerReplicatedJob**: Each ReplicatedJob within the JobSet is treated as a separate gang | ||
| - **JobSetWorkloadTemplate**: Advanced mode allowing users to provide custom Workload specifications for fine-grained control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we plan to introduce Workload template, why do we really need to have other policies? Is it just to simplify UX?
I am wondering if documentation which explains how to configure Workload API is sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea its simpler to have this handled for someone.
For Job, I don't necessarly see a reason to have WorkloadTemplate (plus workload api is v1alpha1 and the Job API is v1).
For JobSet, I figured we can use this API for more advanced features but maybe the quick options are easier to support for simple use cases (all Gang or JobSetGangPerReplicatedJob).
| metadata: | ||
| name: multi-phase-training | ||
| spec: | ||
| gangPolicy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that in the future Workload API will be integrated with TAS and DRA, how are we planning to integrate it into JobSet API?
- KEP-5710: Add initial KEP docs for workload-aware preemption kubernetes/enhancements#5711
- KEP-5732: Topology-aware workload scheduling kubernetes/enhancements#5733
If we want to support templates via JobSet API, do we really need to introduce gangPolicy API?
One of the idea could be to directly introduce the spec.workloadTemplate API similarly to spec.volumeClaimTemplates in StatefulSet.
With that, we can gradually support more Workload features in JobSet controller.
@ahg-g @Edwinhr716 Do we have any updates from the LWS API on the Workload API will be supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was sorta my thought on Workload Template for more advanced features.
Same for supporting DependsOn with GangScheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial plan on the LWS side is to have an API field when the user wants the controller to create the workload template. Though as new features are added, it does make more sense to add the workload template itself to the spec.
cc @helayoty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be nice to keep API consistent between JobSet and LWS, and introducing WorkloadTemplate makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm okay.
For Job I don't think we can have Workload Template so I was going to keep the enum structure there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the KEP with this.
I'll wait for the author to get back to me before I close. I created this as there was some discussion on JobSet in a google doc and I wanted to see what options we could have. I also have a parallel KEP for Job. |
|
@kannon92: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Sounds good. I have the Job opened up. I'll let you review this but we can close this out. I wanted to draft a potential API just to see what gaps we could have with current implementation. |
| // and link it to the JobSet for gang scheduling. | ||
| // The Workload's podSets define how pods should be grouped for gang scheduling. | ||
| // +optional | ||
| WorkloadTemplate *schedulingv1alpha1.WorkloadSpec `json:"workloadTemplate,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, if we are going ahead with workloadTemplate approach, we should move it directly under JobSet spec, and we should allow users to set Workload object metadata like labels and annotations.
The API might look as follows:
type JobSetSpec struct {
WorkloadTemplate *WorkloadTemplate`json:"workloadTemplate,omitempty"`
}
type WorkloadTemplate struct {
metav1.ObjectMeta `json:"metadata,omitempty"`
spec *schedulingv1alpha1.WorkloadSpec `json:"spec,omitempty"`
}|
/hold |
What type of PR is this?
/kind documentation
What this PR does / why we need it:
KEP to demonstrate how we could use workload API
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?