Skip to content

Conversation

@kannon92
Copy link
Contributor

What type of PR is this?

/kind documentation

What this PR does / why we need it:

KEP to demonstrate how we could use workload API

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Copilot AI review requested due to automatic review settings December 22, 2025 17:21
@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Dec 22, 2025
@netlify
Copy link

netlify bot commented Dec 22, 2025

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit 8bc93ca
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-jobset/deploys/695be723eccf1000089b1488

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kannon92

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 22, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Kubernetes Enhancement Proposal (KEP) for adding gang scheduling support to JobSet. Gang scheduling ensures that groups of pods are scheduled atomically, preventing resource deadlock in distributed AI/ML training and HPC workloads.

Key Changes:

  • Adds KEP-969 metadata file defining the proposal status, milestones, and feature gates
  • Introduces comprehensive KEP documentation describing three gang scheduling modes: JobSetAsGang, JobSetGangPerReplicatedJob, and JobSetWorkloadTemplate
  • Proposes API design integrating with Kubernetes Workload API for scheduler interoperability

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
keps/969-GangScheduling/kep.yaml Defines KEP metadata including authors, status, milestones, and feature gate configuration
keps/969-GangScheduling/README.md Comprehensive KEP document covering motivation, API proposal, implementation details, test plans, and graduation criteria

@kannon92
Copy link
Contributor Author

cc @andreyvelich

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @kannon92!
Shall we close this KEP: #1068 to continue discussion in this PR?

cc @astefanutti @tenzen-y @GiuseppeTT @imreddy13

- **JobSetAsGang**: The entire JobSet is treated as a single gang (all-or-nothing scheduling)
- **JobSetGangPerReplicatedJob**: Each ReplicatedJob within the JobSet is treated as a separate gang
- **JobSetWorkloadTemplate**: Advanced mode allowing users to provide custom Workload specifications for fine-grained control
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we plan to introduce Workload template, why do we really need to have other policies? Is it just to simplify UX?
I am wondering if documentation which explains how to configure Workload API is sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea its simpler to have this handled for someone.

For Job, I don't necessarly see a reason to have WorkloadTemplate (plus workload api is v1alpha1 and the Job API is v1).

For JobSet, I figured we can use this API for more advanced features but maybe the quick options are easier to support for simple use cases (all Gang or JobSetGangPerReplicatedJob).

metadata:
name: multi-phase-training
spec:
gangPolicy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that in the future Workload API will be integrated with TAS and DRA, how are we planning to integrate it into JobSet API?

If we want to support templates via JobSet API, do we really need to introduce gangPolicy API?

One of the idea could be to directly introduce the spec.workloadTemplate API similarly to spec.volumeClaimTemplates in StatefulSet.

With that, we can gradually support more Workload features in JobSet controller.

@ahg-g @Edwinhr716 Do we have any updates from the LWS API on the Workload API will be supported?

cc @macsko @wojtek-t @erictune @44past4 @johnbelamaric

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was sorta my thought on Workload Template for more advanced features.

Same for supporting DependsOn with GangScheduling.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial plan on the LWS side is to have an API field when the user wants the controller to create the workload template. Though as new features are added, it does make more sense to add the workload template itself to the spec.

cc @helayoty

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be nice to keep API consistent between JobSet and LWS, and introducing WorkloadTemplate makes sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm okay.

For Job I don't think we can have Workload Template so I was going to keep the enum structure there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the KEP with this.

@kannon92
Copy link
Contributor Author

Thanks for this @kannon92! Shall we close this KEP: #1068 to continue discussion in this PR?

cc @astefanutti @tenzen-y @GiuseppeTT @imreddy13

I'll wait for the author to get back to me before I close.

I created this as there was some discussion on JobSet in a google doc and I wanted to see what options we could have.

I also have a parallel KEP for Job.

@k8s-ci-robot
Copy link
Contributor

@kannon92: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-jobset-test-e2e-main-1-34 8bc93ca link true /test pull-jobset-test-e2e-main-1-34

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@imreddy13
Copy link
Contributor

imreddy13 commented Jan 5, 2026

I haven't read this in detail yet (back from vacation) but I had a KEP for this a few months ago: #1068

We were actually planning to add it for 1.36 when Workload API will be beta after discussing with Wojtekt/Eric that 1.35 might be too early.

@ahg-g is separately designing for LWS too

@kannon92
Copy link
Contributor Author

kannon92 commented Jan 5, 2026

I haven't read this in detail yet (back from vacation) but I had a KEP for this a few months ago: #1068

We were actually planning to add it for 1.36 when Workload API will be beta after discussing with Wojtekt/Eric that 1.35 might be too early.

@ahg-g is separately designing for LWS too

Sounds good.

I have the Job opened up.

I'll let you review this but we can close this out.

I wanted to draft a potential API just to see what gaps we could have with current implementation.

// and link it to the JobSet for gang scheduling.
// The Workload's podSets define how pods should be grouped for gang scheduling.
// +optional
WorkloadTemplate *schedulingv1alpha1.WorkloadSpec `json:"workloadTemplate,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, if we are going ahead with workloadTemplate approach, we should move it directly under JobSet spec, and we should allow users to set Workload object metadata like labels and annotations.
The API might look as follows:

type JobSetSpec struct {
  WorkloadTemplate *WorkloadTemplate`json:"workloadTemplate,omitempty"`
}

type WorkloadTemplate struct {
    metav1.ObjectMeta `json:"metadata,omitempty"`

    spec *schedulingv1alpha1.WorkloadSpec `json:"spec,omitempty"`
}

@kannon92
Copy link
Contributor Author

kannon92 commented Jan 7, 2026

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/documentation Categorizes issue or PR as related to documentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants