KEP-4671: Introduce Workload Scheduling Cycle, extend basic policy with desiredCount #5730

macsko · 2025-12-10T09:05:27Z

One-line PR description: Update KEP-4671 for beta in v1.36. Introduce Workload Scheduling Cycle phase.

Issue link: Gang Scheduling Support in Kubernetes #4671

Other comments: This PR is shows an initial idea of new workload scheduling phase. Further changes related to graduation to beta will be added gradually.

macsko · 2025-12-10T09:06:05Z

/cc @dom4ha @sanposhiho @wojtek-t @erictune

Argh4k · 2025-12-10T13:54:35Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+Alternative 1 (Modify sorting logic):
+
+Modify the sorting logic within the existing `PriorityQueue` to put all pods


We would need to modify it once we have workload priority, right? Otherwise we can keep the current sorting algorithm and pull all gang members from the queue once we encounter the first member of the gang (and switch to workload scheduling), right?

Right, we could keep the current algorithm, but since the gang members can be in any internal queue, removing them from these queues would require some effort (running PreEnqueue, registering metrics, filling in internal structures used for queuing hints, etc.). In fact, this effort could be moved to properly implement queueing with more feature-proof alternative.

If we decide that if there is no workload priority then priority = min(pods priority) (as Wojtek suggested) then it becomes a big problem with this approach.

Yes, if we choose that priority calculation, then this alternative is just a bad idea

I'd also not modify the queue sorting, but "remove" PodGroup pods form it when PodGroup itself is unschedulable by simply making them unschedulable.

We can't "simply" make them unschedulable. What about event handling? What should make the PodGroup schedulable after failure? What to do with the observability (metrics, logs, etc.) that are aligned to the pod-by-pod queueing now? It's hard to say whether we could easily do that or not. I need to analyze the code deeper and come back with some thoughts.

Anyway, the modification of the queue sorting will be needed when the pod group priority will be introduced by the workload-aware preemption KEP. We should consider that when designing the queueing part of this KEP.

erictune

This looks great overall.

I have one proposal to consider renaming part of the API, and some clarifying questions.

keps/sig-scheduling/4671-gang-scheduling/README.md

erictune · 2025-12-11T00:04:08Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


The term PodGroup has become ambiguous. It can mean:

a list item in Workload.podGroups

a specific set of pods having the same PodGroupReplicaKey
These are often the same thing but not always.

If we want to be clear what we mean, we can either:

Always call (1) a PodGroup and always call (2) a PodGroupReplica, including in the case when podGroupReplicaKey is nil

Or, rename PodGroup to PodGroup Template, and call (2) a PodGroup

This may seem pedantic, but when reviewing this KEP, I felt that current double meaning made the text imprecise in a number of places. We do have a chance to fix the naming with v1beta1, but I don't think we are supposed to rename when we go to GA.

My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.

If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?

Agree with andreyvelich, PodSubGroupReplica and PodSetReplica are confusing.

This would be less confusing:

1 Workload object in APIserver names exactly one 1 group of pods at runtime.

1 Workload.podGroupTemplates[i] is a template for N independent PodGroups which are distinguished by PodGroupReplicaKey

Later, if PodSubGroup is added:

1 Workload.podGroupTemplate[i].podSubGroup[j] names exactly one PodSubGroup within a single group of pods (PodGroup) defined as above.

Later, if PodSet is added under PodSubGroup:

1 Workload.podGroupTemplate[i].podSubGroup[j].podSet[k] names exactly one PodSet within PodSubGroup, defined as above.

This needs a longer discussion than belongs in the review comments of this PR.

Suggestion to KEP author:

Keep Workload Scheduling Cycle proposal in this PR.

Keep Basic policy enhancements in this PR.

Remove advancement to beta to.

Let's all as reviewers approve the above. And we can take a week to discuss the "PodGroup as separate object" proposal. which may affect alpha/beta status of Workload.

Agree with @johnbelamaric and @erictune that we should separate discussion whether we need separate object for PodGroup.

Remove advancement to beta to.

Is my understanding correct that we will create another KEP for beta graduation in that case?

I'm going back and forth with whether we should separate PodGroup into a separate object (in fact my original counter-proposal was going that way though for different reasons https://docs.google.com/document/d/13UkLjVMj_edMh7biqVU6SVyNNTIfGAT35p6-pNsF5AY/edit?resourcekey=0-dqUEiwiXWICLwAg6Tqkupw&tab=t.9le0fmf90j3w#heading=h.vf43rjyfidc6 )

I agree it's the last moment to change that before going to Beta.

But I also agree that pretty much neither of the proposed changes (basicPolicy, workload scheduling cycle, scheduling algorithm, ...) depend on this decision and can proceed independently.

So I'm heavily +1 on Eric's suggestion above to focus this PR on those changes but still leave the KEP in Alpha after this PR.

And have a separate PR that will be focused on the API itself (it can't be a separate KEP, but it can be a separate PR and discussion).
Also - we should probably start a dedicated document for that to clearly describe Pros/Cons of both options to make the more data-driven decision.

@macsko - I'll be OOO for the next 2 weeks, would you be able to start such doc and I'm happy to contribute once I'm back

Okay, so I'll remove the beta graduation part from this PR and the discussion about API can be moved to a new document

So I went ahead and started a doc: https://docs.google.com/document/d/1zVdNyMGuSi861Uw16LAKXzKkBgZaICOWdPRQB9YAwTk/edit?resourcekey=0-bD8cjW_B6ZfOpSGgDrU6Mg&tab=t.0#heading=h.c4vrtnmf9f4o

It's only a starter and requires a lot of work so I would appreciate all contributions, especially given I will be OOO until Jan 7th

erictune · 2025-12-11T00:09:13Z

keps/sig-scheduling/4671-gang-scheduling/README.md


-// WorkloadSpec describes a workload in a portable way that scheduler and related
-// tools can understand.  
+// WorkloadMaxPodGroups is the maximum number of pod groups per Workload.


This is the number of items in the list PodGroups

FWIW - these updates are based on what was merged in the code.
I'm happy for these to be updated, but for these specifically maybe we should open a dedicated PR in k/k.

Agree with @wojtek-t. We can defer API description discussions to the k/k PR with API graduation (when it will be created)

erictune · 2025-12-11T00:10:04Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	ControllerRef *TypedLocalObjectReference
+
+	// PodGroups is the list of pod groups that make up the Workload.
+	// The maximum number of pod groups is 8. This field is immutable.


But the maximum number of PodGroup Replicas is not limited.

That's right. It would be even hard to enforce such limitation in the current form of replication - it's based solely on pods' workloadRefs.

Was suggesting to make this more clear in the code comment.

Splitting PodGroup into its own resource would make it a lot clearer that there is a distinction between the template and the instance, and that the limit applies to defined templates, not instances.

keps/sig-scheduling/4671-gang-scheduling/README.md

erictune · 2025-12-11T00:50:01Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+When the scheduler pops a Pod from the active queue, it checks if that Pod
+belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler
+initiates the Workload Scheduling Cycle.


If a pod belongs to an already scheduled PodGroup, it is not clear what to do. We could:

Hold it back, on the assumption that it is a replacement for an already scheduled pod. When minCount additional pods show up, then handle all those at once.

Treat it as if it is a "gang scale up", and try to place it too (on a best-effort basis). If we do this, does it go through Workload cycle, or just normal pod-at-a-time path?

I thinking about races that could happen when a workload is failing and getting pods replaced. And I am thinking about the case when a workload wants to tolerate a small number of failures by having actual count > minCount.

Taking a step back from the implementation, we should think what we really need to do conceptually in that case.

If the pod is part of PodGroup, then what we conceptually want is to schedule that in the context of its PodGroup instance. I think there are effectively three cases here:

this PodGroup instance doesn't satisfy its minCount now, but with this pod it will satisfy it

this PodGroup instance doesn't satisfy its minCount now and with that pod it won't satisfy it either

this PodGroup instance already satisfies its minCount even without this pod

It's worth noting at this point, that with topology-aware scheduling introduce PodGroup instance could have tas requirements too when thinking what we should do with it.

The last point in my opinion means that we effectively should always go through Workload cycle, because we always want to consider it in the context of whole workload.
The primary question is if we want to go kick this workload cycle off with individual pod or wait for more and if so based on what criteria.
I think that the exact criteria will be evolving in the future (I can definitely imagine it depend on "preemption unit" that we're introducing in KEP-5710). So for now, I wouldn't try to settle down on the final solution.

I would suggest starting with:

always go through Workload cycle (to keep in mind whole podGroup instance as context)

for now, always schedule individual pod for the sake of making progress here (we will most probably adjust it in the future but it will be pretty localized change and I think it's fine)

It seems that the easiest way is to do what @wojtek-t described, i.e., go best effort (as for the basic policy), so take as many pods as we have available, and try to schedule them in the workload cycle. Only pods that have passed the workload cycle will be able to move on to the pod-by-pod cycle. In the future, we can try to create more intelligent grouping, but for now, let's focus on delivering a good enough implementation.

What is the plan if PodGroup has been scheduled with TAS constraints based on minCount, and a new Pod comes in that is part of the PodGroup but doesn't fit in the TAS constraints?

That's a good point. I suppose, the new Pod should go through a workload scheduling cycle and the TAS algorithm there should take the scheduled pods from a gang into consideration. If the pod doesn't fit, it will remain unschedulable until something changes in the cluster.

erictune · 2025-12-11T01:11:39Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+     They are then effectively "reserved" on those nodes in the
+     scheduler's internal cache. Pods are then pushed to the
+     active queue (restoring their original timestamps to ensure fairness)
+     to pass through the standard scheduling and binding cycle,


IIUC, there are some cases where a nominated pod fails to pass through the standard scheduling cycle:

differences in the algorithm between workload cycle and standard cycle

refreshed snapshot changes node eligibility.

higher priority pod jumps ahead in active queue

Yes, these are the cases, but we are okay when they occur, as long as the new placement will be still valid. If it won't be anymore, scheduling of the gang will fail (at WaitOnPermit) and the gang will be retried.

There should be no differences between workload and standard cycle in terms of pods feasibility

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t

Added few comments but they are pretty localized - overall this is great proposal pretty aligned with how I was thinking about it too.

keps/prod-readiness/sig-scheduling/4671.yaml

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t · 2025-12-11T09:17:59Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.

wojtek-t · 2025-12-11T09:20:12Z

keps/sig-scheduling/4671-gang-scheduling/README.md


-// WorkloadSpec describes a workload in a portable way that scheduler and related
-// tools can understand.  
+// WorkloadMaxPodGroups is the maximum number of pod groups per Workload.


FWIW - these updates are based on what was merged in the code.
I'm happy for these to be updated, but for these specifically maybe we should open a dedicated PR in k/k.

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t · 2025-12-11T09:55:10Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+  Can report dedicated logs and metrics with less confusion to the user.
+* *Cons:* Significant and non-trivial architectural change to the scheduling queue
+  and `scheduleOne` loop.
+<<[/UNRESOLVED]>>


I would more value opinions from people more hands-on with scheduler code recently than myself, but on paper I think the third alternative seems preferred to me:

Alternative 1 - it will be extremely hard to reason about it if pods sit in different queues (backoff, active, ..) independently and the evolution point is extremely important imho - we know we will be evolving it a lot and preparing for that is super important

Alternative 2 - I share the concerns about corner cases (pod deleted is exactly what I have on my mind). Once we get to rescheduling cases (new pods appear but they should trigger removing and recreating some of the existing pods it will get even more complicated with even harder corner cases). Given that we know that rescheduling is super important, I'm also reluctant about it.

Alternative 3 - It's clearly the cleanest option. The primary drawback of the need for non-trivial changes is imho justified given we know that we will be evolving it significantly in the future.

So I have quite strong preference towards (3) at this point, but I'm happy to be challenged by implementation-specific counter-arguments.

That was my thought as well. I think the effort involved in adding workload queuing will be comparable to modifying the code for the previous alternatives, but maybe I don't see some significant drawbacks.

For the third alternative, do we want to introduce only the queue for PodGroups? If so then we have the same problem of pulling pods from backoff and unschedulable queues right? Or do we mean by the workload queue a structure that will hold all the pods related to workload scheduling?

Let's say that we add a Pod that is part of a group but does not make us meet the min count . Should it land in unschedulable queue as it does right now? Or some additional structure?

If the pod group failed scheduling, where do we put pods from it? We need to have them somewhere, so we can check them against cluster events in case some event makes the pod schedulable thus potentially makes the whole pod group schedulable.

I think, one way might be to introduce both:

a queue for pod groups (let's say podGroupsQ). This will contain the pod groups that can be popped for workload scheduling

a data structure for unscheduled pods (let's say workloadPods) that have to wait for workload scheduling to finish to proceed with their pod-by-pod cycles.

I imagine the pod transitions (for pods that have a workloadRef) would be:

pod is created -> when PreEnqueue passes for a pod, add it to the workloadPods, otherwise add to unschedulablePods -> [for gang pod group] if the >= minCount pods for the group is in a workloadPods, the group is added to the podGroupsQ -> pod group is processed and workload scheduling cycle is executed

When the workload cycle finishes successfully:

pod gets the NNN and is moved to the activeQ -> pod is popped and goes through its pod-by-pod scheduling cycle -> ...

When the workload cycle fails:

pod is moved to the unschedulablePods, where it waits for a cluster change to happen -> when the change happens for this pod or other from its pod group, the pod(s) are moved to the workloadPods -> workload manager detects the group should be retried and adds the group to the podGroupsQ -> processing continues as previously

Again, alternative 3 is the obvious choice IMO if PodGroup is a separate resource.

In Koordinator, I used Alternative2. The desired effect is that, in ActiveQ/BackoffQ, a PodGroup has only one Item, and this Item can carry all queuing attributes (Priority, LastScheduleTime, BackoffTime, etc.). Whether this Item is a representative pod or a real PodGroup is less important.

Another important point is that when a representative Pod or PodGroup is dequeued, its sibling Pods are also dequeued. In Koordinator, we implemented our own NextPod (koordinator-sh/koordinator#2417) to achieve this.

Regarding this KEP, I feel that a fusion of Alternative 2 and Alternative 3 might be a better solution:

Use a QueuedPodGroupInfo that aggregates the queuing attributes of all member Pods to flow PodGroups between ActiveQ, BackoffQ, and UnschedulableQ, allowing the previous QueueHint mechanism to seamlessly integrate with PodGroups.

Use a separate Map to store the Pods belonging to QueuedPodGroupInfo for easy indexing.

Consider that we should have PodGroups without pods at some point in time and we should be able to schedule them somehow.

However, before we get there, I'd consider scheduling PodGroup simply whenever we encounter the first pod that refers to it and the PodGroup wasn't scheduled yet. It's similar to option 1 but without modifying sorting function.

I think it would be the simplest implementation as all pods stay in the active queue unless an attempt to schedule a PodGroup they belong to, makes the PodGroup unschedulable. Obviously the question is when to reconsider unschedulable PodGroup. At the beginning I'd make it periodically schedulable again (after some timeout) without defining PodGroup queue yet.

I'm not sure yet how Workload aware preemption may interact with it yet.

It's similar to option 1 but without modifying sorting function.

Added it as an "Alternative 0" to the proposal.

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t · 2025-12-11T10:00:54Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+     * If preemption is successful, the pod is nominated on the selected node.
+     * If preemption fails, the pod is considered unscheduled for this cycle.
+
+   The phase can effectively stop once `minCount` pods have a placement,


From the POV of optimizing locally, we should continue scheduling all pod (and not stop after minCount) - what changes is that the unschedulability of further pods shouldn't make the whole group unschedulable.

Given that we have minCount at the whole PodGroup level, the local optimization is actually what we should probably do anyway (it would be a different story if we would have a separate minCount per signature, but it's not the case).

If the number of pods in the in the PodGroup is larger than minCount there is also a question if we should stop at first pod scheduling failure or maybe only if the number of failures is larger than the difference between the size of the PodGroup and minCount?

My intuition is that if we encounter a pod scheduling failure we should probably skip scheduling of the remaining pods in a given sub-group treating them as unschedulable but if there are more sub-groups and if there is still a chance that we will be able to schedule minCount of pods we should probably try scheudling pods from the sebsequent sub-groups.

My intuition is that if we encounter a pod scheduling failure we should probably skip scheduling of the remaining pods in a given sub-group treating them as unschedulable but if there are more sub-groups and if there is still a chance that we will be able to schedule minCount of pods we should probably try scheudling pods from the sebsequent sub-groups.

I agree and that was the idea. Added it explicitly to the KEP

keps/sig-scheduling/4671-gang-scheduling/README.md

andreyvelich · 2025-12-11T16:36:29Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?

keps/sig-scheduling/4671-gang-scheduling/README.md

sanposhiho

/assign

wojtek-t · 2025-12-18T14:09:27Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


@thockin - I would like to appreciate your feedback from API approver perspective.

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t · 2025-12-18T14:17:25Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+the standard pod-by-pod scheduling cycle.
+
+When the scheduler pops a Pod from the active queue, it checks if that Pod
+belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler


As discussed in the Basic policy section above, I actually think that all pods that belong to workloads should go through this phase (whether they form gangs or just are from basic policy).

I acknowledge that for basic it will be kind of best-effort (if more pods were created we will get all of them, if we only observed 1 it will be just one), but that better opens doors for future extensions once we have pod templates etc. defined in Workload.

+1
I'd say that Workload scheduling is a phase scheduling PodGroups. Depending on what policy type the group is, the logic is different.

In case of the Basic policy, the group becomes scheduled unconditionally, still pods belonging to that group cannot be scheduled until the PG itself is scheduled.

Removed "Gang" from this sentence. I think we are aligned that the basic-policy pods should be scheduled by this phase

wojtek-t · 2025-12-18T14:19:49Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+*Proposed:* Implement it on PodGroup Level for Beta. However, future migration
+to the Workload Level might necessitate non-trivial changes to the phase
+introduced by this KEP.


I suggest resolving that (and potentially mentioning the risk of non-trivial changes in the Risks section).

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t · 2025-12-18T14:32:17Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+2. The scheduler nominates the victims for preemption and the gang Pod
+   for scheduling on their place. This way, the gang can be attempted
+   without making any intermediate disruptions to the cluster.
+   * If the quorum is met, the scheduler continues scheduling the gang Pods pod-by-pod.


This should be adjusted once we settle down on the details in #5711

Currently these are not aligned :)

[It's a comment for myself too]

johnbelamaric

I really think we should take the opportunity this release to shift to top-level PodGroup instances, with the templates living in workload. I believe this clears up a number of things and will make building on top of PodGroups much easier for the rest of the project.

johnbelamaric · 2025-12-18T18:14:09Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	// +required
+	Name string
+
+	// PodGroup is the name of the PodGroup within the Workload that this Pod


Embedding PodGroups in Workload as done today makes it much harder for clients to track lifecycle events - creation, deletion, resize, etc. of individual PodGroup replicas. And given that (AIUI) individual replica keys are not all stored in the workload, it may not even be possible to track PodGroup replica creation and deletion today without some sort of watch on Pods and inferring them from that.

I don't think it's too late to pull PodGroup out into its own top-level resource, but it is our last chance to do so. I think it's the better design and we should take this opportunity. With that update, clients can use all the standard API machinery rather than something that calculates deltas from changes in workload objects and/or watching Pods.

In that case Workload may contain PodGroupTemplates but the actual instances of PodGroups would be separate objects with a reference back to the workload and template. So we clearly separate lifecycle of the policy configuration and instances of groups based on that policy configuration.

This then would probably need to change from a WorkloadRef to a PodGroup resource name.
Barring that, I prefer what Eric suggests above.

johnbelamaric · 2025-12-18T18:24:34Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+	ControllerRef *TypedLocalObjectReference
+
+	// PodGroups is the list of pod groups that make up the Workload.
+	// The maximum number of pod groups is 8. This field is immutable.


Splitting PodGroup into its own resource would make it a lot clearer that there is a distinction between the template and the instance, and that the limit applies to defined templates, not instances.

johnbelamaric · 2025-12-18T18:32:24Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+When the scheduler pops a Pod from the active queue, it checks if that Pod
+belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler
+initiates the Workload Scheduling Cycle.


What is the plan if PodGroup has been scheduled with TAS constraints based on minCount, and a new Pod comes in that is part of the PodGroup but doesn't fit in the TAS constraints?

johnbelamaric · 2025-12-18T18:36:25Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+*Proposed:* Implement it on PodGroup Level for Beta. However, future migration
+to the Workload Level might necessitate non-trivial changes to the phase
+introduced by this KEP.


+1

Complex true workload controllers should implement any orchestration themselves. In other words, if there really are, say, dependencies between PodGroups, I think we should leave that complexity in the upper layer controller, and the scheduler just deal in PodGroups. So the controller would wait to create the second PodGroup until after it made sure the first one got scheduled. This does mean we don't have full workload atomic scheduling, only PodGroup atomic scheduling. But that level of complexity probably needs something more like Reservation.

johnbelamaric · 2025-12-18T18:44:20Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+end-to-end Pod scheduling flow, it is planned to place this new phase *before*
+the standard pod-by-pod scheduling cycle.
+
+When the scheduler pops a Pod from the active queue, it checks if that Pod


If PodGroup is a separate resource, we can watch for unschedule PodGroups, and defer scheduling any Pod that references a PodGroup until that PodGroup has been scheduled. I think that's a cleaner design than going through the Pod indirection.

So, are we making a new PodGroup API that the API server would see? Let's clarify what we mean.

johnbelamaric · 2025-12-18T18:50:11Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+  Can report dedicated logs and metrics with less confusion to the user.
+* *Cons:* Significant and non-trivial architectural change to the scheduling queue
+  and `scheduleOne` loop.
+<<[/UNRESOLVED]>>


Again, alternative 3 is the obvious choice IMO if PodGroup is a separate resource.

keps/sig-scheduling/4671-gang-scheduling/README.md

ZiMengSheng · 2025-12-19T07:01:48Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+Note that the implementation of this specific logic might follow in a Beta stage
+of this API field.
+
+#### Delayed Preemption


The deletion of Victim takes some time due to resource release and other reasons. During this period, how can the binding process of the Preemptor be blocked to prevent the kubelet from rejecting the Pod due to resource over-provisioning on the node?

The exact design of delayed preemption is proposed in #5711. Let's move this discussion there

kannon92 · 2025-12-19T20:43:53Z

keps/sig-scheduling/4671-gang-scheduling/kep.yaml

 # If the purpose of this KEP is to deprecate a user-visible feature
 # and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
-stage: alpha
+stage: beta


I don't completely follow everything in this proposal but promoting to beta and introducing two new feature gates raises a lot of eyebrows to me.

And now we have 4 separate feature gates for this feature..

What are the implications on these features gates in relation to each other?

ie if GenericWorkload is disabled and GangScheduling enabled what do we expect to happen?

Will any of these graduate at a different time than the other features gates?

I don't completely follow everything in this proposal but promoting to beta and introducing two new feature gates raises a lot of eyebrows to me.

One of the gates (WorkloadBasicPolicyDesiredCount) will start in alpha, but the change is that small that creating a new KEP for it would be an overhead, so we decided to put it in this KEP.

And now we have 4 separate feature gates for this feature..

Tentatively removed one feature gate from the list. I think the workload scheduling cycle could be covered using the gang scheduling gate.

What are the implications on these features gates in relation to each other?

GangScheduling requires GenericWorkload to work, as the latter defines the API that enables the gang scheduling. GenericWorkload itself can be enabled and used to express the "true" workload without a need for it to be gang scheduled by the kube-scheduler.

GenericWorkload is disabled and GangScheduling enabled what do we expect to happen?

GangScheduling gate requires GenericWorkload to be enabled (it is enforced by the feature gate validation.

kannon92 · 2025-12-19T21:32:03Z

Honestly I would really like to see some k8s workloads adopting WAS.

We have #5547 but that seems decoupled from this KEP.

But if we promote the API to beta than we are going to discourage breaking API changes.

Would it be possible to make beta promotion contingent on at least a few workload apis proving out this design?

I don't want to get into a situation where this API gets promoted to beta/ga and then workload authors figure out issues with the API and then we have to revisit the API but we couldn't break the API.

keps/sig-scheduling/4671-gang-scheduling/README.md

dom4ha · 2025-12-22T00:35:33Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+the standard pod-by-pod scheduling cycle.
+
+When the scheduler pops a Pod from the active queue, it checks if that Pod
+belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler


+1
I'd say that Workload scheduling is a phase scheduling PodGroups. Depending on what policy type the group is, the logic is different.

In case of the Basic policy, the group becomes scheduled unconditionally, still pods belonging to that group cannot be scheduled until the PG itself is scheduled.

dom4ha · 2025-12-22T00:50:11Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+*Proposed:* Implement it on PodGroup Level for Beta. However, future migration
+to the Workload Level might necessitate non-trivial changes to the phase
+introduced by this KEP.


+1
IMO we need to schedule PodGroups themselves, as they define scheduling constraints on a level of group of pods, which means pods cannot be scheduled individually.

If we introduce group of groups, then this phase would have to schedule such a group as a whole since individual PodGroups no longer could be scheduled individually. Whenever we add them, extension of the workload scheduling phase would be incremental.

If we ever had cross workloads scheduling constraints, I bet we'd need to schedule such workloads in one cycle as well, so a single workload may not necessarily be a boundary for this phase.

dom4ha · 2025-12-22T01:09:47Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+     scheduler's internal cache. Pods are then pushed to the
+     active queue (restoring their original timestamps to ensure fairness)
+     to pass through the standard scheduling and binding cycle,
+     which will respect the nomination.


Can you elaborate on what "respect" means? Do you mean "consider" like currently or something more? The alternative that I see is consider nomination as required. Inability to follow the nomination would make PodGroup unschedulable.

However, as the first implementation I'd stick to the current logic and use the word "consider". That means that Pods could pick a different node, but only within a constraint selected for a PodGroup. In other words, pod-by-pod cannot reschedule PodGroup itself (assuming scheduling a PodGroup has "an effect", for instance picking a specific topology option).

Checking validity of the PodGroup level constraint may not be that easy and may require checking the state of other pods in the PodGroup or checking the state of DRA allocations. In the context of topology-aware workload scheduling I believe that it would be much easier to consider the nomination to be a hard requirement.

I meant "consider" here, but @44past4 point is valid. If we want to have a nominate semantic here, PodGroup-level plugins such as TAS would need to provide Filter extension points that will be able to verify the correctness of the new chosen node. Or, be able to provide the placement to the pod-by-pod cycle that will limit that phase to the topology/assignment chosen.

To provide TAS Filter extension we would need to store information about selected placement (like the name of the Node label and its value) for the PodGroup. For this we would need probably to use Workload status. This is doable but it is not in scope of the current TAS KEP proposal.

TAS could check in PreFilter where the pod group's previous pods were scheduled and based on that reject or allow the nodes in Filter. Obviously, it won't be much efficient, but maybe sufficiently good for now.

dom4ha · 2025-12-22T02:15:05Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+   reducing overall scheduling time.
+
+   * If a pod fits, it is tentatively nominated.
+   * If a pod cannot fit, the scheduler tries preemption by running


I wouldn't use PostFilter phase in workload scheduling phase but rather "wait" for the proper workload preemption feature. Note that once we have pods that are batched (same signature) the Opportunistic Batching does not have a feature to find feasible-nodes-after-preemption-attempt anyway

But, if workload-aware preemption is not there (or not enabled), we need to provide a way for gangs to perform preemptions. Otherwise, what is the point of having delayed preemption in the beta graduation criteria for gang scheduling?

Opportunistic batching just optimizes the default scheduling algorithm. I don't think preemption would be a big problem anyway because subsequent pod from the homogeneous sub-group can generate new placements for the rest of the batch.

dom4ha · 2025-12-22T02:23:18Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+   * If `schedulableCount >= minCount`, the cycle succeeds. Pods are pushed
+     to the active queue and will soon attempt to be scheduled on their
+     nominated nodes in their own, pod-by-pod cycles. If a pod selects a


I don't see any strong reason why PostFilter should not run in pod-by-pod scheduling if we allow nominated pod to be changed. Yes, it's indeed an open question whether we want to allow changing nomination itself, but if we do, then we should allow to run PostFilter as well.

I think that allowing such disruptions in pod-by-pod scheduling would make workload scheduling hard to reason about. The reason we'll use the workload-aware preemption is to enable efficient and effective preemption. Otherwise, each pod in the group could preempt some pods or even workloads based on its own needs rather than the needs of the entire pod group. I don't see many advantages to allowing PostFilter to run in pod-by-pod scheduling.

dom4ha · 2025-12-22T02:43:28Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+     They are then effectively "reserved" on those nodes in the
+     scheduler's internal cache. Pods are then pushed to the
+     active queue (restoring their original timestamps to ensure fairness)
+     to pass through the standard scheduling and binding cycle,


There should be no differences between workload and standard cycle in terms of pods feasibility

dom4ha · 2025-12-22T02:49:13Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+Alternative 1 (Modify sorting logic):
+
+Modify the sorting logic within the existing `PriorityQueue` to put all pods


I'd also not modify the queue sorting, but "remove" PodGroup pods form it when PodGroup itself is unschedulable by simply making them unschedulable.

dom4ha · 2025-12-22T02:59:16Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+  Can report dedicated logs and metrics with less confusion to the user.
+* *Cons:* Significant and non-trivial architectural change to the scheduling queue
+  and `scheduleOne` loop.
+<<[/UNRESOLVED]>>


Consider that we should have PodGroups without pods at some point in time and we should be able to schedule them somehow.

However, before we get there, I'd consider scheduling PodGroup simply whenever we encounter the first pod that refers to it and the PodGroup wasn't scheduled yet. It's similar to option 1 but without modifying sorting function.

I think it would be the simplest implementation as all pods stay in the active queue unless an attempt to schedule a PodGroup they belong to, makes the PodGroup unschedulable. Obviously the question is when to reconsider unschedulable PodGroup. At the beginning I'd make it periodically schedulable again (after some timeout) without defining PodGroup queue yet.

I'm not sure yet how Workload aware preemption may interact with it yet.

44past4 · 2025-12-22T08:42:39Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+  Would need to inject the workload priority into each of the Pods
+  or somehow apply the lowest pod's priority to the rest of the group.
+
+Alternative 2 (Store a gang representative):


I guess that one more option which would fall somewhere in between Alternative 2 and Alternative 3 could be changing the implementation of activeQ to operate on groups of Pods and treat single pods as groups of pods containing single pod only. Does this option make sense? Have you considered it?

44past4 · 2025-12-22T08:58:24Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+optional. In the `Beta` timeframe, we may opportunistically apply this cycle to
+`Basic` pod groups to leverage the batching performance benefits, but the
+"all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to
+schedule as many pods from such PodGroup as possible.


It is not clear what will happen in case of a pod scheduling failure in this case. Do we stop at first failure or do we continue? If we continue do we continue from a subsequent pod or maybe from subsequent sub-group?

Extended this part. The algorithm will continue from a pod from a subsequent sub-group.

…heduling cycle

k8s-ci-robot · 2025-12-22T11:19:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: macsko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/sig-scheduling/OWNERS~~ [macsko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

macsko · 2025-12-22T11:43:03Z

/hold
To make sure we don't accidentally merge this PR

ZiMengSheng · 2025-12-24T09:19:46Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+   * If the group (i.e., at least `minCount` Pods) can be placed,
+     these Pods have the `.status.nominatedNodeName` set.
+     They are then effectively "reserved" on those nodes in the
+     scheduler's internal cache. Pods are then pushed to the


While reserving Pods using NominatedNodes eliminates the need to consider Pod dequeueing order, too many NominatedNodes can still degrade scheduling performance. A mechanism is needed to ensure that Pods can be dequeued as quickly as possible after a PodGroup resource is nominated.

What part of the scheduling performance are you mentioning? I think NominatedNodeNames shouldn't degrade it visibly.

However, it's still worth considering, so thanks for bringing this topic. According to the
"Pods are then pushed to the active queue (restoring their original timestamps to ensure fairness) to pass through the standard scheduling and binding cycle, which will consider the nomination."
part, the pods will be put into scheduling queue with their original timestamps, what should ensure they are scheduled pretty quickly. We would still like to allow high priority pods to jump in, as it's better to interrupt not yet bound pod group rather than a running one.

I wouldn't make any changes to the way pod-by-pod scheduling works (including queue order) as I don't see any reason why the current behavior needs a change (including the order in which the pods are processed). I can imagine some other KEP could introduce some optimization or redefine pods priority, but in this KEP we should guarantee consistency and correctness, no matter in which order Pods are processed.

The key part of this KEP is to have an ability to propose a workload placement atomically and "reserve" necessary resources for it, which should guarantee the rest of the scheduling process follows the plan (unless cluster state changes in the meantime). In case of workload scheduling these guarantees are more important than the performance of pod-by-pod scheduling itself (which won't become worse anyway) as they solve other fundamental problems like scheduling deadlocks and enable multi-workload simulations (simulate scheduling only on resource nomination level).

In the current implementation, member Pods of a PodGroup reserve resources through NominatedNodes. If other Pods are scheduled before these Nominated member Pods, the Filter process will need to perform additional AddPod and Filter operations, which is the main performance issue I just mentioned.

saintube · 2025-12-25T05:12:19Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+opportunistic batching itself will provide significant improvements.
+Future features like Topology Aware Scheduling can further improve other subsets of use cases.
+
+#### Interaction with Basic Policy


IIUC, the all-or-nothing semantics of minCount are skipped with WSC enabled for the Basic policy, and desiredCount is only a hint for feasibility checks.

Given that, is it in-scope (maybe in the future) to support using both minCount and desiredCount together to express a more elastic gang behavior (e.g. minCount as a hard lower bound, desiredCount as the target batch size)?

the all-or-nothing semantics of minCount are skipped with WSC enabled for the Basic policy, and desiredCount is only a hint for feasibility checks.

Right, minCount is a parameter for the gang policy, i.e., it cannot be used with the basic policy, so the all-or-nothing semantics are not enforced. desiredCount is planned to be added only for the basic policy, as such number will be useful for controllers to better describe their "true" workloads, and for scheduler to potentially make better placement decisions.

Given that, is it in-scope (maybe in the future) to support using both minCount and desiredCount together to express a more elastic gang behavior (e.g. minCount as a hard lower bound, desiredCount as the target batch size)?

Adding the desiredCount to the gang policy can be considered, but not for the incoming release. In fact, having this value would be useful while making placement decisions for the pod group.

…minCount

sanposhiho · 2026-01-03T09:25:02Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+<<[/UNRESOLVED]>>
+```
+
+#### Scheduling Algorithm


Hmm, the first big missing point: what if pods are schedulable if we change the order of scheduling them?
Like let's say pod#1 is scheduled first and pod#2 is unschedulable. But, pod#2 might be schedulable if pod#1 is scheduled onto a different node.

This would mean pessimistically trying all combinations in the algorithm, which is not desirable from a computational point of view. This KEP is expected to deliver the foundation for future algorithms adapted to a workload-aware scheduling architecture. Although the basic algorithm may be suboptimal and even cause the scenario you described, it still eliminates deadlocks and livelocks in gang scheduling and is no worse than scheduling without WAS. We could try shuffling pods with the same priority in each workload scheduling cycle and perform periodic retries, but I don't think that's that important at this moment.

Same as TAS proposal, we should not compromise the correctness of the scheduling. This is about the correctness.

which is not desirable from a computational point of view.

I know we cannot literally try all combinations. I do believe there's a way better than shuffling pods, at least. I know there could be a jobset that can have more than a thousand of pods, and I cannot say shuffling is where we can compromise. In some sense, shuffling is also not desirable from a computational point of view because that's essentially the same as "literally try all combinations in a random order until we find a place".

...and, assuming workload contains many similar pods, shuffling is actually worse than literally trying all combinations until finding a place at a single cycle because the oppotunistic batching could work potentially in the latter case (within cache TTL)

Left a note for a rough idea before going to bed: We might be able to introduce a "swapping" process, very similar to preemption.

Let's say a workload has 5 groups defined by a signature (all are same priority). And, group#1 and 2 are schedulable, while 1 pod in group#3 cannot find a place (even with a normal preemption) Then, we can let the pod in group#3 to try to preempt pod(s) in group#1 and group#2 to see a possibility of "swapping" with equal priority pods. And, for example, if pods in group#1 are listed as a preemption candidate, we'll check a cached node list for group#1 again, and see if there's any spare node where we can move group#1's pod to. If there is, we can schedule the pod in group#3 there, schedule a victim pod in group#1 to somewhere else. (Otherwise, we can say the entire workload is unschedulable.) After that group#2 has to be rescheduled again because a change in group#1 placements might impact it.

The worst case is that group#3 and then 4 and then 5 are selecting group#1 as a victim again and again because we need to restart from group#2 scheduling over and over each time. We'd also need to track which combination that we already tried on each group so that we don't do the same calculation.

sanposhiho · 2026-01-03T09:30:44Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+<<[/UNRESOLVED]>>
+```
+
+#### Scheduling Algorithm


The second missing point: how should we support inter-pod constraints properly? (pod affinity and topology spread)
Like, let's say pods have inter pod affinity to each other and this affinity doesn't match any existing running pods outside this workload. Given the pod affinity plugin doesn't reject the first pod, it will keep returning Success to all pods. (not that NNN reserves the place, but it doesn't impact the evaluation of inter pod constraints, iirc)
But, if they actually enter the scheduling cycle and each pod is assume()ed, the pod affinity might return Unschedulable to some pods because pod affinity starts to work "properly" after assume() for some pods.

note that NNN reserves the place, but it doesn't impact the evaluation of inter pod constraints

It does impact them. If there are pods with NNNs, filters are called twice: once with these pods present and once without. Both filters run have to pass. This means, the affinity check will work correctly.

Furthermore, I think that during the workload scheduling algorithm, the pods will be assumed and reserved, not just nominated, to ensure that each plugin has the correct internal state. These will be replaced to actual nominations when the algorithm finishes.

Please also note that this algorithm is not expected to handle inter-pod constraints particularly well (i.e., much better than without WAS). Ultimately, we plan to allow such constraints to be defined at the pod group level and support them with customized algorithms.

If there are pods with NNNs, filters are called twice

Ok, I missed it. You're right

sanposhiho · 2026-01-03T09:36:13Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+as the current 10 seconds proven to be too low in larger clusters,
+so this might be the case for workloads.
+
+3. Retries


I'd say it's a good starting point, but soon we'll need to enhance it: at least, we should consider only QHints for failed pods (i.e., pods that are unschedulable at the previous attempt)

It's not that easy. Since we don't try all possible pod order combinations, podA from a group can fit on either node1 or node2, but podB can only fit on node1. If podA is scheduled on node1, podB cannot be scheduled. When an event occurs that makes podA schedulable on node3, the algorithm might allow podA on node3 and podB on node1.

Honestly, QHints aren't that critical, especially in larger clusters, because the quantity of events is still large enough. Because as the order can matter, we could even try having some periodic retries for pod groups.

podA from a group can fit on either node1 or node2, but podB can only fit on node1. If podA is scheduled on node1, podB cannot be scheduled. When an event occurs that makes podA schedulable on node3, the algorithm might allow podA on node3 and podB on node1.

Hmm, good point. We cannot simply do my proposal above (only care about pods that failed in a previous cycle)

sanposhiho · 2026-01-03T09:40:40Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+   * If `schedulableCount >= minCount`, the cycle succeeds. Pods are pushed
+     to the active queue and will soon attempt to be scheduled on their
+     nominated nodes in their own, pod-by-pod cycles. If a pod selects a
+     different node than its nomination during the individual cycle, the


And, what if a pod is unschedulable at this pod-by-pod scheduling cycle? I believe we need to track how many pods are "actually" passing the scheduling cycle and waiting at WaitOnPermit and how many pods are rejected at the scheduling cycle. And, once reaching {schedulableCount from the workload scheduling cycle} - {the num of pods rejected at the actual scheduling cycle} < minCount, we shouldn't attempt the rest of workload pods and also reject all waiting pods.

We already keep such numbers in the workload manager - we know how many pods are unscheduled, assumed and assigned (ref), so it shouldn't be a problem. We could add more info there, e.g., about the number of pods that passed workload scheduling cycle.

Nice, please add this to KEP, if you agree then?

once reaching {schedulableCount from the workload scheduling cycle} - {the num of pods rejected at the actual scheduling cycle} < minCount, we shouldn't attempt the rest of workload pods and also reject all waiting pods.

sanposhiho · 2026-01-03T09:47:34Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+   * If a pod cannot fit, the scheduler tries preemption by running
+     the `PostFilter` extension point. *Note:* With workload-aware preemption
+     this phase will be replaced by a workload-level algorithm.
+     * If preemption is successful, the pod is nominated on the selected node.


but the actual pod termination won't finish instantly. As proposed in #5711, shouldn't we regard this pod as "unschedulable" here?

Honestly, having two KEPs (this and #5711) and both mention the preemption is very confusing. I don't get how the preemption works eventually after all because both say different directions slightly.
cc @wojtek-t

@44past4's TAS proposal confused me more.

@macsko @44past4 @wojtek-t
At least for me, it is confusing that multiple KEP PRs are explaining the flow differently and the overall final picture is blurred even though I understand what each KEP wants to propose. e.g., I don't understand where GeneratePlacements etc are called in the flow explained here. I don't understand what is the goal in this release regarding the preemption. What will be implemented by when in general. etc. Please clarify on each KEP how the feature interact with the other KEPs.

but the actual pod termination won't finish instantly. As proposed in #5711, shouldn't we regard this pod as "unschedulable" here?

The exact shape of the Delayed Preemption algorithm is not yet determined, but for now, we should treat such preemptor pod as schedulable in the workload scheduling algorithm because we need to determine if, with these preemptions, the pod group will be schedulable (≥ minCount), and if so, if the preemptions will be actuated after meeting that condition.

Honestly, having two KEPs (this and #5711) and both mention the preemption is very confusing. I don't get how the preemption works eventually after all because both say different directions slightly.

I agree that it's not ideal for the KEPs to depend on each other and differ in details. I'll update this KEP to provide a clearer description of the preemption scenarios. I also believe that, with most people soon returning from vacation, we'll finally be able to align on the details of these three KEPs.

lmktfy · 2026-01-04T18:24:15Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+Introduce a completely separate queue for PodGroups alongside the `activeQ` for Pods.
+The scheduler would pop the item (Pod or PodGroup) with the highest priority/earliest timestamp.
+Pods belonging to an enqueued PodGroup won't be allowed in the `activeQ`.
+* *Pros:* Clean separation of concerns. Can easily use the Workload scheduling priority.
+  Can report dedicated logs and metrics with less confusion to the user.
+* *Cons:* Significant and non-trivial architectural change to the scheduling queue
+  and `scheduleOne` loop.


I presume that we could treat single Pods as groups with a count of one. Did we cover this? I don't see it as a listed alternative.

I'm not in favor of this kind of change, which affects the default pod-by-pod scheduling flow. Ultimately, we may want to change this and treat all individual pods as groups of size one, but before we do so, we need to make sure that everything works correctly. Major modifications to the core scheduling logic have proven difficult and cause subtle problems that can lead to serious regressions.

Furthermore, we want to run a pod-by-pod cycle after the Workload cycle for each (workload's) pod, which would mean unnecessary processing for individual pods.

lmktfy · 2026-01-04T18:25:43Z

keps/sig-scheduling/4671-gang-scheduling/README.md

-
-More detail about scheduler changes is described in [this document](https://docs.google.com/document/d/1lMYkDuGqEoZWfE2b8vjQx0vHieOMyfmi6VHUef5-5is/edit?tab=t.0#heading=h.1p88ilpefnb).
-
+#### Future plans


We could explain this in our non-goals section (which can link to a Future Plans section later in the KEP for a detailed write up of where we hope to get to).

It's already described in the non-goals section. Added a link to the future plans there

lmktfy · 2026-01-04T18:27:09Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+While Gang Scheduling focuses on atomic, all-or-nothing scheduling, there is a significant class
+of workloads that requires best-effort optimization without the strict blocking semantics of a gang.
+
+Currently, the `Basic` policy is a no-op. We propose extending the `Basic` policy


In a KEP, "currently" should refer to the status quo ante, ie the situation before the KEP landed as implementable.

You can still use "currently" to mean something different, but qualify it with a phase (eg alpha), or a date.

Right, rephrased it

ZiMengSheng · 2026-01-05T04:12:26Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+1. Rejection
+
+When the cycle fails, the scheduler rejects the entire group.


IIUC, this type of rejection might currently only be output in the Workload. A small suggestion is that it would be more user-friendly to also output this rejection information to the Pod Event or Condition.

Each pod will go through the standard rejection process, including updating the status and sending an event. Please note, we are also planning to introduce the workload status (kubernetes/kubernetes#135561), which will be helpful here. It is mentioned in the Scheduling Algorithm section, but added it also here.

k8s-ci-robot · 2026-01-05T14:57:40Z

@macsko: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-enhancements-verify	`3834263`	link	true	`/test pull-enhancements-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Dec 10, 2025

k8s-ci-robot requested review from dom4ha and kikisdeliveryservice December 10, 2025 09:05

github-project-automation bot added this to SIG Scheduling Dec 10, 2025

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 10, 2025

k8s-ci-robot requested review from erictune, sanposhiho and wojtek-t December 10, 2025 09:06

macsko mentioned this pull request Dec 10, 2025

Gang Scheduling Support in Kubernetes #4671

Open

13 tasks

Argh4k reviewed Dec 10, 2025

View reviewed changes

erictune reviewed Dec 11, 2025

View reviewed changes

wojtek-t reviewed Dec 11, 2025

View reviewed changes

wojtek-t self-assigned this Dec 11, 2025

andreyvelich reviewed Dec 11, 2025

View reviewed changes

wojtek-t reviewed Dec 12, 2025

View reviewed changes

keps/sig-scheduling/4671-gang-scheduling/README.md Show resolved Hide resolved

macsko force-pushed the gang_scheduling_beta branch from a63f0ef to 71ef75a Compare December 12, 2025 12:23

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 12, 2025

macsko force-pushed the gang_scheduling_beta branch from 71ef75a to 9ce3fc7 Compare December 12, 2025 12:23

Add a section about scheduler changes for v1.36

eae3ddb

macsko force-pushed the gang_scheduling_beta branch from 9ce3fc7 to eae3ddb Compare December 12, 2025 15:30

sanposhiho reviewed Dec 15, 2025

View reviewed changes

k8s-ci-robot assigned sanposhiho Dec 15, 2025

Add a section about basic policy update

9e672be

Argh4k mentioned this pull request Dec 17, 2025

KEP-5710: Add initial KEP docs for workload-aware preemption #5711

Open

wojtek-t reviewed Dec 18, 2025

View reviewed changes

johnbelamaric reviewed Dec 18, 2025

View reviewed changes

erictune reviewed Dec 18, 2025

View reviewed changes

keps/sig-scheduling/4671-gang-scheduling/README.md Show resolved Hide resolved

ZiMengSheng reviewed Dec 19, 2025

View reviewed changes

kannon92 reviewed Dec 19, 2025

View reviewed changes

dom4ha reviewed Dec 22, 2025

View reviewed changes

44past4 reviewed Dec 22, 2025

View reviewed changes

Remove beta graduation from the PR, extend sections about workload sc…

43b5aa9

…heduling cycle

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 22, 2025

macsko changed the title ~~WIP: KEP-4671: Introduce Workload Scheduling Cycle, graduate Workload API and gang scheduling to beta~~ KEP-4671: Introduce Workload Scheduling Cycle, extend basic policy with desiredCount Dec 22, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 22, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 22, 2025

ZiMengSheng reviewed Dec 24, 2025

View reviewed changes

saintube reviewed Dec 25, 2025

View reviewed changes

Expand queueing alternatives. Add unresolved section about enforcing …

8eefcd3

…minCount

sanposhiho reviewed Jan 3, 2026

View reviewed changes

sanposhiho mentioned this pull request Jan 4, 2026

KEP-5732: Topology-aware workload scheduling #5733

Open

lmktfy reviewed Jan 4, 2026

View reviewed changes

ZiMengSheng reviewed Jan 5, 2026

View reviewed changes

Apply comments

3834263

macsko force-pushed the gang_scheduling_beta branch from 405c042 to 3834263 Compare January 5, 2026 14:54


		Alternative 1 (Modify sorting logic):

		Modify the sorting logic within the existing `PriorityQueue` to put all pods


		More detail about scheduler changes is described in [this document](https://docs.google.com/document/d/1lMYkDuGqEoZWfE2b8vjQx0vHieOMyfmi6VHUef5-5is/edit?tab=t.0#heading=h.1p88ilpefnb).

		#### Future plans


		1. Rejection

		When the cycle fails, the scheduler rejects the entire group.

KEP-4671: Introduce Workload Scheduling Cycle, extend basic policy with desiredCount #5730

Are you sure you want to change the base?

KEP-4671: Introduce Workload Scheduling Cycle, extend basic policy with desiredCount #5730

Conversation

macsko commented Dec 10, 2025

Uh oh!

macsko commented Dec 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erictune left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erictune Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

macsko Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erictune Dec 18, 2025 •

edited

Loading

macsko Dec 19, 2025 •

edited

Loading

macsko Dec 12, 2025 •

edited

Loading