-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-4671: Introduce Workload Scheduling Cycle, extend basic policy with desiredCount #5730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
macsko
commented
Dec 10, 2025
- One-line PR description: Update KEP-4671 for beta in v1.36. Introduce Workload Scheduling Cycle phase.
- Issue link: Gang Scheduling Support in Kubernetes #4671
- Other comments: This PR is shows an initial idea of new workload scheduling phase. Further changes related to graduation to beta will be added gradually.
|
|
||
| Alternative 1 (Modify sorting logic): | ||
|
|
||
| Modify the sorting logic within the existing `PriorityQueue` to put all pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to modify it once we have workload priority, right? Otherwise we can keep the current sorting algorithm and pull all gang members from the queue once we encounter the first member of the gang (and switch to workload scheduling), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we could keep the current algorithm, but since the gang members can be in any internal queue, removing them from these queues would require some effort (running PreEnqueue, registering metrics, filling in internal structures used for queuing hints, etc.). In fact, this effort could be moved to properly implement queueing with more feature-proof alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide that if there is no workload priority then priority = min(pods priority) (as Wojtek suggested) then it becomes a big problem with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we choose that priority calculation, then this alternative is just a bad idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also not modify the queue sorting, but "remove" PodGroup pods form it when PodGroup itself is unschedulable by simply making them unschedulable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't "simply" make them unschedulable. What about event handling? What should make the PodGroup schedulable after failure? What to do with the observability (metrics, logs, etc.) that are aligned to the pod-by-pod queueing now? It's hard to say whether we could easily do that or not. I need to analyze the code deeper and come back with some thoughts.
Anyway, the modification of the queue sorting will be needed when the pod group priority will be introduced by the workload-aware preemption KEP. We should consider that when designing the queueing part of this KEP.
erictune
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great overall.
I have one proposal to consider renaming part of the API, and some clarifying questions.
| // +required | ||
| Name string | ||
| // PodGroup is the name of the PodGroup within the Workload that this Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term PodGroup has become ambiguous. It can mean:
- a list item in
Workload.podGroups - a specific set of pods having the same PodGroupReplicaKey
These are often the same thing but not always.
If we want to be clear what we mean, we can either:
- Always call (1) a PodGroup and always call (2) a PodGroupReplica, including in the case when podGroupReplicaKey is nil
- Or, rename PodGroup to PodGroup Template, and call (2) a PodGroup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may seem pedantic, but when reviewing this KEP, I felt that current double meaning made the text imprecise in a number of places. We do have a chance to fix the naming with v1beta1, but I don't think we are supposed to rename when we go to GA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with andreyvelich, PodSubGroupReplica and PodSetReplica are confusing.
This would be less confusing:
- 1 Workload object in APIserver names exactly one 1 group of pods at runtime.
- 1 Workload.podGroupTemplates[i] is a template for N independent PodGroups which are distinguished by PodGroupReplicaKey
Later, if PodSubGroup is added:
- 1 Workload.podGroupTemplate[i].podSubGroup[j] names exactly one PodSubGroup within a single group of pods (PodGroup) defined as above.
Later, if PodSet is added under PodSubGroup:
- 1 Workload.podGroupTemplate[i].podSubGroup[j].podSet[k] names exactly one PodSet within PodSubGroup, defined as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a longer discussion than belongs in the review comments of this PR.
Suggestion to KEP author:
- Keep Workload Scheduling Cycle proposal in this PR.
- Keep Basic policy enhancements in this PR.
- Remove advancement to beta to.
Let's all as reviewers approve the above. And we can take a week to discuss the "PodGroup as separate object" proposal. which may affect alpha/beta status of Workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @johnbelamaric and @erictune that we should separate discussion whether we need separate object for PodGroup.
Remove advancement to beta to.
Is my understanding correct that we will create another KEP for beta graduation in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going back and forth with whether we should separate PodGroup into a separate object (in fact my original counter-proposal was going that way though for different reasons https://docs.google.com/document/d/13UkLjVMj_edMh7biqVU6SVyNNTIfGAT35p6-pNsF5AY/edit?resourcekey=0-dqUEiwiXWICLwAg6Tqkupw&tab=t.9le0fmf90j3w#heading=h.vf43rjyfidc6 )
I agree it's the last moment to change that before going to Beta.
But I also agree that pretty much neither of the proposed changes (basicPolicy, workload scheduling cycle, scheduling algorithm, ...) depend on this decision and can proceed independently.
So I'm heavily +1 on Eric's suggestion above to focus this PR on those changes but still leave the KEP in Alpha after this PR.
And have a separate PR that will be focused on the API itself (it can't be a separate KEP, but it can be a separate PR and discussion).
Also - we should probably start a dedicated document for that to clearly describe Pros/Cons of both options to make the more data-driven decision.
@macsko - I'll be OOO for the next 2 weeks, would you be able to start such doc and I'm happy to contribute once I'm back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so I'll remove the beta graduation part from this PR and the discussion about API can be moved to a new document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I went ahead and started a doc: https://docs.google.com/document/d/1zVdNyMGuSi861Uw16LAKXzKkBgZaICOWdPRQB9YAwTk/edit?resourcekey=0-bD8cjW_B6ZfOpSGgDrU6Mg&tab=t.0#heading=h.c4vrtnmf9f4o
It's only a starter and requires a lot of work so I would appreciate all contributions, especially given I will be OOO until Jan 7th
| // WorkloadSpec describes a workload in a portable way that scheduler and related | ||
| // tools can understand. | ||
| // WorkloadMaxPodGroups is the maximum number of pod groups per Workload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the number of items in the list PodGroups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW - these updates are based on what was merged in the code.
I'm happy for these to be updated, but for these specifically maybe we should open a dedicated PR in k/k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @wojtek-t. We can defer API description discussions to the k/k PR with API graduation (when it will be created)
| ControllerRef *TypedLocalObjectReference | ||
| // PodGroups is the list of pod groups that make up the Workload. | ||
| // The maximum number of pod groups is 8. This field is immutable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the maximum number of PodGroup Replicas is not limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. It would be even hard to enforce such limitation in the current form of replication - it's based solely on pods' workloadRefs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was suggesting to make this more clear in the code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting PodGroup into its own resource would make it a lot clearer that there is a distinction between the template and the instance, and that the limit applies to defined templates, not instances.
|
|
||
| When the scheduler pops a Pod from the active queue, it checks if that Pod | ||
| belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler | ||
| initiates the Workload Scheduling Cycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a pod belongs to an already scheduled PodGroup, it is not clear what to do. We could:
- Hold it back, on the assumption that it is a replacement for an already scheduled pod. When
minCountadditional pods show up, then handle all those at once. - Treat it as if it is a "gang scale up", and try to place it too (on a best-effort basis). If we do this, does it go through Workload cycle, or just normal pod-at-a-time path?
I thinking about races that could happen when a workload is failing and getting pods replaced. And I am thinking about the case when a workload wants to tolerate a small number of failures by having actual count > minCount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a step back from the implementation, we should think what we really need to do conceptually in that case.
If the pod is part of PodGroup, then what we conceptually want is to schedule that in the context of its PodGroup instance. I think there are effectively three cases here:
- this PodGroup instance doesn't satisfy its minCount now, but with this pod it will satisfy it
- this PodGroup instance doesn't satisfy its minCount now and with that pod it won't satisfy it either
- this PodGroup instance already satisfies its minCount even without this pod
It's worth noting at this point, that with topology-aware scheduling introduce PodGroup instance could have tas requirements too when thinking what we should do with it.
The last point in my opinion means that we effectively should always go through Workload cycle, because we always want to consider it in the context of whole workload.
The primary question is if we want to go kick this workload cycle off with individual pod or wait for more and if so based on what criteria.
I think that the exact criteria will be evolving in the future (I can definitely imagine it depend on "preemption unit" that we're introducing in KEP-5710). So for now, I wouldn't try to settle down on the final solution.
I would suggest starting with:
- always go through Workload cycle (to keep in mind whole podGroup instance as context)
- for now, always schedule individual pod for the sake of making progress here (we will most probably adjust it in the future but it will be pretty localized change and I think it's fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the easiest way is to do what @wojtek-t described, i.e., go best effort (as for the basic policy), so take as many pods as we have available, and try to schedule them in the workload cycle. Only pods that have passed the workload cycle will be able to move on to the pod-by-pod cycle. In the future, we can try to create more intelligent grouping, but for now, let's focus on delivering a good enough implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan if PodGroup has been scheduled with TAS constraints based on minCount, and a new Pod comes in that is part of the PodGroup but doesn't fit in the TAS constraints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I suppose, the new Pod should go through a workload scheduling cycle and the TAS algorithm there should take the scheduled pods from a gang into consideration. If the pod doesn't fit, it will remain unschedulable until something changes in the cluster.
| They are then effectively "reserved" on those nodes in the | ||
| scheduler's internal cache. Pods are then pushed to the | ||
| active queue (restoring their original timestamps to ensure fairness) | ||
| to pass through the standard scheduling and binding cycle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, there are some cases where a nominated pod fails to pass through the standard scheduling cycle:
- differences in the algorithm between workload cycle and standard cycle
- refreshed snapshot changes node eligibility.
- higher priority pod jumps ahead in active queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these are the cases, but we are okay when they occur, as long as the new placement will be still valid. If it won't be anymore, scheduling of the gang will fail (at WaitOnPermit) and the gang will be retried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no differences between workload and standard cycle in terms of pods feasibility
wojtek-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added few comments but they are pretty localized - overall this is great proposal pretty aligned with how I was thinking about it too.
| // +required | ||
| Name string | ||
| // PodGroup is the name of the PodGroup within the Workload that this Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal preference: when talking about (2), I would just call that either PodGroupReplica or alternatively (which is what I was using) PodGroup instance. WIth that, I wouldn't rename it - but these I my personal preferences.
| // WorkloadSpec describes a workload in a portable way that scheduler and related | ||
| // tools can understand. | ||
| // WorkloadMaxPodGroups is the maximum number of pod groups per Workload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW - these updates are based on what was merged in the code.
I'm happy for these to be updated, but for these specifically maybe we should open a dedicated PR in k/k.
| Can report dedicated logs and metrics with less confusion to the user. | ||
| * *Cons:* Significant and non-trivial architectural change to the scheduling queue | ||
| and `scheduleOne` loop. | ||
| <<[/UNRESOLVED]>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would more value opinions from people more hands-on with scheduler code recently than myself, but on paper I think the third alternative seems preferred to me:
-
Alternative 1 - it will be extremely hard to reason about it if pods sit in different queues (backoff, active, ..) independently and the evolution point is extremely important imho - we know we will be evolving it a lot and preparing for that is super important
-
Alternative 2 - I share the concerns about corner cases (pod deleted is exactly what I have on my mind). Once we get to rescheduling cases (new pods appear but they should trigger removing and recreating some of the existing pods it will get even more complicated with even harder corner cases). Given that we know that rescheduling is super important, I'm also reluctant about it.
-
Alternative 3 - It's clearly the cleanest option. The primary drawback of the need for non-trivial changes is imho justified given we know that we will be evolving it significantly in the future.
So I have quite strong preference towards (3) at this point, but I'm happy to be challenged by implementation-specific counter-arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my thought as well. I think the effort involved in adding workload queuing will be comparable to modifying the code for the previous alternatives, but maybe I don't see some significant drawbacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the third alternative, do we want to introduce only the queue for PodGroups? If so then we have the same problem of pulling pods from backoff and unschedulable queues right? Or do we mean by the workload queue a structure that will hold all the pods related to workload scheduling?
Let's say that we add a Pod that is part of a group but does not make us meet the min count . Should it land in unschedulable queue as it does right now? Or some additional structure?
If the pod group failed scheduling, where do we put pods from it? We need to have them somewhere, so we can check them against cluster events in case some event makes the pod schedulable thus potentially makes the whole pod group schedulable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, one way might be to introduce both:
- a queue for pod groups (let's say
podGroupsQ). This will contain the pod groups that can be popped for workload scheduling - a data structure for unscheduled pods (let's say
workloadPods) that have to wait for workload scheduling to finish to proceed with their pod-by-pod cycles.
I imagine the pod transitions (for pods that have a workloadRef) would be:
pod is created -> when PreEnqueue passes for a pod, add it to the workloadPods, otherwise add to unschedulablePods -> [for gang pod group] if the >= minCount pods for the group is in a workloadPods, the group is added to the podGroupsQ -> pod group is processed and workload scheduling cycle is executed
When the workload cycle finishes successfully:
pod gets the NNN and is moved to the activeQ -> pod is popped and goes through its pod-by-pod scheduling cycle -> ...
When the workload cycle fails:
pod is moved to the unschedulablePods, where it waits for a cluster change to happen -> when the change happens for this pod or other from its pod group, the pod(s) are moved to the workloadPods -> workload manager detects the group should be retried and adds the group to the podGroupsQ -> processing continues as previously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, alternative 3 is the obvious choice IMO if PodGroup is a separate resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Koordinator, I used Alternative2. The desired effect is that, in ActiveQ/BackoffQ, a PodGroup has only one Item, and this Item can carry all queuing attributes (Priority, LastScheduleTime, BackoffTime, etc.). Whether this Item is a representative pod or a real PodGroup is less important.
Another important point is that when a representative Pod or PodGroup is dequeued, its sibling Pods are also dequeued. In Koordinator, we implemented our own NextPod (koordinator-sh/koordinator#2417) to achieve this.
Regarding this KEP, I feel that a fusion of Alternative 2 and Alternative 3 might be a better solution:
-
Use a QueuedPodGroupInfo that aggregates the queuing attributes of all member Pods to flow PodGroups between ActiveQ, BackoffQ, and UnschedulableQ, allowing the previous QueueHint mechanism to seamlessly integrate with PodGroups.
-
Use a separate Map to store the Pods belonging to QueuedPodGroupInfo for easy indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider that we should have PodGroups without pods at some point in time and we should be able to schedule them somehow.
However, before we get there, I'd consider scheduling PodGroup simply whenever we encounter the first pod that refers to it and the PodGroup wasn't scheduled yet. It's similar to option 1 but without modifying sorting function.
I think it would be the simplest implementation as all pods stay in the active queue unless an attempt to schedule a PodGroup they belong to, makes the PodGroup unschedulable. Obviously the question is when to reconsider unschedulable PodGroup. At the beginning I'd make it periodically schedulable again (after some timeout) without defining PodGroup queue yet.
I'm not sure yet how Workload aware preemption may interact with it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's similar to option 1 but without modifying sorting function.
Added it as an "Alternative 0" to the proposal.
| * If preemption is successful, the pod is nominated on the selected node. | ||
| * If preemption fails, the pod is considered unscheduled for this cycle. | ||
|
|
||
| The phase can effectively stop once `minCount` pods have a placement, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the POV of optimizing locally, we should continue scheduling all pod (and not stop after minCount) - what changes is that the unschedulability of further pods shouldn't make the whole group unschedulable.
Given that we have minCount at the whole PodGroup level, the local optimization is actually what we should probably do anyway (it would be a different story if we would have a separate minCount per signature, but it's not the case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of pods in the in the PodGroup is larger than minCount there is also a question if we should stop at first pod scheduling failure or maybe only if the number of failures is larger than the difference between the size of the PodGroup and minCount?
My intuition is that if we encounter a pod scheduling failure we should probably skip scheduling of the remaining pods in a given sub-group treating them as unschedulable but if there are more sub-groups and if there is still a chance that we will be able to schedule minCount of pods we should probably try scheudling pods from the sebsequent sub-groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intuition is that if we encounter a pod scheduling failure we should probably skip scheduling of the remaining pods in a given sub-group treating them as unschedulable but if there are more sub-groups and if there is still a chance that we will be able to schedule minCount of pods we should probably try scheudling pods from the sebsequent sub-groups.
I agree and that was the idea. Added it explicitly to the KEP
| // +required | ||
| Name string | ||
| // PodGroup is the name of the PodGroup within the Workload that this Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we rename it to PodGroupReplica, are we planning to introduce PodSubGroup under it, as we discussed here: #5711 (comment) ? And what about PodSubGroupReplica?
a63f0ef to
71ef75a
Compare
71ef75a to
9ce3fc7
Compare
9ce3fc7 to
eae3ddb
Compare
sanposhiho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign
| // +required | ||
| Name string | ||
| // PodGroup is the name of the PodGroup within the Workload that this Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin - I would like to appreciate your feedback from API approver perspective.
| the standard pod-by-pod scheduling cycle. | ||
|
|
||
| When the scheduler pops a Pod from the active queue, it checks if that Pod | ||
| belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the Basic policy section above, I actually think that all pods that belong to workloads should go through this phase (whether they form gangs or just are from basic policy).
I acknowledge that for basic it will be kind of best-effort (if more pods were created we will get all of them, if we only observed 1 it will be just one), but that better opens doors for future extensions once we have pod templates etc. defined in Workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I'd say that Workload scheduling is a phase scheduling PodGroups. Depending on what policy type the group is, the logic is different.
In case of the Basic policy, the group becomes scheduled unconditionally, still pods belonging to that group cannot be scheduled until the PG itself is scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed "Gang" from this sentence. I think we are aligned that the basic-policy pods should be scheduled by this phase
|
|
||
| *Proposed:* Implement it on PodGroup Level for Beta. However, future migration | ||
| to the Workload Level might necessitate non-trivial changes to the phase | ||
| introduced by this KEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest resolving that (and potentially mentioning the risk of non-trivial changes in the Risks section).
| 2. The scheduler nominates the victims for preemption and the gang Pod | ||
| for scheduling on their place. This way, the gang can be attempted | ||
| without making any intermediate disruptions to the cluster. | ||
| * If the quorum is met, the scheduler continues scheduling the gang Pods pod-by-pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be adjusted once we settle down on the details in #5711
Currently these are not aligned :)
[It's a comment for myself too]
johnbelamaric
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we should take the opportunity this release to shift to top-level PodGroup instances, with the templates living in workload. I believe this clears up a number of things and will make building on top of PodGroups much easier for the rest of the project.
| // +required | ||
| Name string | ||
| // PodGroup is the name of the PodGroup within the Workload that this Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Embedding PodGroups in Workload as done today makes it much harder for clients to track lifecycle events - creation, deletion, resize, etc. of individual PodGroup replicas. And given that (AIUI) individual replica keys are not all stored in the workload, it may not even be possible to track PodGroup replica creation and deletion today without some sort of watch on Pods and inferring them from that.
I don't think it's too late to pull PodGroup out into its own top-level resource, but it is our last chance to do so. I think it's the better design and we should take this opportunity. With that update, clients can use all the standard API machinery rather than something that calculates deltas from changes in workload objects and/or watching Pods.
In that case Workload may contain PodGroupTemplates but the actual instances of PodGroups would be separate objects with a reference back to the workload and template. So we clearly separate lifecycle of the policy configuration and instances of groups based on that policy configuration.
This then would probably need to change from a WorkloadRef to a PodGroup resource name.
Barring that, I prefer what Eric suggests above.
| ControllerRef *TypedLocalObjectReference | ||
| // PodGroups is the list of pod groups that make up the Workload. | ||
| // The maximum number of pod groups is 8. This field is immutable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting PodGroup into its own resource would make it a lot clearer that there is a distinction between the template and the instance, and that the limit applies to defined templates, not instances.
|
|
||
| When the scheduler pops a Pod from the active queue, it checks if that Pod | ||
| belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler | ||
| initiates the Workload Scheduling Cycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan if PodGroup has been scheduled with TAS constraints based on minCount, and a new Pod comes in that is part of the PodGroup but doesn't fit in the TAS constraints?
|
|
||
| *Proposed:* Implement it on PodGroup Level for Beta. However, future migration | ||
| to the Workload Level might necessitate non-trivial changes to the phase | ||
| introduced by this KEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Complex true workload controllers should implement any orchestration themselves. In other words, if there really are, say, dependencies between PodGroups, I think we should leave that complexity in the upper layer controller, and the scheduler just deal in PodGroups. So the controller would wait to create the second PodGroup until after it made sure the first one got scheduled. This does mean we don't have full workload atomic scheduling, only PodGroup atomic scheduling. But that level of complexity probably needs something more like Reservation.
| end-to-end Pod scheduling flow, it is planned to place this new phase *before* | ||
| the standard pod-by-pod scheduling cycle. | ||
|
|
||
| When the scheduler pops a Pod from the active queue, it checks if that Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If PodGroup is a separate resource, we can watch for unschedule PodGroups, and defer scheduling any Pod that references a PodGroup until that PodGroup has been scheduled. I think that's a cleaner design than going through the Pod indirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, are we making a new PodGroup API that the API server would see? Let's clarify what we mean.
| Can report dedicated logs and metrics with less confusion to the user. | ||
| * *Cons:* Significant and non-trivial architectural change to the scheduling queue | ||
| and `scheduleOne` loop. | ||
| <<[/UNRESOLVED]>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, alternative 3 is the obvious choice IMO if PodGroup is a separate resource.
| Note that the implementation of this specific logic might follow in a Beta stage | ||
| of this API field. | ||
|
|
||
| #### Delayed Preemption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deletion of Victim takes some time due to resource release and other reasons. During this period, how can the binding process of the Preemptor be blocked to prevent the kubelet from rejecting the Pod due to resource over-provisioning on the node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exact design of delayed preemption is proposed in #5711. Let's move this discussion there
| # If the purpose of this KEP is to deprecate a user-visible feature | ||
| # and a Deprecated feature gates are added, they should be deprecated|disabled|removed. | ||
| stage: alpha | ||
| stage: beta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't completely follow everything in this proposal but promoting to beta and introducing two new feature gates raises a lot of eyebrows to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And now we have 4 separate feature gates for this feature..
What are the implications on these features gates in relation to each other?
ie if GenericWorkload is disabled and GangScheduling enabled what do we expect to happen?
Will any of these graduate at a different time than the other features gates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't completely follow everything in this proposal but promoting to beta and introducing two new feature gates raises a lot of eyebrows to me.
One of the gates (WorkloadBasicPolicyDesiredCount) will start in alpha, but the change is that small that creating a new KEP for it would be an overhead, so we decided to put it in this KEP.
And now we have 4 separate feature gates for this feature..
Tentatively removed one feature gate from the list. I think the workload scheduling cycle could be covered using the gang scheduling gate.
What are the implications on these features gates in relation to each other?
GangScheduling requires GenericWorkload to work, as the latter defines the API that enables the gang scheduling. GenericWorkload itself can be enabled and used to express the "true" workload without a need for it to be gang scheduled by the kube-scheduler.
GenericWorkload is disabled and GangScheduling enabled what do we expect to happen?
GangScheduling gate requires GenericWorkload to be enabled (it is enforced by the feature gate validation.
|
Honestly I would really like to see some k8s workloads adopting WAS. We have #5547 but that seems decoupled from this KEP. But if we promote the API to beta than we are going to discourage breaking API changes. Would it be possible to make beta promotion contingent on at least a few workload apis proving out this design? I don't want to get into a situation where this API gets promoted to beta/ga and then workload authors figure out issues with the API and then we have to revisit the API but we couldn't break the API. |
| the standard pod-by-pod scheduling cycle. | ||
|
|
||
| When the scheduler pops a Pod from the active queue, it checks if that Pod | ||
| belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I'd say that Workload scheduling is a phase scheduling PodGroups. Depending on what policy type the group is, the logic is different.
In case of the Basic policy, the group becomes scheduled unconditionally, still pods belonging to that group cannot be scheduled until the PG itself is scheduled.
|
|
||
| *Proposed:* Implement it on PodGroup Level for Beta. However, future migration | ||
| to the Workload Level might necessitate non-trivial changes to the phase | ||
| introduced by this KEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
IMO we need to schedule PodGroups themselves, as they define scheduling constraints on a level of group of pods, which means pods cannot be scheduled individually.
If we introduce group of groups, then this phase would have to schedule such a group as a whole since individual PodGroups no longer could be scheduled individually. Whenever we add them, extension of the workload scheduling phase would be incremental.
If we ever had cross workloads scheduling constraints, I bet we'd need to schedule such workloads in one cycle as well, so a single workload may not necessarily be a boundary for this phase.
| scheduler's internal cache. Pods are then pushed to the | ||
| active queue (restoring their original timestamps to ensure fairness) | ||
| to pass through the standard scheduling and binding cycle, | ||
| which will respect the nomination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what "respect" means? Do you mean "consider" like currently or something more? The alternative that I see is consider nomination as required. Inability to follow the nomination would make PodGroup unschedulable.
However, as the first implementation I'd stick to the current logic and use the word "consider". That means that Pods could pick a different node, but only within a constraint selected for a PodGroup. In other words, pod-by-pod cannot reschedule PodGroup itself (assuming scheduling a PodGroup has "an effect", for instance picking a specific topology option).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking validity of the PodGroup level constraint may not be that easy and may require checking the state of other pods in the PodGroup or checking the state of DRA allocations. In the context of topology-aware workload scheduling I believe that it would be much easier to consider the nomination to be a hard requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant "consider" here, but @44past4 point is valid. If we want to have a nominate semantic here, PodGroup-level plugins such as TAS would need to provide Filter extension points that will be able to verify the correctness of the new chosen node. Or, be able to provide the placement to the pod-by-pod cycle that will limit that phase to the topology/assignment chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To provide TAS Filter extension we would need to store information about selected placement (like the name of the Node label and its value) for the PodGroup. For this we would need probably to use Workload status. This is doable but it is not in scope of the current TAS KEP proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TAS could check in PreFilter where the pod group's previous pods were scheduled and based on that reject or allow the nodes in Filter. Obviously, it won't be much efficient, but maybe sufficiently good for now.
| reducing overall scheduling time. | ||
|
|
||
| * If a pod fits, it is tentatively nominated. | ||
| * If a pod cannot fit, the scheduler tries preemption by running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't use PostFilter phase in workload scheduling phase but rather "wait" for the proper workload preemption feature. Note that once we have pods that are batched (same signature) the Opportunistic Batching does not have a feature to find feasible-nodes-after-preemption-attempt anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, if workload-aware preemption is not there (or not enabled), we need to provide a way for gangs to perform preemptions. Otherwise, what is the point of having delayed preemption in the beta graduation criteria for gang scheduling?
Opportunistic batching just optimizes the default scheduling algorithm. I don't think preemption would be a big problem anyway because subsequent pod from the homogeneous sub-group can generate new placements for the rest of the batch.
|
|
||
| * If `schedulableCount >= minCount`, the cycle succeeds. Pods are pushed | ||
| to the active queue and will soon attempt to be scheduled on their | ||
| nominated nodes in their own, pod-by-pod cycles. If a pod selects a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any strong reason why PostFilter should not run in pod-by-pod scheduling if we allow nominated pod to be changed. Yes, it's indeed an open question whether we want to allow changing nomination itself, but if we do, then we should allow to run PostFilter as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that allowing such disruptions in pod-by-pod scheduling would make workload scheduling hard to reason about. The reason we'll use the workload-aware preemption is to enable efficient and effective preemption. Otherwise, each pod in the group could preempt some pods or even workloads based on its own needs rather than the needs of the entire pod group. I don't see many advantages to allowing PostFilter to run in pod-by-pod scheduling.
| They are then effectively "reserved" on those nodes in the | ||
| scheduler's internal cache. Pods are then pushed to the | ||
| active queue (restoring their original timestamps to ensure fairness) | ||
| to pass through the standard scheduling and binding cycle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no differences between workload and standard cycle in terms of pods feasibility
|
|
||
| Alternative 1 (Modify sorting logic): | ||
|
|
||
| Modify the sorting logic within the existing `PriorityQueue` to put all pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also not modify the queue sorting, but "remove" PodGroup pods form it when PodGroup itself is unschedulable by simply making them unschedulable.
| Can report dedicated logs and metrics with less confusion to the user. | ||
| * *Cons:* Significant and non-trivial architectural change to the scheduling queue | ||
| and `scheduleOne` loop. | ||
| <<[/UNRESOLVED]>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider that we should have PodGroups without pods at some point in time and we should be able to schedule them somehow.
However, before we get there, I'd consider scheduling PodGroup simply whenever we encounter the first pod that refers to it and the PodGroup wasn't scheduled yet. It's similar to option 1 but without modifying sorting function.
I think it would be the simplest implementation as all pods stay in the active queue unless an attempt to schedule a PodGroup they belong to, makes the PodGroup unschedulable. Obviously the question is when to reconsider unschedulable PodGroup. At the beginning I'd make it periodically schedulable again (after some timeout) without defining PodGroup queue yet.
I'm not sure yet how Workload aware preemption may interact with it yet.
| Would need to inject the workload priority into each of the Pods | ||
| or somehow apply the lowest pod's priority to the rest of the group. | ||
|
|
||
| Alternative 2 (Store a gang representative): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that one more option which would fall somewhere in between Alternative 2 and Alternative 3 could be changing the implementation of activeQ to operate on groups of Pods and treat single pods as groups of pods containing single pod only. Does this option make sense? Have you considered it?
| optional. In the `Beta` timeframe, we may opportunistically apply this cycle to | ||
| `Basic` pod groups to leverage the batching performance benefits, but the | ||
| "all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to | ||
| schedule as many pods from such PodGroup as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear what will happen in case of a pod scheduling failure in this case. Do we stop at first failure or do we continue? If we continue do we continue from a subsequent pod or maybe from subsequent sub-group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extended this part. The algorithm will continue from a pod from a subsequent sub-group.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: macsko The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
| * If the group (i.e., at least `minCount` Pods) can be placed, | ||
| these Pods have the `.status.nominatedNodeName` set. | ||
| They are then effectively "reserved" on those nodes in the | ||
| scheduler's internal cache. Pods are then pushed to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reserving Pods using NominatedNodes eliminates the need to consider Pod dequeueing order, too many NominatedNodes can still degrade scheduling performance. A mechanism is needed to ensure that Pods can be dequeued as quickly as possible after a PodGroup resource is nominated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What part of the scheduling performance are you mentioning? I think NominatedNodeNames shouldn't degrade it visibly.
However, it's still worth considering, so thanks for bringing this topic. According to the
"Pods are then pushed to the active queue (restoring their original timestamps to ensure fairness) to pass through the standard scheduling and binding cycle, which will consider the nomination."
part, the pods will be put into scheduling queue with their original timestamps, what should ensure they are scheduled pretty quickly. We would still like to allow high priority pods to jump in, as it's better to interrupt not yet bound pod group rather than a running one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't make any changes to the way pod-by-pod scheduling works (including queue order) as I don't see any reason why the current behavior needs a change (including the order in which the pods are processed). I can imagine some other KEP could introduce some optimization or redefine pods priority, but in this KEP we should guarantee consistency and correctness, no matter in which order Pods are processed.
The key part of this KEP is to have an ability to propose a workload placement atomically and "reserve" necessary resources for it, which should guarantee the rest of the scheduling process follows the plan (unless cluster state changes in the meantime). In case of workload scheduling these guarantees are more important than the performance of pod-by-pod scheduling itself (which won't become worse anyway) as they solve other fundamental problems like scheduling deadlocks and enable multi-workload simulations (simulate scheduling only on resource nomination level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, member Pods of a PodGroup reserve resources through NominatedNodes. If other Pods are scheduled before these Nominated member Pods, the Filter process will need to perform additional AddPod and Filter operations, which is the main performance issue I just mentioned.
| opportunistic batching itself will provide significant improvements. | ||
| Future features like Topology Aware Scheduling can further improve other subsets of use cases. | ||
|
|
||
| #### Interaction with Basic Policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the all-or-nothing semantics of minCount are skipped with WSC enabled for the Basic policy, and desiredCount is only a hint for feasibility checks.
Given that, is it in-scope (maybe in the future) to support using both minCount and desiredCount together to express a more elastic gang behavior (e.g. minCount as a hard lower bound, desiredCount as the target batch size)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the all-or-nothing semantics of
minCountare skipped with WSC enabled for the Basic policy, anddesiredCountis only a hint for feasibility checks.
Right, minCount is a parameter for the gang policy, i.e., it cannot be used with the basic policy, so the all-or-nothing semantics are not enforced. desiredCount is planned to be added only for the basic policy, as such number will be useful for controllers to better describe their "true" workloads, and for scheduler to potentially make better placement decisions.
Given that, is it in-scope (maybe in the future) to support using both
minCountanddesiredCounttogether to express a more elastic gang behavior (e.g.minCountas a hard lower bound,desiredCountas the target batch size)?
Adding the desiredCount to the gang policy can be considered, but not for the incoming release. In fact, having this value would be useful while making placement decisions for the pod group.
| <<[/UNRESOLVED]>> | ||
| ``` | ||
|
|
||
| #### Scheduling Algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the first big missing point: what if pods are schedulable if we change the order of scheduling them?
Like let's say pod#1 is scheduled first and pod#2 is unschedulable. But, pod#2 might be schedulable if pod#1 is scheduled onto a different node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would mean pessimistically trying all combinations in the algorithm, which is not desirable from a computational point of view. This KEP is expected to deliver the foundation for future algorithms adapted to a workload-aware scheduling architecture. Although the basic algorithm may be suboptimal and even cause the scenario you described, it still eliminates deadlocks and livelocks in gang scheduling and is no worse than scheduling without WAS. We could try shuffling pods with the same priority in each workload scheduling cycle and perform periodic retries, but I don't think that's that important at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as TAS proposal, we should not compromise the correctness of the scheduling. This is about the correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is not desirable from a computational point of view.
I know we cannot literally try all combinations. I do believe there's a way better than shuffling pods, at least. I know there could be a jobset that can have more than a thousand of pods, and I cannot say shuffling is where we can compromise. In some sense, shuffling is also not desirable from a computational point of view because that's essentially the same as "literally try all combinations in a random order until we find a place".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and, assuming workload contains many similar pods, shuffling is actually worse than literally trying all combinations until finding a place at a single cycle because the oppotunistic batching could work potentially in the latter case (within cache TTL)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a note for a rough idea before going to bed: We might be able to introduce a "swapping" process, very similar to preemption.
Let's say a workload has 5 groups defined by a signature (all are same priority). And, group#1 and 2 are schedulable, while 1 pod in group#3 cannot find a place (even with a normal preemption) Then, we can let the pod in group#3 to try to preempt pod(s) in group#1 and group#2 to see a possibility of "swapping" with equal priority pods. And, for example, if pods in group#1 are listed as a preemption candidate, we'll check a cached node list for group#1 again, and see if there's any spare node where we can move group#1's pod to. If there is, we can schedule the pod in group#3 there, schedule a victim pod in group#1 to somewhere else. (Otherwise, we can say the entire workload is unschedulable.) After that group#2 has to be rescheduled again because a change in group#1 placements might impact it.
The worst case is that group#3 and then 4 and then 5 are selecting group#1 as a victim again and again because we need to restart from group#2 scheduling over and over each time. We'd also need to track which combination that we already tried on each group so that we don't do the same calculation.
| <<[/UNRESOLVED]>> | ||
| ``` | ||
|
|
||
| #### Scheduling Algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second missing point: how should we support inter-pod constraints properly? (pod affinity and topology spread)
Like, let's say pods have inter pod affinity to each other and this affinity doesn't match any existing running pods outside this workload. Given the pod affinity plugin doesn't reject the first pod, it will keep returning Success to all pods. (not that NNN reserves the place, but it doesn't impact the evaluation of inter pod constraints, iirc)
But, if they actually enter the scheduling cycle and each pod is assume()ed, the pod affinity might return Unschedulable to some pods because pod affinity starts to work "properly" after assume() for some pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that NNN reserves the place, but it doesn't impact the evaluation of inter pod constraints
It does impact them. If there are pods with NNNs, filters are called twice: once with these pods present and once without. Both filters run have to pass. This means, the affinity check will work correctly.
Furthermore, I think that during the workload scheduling algorithm, the pods will be assumed and reserved, not just nominated, to ensure that each plugin has the correct internal state. These will be replaced to actual nominations when the algorithm finishes.
Please also note that this algorithm is not expected to handle inter-pod constraints particularly well (i.e., much better than without WAS). Ultimately, we plan to allow such constraints to be defined at the pod group level and support them with customized algorithms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are pods with NNNs, filters are called twice
Ok, I missed it. You're right
| as the current 10 seconds proven to be too low in larger clusters, | ||
| so this might be the case for workloads. | ||
|
|
||
| 3. Retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it's a good starting point, but soon we'll need to enhance it: at least, we should consider only QHints for failed pods (i.e., pods that are unschedulable at the previous attempt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that easy. Since we don't try all possible pod order combinations, podA from a group can fit on either node1 or node2, but podB can only fit on node1. If podA is scheduled on node1, podB cannot be scheduled. When an event occurs that makes podA schedulable on node3, the algorithm might allow podA on node3 and podB on node1.
Honestly, QHints aren't that critical, especially in larger clusters, because the quantity of events is still large enough. Because as the order can matter, we could even try having some periodic retries for pod groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
podA from a group can fit on either node1 or node2, but podB can only fit on node1. If podA is scheduled on node1, podB cannot be scheduled. When an event occurs that makes podA schedulable on node3, the algorithm might allow podA on node3 and podB on node1.
Hmm, good point. We cannot simply do my proposal above (only care about pods that failed in a previous cycle)
| * If `schedulableCount >= minCount`, the cycle succeeds. Pods are pushed | ||
| to the active queue and will soon attempt to be scheduled on their | ||
| nominated nodes in their own, pod-by-pod cycles. If a pod selects a | ||
| different node than its nomination during the individual cycle, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, what if a pod is unschedulable at this pod-by-pod scheduling cycle? I believe we need to track how many pods are "actually" passing the scheduling cycle and waiting at WaitOnPermit and how many pods are rejected at the scheduling cycle. And, once reaching {schedulableCount from the workload scheduling cycle} - {the num of pods rejected at the actual scheduling cycle} < minCount, we shouldn't attempt the rest of workload pods and also reject all waiting pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already keep such numbers in the workload manager - we know how many pods are unscheduled, assumed and assigned (ref), so it shouldn't be a problem. We could add more info there, e.g., about the number of pods that passed workload scheduling cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, please add this to KEP, if you agree then?
once reaching {schedulableCount from the workload scheduling cycle} - {the num of pods rejected at the actual scheduling cycle} < minCount, we shouldn't attempt the rest of workload pods and also reject all waiting pods.
| * If a pod cannot fit, the scheduler tries preemption by running | ||
| the `PostFilter` extension point. *Note:* With workload-aware preemption | ||
| this phase will be replaced by a workload-level algorithm. | ||
| * If preemption is successful, the pod is nominated on the selected node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the actual pod termination won't finish instantly. As proposed in #5711, shouldn't we regard this pod as "unschedulable" here?
Honestly, having two KEPs (this and #5711) and both mention the preemption is very confusing. I don't get how the preemption works eventually after all because both say different directions slightly.
cc @wojtek-t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@44past4's TAS proposal confused me more.
@macsko @44past4 @wojtek-t
At least for me, it is confusing that multiple KEP PRs are explaining the flow differently and the overall final picture is blurred even though I understand what each KEP wants to propose. e.g., I don't understand where GeneratePlacements etc are called in the flow explained here. I don't understand what is the goal in this release regarding the preemption. What will be implemented by when in general. etc. Please clarify on each KEP how the feature interact with the other KEPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the actual pod termination won't finish instantly. As proposed in #5711, shouldn't we regard this pod as "unschedulable" here?
The exact shape of the Delayed Preemption algorithm is not yet determined, but for now, we should treat such preemptor pod as schedulable in the workload scheduling algorithm because we need to determine if, with these preemptions, the pod group will be schedulable (≥ minCount), and if so, if the preemptions will be actuated after meeting that condition.
Honestly, having two KEPs (this and #5711) and both mention the preemption is very confusing. I don't get how the preemption works eventually after all because both say different directions slightly.
I agree that it's not ideal for the KEPs to depend on each other and differ in details. I'll update this KEP to provide a clearer description of the preemption scenarios. I also believe that, with most people soon returning from vacation, we'll finally be able to align on the details of these three KEPs.
| Introduce a completely separate queue for PodGroups alongside the `activeQ` for Pods. | ||
| The scheduler would pop the item (Pod or PodGroup) with the highest priority/earliest timestamp. | ||
| Pods belonging to an enqueued PodGroup won't be allowed in the `activeQ`. | ||
| * *Pros:* Clean separation of concerns. Can easily use the Workload scheduling priority. | ||
| Can report dedicated logs and metrics with less confusion to the user. | ||
| * *Cons:* Significant and non-trivial architectural change to the scheduling queue | ||
| and `scheduleOne` loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume that we could treat single Pods as groups with a count of one. Did we cover this? I don't see it as a listed alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not in favor of this kind of change, which affects the default pod-by-pod scheduling flow. Ultimately, we may want to change this and treat all individual pods as groups of size one, but before we do so, we need to make sure that everything works correctly. Major modifications to the core scheduling logic have proven difficult and cause subtle problems that can lead to serious regressions.
Furthermore, we want to run a pod-by-pod cycle after the Workload cycle for each (workload's) pod, which would mean unnecessary processing for individual pods.
|
|
||
| More detail about scheduler changes is described in [this document](https://docs.google.com/document/d/1lMYkDuGqEoZWfE2b8vjQx0vHieOMyfmi6VHUef5-5is/edit?tab=t.0#heading=h.1p88ilpefnb). | ||
|
|
||
| #### Future plans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could explain this in our non-goals section (which can link to a Future Plans section later in the KEP for a detailed write up of where we hope to get to).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already described in the non-goals section. Added a link to the future plans there
| While Gang Scheduling focuses on atomic, all-or-nothing scheduling, there is a significant class | ||
| of workloads that requires best-effort optimization without the strict blocking semantics of a gang. | ||
|
|
||
| Currently, the `Basic` policy is a no-op. We propose extending the `Basic` policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a KEP, "currently" should refer to the status quo ante, ie the situation before the KEP landed as implementable.
You can still use "currently" to mean something different, but qualify it with a phase (eg alpha), or a date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, rephrased it
|
|
||
| 1. Rejection | ||
|
|
||
| When the cycle fails, the scheduler rejects the entire group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this type of rejection might currently only be output in the Workload. A small suggestion is that it would be more user-friendly to also output this rejection information to the Pod Event or Condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each pod will go through the standard rejection process, including updating the status and sending an event. Please note, we are also planning to introduce the workload status (kubernetes/kubernetes#135561), which will be helpful here. It is mentioned in the Scheduling Algorithm section, but added it also here.
405c042 to
3834263
Compare
|
@macsko: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |