-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors - CNCF Fix #57152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
5b1a1ea
d86f8b8
126133e
38ac7d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -380,11 +380,12 @@ def sync(self) -> None: | |
| body = {"message": e.body} | ||
|
|
||
| retries = self.task_publish_retries[key] | ||
| # In case of exceeded quota or conflict errors, requeue the task as per the task_publish_max_retries | ||
| # In case of exceeded quota, conflict errors, or rate limiting, requeue the task as per the task_publish_max_retries | ||
| message = body.get("message", "") | ||
| if ( | ||
| (str(e.status) == "403" and "exceeded quota" in message) | ||
| or (str(e.status) == "409" and "object has been modified" in message) | ||
| or str(e.status) == "429" # Add support for rate limiting errors | ||
| ) and (self.task_publish_max_retries == -1 or retries < self.task_publish_max_retries): | ||
| self.log.warning( | ||
| "[Try %s of %s] Kube ApiException for Task: (%s). Reason: %r. Message: %s", | ||
|
|
@@ -682,6 +683,17 @@ def adopt_launched_task( | |
| ) | ||
| except ApiException as e: | ||
| self.log.info("Failed to adopt pod %s. Reason: %s", pod.metadata.name, e) | ||
|
|
||
| # Log detailed information for rate limiting errors (429) which can cause task history loss | ||
| if str(e.status) == "429": | ||
|
Comment on lines
+687
to
+688
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is good to add explicit handling for 429 here. This helps avoid losing history when K8s throttles. Can we think about whether we want to apply the same backoff/retry cadence as 403/409 cases for consistency? Could also consider wrapping these transient checks into a small helper later, so error handling doesn’t get too scattered? |
||
| self.log.warning( | ||
| "Kubernetes API rate limiting (429) prevented adoption of pod %s for task %s. " | ||
| "This may cause task history loss if the task was previously running. " | ||
| "Consider implementing rate limiting backoff or increasing API quota.", | ||
| pod.metadata.name, | ||
| ti_key, | ||
| ) | ||
|
Comment on lines
+689
to
+695
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This log message is helpful & it gives great visibility when the scheduler hits throttling. Can we also surface a shorter summary in the UI or metrics so users notice API quota pressure sooner? |
||
|
|
||
| return | ||
|
|
||
| del tis_to_flush_by_key[ti_key] | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTTP 429 also means that you should throttle calls as backend is overloaded. Can you also consider adding handling following the "Retry-After" response header?