-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors - CNCF Fix #57152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors - CNCF Fix #57152
Conversation
- Handle 429 errors in KubernetesExecutor task publishing retry logic - Detect orphaned tasks and record TaskInstanceHistory in failure handler - Add detailed logging for rate limiting scenarios
Move orphaned task detection before end_date assignment to ensure TaskInstanceHistory is recorded for tasks that become detached during scheduler restarts due to Kubernetes API 429 errors.
- Remove taskinstance part
0c59f7e to
38ac7d1
Compare
| self.log.warning( | ||
| "Kubernetes API rate limiting (429) prevented adoption of pod %s for task %s. " | ||
| "This may cause task history loss if the task was previously running. " | ||
| "Consider implementing rate limiting backoff or increasing API quota.", | ||
| pod.metadata.name, | ||
| ti_key, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log message is helpful & it gives great visibility when the scheduler hits throttling. Can we also surface a shorter summary in the UI or metrics so users notice API quota pressure sooner?
| # Log detailed information for rate limiting errors (429) which can cause task history loss | ||
| if str(e.status) == "429": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is good to add explicit handling for 429 here. This helps avoid losing history when K8s throttles. Can we think about whether we want to apply the same backoff/retry cadence as 403/409 cases for consistency? Could also consider wrapping these transient checks into a small helper later, so error handling doesn’t get too scattered?
| if ( | ||
| (str(e.status) == "403" and "exceeded quota" in message) | ||
| or (str(e.status) == "409" and "object has been modified" in message) | ||
| or str(e.status) == "429" # Add support for rate limiting errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTTP 429 also means that you should throttle calls as backend is overloaded. Can you also consider adding handling following the "Retry-After" response header?
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
Split from PR #55159.
Description
This PR fixes issue #49517 where TaskInstanceHistory records were lost when Kubernetes API rate limiting (429 errors) prevented task adoption during scheduler restarts.
Problem
When using KubernetesExecutor or CeleryKubernetesExecutor:
NoneRUNNINGSolution
KubernetesExecutor: Add 429 error handling to retry logic and detailed logging for adoption failures
Impact
Before:
After:
Fixes: #49517
Related: #49244
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.