Skip to content

Conversation

@HsiuChuanHsu
Copy link
Contributor

Split from PR #55159.

Description

This PR fixes issue #49517 where TaskInstanceHistory records were lost when Kubernetes API rate limiting (429 errors) prevented task adoption during scheduler restarts.

Problem

When using KubernetesExecutor or CeleryKubernetesExecutor:

  1. Task pods launch successfully but K8s API starts returning 429 errors
  2. KubernetesJobWatcher crashes, causing Scheduler restart
  3. Scheduler fails to re-adopt running pods due to continued 429s
  4. Tasks are marked orphaned with state reset to None
  5. TaskInstanceHistory is not recorded since state ≠ RUNNING
  6. Airflow UI shows missing log links for failed attempts

Solution

KubernetesExecutor: Add 429 error handling to retry logic and detailed logging for adoption failures

Impact

Before:

Task Running → K8s API 429 → Scheduler Restart → Task Orphaned → State Reset to None → 
No History → Missing UI Logs

After:

Task Running → K8s API 429 → Scheduler Restart → Task Orphaned → State Reset to None → 
History Recorded → UI Logs Available

Fixes: #49517
Related: #49244


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Oct 23, 2025
- Handle 429 errors in KubernetesExecutor task publishing retry logic
- Detect orphaned tasks and record TaskInstanceHistory in failure handler
- Add detailed logging for rate limiting scenarios
Move orphaned task detection before end_date assignment to ensure
TaskInstanceHistory is recorded for tasks that become detached
during scheduler restarts due to Kubernetes API 429 errors.
Comment on lines +689 to +695
self.log.warning(
"Kubernetes API rate limiting (429) prevented adoption of pod %s for task %s. "
"This may cause task history loss if the task was previously running. "
"Consider implementing rate limiting backoff or increasing API quota.",
pod.metadata.name,
ti_key,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message is helpful & it gives great visibility when the scheduler hits throttling. Can we also surface a shorter summary in the UI or metrics so users notice API quota pressure sooner?

Comment on lines +687 to +688
# Log detailed information for rate limiting errors (429) which can cause task history loss
if str(e.status) == "429":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to add explicit handling for 429 here. This helps avoid losing history when K8s throttles. Can we think about whether we want to apply the same backoff/retry cadence as 403/409 cases for consistency? Could also consider wrapping these transient checks into a small helper later, so error handling doesn’t get too scattered?

@eladkal eladkal requested a review from romsharon98 October 24, 2025 03:05
if (
(str(e.status) == "403" and "exceeded quota" in message)
or (str(e.status) == "409" and "object has been modified" in message)
or str(e.status) == "429" # Add support for rate limiting errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP 429 also means that you should throttle calls as backend is overloaded. Can you also consider adding handling following the "Retry-After" response header?

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TI history missing after Scheduler restart during K8s 429 error

3 participants