Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors - CNCF Fix #57152

HsiuChuanHsu · 2025-10-23T11:45:26Z

Split from PR #55159.

Description

This PR fixes issue #49517 where TaskInstanceHistory records were lost when Kubernetes API rate limiting (429 errors) prevented task adoption during scheduler restarts.

Problem

When using KubernetesExecutor or CeleryKubernetesExecutor:

Task pods launch successfully but K8s API starts returning 429 errors
KubernetesJobWatcher crashes, causing Scheduler restart
Scheduler fails to re-adopt running pods due to continued 429s
Tasks are marked orphaned with state reset to None
TaskInstanceHistory is not recorded since state ≠ RUNNING
Airflow UI shows missing log links for failed attempts

Solution

KubernetesExecutor: Add 429 error handling to retry logic and detailed logging for adoption failures

Impact

Before:

Task Running → K8s API 429 → Scheduler Restart → Task Orphaned → State Reset to None → 
No History → Missing UI Logs

After:

Task Running → K8s API 429 → Scheduler Restart → Task Orphaned → State Reset to None → 
History Recorded → UI Logs Available

Fixes: #49517
Related: #49244

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

- Handle 429 errors in KubernetesExecutor task publishing retry logic - Detect orphaned tasks and record TaskInstanceHistory in failure handler - Add detailed logging for rate limiting scenarios

Move orphaned task detection before end_date assignment to ensure TaskInstanceHistory is recorded for tasks that become detached during scheduler restarts due to Kubernetes API 429 errors.

- Remove taskinstance part

KoviAnusha · 2025-10-24T02:48:55Z

...iders/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py

+                self.log.warning(
+                    "Kubernetes API rate limiting (429) prevented adoption of pod %s for task %s. "
+                    "This may cause task history loss if the task was previously running. "
+                    "Consider implementing rate limiting backoff or increasing API quota.",
+                    pod.metadata.name,
+                    ti_key,
+                )


This log message is helpful & it gives great visibility when the scheduler hits throttling. Can we also surface a shorter summary in the UI or metrics so users notice API quota pressure sooner?

KoviAnusha · 2025-10-24T02:48:56Z

...iders/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py

+            # Log detailed information for rate limiting errors (429) which can cause task history loss
+            if str(e.status) == "429":


It is good to add explicit handling for 429 here. This helps avoid losing history when K8s throttles. Can we think about whether we want to apply the same backoff/retry cadence as 403/409 cases for consistency? Could also consider wrapping these transient checks into a small helper later, so error handling doesn’t get too scattered?

jscheffl · 2025-11-13T20:23:22Z

...iders/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py

                    if (
                        (str(e.status) == "403" and "exceeded quota" in message)
                        or (str(e.status) == "409" and "object has been modified" in message)
+                        or str(e.status) == "429"  # Add support for rate limiting errors


HTTP 429 also means that you should throttle calls as backend is overloaded. Can you also consider adding handling following the "Retry-After" response header?

github-actions · 2026-01-10T00:19:08Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

HsiuChuanHsu requested review from hussein-awala and jedcunningham as code owners October 23, 2025 11:45

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Oct 23, 2025

HsiuChuanHsu added 4 commits October 23, 2025 19:45

fix(k8s): Preserve task history during API rate limiting

5b1a1ea

- Handle 429 errors in KubernetesExecutor task publishing retry logic - Detect orphaned tasks and record TaskInstanceHistory in failure handler - Add detailed logging for rate limiting scenarios

fix(k8s): Update tests to reflect new 429 error retry behavior

d86f8b8

fix: Record history for orphaned tasks during K8s executor failures

126133e

Move orphaned task detection before end_date assignment to ensure TaskInstanceHistory is recorded for tasks that become detached during scheduler restarts due to Kubernetes API 429 errors.

Split original PR into cncf & taskinstance parts

38ac7d1

- Remove taskinstance part

HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error-cncf-provider branch from 0c59f7e to 38ac7d1 Compare October 23, 2025 11:45

HsiuChuanHsu mentioned this pull request Oct 23, 2025

Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors - Task Instance Fix #55159

Open

KoviAnusha reviewed Oct 24, 2025

View reviewed changes

eladkal requested a review from romsharon98 October 24, 2025 03:05

jscheffl reviewed Nov 13, 2025

View reviewed changes

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors - CNCF Fix #57152

Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors - CNCF Fix #57152

HsiuChuanHsu commented Oct 23, 2025

Uh oh!

KoviAnusha Oct 24, 2025

Uh oh!

KoviAnusha Oct 24, 2025

Uh oh!

jscheffl Nov 13, 2025

Uh oh!

github-actions bot commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Log detailed information for rate limiting errors (429) which can cause task history loss
		if str(e.status) == "429":

Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors - CNCF Fix #57152

Are you sure you want to change the base?

Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors - CNCF Fix #57152

Conversation

HsiuChuanHsu commented Oct 23, 2025

Description

Problem

Solution

Impact

Uh oh!

KoviAnusha Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

KoviAnusha Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

jscheffl Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants