Skip to content

Conversation

@HsiuChuanHsu
Copy link
Contributor

@HsiuChuanHsu HsiuChuanHsu commented Sep 1, 2025

Description

This PR fixes issue #49517 where TaskInstanceHistory records were lost when Kubernetes API rate limiting (429 errors) prevented task adoption during scheduler restarts.

Problem

When using KubernetesExecutor or CeleryKubernetesExecutor:

  1. Task pods launch successfully but K8s API starts returning 429 errors
  2. KubernetesJobWatcher crashes, causing Scheduler restart
  3. Scheduler fails to re-adopt running pods due to continued 429s
  4. Tasks are marked orphaned with state reset to None
  5. TaskInstanceHistory is not recorded since state ≠ RUNNING
  6. Airflow UI shows missing log links for failed attempts

Solution

KubernetesExecutor: Add 429 error handling to retry logic and detailed logging for adoption failures
TaskInstance: Detect orphaned tasks (state=None + start_date set + end_date unset ) and record TaskInstanceHistory

Impact

Before:

Task Running → K8s API 429 → Scheduler Restart → Task Orphaned → State Reset to None → 
No History → Missing UI Logs

After:

Task Running → K8s API 429 → Scheduler Restart → Task Orphaned → State Reset to None → 
History Recorded → UI Logs Available

Fixes: #49517
Related: #49244


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Sep 1, 2025
@HsiuChuanHsu HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error branch from edbf605 to f0ab406 Compare September 1, 2025 23:07
@uranusjr
Copy link
Member

uranusjr commented Sep 2, 2025

Fix looks reasonable but tests don’t agree. This should include a test case too.

@HsiuChuanHsu HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error branch from f0ab406 to 01962e3 Compare September 2, 2025 13:07
@eladkal eladkal requested a review from uranusjr September 2, 2025 14:34
@HsiuChuanHsu HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error branch from 75dfd76 to 63a5ad1 Compare September 2, 2025 14:42
@HsiuChuanHsu
Copy link
Contributor Author

HsiuChuanHsu commented Sep 2, 2025

Fix looks reasonable but tests don’t agree. This should include a test case too.

When implementing unit tests for the new orphaned task detection logic in the fetch_handle_failure_context method of taskinstance.py, found a critical timing bug that kept it from working right.

ti.end_date = timezone.utcnow()

Original problem

  • The orphaned task detection relies on the condition: ti.end_date is None
  • However, ti.end_date = timezone.utcnow() was executed earlier in the method
  • This made the ti.end_date is None condition impossible to satisfy
def fetch_handle_failure_context(...):
    # ... other code ...
    ti.end_date = timezone.utcnow()  # ← Sets end_date first
    
    # ... later in the method ...
    if ti.state is None and ti.start_date is not None and ti.end_date is None:
        # ← This condition can never be True because end_date was already set above

Solution
The orphaned task detection logic was moved to execute before the end_date assignment:

def fetch_handle_failure_context(...):
    # Check for orphaned task BEFORE setting end_date
    if (
        ti.is_eligible_to_retry()
        and ti.state is None
        and ti.start_date is not None
        and ti.end_date is None  # ← Now this can actually be True
    ):
        # Handle orphaned task detection and history recording
    
    # THEN set end_date
    ti.end_date = timezone.utcnow()

@eladkal eladkal added this to the Airflow 3.1.0 milestone Sep 14, 2025
@kaxil kaxil modified the milestones: Airflow 3.1.0, Airflow 3.1.1 Sep 18, 2025
@eladkal eladkal force-pushed the fix/taskinstance-history-k8s-429-error branch from 63a5ad1 to e9dbba6 Compare October 20, 2025 12:38
@eladkal
Copy link
Contributor

eladkal commented Oct 20, 2025

@HsiuChuanHsu this PR combines changes to airflow core and k8s provider. If these changes are not coupled can you please separate? Providers and core have different release cycles

@HsiuChuanHsu
Copy link
Contributor Author

@eladkal Sure, will work on it.

@kaxil kaxil modified the milestones: Airflow 3.1.1, Airflow 3.1.2 Oct 21, 2025
@HsiuChuanHsu HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error branch from e9dbba6 to 7d78088 Compare October 23, 2025 11:37
@HsiuChuanHsu HsiuChuanHsu changed the title Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors - Task Instance Fix Oct 23, 2025
@HsiuChuanHsu
Copy link
Contributor Author

@eladkal Just submit a new PR to split the changes from the original one here.
Thanks for the reminder!

@eladkal eladkal added area:core and removed area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Oct 30, 2025
@eladkal eladkal requested a review from kaxil October 30, 2025 08:19
@kaxil kaxil modified the milestones: Airflow 3.1.2, Airflow 3.1.3 Oct 31, 2025
@HsiuChuanHsu HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error branch from 7d78088 to 94b9ea9 Compare November 26, 2025 15:01
- Handle 429 errors in KubernetesExecutor task publishing retry logic
- Detect orphaned tasks and record TaskInstanceHistory in failure handler
- Add detailed logging for rate limiting scenarios
Move orphaned task detection before end_date assignment to ensure
TaskInstanceHistory is recorded for tasks that become detached
during scheduler restarts due to Kubernetes API 429 errors.
@HsiuChuanHsu HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error branch from 94b9ea9 to d9b125d Compare November 26, 2025 15:01
@potiuk potiuk modified the milestones: Airflow 3.1.5, Airflow 3.1.6 Dec 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TI history missing after Scheduler restart during K8s 429 error

6 participants