Skip to content

Conversation

@njnu-seafish
Copy link
Contributor

Purpose of the pull request

close #17758

Brief change log

publish TaskFatalLifecycleEvent if initializeTaskExecutionContext fail when try to dispatch task

Verify this pull request

This pull request is already covered by existing tests.
First, add it test case
Second, We have already verified and tested this in our actual production environment.

Pull Request Notice

Pull Request Notice

If your pull request contains incompatible change, you should also add it to docs/docs/en/guide/upgrade/incompatible.md

@njnu-seafish
Copy link
Contributor Author

The second version of the code has been verified to work in our actual test environment. The specific logs are as follows:

[WI-193][TI-0] - 2025-12-25 09:58:53.578 INFO [ds-workflow-eventbus-worker-5] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskStartLifecycleEvent{task=sh01}
[WI-193][TI-0] - 2025-12-25 09:58:53.578 INFO [ds-workflow-eventbus-worker-5] o.a.d.s.m.e.w.l.h.AbstractWorkflowLifecycleEventHandler:[52] - Fired workflow flow_condition_import_no_environment-20251225095821727 LifecycleEvent[WorkflowTopologyLogicalTransitionWithTaskFinishLifecycleEvent{task=sh001taskState=SUCCESS}] with state: RUNNING_EXECUTION
[WI-193][TI-0] - 2025-12-25 09:58:53.583 INFO [ds-workflow-eventbus-worker-5] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskDispatchLifecycleEvent{task=sh01}
[WI-193][TI-0] - 2025-12-25 09:58:53.583 INFO [ds-workflow-eventbus-worker-5] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task sh01 TaskStartLifecycleEvent{task=sh01} with state SUBMITTED_SUCCESS
[WI-193][TI-0] - 2025-12-25 09:58:53.585 ERROR [ds-workflow-eventbus-worker-5] o.a.d.s.m.e.t.s.TaskSubmittedStateAction:[117] - Failed to initialize task execution context, taskName: sh01
java.lang.IllegalArgumentException: Cannot find the environment: 144873539254368
at org.apache.dolphinscheduler.server.master.runner.TaskExecutionContextFactory.getEnvironmentConfigFromDB(TaskExecutionContextFactory.java:217)
at org.apache.dolphinscheduler.server.master.runner.TaskExecutionContextFactory.createTaskExecutionContext(TaskExecutionContextFactory.java:102)
at org.apache.dolphinscheduler.server.master.engine.task.runnable.TaskExecutionRunnable.initializeTaskExecutionContext(TaskExecutionRunnable.java:148)
at org.apache.dolphinscheduler.server.master.engine.task.statemachine.TaskSubmittedStateAction.onDispatchEvent(TaskSubmittedStateAction.java:115)
at org.apache.dolphinscheduler.server.master.engine.task.lifecycle.handler.TaskDispatchLifecycleEventHandler.handle(TaskDispatchLifecycleEventHandler.java:40)
at org.apache.dolphinscheduler.server.master.engine.task.lifecycle.handler.TaskDispatchLifecycleEventHandler.handle(TaskDispatchLifecycleEventHandler.java:31)
at org.apache.dolphinscheduler.server.master.engine.task.lifecycle.handler.AbstractTaskLifecycleEventHandler.handle(AbstractTaskLifecycleEventHandler.java:46)
at org.apache.dolphinscheduler.server.master.engine.task.lifecycle.handler.AbstractTaskLifecycleEventHandler.handle(AbstractTaskLifecycleEventHandler.java:32)
at org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusFireWorker.doFireSingleEvent(WorkflowEventBusFireWorker.java:158)
at org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusFireWorker.doFireSingleWorkflowEventBus(WorkflowEventBusFireWorker.java:125)
at org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusFireWorker.fireAllRegisteredEvent(WorkflowEventBusFireWorker.java:89)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
[WI-193][TI-0] - 2025-12-25 09:58:53.587 INFO [ds-workflow-eventbus-worker-5] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: TaskFatalLifecycleEvent{task=sh01, endTime=Thu Dec 25 09:58:53 GMT+08:00 2025}
[WI-193][TI-0] - 2025-12-25 09:58:53.587 ERROR [ds-workflow-eventbus-worker-5] o.a.d.s.m.e.WorkflowEventBusFireWorker:[91] - Fire event failed for WorkflowExecuteRunnable: flow_condition_import_no_environment-20251225095821727
org.apache.dolphinscheduler.server.master.engine.exceptions.WorkflowEventFireException: Failed to fire event: TaskDispatchLifecycleEvent{task=sh01}
at org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusFireWorker.doFireSingleWorkflowEventBus(WorkflowEventBusFireWorker.java:147)
at org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusFireWorker.fireAllRegisteredEvent(WorkflowEventBusFireWorker.java:89)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.dolphinscheduler.server.master.exception.TaskExecutionContextCreateException: Cannot find the environment: 144873539254368
at org.apache.dolphinscheduler.server.master.engine.task.statemachine.TaskSubmittedStateAction.onDispatchEvent(TaskSubmittedStateAction.java:118)
at org.apache.dolphinscheduler.server.master.engine.task.lifecycle.handler.TaskDispatchLifecycleEventHandler.handle(TaskDispatchLifecycleEventHandler.java:40)
at org.apache.dolphinscheduler.server.master.engine.task.lifecycle.handler.TaskDispatchLifecycleEventHandler.handle(TaskDispatchLifecycleEventHandler.java:31)
at org.apache.dolphinscheduler.server.master.engine.task.lifecycle.handler.AbstractTaskLifecycleEventHandler.handle(AbstractTaskLifecycleEventHandler.java:46)
at org.apache.dolphinscheduler.server.master.engine.task.lifecycle.handler.AbstractTaskLifecycleEventHandler.handle(AbstractTaskLifecycleEventHandler.java:32)
at org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusFireWorker.doFireSingleEvent(WorkflowEventBusFireWorker.java:158)
at org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusFireWorker.doFireSingleWorkflowEventBus(WorkflowEventBusFireWorker.java:125)
... 8 common frames omitted
[WI-193][TI-0] - 2025-12-25 09:58:53.694 INFO [ds-workflow-eventbus-worker-10] o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: WorkflowTopologyLogicalTransitionWithTaskFinishLifecycleEvent{task=sh01taskState=FAILURE}
[WI-193][TI-0] - 2025-12-25 09:58:53.694 INFO [ds-workflow-eventbus-worker-10] o.a.d.s.m.e.t.l.h.AbstractTaskLifecycleEventHandler:[47] - Fired task sh01 TaskFatalLifecycleEvent{task=sh01, endTime=Thu Dec 25 09:58:53 GMT+08:00 2025} with state SUBMITTED_SUCCESS
[WI-193][TI-0] - 2025-12-25 09:58:53.694 INFO [ds-workflow-eventbus-worker-10] o.a.d.s.m.e.w.l.h.AbstractWorkflowLifecycleEventHandler:[47] - Begin fire workflow flow_condition_import_no_environment-20251225095821727 LifecycleEvent[WorkflowTopologyLogicalTransitionWithTaskFinishLifecycleEvent{task=sh01taskState=FAILURE}] with state: RUNNING_EXECUTION

@SbloodyS SbloodyS added this to the 3.4.1 milestone Dec 29, 2025
@SbloodyS SbloodyS added the bug Something isn't working label Dec 29, 2025
@AllArgsConstructor
public class TaskFatalLifecycleEvent extends AbstractTaskLifecycleEvent {

private final ITaskExecutionRunnable taskExecutionRunnable;

Check notice

Code scanning / CodeQL

Missing Override annotation Note

This method overrides
AbstractTaskLifecycleEvent.getTaskExecutionRunnable
; it is advisable to add an Override annotation.
@sonarqubecloud
Copy link

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes issue #17758 by implementing proper failure handling when TaskExecutionContext initialization fails during task dispatch. Previously, when initialization failed, tasks would remain in an incomplete state rather than being marked as failed.

Key Changes:

  • Introduced TaskFatalLifecycleEvent to handle catastrophic task failures
  • Added exception handling in TaskSubmittedStateAction to catch initialization failures
  • Implemented automatic task failure marking when initialization fails, with support for retries and condition task workflows

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
TaskExecutionContextCreateException.java Changed exception hierarchy from MasterException to RuntimeException
ExceptionUtils.java Added utility method to check for TaskExecutionContextCreateException
TaskSubmittedStateAction.java Added try-catch block around task context initialization to throw TaskExecutionContextCreateException
TaskLifecycleEventType.java Added new FATAL event type for catastrophic failures
TaskFatalLifecycleEvent.java New event class representing fatal task failures with end time tracking
TaskFatalLifecycleEventHandler.java New handler to process fatal lifecycle events
ITaskStateAction.java Added onFatalEvent method interface for handling fatal events
AbstractTaskStateAction.java Implemented onFatalEvent with retry logic, condition task handling, and failure chain marking
WorkflowEventBusFireWorker.java Added logic to publish TaskFatalLifecycleEvent when TaskExecutionContextCreateException occurs
WorkflowStartTestCase.java Added three test methods to verify fatal task handling scenarios
workflow_with_one_fake_task_fatal.yaml Test configuration for single fatal task scenario
workflow_with_one_condition_task_with_one_fake_predecessor_fatal.yaml Test configuration for condition task with fatal predecessor
workflow_with_one_forbidden_condition_task_with_one_fake_predecessor_fatal.yaml Test configuration for forbidden condition task with fatal predecessor
workflow_with_one_forbidden_condition_task_with_one_fake_predecessor_failed.yaml Updated name and description for clarity

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +139 to +144
AbstractTaskLifecycleEvent taskLifecycleEvent = (AbstractTaskLifecycleEvent) lifecycleEvent;
final TaskFatalLifecycleEvent taskFatalEvent = TaskFatalLifecycleEvent.builder()
.taskExecutionRunnable(taskLifecycleEvent.getTaskExecutionRunnable())
.endTime(new Date())
.build();
workflowEventBus.publish(taskFatalEvent);
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe type casting without verification. While the TaskExecutionContextCreateException is currently only thrown when handling TaskDispatchLifecycleEvent (which extends AbstractTaskLifecycleEvent), it would be safer to check the type before casting to prevent potential ClassCastException if the code structure changes. Consider adding an instanceof check before the cast.

Suggested change
AbstractTaskLifecycleEvent taskLifecycleEvent = (AbstractTaskLifecycleEvent) lifecycleEvent;
final TaskFatalLifecycleEvent taskFatalEvent = TaskFatalLifecycleEvent.builder()
.taskExecutionRunnable(taskLifecycleEvent.getTaskExecutionRunnable())
.endTime(new Date())
.build();
workflowEventBus.publish(taskFatalEvent);
if (lifecycleEvent instanceof AbstractTaskLifecycleEvent) {
AbstractTaskLifecycleEvent taskLifecycleEvent = (AbstractTaskLifecycleEvent) lifecycleEvent;
final TaskFatalLifecycleEvent taskFatalEvent = TaskFatalLifecycleEvent.builder()
.taskExecutionRunnable(taskLifecycleEvent.getTaskExecutionRunnable())
.endTime(new Date())
.build();
workflowEventBus.publish(taskFatalEvent);
} else {
log.warn(
"TaskExecutionContextCreateException occurred for non-task lifecycle event: {}",
lifecycleEvent.getClass().getName(),
ex);
}

Copilot uses AI. Check for mistakes.
} catch (Exception ex) {
log.error("Failed to initialize task execution context, taskName: {}", taskInstance.getName(), ex);
throw new TaskExecutionContextCreateException(ex.getMessage());

Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary blank line after the throw statement. This extra blank line should be removed to maintain consistent code formatting.

Suggested change

Copilot uses AI. Check for mistakes.
code: 1
version: 1
projectCode: 1
description: This is a fake workflow with one forbidden condition task which has one predecessor fatal
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description reads awkwardly. It should be "This is a fake workflow with one forbidden condition task that has one predecessor that failed fatally" or "...with one predecessor that encountered a fatal error". The current phrasing "has one predecessor fatal" is grammatically incorrect.

Suggested change
description: This is a fake workflow with one forbidden condition task which has one predecessor fatal
description: This is a fake workflow with one forbidden condition task that has one predecessor that failed fatally

Copilot uses AI. Check for mistakes.
code: 1
version: 1
projectCode: 1
description: This is a fake workflow with one condition task which has one predecessor fatal
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description reads awkwardly. It should be "This is a fake workflow with one condition task that has one predecessor that failed fatally" or "...with one predecessor that encountered a fatal error". The current phrasing "has one predecessor fatal" is grammatically incorrect.

Suggested change
description: This is a fake workflow with one condition task which has one predecessor fatal
description: This is a fake workflow with one condition task that has one predecessor that failed fatally

Copilot uses AI. Check for mistakes.
@AllArgsConstructor
public class TaskFatalLifecycleEvent extends AbstractTaskLifecycleEvent {

private final ITaskExecutionRunnable taskExecutionRunnable;
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides AbstractTaskLifecycleEvent.getTaskExecutionRunnable; it is advisable to add an Override annotation.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [Master] If a task fails during initialization, it will neither be dispatched by the Master nor can it be properly killed.

2 participants