Fix(CDAP-21219): Handle CancelJob on DONE Dataproc jobs gracefully#16066
Draft
cjac wants to merge 1 commit intocdapio:developfrom
Draft
Fix(CDAP-21219): Handle CancelJob on DONE Dataproc jobs gracefully#16066cjac wants to merge 1 commit intocdapio:developfrom
cjac wants to merge 1 commit intocdapio:developfrom
Conversation
This commit addresses an issue where CDAP pipelines were incorrectly
marked as FAILED when ephemeral Dataproc cluster deprovisioning
attempted to cancel a job that had already completed.
The following changes are included:
1. **RemoteExecutionTwillController:** Added a RuntimeJobStatus check
before attempting to force kill a remote process in the `complete()`
method's exception handler. This prevents sending a kill command
to jobs already in a terminal state.
2. **AbstractDataprocProvisioner:** Modified `deleteClusterWithStatus`
to specifically detect and handle the error returned by the Dataproc
API when a CancelJob request is made on a job in the DONE state.
This error is now logged as a warning and does not cause the
pipeline to be marked as FAILED.
3. **Unit Tests:** Added new unit tests for both
`RemoteExecutionTwillController` and `DataprocProvisioner` to
verify the new logic and prevent regressions.
4. **CONTRIBUTING.rst:** Updated the issues link to the current JIRA URL.
These changes ensure that the pipeline status accurately reflects the
execution result even if there are timing issues during cluster
deprovisioning.
Fixes: b/460875216
f5a62b0 to
225ca01
Compare
sahusanket
reviewed
Nov 25, 2025
| try { | ||
| LOG.debug("Force termination of remote process for program run {}", programRunId); | ||
| remoteProcessController.kill(RuntimeJobStatus.RUNNING); | ||
| RuntimeJobStatus currentStatus = remoteProcessController.getStatus(); |
Contributor
There was a problem hiding this comment.
The Prior Logic :
- While STATUS is RUNNING , keep checking every second.
- If it exceeds for more than 5 seconds, then throw
IllegalStateException - So the moment
getStatusis called, the 5 second check is done without any gap and immediately goes to catch block for force termination.
I agree between this few MS the dataproc job status could be DONE
But in the new extra check, the GAP for error still exists and this intermittent Wrong killing of pipeline would still happen.
My point is there is not much time gap between the existing getStatus == running and remoteProcessController.kill() , and similar to extra check..
| ((DataprocRuntimeJobDetail) jobDetail).getJobId(), statusDetails)); | ||
| LOG.error("Dataproc Job {}", jobDetail.getStatus(), e); | ||
| // Check if the failure is due to attempting to cancel a job already DONE | ||
| if (jobDetail.getStatus() == RuntimeJobStatus.FAILED && statusDetails.contains("is not supported in the current state: DONE")) { |
Contributor
There was a problem hiding this comment.
This exception of dataproc seems to be covered under FAILED_PRECONDITION
and we are already handling it DataprocRuntimeJobManager.java#L923
So, this check might not work.
We are assuming Failure for all conditions of FAILED_PRECONDITION , may be we can have a specific check there.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit addresses an issue where CDAP pipelines were incorrectly marked as FAILED when ephemeral Dataproc cluster deprovisioning attempted to cancel a job that had already completed.
The following changes are included:
RemoteExecutionTwillController: Added a RuntimeJobStatus check before attempting to force kill a remote process in the
complete()method's exception handler. This prevents sending a kill command to jobs already in a terminal state.AbstractDataprocProvisioner: Modified
deleteClusterWithStatusto specifically detect and handle the error returned by the Dataproc API when a CancelJob request is made on a job in the DONE state. This error is now logged as a warning and does not cause the pipeline to be marked as FAILED.Unit Tests: Added new unit tests for both
RemoteExecutionTwillControllerandDataprocProvisionerto verify the new logic and prevent regressions.CONTRIBUTING.rst: Updated the issues link to the current JIRA URL.
These changes ensure that the pipeline status accurately reflects the execution result even if there are timing issues during cluster deprovisioning.
Fixes: b/460875216