fix: jobs report success during dependency outages#592
Open
silent-cipher wants to merge 3 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes job result reporting so dependency outages (e.g., PDP subgraph down, baseline DB load failure, on-chain provider registry load failure) are surfaced as job failures instead of being recorded as successful completions, ensuring jobs_completed_total{handler_result="error"} reflects real execution failures.
Changes:
- Make
WalletSdkService.loadProviders()return a success boolean and use it in theproviders_refreshjob to fail the job when on-chain provider loading fails. - Update
DataRetentionService.pollDataRetention()to throw a dependency-specific error when critical dependencies are unavailable, including on batch subgraph query failures. - Expand
data-retention.service.spec.tscoverage to assert job-failure behavior for dependency outages while preserving “success” for empty-but-healthy provider sets and transient per-provider processing failures.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| apps/backend/src/wallet-sdk/wallet-sdk.service.ts | Return a boolean from loadProviders() so callers can treat dependency failures as job failures. |
| apps/backend/src/jobs/jobs.service.ts | Fail providers_refresh job when loadProviders() reports failure (so handler_result becomes error). |
| apps/backend/src/data-retention/data-retention.service.ts | Throw DataRetentionDependencyError for missing endpoint / baseline load failure / subgraph meta & query failures; fail after recording partial batch results. |
| apps/backend/src/data-retention/data-retention.service.spec.ts | Add/update tests to verify dependency outages now fail the job while partial/transient failures remain success. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
data_retention_pollrecordedjobs_completed_total{handler_result="success"}on every run, even when the check couldn't execute because the PDP subgraph was down. A subgraph outage silently took down mainnet + calibration data-retention checks for ~6 days with no alarm, stalling SP approval on insufficientdataSetChallengeStatussamples.providers_refreshhad the same latent bug: it discarded the boolean from loadProviders() and hardcoded success.A job's success should mean the check was able to run, so a dependency outage must be recorded as a job failure.
Changes
data-retention.service.ts:pollDataRetention()now fails the job (throwsDataRetentionDependencyError) when a check dependency is unavailable - missing subgraph endpoint, baseline-load failure, subgraph meta fetch failure, or a subgraph query failure. Transient per-provider failures and an empty-but-healthy provider set stay success. On a batch query failure the poll still records healthy batches before failing, and a failure is logged once.wallet-sdk.service.ts:loadProviders()returns its success boolean instead of discarding it.jobs.service.ts:handleProvidersRefreshJobfails the job whenloadProviders()reports failure.These propagate through
recordJobExecution, which recordshandler_result="error".For #591
This PR doesn't closes it because of the below remaining task: