Skip to content

fix: jobs report success during dependency outages#592

Open
silent-cipher wants to merge 3 commits into
mainfrom
fix/job-vs-check-failure
Open

fix: jobs report success during dependency outages#592
silent-cipher wants to merge 3 commits into
mainfrom
fix/job-vs-check-failure

Conversation

@silent-cipher
Copy link
Copy Markdown
Collaborator

Problem

data_retention_poll recorded jobs_completed_total{handler_result="success"} on every run, even when the check couldn't execute because the PDP subgraph was down. A subgraph outage silently took down mainnet + calibration data-retention checks for ~6 days with no alarm, stalling SP approval on insufficient dataSetChallengeStatus samples. providers_refresh had the same latent bug: it discarded the boolean from loadProviders() and hardcoded success.

A job's success should mean the check was able to run, so a dependency outage must be recorded as a job failure.

Changes

  • data-retention.service.ts: pollDataRetention() now fails the job (throws DataRetentionDependencyError) when a check dependency is unavailable - missing subgraph endpoint, baseline-load failure, subgraph meta fetch failure, or a subgraph query failure. Transient per-provider failures and an empty-but-healthy provider set stay success. On a batch query failure the poll still records healthy batches before failing, and a failure is logged once.
  • wallet-sdk.service.ts: loadProviders() returns its success boolean instead of discarding it.
  • jobs.service.ts: handleProvidersRefreshJob fails the job when loadProviders() reports failure.

These propagate through recordJobExecution, which records handler_result="error".

For #591

This PR doesn't closes it because of the below remaining task:

  • The job-completion alarm fires on a sustained dependency outage, including when only error results exist.

Copilot AI review requested due to automatic review settings June 3, 2026 18:25
@FilOzzy FilOzzy added this to FOC Jun 3, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC Jun 3, 2026
@silent-cipher silent-cipher self-assigned this Jun 3, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes job result reporting so dependency outages (e.g., PDP subgraph down, baseline DB load failure, on-chain provider registry load failure) are surfaced as job failures instead of being recorded as successful completions, ensuring jobs_completed_total{handler_result="error"} reflects real execution failures.

Changes:

  • Make WalletSdkService.loadProviders() return a success boolean and use it in the providers_refresh job to fail the job when on-chain provider loading fails.
  • Update DataRetentionService.pollDataRetention() to throw a dependency-specific error when critical dependencies are unavailable, including on batch subgraph query failures.
  • Expand data-retention.service.spec.ts coverage to assert job-failure behavior for dependency outages while preserving “success” for empty-but-healthy provider sets and transient per-provider processing failures.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
apps/backend/src/wallet-sdk/wallet-sdk.service.ts Return a boolean from loadProviders() so callers can treat dependency failures as job failures.
apps/backend/src/jobs/jobs.service.ts Fail providers_refresh job when loadProviders() reports failure (so handler_result becomes error).
apps/backend/src/data-retention/data-retention.service.ts Throw DataRetentionDependencyError for missing endpoint / baseline load failure / subgraph meta & query failures; fail after recording partial batch results.
apps/backend/src/data-retention/data-retention.service.spec.ts Add/update tests to verify dependency outages now fail the job while partial/transient failures remain success.

Comment thread apps/backend/src/data-retention/data-retention.service.ts Outdated
Comment thread apps/backend/src/data-retention/data-retention.service.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread apps/backend/src/data-retention/data-retention.service.ts
Comment thread apps/backend/src/jobs/jobs.service.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 📌 Triage

Development

Successfully merging this pull request may close these issues.

3 participants