Skip to content

docs: data set termination job design documentation#588

Open
silent-cipher wants to merge 6 commits into
mainfrom
feat/data-set-deletion-job
Open

docs: data set termination job design documentation#588
silent-cipher wants to merge 6 commits into
mainfrom
feat/data-set-deletion-job

Conversation

@silent-cipher
Copy link
Copy Markdown
Collaborator

@silent-cipher silent-cipher commented Jun 1, 2026

Summary

Adds docs/data-set-termination.md, a design doc for the proposed calibration-only data_set_termination job.

Why this job is needed: Once all provider reach the MIN_NUM_DATASETS_FOR_CHECKS cap, data_set_creation stops exercising the on-chain createDataSet path. During the PDPVerifier V3.4.0 rollout this meant dealbot missed the calibration outage entirely. The termination job restores continuous canary coverage by periodically freeing a slot so data_set_creation must recreate it.

What the doc covers:

  • Problem context and goals
  • Three new config variables: DATASET_TERMINATIONS_PER_SP_PER_HOUR, DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS, DATA_SET_TERMINATION_MIN_INDEX (with the disable-by-config pattern via MIN_INDEX = MIN_NUM)
  • Scheduling gate: calibration-only, non-empty canary window required
  • Handler algorithm: scan slots high-to-low, skip missing/terminated/actively-tracked, terminate the first eligible slot
  • Idempotency contract for races and pg-boss retries
  • Metrics (dataSetTerminationStatus, dataSetTerminationMs) alongside the existing creation metrics that serve as the primary canary signal
  • Relationship to data_set_creation including the rate constraint (creations/hr >= terminations/hr)
  • FAQ on the full on-chain lifecycle after terminateService and why the job only needs to wait for step 1
  • Two open questions: renaming the terminated provisioning status and whether repair should move into this job

Closes #586

@FilOzzy FilOzzy added this to FOC Jun 1, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC Jun 1, 2026
@silent-cipher silent-cipher changed the base branch from main to docs/data-set-creation-design-doc June 1, 2026 18:24
@silent-cipher silent-cipher self-assigned this Jun 1, 2026
@rjan90 rjan90 moved this from 📌 Triage to ⌨️ In Progress in FOC Jun 2, 2026
Base automatically changed from docs/data-set-creation-design-doc to main June 3, 2026 06:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new design document describing a proposed calibration-only data_set_termination pg-boss job intended to periodically terminate a managed dataset slot so the existing data_set_creation job recreates it, keeping the on-chain dataset lifecycle continuously exercised as a canary.

Changes:

  • Introduces a detailed design/spec for a new data_set_termination job, including scheduling, handler algorithm, and idempotency expectations.
  • Documents proposed configuration knobs and operational constraints (calibration-only gating, canary window sizing, rate constraints vs creation).
  • Outlines observability expectations and BetterStack dashboard questions for validating the termination→creation loop.

Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md
Comment thread docs/data-set-termination.md
@silent-cipher silent-cipher changed the title docs: data set deletion job design documentation docs: data set termination job design documentation Jun 3, 2026
@rjan90 rjan90 marked this pull request as ready for review June 3, 2026 14:40
@rjan90 rjan90 moved this from ⌨️ In Progress to 🔎 Awaiting review in FOC Jun 3, 2026
@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Jun 3, 2026

Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03.

Copy link
Copy Markdown
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I submitted this prematurely. I am still reviewing and will send another review when I'm done reading through.

I don't think we should say that this closes #586

We still need to do the implementation work and we should make sure we have visibility on this job on the internal dealbot dashboard.

@@ -0,0 +1,220 @@
# Data Set Termination Job

This doc proposes a calibration-only `data_set_termination` job that periodically terminates a dealbot managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyperlink to data_set_creation docs


This doc proposes a calibration-only `data_set_termination` job that periodically terminates a dealbot managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached.

> **Note**: this design does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated; if `deleteDataSet` canary coverage is required for #586, that would need a different approach.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> **Note**: this design does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated; if `deleteDataSet` canary coverage is required for #586, that would need a different approach.
> **Note**: this design does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated.

This is a good callout, but I think we're agreeing here not do deleteDataSet so we don't need to say more.


## Summary

- `data_set_termination` is a calibration-only job that periodically terminates one managed dataset slot per provider.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be calibration-only by default, but it should be possible to run in mainnet if someone enables it.

- `data_set_termination` is a calibration-only job that periodically terminates one managed dataset slot per provider.
- Together with [`data_set_creation`](./data-set-creation.md), the two jobs form a bounded loop that keeps the `createDataSet` on-chain path continuously exercised as a canary.
- The job terminates **at most one dataset per invocation**; `data_set_creation` handles replenishment on its next scheduled tick.
- It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with `deal`, `retrieval`, `piece_cleanup`, `pull_check`, or `data_set_creation` for the same provider.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than list them all out (which is going to get stale as we add jobs), maybe just say

Suggested change
- It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with `deal`, `retrieval`, `piece_cleanup`, `pull_check`, or `data_set_creation` for the same provider.
- It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with other jobs for the same provider.


## Problem Context

During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets ([`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks)). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets ([`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks)). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared.
During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets ([`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks)). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared. (See [post mortem](https://app.notion.com/p/filecoindev/2026-05-28-Hotfix-FWSS-v1-2-0-compatibility-with-PDPVerifier-v3-4-0-901dc41950c1820d99c88157214dec5d) for more info.)

- Reuse the existing `data_set_creation` job as the replenishment mechanism.
- Minimize disruption to ongoing deal and retrieval checks.
- Make termination cadence explicitly configurable so the expected create cadence can be reasoned about.
- Ensure the job cannot run on mainnet.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Ensure the job cannot run on mainnet.
- Ensure the job doesn't run on mainnet by default.

@silent-cipher
Copy link
Copy Markdown
Collaborator Author

I don't think we should say that this closes #586

I was planning to include implementation in this same PR.

Copy link
Copy Markdown
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to take another look 2026-06-04, but hpefully this gives enough direction to give confidence about starting implementation.

(This now concludes the review I started with #588 (review))

- default: `1` — only the baseline slot (index `0`) is protected
- slots `0` through `DATA_SET_TERMINATION_MIN_INDEX - 1` are never touched by this job
- example: `MIN_NUM_DATASETS_FOR_CHECKS = 10`, `DATA_SET_TERMINATION_MIN_INDEX = 5` → slots 0–4 are stable, slots 5–9 cycle as the canary window
- set to `MIN_NUM_DATASETS_FOR_CHECKS` to disable termination entirely — the canary window becomes empty and no schedule is created
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this works, but maybe we use a more obvious "this is disabled" value like "-1" or even a string "DISABLED"?


The schedule is only upserted when all of the following are true:

- `NETWORK=calibration`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we want to enforce this in code.

Other ideas:

  • if the network is mainnet, maybe we log an info/warning (e.g., "Dataset deletion is enabled. Mainnet funds will likely drain faster as a result of ensuing repeated createDataSet fees.") or
  • if the network is mainnet, require a more explicit opt in: (e.g., ENABLE_DATASET_DELETION=TRUE") which by default is false for mainnet. Basically I don't want our code to get in the way if we or others decide they want to run deletion jobs against mainnet.

2. Apply the same maintenance-window and SP-blocklist rules used by other SP jobs.
3. Create an `AbortController` using `DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS`.
4. Read `MIN_NUM_DATASETS_FOR_CHECKS` and base dataset metadata.
5. Scan slots from `minDataSets - 1` down to `DATA_SET_TERMINATION_MIN_INDEX`. For each slot:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ordering of the slots deterministic? If so, that would mean some datasets will be long lived and some will churn. I haven't thought about this too deeply and appreciate your input, but on the surface, I would imagine doing this with random dataset selection. If we go random, then I assume we do away with DATA_SET_TERMINATION_MIN_INDEX approach.

Another way to look at my question is: why are we doing DATA_SET_TERMINATION_MIN_INDEX rather than just randomly selecting a dataset?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we keep some slots stable. We don't exercise dataSetCreation -> terminateService cycle on them because we need deals to perform retrieval tests on them. If we pick random slot from all available slots for terminations we will barely have deals to do retrievals on them.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of making only the partition deterministic, not the selection itself. The lower-index slots (up to DATA_SET_TERMINATION_MIN_INDEX) stay stable to produce deals for retrieval testing, while the slots in [DATA_SET_TERMINATION_MIN_INDEX, MIN_NUM_DATASETS_FOR_CHECKS) can be picked randomly for the create → terminate cycle.
I'll make it more clear in the docs.

- c. Skip if `missing` — nothing to terminate.
- d. Skip if `terminated` — `data_set_creation` owns repair of these slots.
- e. Skip if `live` but has any deal row with `cleaned_up = false` — the deal job is still tracking it as active.
6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped).
6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped).

7. Log the outcome and exit for this tick.
8. If no eligible slot is found after the full scan, log `skipped.no_candidate` and exit. This is expected when `data_set_creation` has not yet replenished a previously terminated slot.

As with `data_set_creation`, the job performs **at most one state-changing action per invocation**.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As with `data_set_creation`, the job performs **at most one state-changing action per invocation**.
As with `data_set_creation`, the job performs **at most one state-changing action per invocation**. The way to increase the dataset termination rate is to increase `DATASET_TERMINATIONS_PER_SP_PER_HOUR`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level: should we maybe combine this done with data-set-creation.md ? Maybe there's a data-set-lifecycle.md doc we want?

I haven't thought too deeply about this idea, but I'm basically wondering about whether it's easier to think about the dataset lifecycle as a whole rather than spanning two documents.

2. `data_set_creation` runs next. The Synapse SDK filters the terminated dataset from metadata lookups, so the slot resolves as `missing` immediately. `data_set_creation` provisions a replacement dataset directly in this run.
3. Existing creation metrics and alerts resume acting as the canary.

**Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_TERMINATIONS_PER_SP_PER_HOUR`. If termination runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler should emit a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_TERMINATIONS_PER_SP_PER_HOUR`. If termination runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler should emit a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard.
**Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_TERMINATIONS_PER_SP_PER_HOUR`. If termination runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler emits a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard.

I changed the wording because I'm assume we'll implement it this way.

Comment on lines +176 to +178
The `terminated` status returned by `getDataSetProvisioningStatus` means: the Synapse SDK resolved a `dataSetId` from the metadata fingerprint but liveness probes failed. This is distinct from a dataset that has `pdpEndEpoch !== 0` on-chain (which the SDK filters out entirely, causing the slot to resolve as `missing`).

The name `terminated` is already used for both the on-chain lifecycle concept and this SDK liveness-probe failure state, which causes confusion. Candidate replacements: `irrecoverable` or `missing.sp`. This rename would affect `data_set_creation`'s handler and repair path as well.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what names are used where and the specifics, but lets not make things confusing. A "terminated" dataset already has a definition (and most recently getting updated in https://app.notion.com/p/filecoindev/Data-Set-Terminations-Clean-up-360dc41950c1801ebf0aff017a322e7 I believe).

If our terminated definition isn't in sync with other understandings of terminated, then lets not use that word.

I don't really know yet what to suggestion in terms of name.

When you say "liveness probes failed", what do you mean? Basically to help with naming, today when getDataSetProvisioningStatus returns terminated what is happening with the dataset? How could that have happened to the dataset? That is useful info for coming up with a new name here.

But in general: I'm in favor of fixing this and making it more clear and simpler to reason about. Overlaoding the term "terminated" I think moves us in the wrong direction.

Copy link
Copy Markdown
Collaborator Author

@silent-cipher silent-cipher Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When getDataSetProvisioningStatus return terminated it means data set is live/healthy on-chain but curio's addPieces path returns unrecoverable_proving_failure_epoch state, where SP refuses addPieces with HTTP 409.

Ref - #545


The name `terminated` is already used for both the on-chain lifecycle concept and this SDK liveness-probe failure state, which causes confusion. Candidate replacements: `irrecoverable` or `missing.sp`. This rename would affect `data_set_creation`'s handler and repair path as well.

### Should `data_set_termination` absorb the repair path from `data_set_creation`?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to raise it. I don't know if there is an obvious clear-cut answer.

It sounds like the repair case still involves creating a dataset right, which is why it's reasonable to be in data_set_creation.

Maybe we just have an FAQ item for "Why doesn't data-set-termination handle repair?" For the answer we can say it's not clear cut either direction and that logic first lived in create_data_set. There hasn't been a compelling reason to change this.

Copy link
Copy Markdown
Collaborator Author

@silent-cipher silent-cipher Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By repair it means the curio's addPieces refuses with unrecoverable_proving_failure_epoch but the data set is still live on-chain. So, repair path calls terminateSerivce which then actually terminates the data set on-chain. Then, in next tick the data set is replenished.


`terminateService` calls `FilecoinPay.terminateRail(pdpRailId)`, which sets `endEpoch = block.number + lockupPeriod` on the PDP rail. The FWSS `railTerminated` callback fires in the same transaction, stores `info.pdpEndEpoch`, and emits `PDPPaymentsTerminated` and `ServiceTerminated`.

This is the point the termination job polls for: `pdpEndEpoch !== 0`. Once this is set, `data_set_creation` will classify the slot as `missing` and begin the replenishment sequence. The termination job's work is done here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

termination job polls for: pdpEndEpoch !== 0.

Maybe quickly say what happens after this polling when ``pdpEndEpoch !== 0` is found (or you can link to the document where this is described more).

Once this is set

I think it's not fully obvious what the "this" pronoun is referring to. To make it clear, maybe remove the pronoun and replace with the actual noun?

@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented Jun 3, 2026

Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03.

Post GA item created: FilOzone/filecoin-services#503

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🔎 Awaiting review

Development

Successfully merging this pull request may close these issues.

Periodic dataset deletion job in calibration to canary createDataSet flow

5 participants