docs: data set termination job design documentation#588
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new design document describing a proposed calibration-only data_set_termination pg-boss job intended to periodically terminate a managed dataset slot so the existing data_set_creation job recreates it, keeping the on-chain dataset lifecycle continuously exercised as a canary.
Changes:
- Introduces a detailed design/spec for a new
data_set_terminationjob, including scheduling, handler algorithm, and idempotency expectations. - Documents proposed configuration knobs and operational constraints (calibration-only gating, canary window sizing, rate constraints vs creation).
- Outlines observability expectations and BetterStack dashboard questions for validating the termination→creation loop.
|
Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03. |
There was a problem hiding this comment.
Note: I submitted this prematurely. I am still reviewing and will send another review when I'm done reading through.
I don't think we should say that this closes #586
We still need to do the implementation work and we should make sure we have visibility on this job on the internal dealbot dashboard.
| @@ -0,0 +1,220 @@ | |||
| # Data Set Termination Job | |||
|
|
|||
| This doc proposes a calibration-only `data_set_termination` job that periodically terminates a dealbot managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached. | |||
There was a problem hiding this comment.
Hyperlink to data_set_creation docs
|
|
||
| This doc proposes a calibration-only `data_set_termination` job that periodically terminates a dealbot managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached. | ||
|
|
||
| > **Note**: this design does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated; if `deleteDataSet` canary coverage is required for #586, that would need a different approach. |
There was a problem hiding this comment.
| > **Note**: this design does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated; if `deleteDataSet` canary coverage is required for #586, that would need a different approach. | |
| > **Note**: this design does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated. |
This is a good callout, but I think we're agreeing here not do deleteDataSet so we don't need to say more.
|
|
||
| ## Summary | ||
|
|
||
| - `data_set_termination` is a calibration-only job that periodically terminates one managed dataset slot per provider. |
There was a problem hiding this comment.
I think it should be calibration-only by default, but it should be possible to run in mainnet if someone enables it.
| - `data_set_termination` is a calibration-only job that periodically terminates one managed dataset slot per provider. | ||
| - Together with [`data_set_creation`](./data-set-creation.md), the two jobs form a bounded loop that keeps the `createDataSet` on-chain path continuously exercised as a canary. | ||
| - The job terminates **at most one dataset per invocation**; `data_set_creation` handles replenishment on its next scheduled tick. | ||
| - It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with `deal`, `retrieval`, `piece_cleanup`, `pull_check`, or `data_set_creation` for the same provider. |
There was a problem hiding this comment.
Rather than list them all out (which is going to get stale as we add jobs), maybe just say
| - It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with `deal`, `retrieval`, `piece_cleanup`, `pull_check`, or `data_set_creation` for the same provider. | |
| - It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with other jobs for the same provider. |
|
|
||
| ## Problem Context | ||
|
|
||
| During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets ([`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks)). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared. |
There was a problem hiding this comment.
| During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets ([`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks)). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared. | |
| During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets ([`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks)). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared. (See [post mortem](https://app.notion.com/p/filecoindev/2026-05-28-Hotfix-FWSS-v1-2-0-compatibility-with-PDPVerifier-v3-4-0-901dc41950c1820d99c88157214dec5d) for more info.) |
| - Reuse the existing `data_set_creation` job as the replenishment mechanism. | ||
| - Minimize disruption to ongoing deal and retrieval checks. | ||
| - Make termination cadence explicitly configurable so the expected create cadence can be reasoned about. | ||
| - Ensure the job cannot run on mainnet. |
There was a problem hiding this comment.
| - Ensure the job cannot run on mainnet. | |
| - Ensure the job doesn't run on mainnet by default. |
I was planning to include implementation in this same PR. |
BigLep
left a comment
There was a problem hiding this comment.
I'm happy to take another look 2026-06-04, but hpefully this gives enough direction to give confidence about starting implementation.
(This now concludes the review I started with #588 (review))
| - default: `1` — only the baseline slot (index `0`) is protected | ||
| - slots `0` through `DATA_SET_TERMINATION_MIN_INDEX - 1` are never touched by this job | ||
| - example: `MIN_NUM_DATASETS_FOR_CHECKS = 10`, `DATA_SET_TERMINATION_MIN_INDEX = 5` → slots 0–4 are stable, slots 5–9 cycle as the canary window | ||
| - set to `MIN_NUM_DATASETS_FOR_CHECKS` to disable termination entirely — the canary window becomes empty and no schedule is created |
There was a problem hiding this comment.
I guess this works, but maybe we use a more obvious "this is disabled" value like "-1" or even a string "DISABLED"?
|
|
||
| The schedule is only upserted when all of the following are true: | ||
|
|
||
| - `NETWORK=calibration` |
There was a problem hiding this comment.
I don't know if we want to enforce this in code.
Other ideas:
- if the network is mainnet, maybe we log an info/warning (e.g., "Dataset deletion is enabled. Mainnet funds will likely drain faster as a result of ensuing repeated createDataSet fees.") or
- if the network is mainnet, require a more explicit opt in: (e.g., ENABLE_DATASET_DELETION=TRUE") which by default is false for mainnet. Basically I don't want our code to get in the way if we or others decide they want to run deletion jobs against mainnet.
| 2. Apply the same maintenance-window and SP-blocklist rules used by other SP jobs. | ||
| 3. Create an `AbortController` using `DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS`. | ||
| 4. Read `MIN_NUM_DATASETS_FOR_CHECKS` and base dataset metadata. | ||
| 5. Scan slots from `minDataSets - 1` down to `DATA_SET_TERMINATION_MIN_INDEX`. For each slot: |
There was a problem hiding this comment.
Is the ordering of the slots deterministic? If so, that would mean some datasets will be long lived and some will churn. I haven't thought about this too deeply and appreciate your input, but on the surface, I would imagine doing this with random dataset selection. If we go random, then I assume we do away with DATA_SET_TERMINATION_MIN_INDEX approach.
Another way to look at my question is: why are we doing DATA_SET_TERMINATION_MIN_INDEX rather than just randomly selecting a dataset?
There was a problem hiding this comment.
I thought we keep some slots stable. We don't exercise dataSetCreation -> terminateService cycle on them because we need deals to perform retrieval tests on them. If we pick random slot from all available slots for terminations we will barely have deals to do retrievals on them.
There was a problem hiding this comment.
I was thinking of making only the partition deterministic, not the selection itself. The lower-index slots (up to DATA_SET_TERMINATION_MIN_INDEX) stay stable to produce deals for retrieval testing, while the slots in [DATA_SET_TERMINATION_MIN_INDEX, MIN_NUM_DATASETS_FOR_CHECKS) can be picked randomly for the create → terminate cycle.
I'll make it more clear in the docs.
| - c. Skip if `missing` — nothing to terminate. | ||
| - d. Skip if `terminated` — `data_set_creation` owns repair of these slots. | ||
| - e. Skip if `live` but has any deal row with `cleaned_up = false` — the deal job is still tracking it as active. | ||
| 6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped). |
There was a problem hiding this comment.
| 6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped). | |
| 6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped). |
| 7. Log the outcome and exit for this tick. | ||
| 8. If no eligible slot is found after the full scan, log `skipped.no_candidate` and exit. This is expected when `data_set_creation` has not yet replenished a previously terminated slot. | ||
|
|
||
| As with `data_set_creation`, the job performs **at most one state-changing action per invocation**. |
There was a problem hiding this comment.
| As with `data_set_creation`, the job performs **at most one state-changing action per invocation**. | |
| As with `data_set_creation`, the job performs **at most one state-changing action per invocation**. The way to increase the dataset termination rate is to increase `DATASET_TERMINATIONS_PER_SP_PER_HOUR`. |
There was a problem hiding this comment.
High level: should we maybe combine this done with data-set-creation.md ? Maybe there's a data-set-lifecycle.md doc we want?
I haven't thought too deeply about this idea, but I'm basically wondering about whether it's easier to think about the dataset lifecycle as a whole rather than spanning two documents.
| 2. `data_set_creation` runs next. The Synapse SDK filters the terminated dataset from metadata lookups, so the slot resolves as `missing` immediately. `data_set_creation` provisions a replacement dataset directly in this run. | ||
| 3. Existing creation metrics and alerts resume acting as the canary. | ||
|
|
||
| **Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_TERMINATIONS_PER_SP_PER_HOUR`. If termination runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler should emit a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard. |
There was a problem hiding this comment.
| **Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_TERMINATIONS_PER_SP_PER_HOUR`. If termination runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler should emit a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard. | |
| **Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_TERMINATIONS_PER_SP_PER_HOUR`. If termination runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler emits a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard. |
I changed the wording because I'm assume we'll implement it this way.
| The `terminated` status returned by `getDataSetProvisioningStatus` means: the Synapse SDK resolved a `dataSetId` from the metadata fingerprint but liveness probes failed. This is distinct from a dataset that has `pdpEndEpoch !== 0` on-chain (which the SDK filters out entirely, causing the slot to resolve as `missing`). | ||
|
|
||
| The name `terminated` is already used for both the on-chain lifecycle concept and this SDK liveness-probe failure state, which causes confusion. Candidate replacements: `irrecoverable` or `missing.sp`. This rename would affect `data_set_creation`'s handler and repair path as well. |
There was a problem hiding this comment.
I don't know what names are used where and the specifics, but lets not make things confusing. A "terminated" dataset already has a definition (and most recently getting updated in https://app.notion.com/p/filecoindev/Data-Set-Terminations-Clean-up-360dc41950c1801ebf0aff017a322e7 I believe).
If our terminated definition isn't in sync with other understandings of terminated, then lets not use that word.
I don't really know yet what to suggestion in terms of name.
When you say "liveness probes failed", what do you mean? Basically to help with naming, today when getDataSetProvisioningStatus returns terminated what is happening with the dataset? How could that have happened to the dataset? That is useful info for coming up with a new name here.
But in general: I'm in favor of fixing this and making it more clear and simpler to reason about. Overlaoding the term "terminated" I think moves us in the wrong direction.
There was a problem hiding this comment.
When getDataSetProvisioningStatus return terminated it means data set is live/healthy on-chain but curio's addPieces path returns unrecoverable_proving_failure_epoch state, where SP refuses addPieces with HTTP 409.
Ref - #545
|
|
||
| The name `terminated` is already used for both the on-chain lifecycle concept and this SDK liveness-probe failure state, which causes confusion. Candidate replacements: `irrecoverable` or `missing.sp`. This rename would affect `data_set_creation`'s handler and repair path as well. | ||
|
|
||
| ### Should `data_set_termination` absorb the repair path from `data_set_creation`? |
There was a problem hiding this comment.
It's good to raise it. I don't know if there is an obvious clear-cut answer.
It sounds like the repair case still involves creating a dataset right, which is why it's reasonable to be in data_set_creation.
Maybe we just have an FAQ item for "Why doesn't data-set-termination handle repair?" For the answer we can say it's not clear cut either direction and that logic first lived in create_data_set. There hasn't been a compelling reason to change this.
There was a problem hiding this comment.
By repair it means the curio's addPieces refuses with unrecoverable_proving_failure_epoch but the data set is still live on-chain. So, repair path calls terminateSerivce which then actually terminates the data set on-chain. Then, in next tick the data set is replenished.
|
|
||
| `terminateService` calls `FilecoinPay.terminateRail(pdpRailId)`, which sets `endEpoch = block.number + lockupPeriod` on the PDP rail. The FWSS `railTerminated` callback fires in the same transaction, stores `info.pdpEndEpoch`, and emits `PDPPaymentsTerminated` and `ServiceTerminated`. | ||
|
|
||
| This is the point the termination job polls for: `pdpEndEpoch !== 0`. Once this is set, `data_set_creation` will classify the slot as `missing` and begin the replenishment sequence. The termination job's work is done here. |
There was a problem hiding this comment.
termination job polls for:
pdpEndEpoch !== 0.
Maybe quickly say what happens after this polling when ``pdpEndEpoch !== 0` is found (or you can link to the document where this is described more).
Once this is set
I think it's not fully obvious what the "this" pronoun is referring to. To make it clear, maybe remove the pronoun and replace with the actual noun?
Post GA item created: FilOzone/filecoin-services#503 |
Summary
Adds
docs/data-set-termination.md, a design doc for the proposed calibration-onlydata_set_terminationjob.Why this job is needed: Once all provider reach the
MIN_NUM_DATASETS_FOR_CHECKScap,data_set_creationstops exercising the on-chaincreateDataSetpath. During the PDPVerifier V3.4.0 rollout this meant dealbot missed the calibration outage entirely. The termination job restores continuous canary coverage by periodically freeing a slot sodata_set_creationmust recreate it.What the doc covers:
DATASET_TERMINATIONS_PER_SP_PER_HOUR,DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS,DATA_SET_TERMINATION_MIN_INDEX(with the disable-by-config pattern viaMIN_INDEX=MIN_NUM)dataSetTerminationStatus,dataSetTerminationMs) alongside the existing creation metrics that serve as the primary canary signaldata_set_creationincluding the rate constraint (creations/hr>=terminations/hr)terminateServiceand why the job only needs to wait for step 1terminatedprovisioning status and whether repair should move into this jobCloses #586