docs: data set termination job design documentation by silent-cipher · Pull Request #588 · FilOzone/dealbot

silent-cipher · 2026-06-01T18:24:19Z

Summary

Adds docs/data-set-termination.md, a design doc for the proposed calibration-only data_set_termination job.

Why this job is needed: Once all provider reach the MIN_NUM_DATASETS_FOR_CHECKS cap, data_set_creation stops exercising the on-chain createDataSet path. During the PDPVerifier V3.4.0 rollout this meant dealbot missed the calibration outage entirely. The termination job restores continuous canary coverage by periodically freeing a slot so data_set_creation must recreate it.

What the doc covers:

Problem context and goals
Three new config variables: DATASET_TERMINATIONS_PER_SP_PER_HOUR, DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS, DATA_SET_TERMINATION_MIN_INDEX (with the disable-by-config pattern via MIN_INDEX = MIN_NUM)
Scheduling gate: calibration-only, non-empty canary window required
Handler algorithm: scan slots high-to-low, skip missing/terminated/actively-tracked, terminate the first eligible slot
Idempotency contract for races and pg-boss retries
Metrics (dataSetTerminationStatus, dataSetTerminationMs) alongside the existing creation metrics that serve as the primary canary signal
Relationship to data_set_creation including the rate constraint (creations/hr >= terminations/hr)
FAQ on the full on-chain lifecycle after terminateService and why the job only needs to wait for step 1
Two open questions: renaming the terminated provisioning status and whether repair should move into this job

Closes #586

Copilot

Pull request overview

Adds a new design document describing a proposed calibration-only data_set_termination pg-boss job intended to periodically terminate a managed dataset slot so the existing data_set_creation job recreates it, keeping the on-chain dataset lifecycle continuously exercised as a canary.

Changes:

Introduces a detailed design/spec for a new data_set_termination job, including scheduling, handler algorithm, and idempotency expectations.
Documents proposed configuration knobs and operational constraints (calibration-only gating, canary window sizing, rate constraints vs creation).
Outlines observability expectations and BetterStack dashboard questions for validating the termination→creation loop.

BigLep · 2026-06-03T14:57:06Z

Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03.

BigLep

Note: I submitted this prematurely. I am still reviewing and will send another review when I'm done reading through.

I don't think we should say that this closes #586

We still need to do the implementation work and we should make sure we have visibility on this job on the internal dealbot dashboard.

BigLep · 2026-06-03T17:58:50Z

@@ -0,0 +1,220 @@
+# Data Set Termination Job
+
+This doc proposes a calibration-only `data_set_termination` job that periodically terminates a dealbot managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached.


Hyperlink to data_set_creation docs

BigLep · 2026-06-03T18:01:38Z

+
+This doc proposes a calibration-only `data_set_termination` job that periodically terminates a dealbot managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached.
+
+> **Note**: this design does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated; if `deleteDataSet` canary coverage is required for #586, that would need a different approach.


Suggested change

> **Note**: this design does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated; if `deleteDataSet` canary coverage is required for #586, that would need a different approach.

> **Note**: this design does **not** attempt `PDPVerifier.deleteDataSet`, which is SP-initiated.

This is a good callout, but I think we're agreeing here not do deleteDataSet so we don't need to say more.

BigLep · 2026-06-03T18:02:19Z

+
+## Summary
+
+- `data_set_termination` is a calibration-only job that periodically terminates one managed dataset slot per provider.


I think it should be calibration-only by default, but it should be possible to run in mainnet if someone enables it.

BigLep · 2026-06-03T18:03:19Z

+- `data_set_termination` is a calibration-only job that periodically terminates one managed dataset slot per provider.
+- Together with [`data_set_creation`](./data-set-creation.md), the two jobs form a bounded loop that keeps the `createDataSet` on-chain path continuously exercised as a canary.
+- The job terminates **at most one dataset per invocation**; `data_set_creation` handles replenishment on its next scheduled tick.
+- It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with `deal`, `retrieval`, `piece_cleanup`, `pull_check`, or `data_set_creation` for the same provider.


Rather than list them all out (which is going to get stale as we add jobs), maybe just say

Suggested change

- It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with `deal`, `retrieval`, `piece_cleanup`, `pull_check`, or `data_set_creation` for the same provider.

- It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with other jobs for the same provider.

BigLep · 2026-06-03T18:06:00Z

+
+## Problem Context
+
+During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets ([`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks)). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared.


Suggested change

During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets ([`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks)). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared.

During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets ([`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks)). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared. (See [post mortem](https://app.notion.com/p/filecoindev/2026-05-28-Hotfix-FWSS-v1-2-0-compatibility-with-PDPVerifier-v3-4-0-901dc41950c1820d99c88157214dec5d) for more info.)

BigLep · 2026-06-03T18:07:01Z

+- Reuse the existing `data_set_creation` job as the replenishment mechanism.
+- Minimize disruption to ongoing deal and retrieval checks.
+- Make termination cadence explicitly configurable so the expected create cadence can be reasoned about.
+- Ensure the job cannot run on mainnet.


Suggested change

- Ensure the job cannot run on mainnet.

- Ensure the job doesn't run on mainnet by default.

silent-cipher · 2026-06-03T18:54:26Z

I don't think we should say that this closes #586

I was planning to include implementation in this same PR.

BigLep

I'm happy to take another look 2026-06-04, but hpefully this gives enough direction to give confidence about starting implementation.

(This now concludes the review I started with #588 (review))

BigLep · 2026-06-03T18:37:50Z

+  - default: `1` — only the baseline slot (index `0`) is protected
+  - slots `0` through `DATA_SET_TERMINATION_MIN_INDEX - 1` are never touched by this job
+  - example: `MIN_NUM_DATASETS_FOR_CHECKS = 10`, `DATA_SET_TERMINATION_MIN_INDEX = 5` → slots 0–4 are stable, slots 5–9 cycle as the canary window
+  - set to `MIN_NUM_DATASETS_FOR_CHECKS` to disable termination entirely — the canary window becomes empty and no schedule is created


I guess this works, but maybe we use a more obvious "this is disabled" value like "-1" or even a string "DISABLED"?

BigLep · 2026-06-03T18:45:04Z

+
+The schedule is only upserted when all of the following are true:
+
+- `NETWORK=calibration`


I don't know if we want to enforce this in code.

Other ideas:

if the network is mainnet, maybe we log an info/warning (e.g., "Dataset deletion is enabled. Mainnet funds will likely drain faster as a result of ensuing repeated createDataSet fees.") or

if the network is mainnet, require a more explicit opt in: (e.g., ENABLE_DATASET_DELETION=TRUE") which by default is false for mainnet. Basically I don't want our code to get in the way if we or others decide they want to run deletion jobs against mainnet.

BigLep · 2026-06-03T19:14:56Z

+2. Apply the same maintenance-window and SP-blocklist rules used by other SP jobs.
+3. Create an `AbortController` using `DATA_SET_TERMINATION_JOB_TIMEOUT_SECONDS`.
+4. Read `MIN_NUM_DATASETS_FOR_CHECKS` and base dataset metadata.
+5. Scan slots from `minDataSets - 1` down to `DATA_SET_TERMINATION_MIN_INDEX`. For each slot:


Is the ordering of the slots deterministic? If so, that would mean some datasets will be long lived and some will churn. I haven't thought about this too deeply and appreciate your input, but on the surface, I would imagine doing this with random dataset selection. If we go random, then I assume we do away with DATA_SET_TERMINATION_MIN_INDEX approach.

Another way to look at my question is: why are we doing DATA_SET_TERMINATION_MIN_INDEX rather than just randomly selecting a dataset?

I thought we keep some slots stable. We don't exercise dataSetCreation -> terminateService cycle on them because we need deals to perform retrieval tests on them. If we pick random slot from all available slots for terminations we will barely have deals to do retrievals on them.

I was thinking of making only the partition deterministic, not the selection itself. The lower-index slots (up to DATA_SET_TERMINATION_MIN_INDEX) stay stable to produce deals for retrieval testing, while the slots in [DATA_SET_TERMINATION_MIN_INDEX, MIN_NUM_DATASETS_FOR_CHECKS) can be picked randomly for the create → terminate cycle.
I'll make it more clear in the docs.

BigLep · 2026-06-03T19:16:08Z

+   - c. Skip if `missing` — nothing to terminate.
+   - d. Skip if `terminated` — `data_set_creation` owns repair of these slots.
+   - e. Skip if `live` but has any deal row with `cleaned_up = false` — the deal job is still tracking it as active.
+6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped).


Suggested change

6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped).

6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped).

BigLep · 2026-06-03T19:18:17Z

+7. Log the outcome and exit for this tick.
+8. If no eligible slot is found after the full scan, log `skipped.no_candidate` and exit. This is expected when `data_set_creation` has not yet replenished a previously terminated slot.
+
+As with `data_set_creation`, the job performs **at most one state-changing action per invocation**.


Suggested change

As with `data_set_creation`, the job performs **at most one state-changing action per invocation**.

As with `data_set_creation`, the job performs **at most one state-changing action per invocation**. The way to increase the dataset termination rate is to increase `DATASET_TERMINATIONS_PER_SP_PER_HOUR`.

BigLep · 2026-06-03T19:59:58Z

High level: should we maybe combine this done with data-set-creation.md ? Maybe there's a data-set-lifecycle.md doc we want?

I haven't thought too deeply about this idea, but I'm basically wondering about whether it's easier to think about the dataset lifecycle as a whole rather than spanning two documents.

BigLep · 2026-06-03T20:01:43Z

+2. `data_set_creation` runs next. The Synapse SDK filters the terminated dataset from metadata lookups, so the slot resolves as `missing` immediately. `data_set_creation` provisions a replacement dataset directly in this run.
+3. Existing creation metrics and alerts resume acting as the canary.
+
+**Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_TERMINATIONS_PER_SP_PER_HOUR`. If termination runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler should emit a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard.


Suggested change

**Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_TERMINATIONS_PER_SP_PER_HOUR`. If termination runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler should emit a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard.

**Rate constraint:** `DATASET_CREATIONS_PER_SP_PER_HOUR` should be **greater than or equal to** `DATASET_TERMINATIONS_PER_SP_PER_HOUR`. If termination runs faster than creation, the missing-slot backlog accumulates and the system stops behaving like a simple steady-state canary. The scheduler emits a startup warning log when this constraint is violated so the misconfiguration is visible without a dashboard.

I changed the wording because I'm assume we'll implement it this way.

BigLep · 2026-06-03T20:08:03Z

+The `terminated` status returned by `getDataSetProvisioningStatus` means: the Synapse SDK resolved a `dataSetId` from the metadata fingerprint but liveness probes failed. This is distinct from a dataset that has `pdpEndEpoch !== 0` on-chain (which the SDK filters out entirely, causing the slot to resolve as `missing`).
+
+The name `terminated` is already used for both the on-chain lifecycle concept and this SDK liveness-probe failure state, which causes confusion. Candidate replacements: `irrecoverable` or `missing.sp`. This rename would affect `data_set_creation`'s handler and repair path as well.


I don't know what names are used where and the specifics, but lets not make things confusing. A "terminated" dataset already has a definition (and most recently getting updated in https://app.notion.com/p/filecoindev/Data-Set-Terminations-Clean-up-360dc41950c1801ebf0aff017a322e7 I believe).

If our terminated definition isn't in sync with other understandings of terminated, then lets not use that word.

I don't really know yet what to suggestion in terms of name.

When you say "liveness probes failed", what do you mean? Basically to help with naming, today when getDataSetProvisioningStatus returns terminated what is happening with the dataset? How could that have happened to the dataset? That is useful info for coming up with a new name here.

But in general: I'm in favor of fixing this and making it more clear and simpler to reason about. Overlaoding the term "terminated" I think moves us in the wrong direction.

When getDataSetProvisioningStatus return terminated it means data set is live/healthy on-chain but curio's addPieces path returns unrecoverable_proving_failure_epoch state, where SP refuses addPieces with HTTP 409.

Ref - #545

BigLep · 2026-06-03T20:11:58Z

+
+The name `terminated` is already used for both the on-chain lifecycle concept and this SDK liveness-probe failure state, which causes confusion. Candidate replacements: `irrecoverable` or `missing.sp`. This rename would affect `data_set_creation`'s handler and repair path as well.
+
+### Should `data_set_termination` absorb the repair path from `data_set_creation`?


It's good to raise it. I don't know if there is an obvious clear-cut answer.

It sounds like the repair case still involves creating a dataset right, which is why it's reasonable to be in data_set_creation.

Maybe we just have an FAQ item for "Why doesn't data-set-termination handle repair?" For the answer we can say it's not clear cut either direction and that logic first lived in create_data_set. There hasn't been a compelling reason to change this.

By repair it means the curio's addPieces refuses with unrecoverable_proving_failure_epoch but the data set is still live on-chain. So, repair path calls terminateSerivce which then actually terminates the data set on-chain. Then, in next tick the data set is replenished.

BigLep · 2026-06-03T20:15:46Z

+
+`terminateService` calls `FilecoinPay.terminateRail(pdpRailId)`, which sets `endEpoch = block.number + lockupPeriod` on the PDP rail. The FWSS `railTerminated` callback fires in the same transaction, stores `info.pdpEndEpoch`, and emits `PDPPaymentsTerminated` and `ServiceTerminated`.
+
+This is the point the termination job polls for: `pdpEndEpoch !== 0`. Once this is set, `data_set_creation` will classify the slot as `missing` and begin the replenishment sequence. The termination job's work is done here.


termination job polls for: pdpEndEpoch !== 0.

Maybe quickly say what happens after this polling when ``pdpEndEpoch !== 0` is found (or you can link to the document where this is described more).

Once this is set

I think it's not fully obvious what the "this" pronoun is referring to. To make it clear, maybe remove the pronoun and replace with the actual noun?

BigLep · 2026-06-03T20:45:47Z

Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03.

Post GA item created: FilOzone/filecoin-services#503

silent-cipher added 2 commits June 1, 2026 17:53

docs: add data-set-creation job design documentation

d71b721

docs: add data-set-deletion job design documentation

45a3e90

FilOzzy added this to FOC Jun 1, 2026

github-project-automation Bot moved this to 📌 Triage in FOC Jun 1, 2026

silent-cipher changed the base branch from main to docs/data-set-creation-design-doc June 1, 2026 18:24

docs: simplify terminated slot skip

e654660

silent-cipher self-assigned this Jun 1, 2026

rjan90 moved this from 📌 Triage to ⌨️ In Progress in FOC Jun 2, 2026

Base automatically changed from docs/data-set-creation-design-doc to main June 3, 2026 06:18

silent-cipher added 2 commits June 3, 2026 11:50

Merge branch 'main' into feat/data-set-deletion-job

c08d54f

docs: rename + more explanations

6af976b

silent-cipher requested review from BigLep and Copilot June 3, 2026 08:03

Copilot started reviewing on behalf of silent-cipher June 3, 2026 08:03 View session

silent-cipher requested a review from SgtPooki June 3, 2026 08:04

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Comment thread docs/data-set-termination.md Outdated

Comment thread docs/data-set-termination.md

Comment thread docs/data-set-termination.md

chore: address pr comments

c537509

silent-cipher changed the title ~~docs: data set deletion job design documentation~~ docs: data set termination job design documentation Jun 3, 2026

rjan90 marked this pull request as ready for review June 3, 2026 14:40

rjan90 moved this from ⌨️ In Progress to 🔎 Awaiting review in FOC Jun 3, 2026

BigLep reviewed Jun 3, 2026

View reviewed changes

		@@ -0,0 +1,220 @@
		# Data Set Termination Job

		This doc proposes a calibration-only `data_set_termination` job that periodically terminates a dealbot managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached.


		This doc proposes a calibration-only `data_set_termination` job that periodically terminates a dealbot managed dataset so the existing `data_set_creation` job naturally recreates it. The goal is to keep dealbot continuously exercising the on-chain `createDataSet` lifecycle instead of only creating datasets until a steady-state cap is reached.

		> Note: this design does not attempt `PDPVerifier.deleteDataSet`, which is SP-initiated; if `deleteDataSet` canary coverage is required for #586, that would need a different approach.


		## Summary

		- `data_set_termination` is a calibration-only job that periodically terminates one managed dataset slot per provider.

	- It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with `deal`, `retrieval`, `piece_cleanup`, `pull_check`, or `data_set_creation` for the same provider.
	- It runs on the shared `sp.work` queue with `singletonKey=spAddress`, so it cannot race with other jobs for the same provider.


		## Problem Context

		During the PDPVerifier v3.4.0 rollout, `createDataSet` broke on calibration and mainnet. Dealbot did not detect the calibration outage because providers had already reached the steady-state cap of managed datasets ([`MIN_NUM_DATASETS_FOR_CHECKS`](./environment-variables.md#min_num_datasets_for_checks)). Once the cap was reached, `data_set_creation` stopped exercising the on-chain create path, so the canary value of that job disappeared.

	- Ensure the job cannot run on mainnet.
	- Ensure the job doesn't run on mainnet by default.


		The schedule is only upserted when all of the following are true:

		- `NETWORK=calibration`

	6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped).
	6. Call the termination flow on the first slot that passes all skip conditions (reaches step 5e without being skipped).

	As with `data_set_creation`, the job performs at most one state-changing action per invocation.
	As with `data_set_creation`, the job performs at most one state-changing action per invocation. The way to increase the dataset termination rate is to increase `DATASET_TERMINATIONS_PER_SP_PER_HOUR`.

		The `terminated` status returned by `getDataSetProvisioningStatus` means: the Synapse SDK resolved a `dataSetId` from the metadata fingerprint but liveness probes failed. This is distinct from a dataset that has `pdpEndEpoch !== 0` on-chain (which the SDK filters out entirely, causing the slot to resolve as `missing`).

		The name `terminated` is already used for both the on-chain lifecycle concept and this SDK liveness-probe failure state, which causes confusion. Candidate replacements: `irrecoverable` or `missing.sp`. This rename would affect `data_set_creation`'s handler and repair path as well.


		The name `terminated` is already used for both the on-chain lifecycle concept and this SDK liveness-probe failure state, which causes confusion. Candidate replacements: `irrecoverable` or `missing.sp`. This rename would affect `data_set_creation`'s handler and repair path as well.

		### Should `data_set_termination` absorb the repair path from `data_set_creation`?


		`terminateService` calls `FilecoinPay.terminateRail(pdpRailId)`, which sets `endEpoch = block.number + lockupPeriod` on the PDP rail. The FWSS `railTerminated` callback fires in the same transaction, stores `info.pdpEndEpoch`, and emits `PDPPaymentsTerminated` and `ServiceTerminated`.

		This is the point the termination job polls for: `pdpEndEpoch !== 0`. Once this is set, `data_set_creation` will classify the slot as `missing` and begin the replenishment sequence. The termination job's work is done here.

Conversation

silent-cipher commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the doc covers:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BigLep commented Jun 3, 2026

Uh oh!

BigLep left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silent-cipher commented Jun 3, 2026

Uh oh!

BigLep left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silent-cipher Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silent-cipher Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BigLep commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

silent-cipher commented Jun 1, 2026 •

edited

Loading

BigLep left a comment •

edited

Loading

silent-cipher Jun 4, 2026 •

edited

Loading

silent-cipher Jun 4, 2026 •

edited

Loading