Skip to content

Surface address-group migration re-stake lifecycle to operators (scheduled / retrying / succeeded / failed) #301

@jorgecuesta

Description

@jorgecuesta

Context

Follow-up from the PR #278 review thread on apps/provider/src/actions/Keys.ts:259.

Today MigrateKeysToAddressGroup only console.warns when TriggerAddressGroupMigrationSchedule() fails and still returns success: true. The recurring migration schedule (every 30s) self-discovers the migrated keys via remediationHistory and re-stakes them, so a failed trigger self-heals — but the operator has no visibility into whether the re-stake actually progressed. Throwing on trigger failure would wrongly report a failed migration (the DB migration did commit), so the right fix is observability, not a hard failure.

Proposal — track and surface the re-stake lifecycle per key

Replace the opaque "fire-and-forget trigger" with an explicit, operator-visible state:

  • Scheduled — migrated, queued for the recurring schedule to re-stake.
  • Retrying (#N) — last re-stake attempt failed; the schedule will try again. The operator sees both "it's scheduled" and "attempt N failed".
  • Succeeded — re-staked on-chain.
  • Failed — exhausted the attempt limit; stops auto-retrying, surfaced prominently + provider notification. A manual "retry" resets the counter and re-queues.

This reframes the trigger as a pure optimization (an early kick): the source of truth for "did the re-stake happen" is the recurring schedule's work, not the trigger call.

Open design questions (for team triage)

  1. State storage — dedicated indexed migrationStatus + migrationAttempts columns on keys (easy to query "show me stuck keys") vs. encoding in remediationHistory. Leaning: status column for UI/queryability, remediationHistory for the per-attempt audit trail (error + timestamp).
  2. Attempt limit N + cadence — fixed 30s vs. backoff. Each attempt costs gas; too low marks Failed on a transient blip, too high alerts late.
  3. Terminal Failed handling — manual retry only, or auto-reset after a cooldown?
  4. Idempotent counter increments — must not double-count under overlapping schedule runs (now that the migration SELECT takes FOR UPDATE, ref PR fix: mark stake transactions correctly when they land more than one block after broadcast #278).

Non-goal

Not blocking PR #278 — the current console.warn + self-heal is acceptable interim behavior.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions