Skip to content

feat(walrus-service): add unique blob pending count metric for shard recovery#3390

Draft
halfprice wants to merge 1 commit into
mainfrom
zhewu/shard-recover-blob-count-metric
Draft

feat(walrus-service): add unique blob pending count metric for shard recovery#3390
halfprice wants to merge 1 commit into
mainfrom
zhewu/shard-recover-blob-count-metric

Conversation

@halfprice

Copy link
Copy Markdown
Collaborator

TODO addressed

Add metrics to track recovery blob count in shard sync

Approach

The existing sync_shard_recover_sliver_pending_total gauge counts entries in pending_recover_slivers, but those entries are keyed by (SliverType, BlobId) — so a blob with both primary and secondary slivers pending contributes two entries, double-counting at the blob level. This PR adds a separate sync_shard_recover_blob_pending_total gauge that tracks unique blob IDs in the pending set, while keeping the sliver-level metric semantically correct. During the recovery iteration we maintain an in-memory HashMap<BlobId, usize> of remaining entries per blob, decrementing the unique blob count exactly when a blob's last entry is processed.

Changes

  • Add sync_shard_recover_blob_pending_total metric (IntGaugeVec labeled by shard) in crates/walrus-service/src/node/metrics.rs.
  • In ShardStorage::recover_missing_blobs, compute pending sliver entries and unique blob count from a single initial scan and update both metrics as the iteration progresses.
  • Update record_pending_recovery_metrics to accept both counts and set both gauges.

What I did NOT do

  • Did not rename or change semantics of the existing sync_shard_recover_sliver_pending_total metric, since dashboards and alerts may already depend on it.
  • Did not change the per-blob accounting inside recover_blob (success/error/skip metrics remain sliver-typed); only the pending-set view gains a blob-level reading.

Verification

  • cargo fmt -- --config group_imports=StdExternalCrate,imports_granularity=Crate,imports_layout=HorizontalVertical
  • cargo clippy -p walrus-service --all-features --tests -- -D warnings
  • cargo test -p walrus-service --lib recovery (22 passed)
  • cargo test -p walrus-service --lib shard (86 passed, 1 long-running ignored)

https://claude.ai/code/session_016VZ7yf2vEgkXYiriv8ipgS


Generated by Claude Code

…recovery

Add a new `sync_shard_recover_blob_pending_total` gauge metric that
tracks the number of unique blob IDs pending recovery during shard
sync, distinct from the existing `sync_shard_recover_sliver_pending_total`
which counts (sliver_type, blob_id) entries. A blob can contribute
either one or two entries to the pending set, so the existing metric
overcounts blobs when both primary and secondary slivers need recovery.

https://claude.ai/code/session_016VZ7yf2vEgkXYiriv8ipgS
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

This PR is stale because it has been open 14 days with no activity. It will be closed in 7 days unless you remove the stale label, add the do-not-close label, or comment on it.

@github-actions github-actions Bot added the stale label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants