feat(walrus-service): add unique blob pending count metric for shard recovery#3390
Draft
halfprice wants to merge 1 commit into
Draft
feat(walrus-service): add unique blob pending count metric for shard recovery#3390halfprice wants to merge 1 commit into
halfprice wants to merge 1 commit into
Conversation
…recovery Add a new `sync_shard_recover_blob_pending_total` gauge metric that tracks the number of unique blob IDs pending recovery during shard sync, distinct from the existing `sync_shard_recover_sliver_pending_total` which counts (sliver_type, blob_id) entries. A blob can contribute either one or two entries to the pending set, so the existing metric overcounts blobs when both primary and secondary slivers need recovery. https://claude.ai/code/session_016VZ7yf2vEgkXYiriv8ipgS
Contributor
|
This PR is stale because it has been open 14 days with no activity. It will be closed in 7 days unless you remove the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TODO addressed
Approach
The existing
sync_shard_recover_sliver_pending_totalgauge counts entries inpending_recover_slivers, but those entries are keyed by(SliverType, BlobId)— so a blob with both primary and secondary slivers pending contributes two entries, double-counting at the blob level. This PR adds a separatesync_shard_recover_blob_pending_totalgauge that tracks unique blob IDs in the pending set, while keeping the sliver-level metric semantically correct. During the recovery iteration we maintain an in-memoryHashMap<BlobId, usize>of remaining entries per blob, decrementing the unique blob count exactly when a blob's last entry is processed.Changes
sync_shard_recover_blob_pending_totalmetric (IntGaugeVec labeled byshard) incrates/walrus-service/src/node/metrics.rs.ShardStorage::recover_missing_blobs, compute pending sliver entries and unique blob count from a single initial scan and update both metrics as the iteration progresses.record_pending_recovery_metricsto accept both counts and set both gauges.What I did NOT do
sync_shard_recover_sliver_pending_totalmetric, since dashboards and alerts may already depend on it.recover_blob(success/error/skip metrics remain sliver-typed); only the pending-set view gains a blob-level reading.Verification
cargo fmt -- --config group_imports=StdExternalCrate,imports_granularity=Crate,imports_layout=HorizontalVerticalcargo clippy -p walrus-service --all-features --tests -- -D warningscargo test -p walrus-service --lib recovery(22 passed)cargo test -p walrus-service --lib shard(86 passed, 1 long-running ignored)https://claude.ai/code/session_016VZ7yf2vEgkXYiriv8ipgS
Generated by Claude Code