Skip to content

Release v3.8.2: activation eligibility, reward overflow & stability fixes#272

Open
Zyra-V21 wants to merge 11 commits into
masterfrom
dev
Open

Release v3.8.2: activation eligibility, reward overflow & stability fixes#272
Zyra-V21 wants to merge 11 commits into
masterfrom
dev

Conversation

@Zyra-V21
Copy link
Copy Markdown
Collaborator

Summary

Promotes the current dev line to master as v3.8.2. Five changes across data completeness, arithmetic correctness, and indexer stability.

Data / schema

Bug fixes

Closes

Known issues (not addressed here)

Deploy notes

  • Ships two DB migrations, applied in order by golang-migrate:
    • 000036_alter_block_rewards_uint256MODIFY COLUMN on the three reward columns of t_block_rewards (UInt64 → UInt256). This triggers a background ClickHouse mutation that rewrites those columns across all parts; on large tables it needs free disk and is not instantaneous. Monitor system.mutations for completion.
    • 000037_add_activation_eligibility_epoch — additive ADD COLUMN on t_validator_last_status, non-blocking / instant.
  • Apply migrations together with the new binary. Running the old binary against the migrated schema leaves f_activation_eligibility_epoch at its FAR_FUTURE_EPOCH default for freshly written rows.

Test plan

  • Migrations 000036 and 000037 apply cleanly on a master-versioned DB
  • system.mutations for the t_block_rewards rewrite completes without error
  • Indexer keeps chain pace post-deploy (finalized epoch advancing)
  • f_activation_eligibility_epoch populated for active/exited validators
  • Block reward columns hold large values without wraparound

BitWonka and others added 11 commits April 16, 2026 03:11
…mutation pressure

The val-window service emits a lightweight DELETE on
t_validator_rewards_summary every finalized checkpoint event (~6:24 min).
On networks with ~1M validators this lowers to a ClickHouse mutation that
rewrites the in-window parts (14-55 GiB each on hoodi); each fire takes
minutes and queues up faster than it can drain, saturating one merge core
and stalling the head until an operator restart.

Make the boundary advance gate the fire: only emit DELETE once the
window's lower boundary has advanced by DELETE_CADENCE_EPOCHS since the
last successful fire (default 32 epochs ~3.4h, ~70x headroom over the
~150s mutation cost on hoodi). The first event after start always fires
to anchor the baseline; subsequent events skip until the cadence is met.

The DELETE statement and boundary calculation are unchanged - the only
observable difference is up to (cadence-1) extra epochs retained beyond
the strict window (0.16% overshoot vs. the 20250-epoch window). The
per-epoch surgical delete used by reorg recovery (DeleteStateMetrics) is
untouched. Set DELETE_CADENCE_EPOCHS=1 for legacy behaviour.
…UpTo race

Each FinalizedCheckpointEvent in head mode launches a new
`go AdvanceFinalized(...)`. When a previous invocation is still running
(common when ProcessStateTransitionMetrics takes longer than the ~6:24
min finalized interval — networks with ~1M validators, or any catch-up
scenario), two goroutines race over the same StateHistory: the newer one
runs CleanUpTo at the end of its loop and evicts entries that the older
one is still blocked on inside StateHistory.Wait / BlockHistory.Wait.
The blocked goroutine then waits forever holding a processerBook slot,
and successive races leak the whole 32-slot pool, surfacing as floods
of "Waiting for too long to acquire page" warnings and a stuck head.

Observed on goteth-hoodi this morning: a single dependency state at
epoch 93105 was evicted while a ProcessStateTransitionMetrics goroutine
held a Wait on it, blocking that slot for 30+ minutes; the analyzer
stopped advancing past dbHeadEpoch 93110 even with ClickHouse healthy.

Skip overlapping invocations via TryLock. The skipped one would have
iterated a subset of the state keys the next invocation will see, and
its CleanUpTo would have been a subset of what the next one performs,
so dropping it is monotonically safe — no work is lost.

The historical-mode synchronous call site (routines.go:208) is
unaffected: head mode only starts after historical completes, so
TryLock always succeeds there.
…tatus

Validators between deposit and activation can be in one of two
spec-defined sub-states: pending_initialized (eligibility epoch is
FAR_FUTURE_EPOCH) or pending_queued (eligibility epoch is set).
goteth read ActivationEligibilityEpoch from the beacon state into
local memory but never persisted it, so downstream consumers could
not split the two sub-states.

This commit:
- adds f_activation_eligibility_epoch (UInt64, default
  FAR_FUTURE_EPOCH) to t_validator_last_status via migration 000036
- extends the ValidatorLastStatus struct, ToArray, and the
  ClickHouse INSERT to carry the new field
- reads validator.ActivationEligibilityEpoch in
  processValLastStatus
- adds three invariant tests in tests/db_validator_test.py
- documents the column in docs/tables.md

Fixes #266
* fix: block rewards overflow

* use helper for bigInt conversion

* use uint256 instead of string

* update docs

---------


Co-authored-by: Zyra-V21 <zyrav21@proton.me>
PR #259 (block-rewards overflow fix) merged into dev today and claimed
migration number 036 for alter_block_rewards_uint256. Renumber this PR's
migration to 037 to avoid the collision and keep numerical ordering
deterministic on rebase.

No content change; pure file rename.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…letes

fix: prevent goteth stalls on networks with large validator sets
@Zyra-V21 Zyra-V21 self-assigned this May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants