Skip to content

fix(net): retry whitelist peers instead of exiting#586

Draft
0xsiddharthks wants to merge 3 commits into
2140-dev:masterfrom
0xsiddharthks:siddharth/fix/whitelist-retry
Draft

fix(net): retry whitelist peers instead of exiting#586
0xsiddharthks wants to merge 3 commits into
2140-dev:masterfrom
0xsiddharthks:siddharth/fix/whitelist-retry

Conversation

@0xsiddharthks

@0xsiddharthks 0xsiddharthks commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Whitelist-only nodes die permanently on transient peer loss: whitelist entries are consumed as they're dialed, hostnames are resolved once, and nothing ever refills the list. With the default 2h connection rotation, a single-peer node (the #566 k8s setup) hits NoReachablePeers every ~4 hours. We measured 54 deaths across 4 production nodes in 48 hours, each one forcing a full resync.

Changes:

  • The configured whitelist is now a persistent set. When the dial queue runs dry in whitelist-only mode, it's re-seeded from the whitelist (shuffled, rate-limited to once per second).
  • Hostnames are re-resolved on every cycle, so reconnections follow DNS changes.
  • Gossip mode is unchanged, and an empty whitelist still errors immediately.

An allowlist says who to connect to, not how long to try. The new integration test forces 2-second rotations: on master the node dies on the second rotation, with this change it keeps syncing.

Whitelist entries were consumed as they were dialed and hostnames were
resolved once, so a whitelist-only node permanently exited with
NoReachablePeers after a handful of transient disconnects. The default
two hour connection rotation alone kills a single-peer node in about
four hours.

Keep the configured whitelist as a persistent set and re-seed the dial
queue from it, rate-limited to once per second, whenever it runs dry in
whitelist-only mode. Hostnames are re-resolved on every cycle, so
reconnections follow DNS changes. Gossip mode is unchanged, and an
empty whitelist still exits with NoReachablePeers.
Force frequent connection rotations with a two second
maximum_connection_time and assert the node keeps syncing new blocks
across several rotations. Before the retry fix this exits with
NoReachablePeers on the second rotation.
@randomlogin

Copy link
Copy Markdown
Contributor

Some ideas/questions out of curiosity:

Would it make sense to randomize whitelist order on each refill?
What happens if we set required peers to 2, have whitelist of 10 peers, 9 of which do not respond and only the last one is really working, will kyoto survive? Will it find that 1 whitelisted peer and 1 random one?

Spread reconnections across the allowlist instead of always retrying
entries in the same order. Suggested in review.
@0xsiddharthks

0xsiddharthks commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@randomlogin Thanks for reviewing my changes!

Here's the answers:

  • Randomizing the order
    Yes, this totally makes sense. Just added it to the PR.
    The dial queue is now shuffled on each refill, so reconnections spread across the allowlist instead of always favoring the same entry. Initial connection order is unchanged to keep the diff focused.

  • 9 dead + 1 live with required_peers = 2
    the node survives in both modes:

  • With whitelist_only

    • each refill cycle redials the dead peers (one per tick, bounded by the handshake timeout) and reconnects to the live one.
    • One caveat: PeerMap doesn't dedupe against already-connected addresses, so to satisfy required_peers = 2 it can open a second connection to the same live peer (the same thing you'd get today by listing one peer twice). Skipping already-connected addresses when popping the queue could be a nice follow-up.
  • Without whitelist_only

    • this PR changes nothing: the whitelist is consumed once as before, so you'd get 1 whitelisted connection and the second from gossip/DNS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants