fix(net): retry whitelist peers instead of exiting by 0xsiddharthks · Pull Request #586 · 2140-dev/kyoto

0xsiddharthks · 2026-06-10T21:44:55Z

Whitelist-only nodes die permanently on transient peer loss: whitelist entries are consumed as they're dialed, hostnames are resolved once, and nothing ever refills the list. With the default 2h connection rotation, a single-peer node (the #566 k8s setup) hits NoReachablePeers every ~4 hours. We measured 54 deaths across 4 production nodes in 48 hours, each one forcing a full resync.

Changes:

The configured whitelist is now a persistent set. When the dial queue runs dry in whitelist-only mode, it's re-seeded from the whitelist (shuffled, rate-limited to once per second).
Hostnames are re-resolved on every cycle, so reconnections follow DNS changes.
Gossip mode is unchanged, and an empty whitelist still errors immediately.

An allowlist says who to connect to, not how long to try. The new integration test forces 2-second rotations: on master the node dies on the second rotation, with this change it keeps syncing.

Whitelist entries were consumed as they were dialed and hostnames were resolved once, so a whitelist-only node permanently exited with NoReachablePeers after a handful of transient disconnects. The default two hour connection rotation alone kills a single-peer node in about four hours. Keep the configured whitelist as a persistent set and re-seed the dial queue from it, rate-limited to once per second, whenever it runs dry in whitelist-only mode. Hostnames are re-resolved on every cycle, so reconnections follow DNS changes. Gossip mode is unchanged, and an empty whitelist still exits with NoReachablePeers.

Force frequent connection rotations with a two second maximum_connection_time and assert the node keeps syncing new blocks across several rotations. Before the retry fix this exits with NoReachablePeers on the second rotation.

randomlogin · 2026-06-11T00:07:47Z

Some ideas/questions out of curiosity:

Would it make sense to randomize whitelist order on each refill?
What happens if we set required peers to 2, have whitelist of 10 peers, 9 of which do not respond and only the last one is really working, will kyoto survive? Will it find that 1 whitelisted peer and 1 random one?

Spread reconnections across the allowlist instead of always retrying entries in the same order. Suggested in review.

0xsiddharthks · 2026-06-11T03:49:39Z

@randomlogin Thanks for reviewing my changes!

Here's the answers:

Randomizing the order
Yes, this totally makes sense. Just added it to the PR.
The dial queue is now shuffled on each refill, so reconnections spread across the allowlist instead of always favoring the same entry. Initial connection order is unchanged to keep the diff focused.
9 dead + 1 live with required_peers = 2
the node survives in both modes:
With whitelist_only
- each refill cycle redials the dead peers (one per tick, bounded by the handshake timeout) and reconnects to the live one.
- One caveat: PeerMap doesn't dedupe against already-connected addresses, so to satisfy required_peers = 2 it can open a second connection to the same live peer (the same thing you'd get today by listing one peer twice). Skipping already-connected addresses when popping the queue could be a nice follow-up.
Without whitelist_only
- this PR changes nothing: the whitelist is consumed once as before, so you'd get 1 whitelisted connection and the second from gossip/DNS.

0xsiddharthks added 2 commits June 10, 2026 17:33

test: whitelist-only node reconnects after rotation

4100d81

Force frequent connection rotations with a two second maximum_connection_time and assert the node keeps syncing new blocks across several rotations. Before the retry fix this exits with NoReachablePeers on the second rotation.

fix(net): shuffle the dial queue on refill

9955901

Spread reconnections across the allowlist instead of always retrying entries in the same order. Suggested in review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(net): retry whitelist peers instead of exiting#586

fix(net): retry whitelist peers instead of exiting#586
0xsiddharthks wants to merge 3 commits into
2140-dev:masterfrom
0xsiddharthks:siddharth/fix/whitelist-retry

0xsiddharthks commented Jun 10, 2026 •

edited

Loading

Uh oh!

randomlogin commented Jun 11, 2026

Uh oh!

0xsiddharthks commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

0xsiddharthks commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

randomlogin commented Jun 11, 2026

Uh oh!

0xsiddharthks commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

0xsiddharthks commented Jun 10, 2026 •

edited

Loading

0xsiddharthks commented Jun 11, 2026 •

edited

Loading