fix(net): retry whitelist peers instead of exiting#586
Conversation
Whitelist entries were consumed as they were dialed and hostnames were resolved once, so a whitelist-only node permanently exited with NoReachablePeers after a handful of transient disconnects. The default two hour connection rotation alone kills a single-peer node in about four hours. Keep the configured whitelist as a persistent set and re-seed the dial queue from it, rate-limited to once per second, whenever it runs dry in whitelist-only mode. Hostnames are re-resolved on every cycle, so reconnections follow DNS changes. Gossip mode is unchanged, and an empty whitelist still exits with NoReachablePeers.
Force frequent connection rotations with a two second maximum_connection_time and assert the node keeps syncing new blocks across several rotations. Before the retry fix this exits with NoReachablePeers on the second rotation.
|
Some ideas/questions out of curiosity: Would it make sense to randomize whitelist order on each refill? |
Spread reconnections across the allowlist instead of always retrying entries in the same order. Suggested in review.
|
@randomlogin Thanks for reviewing my changes! Here's the answers:
|
Whitelist-only nodes die permanently on transient peer loss: whitelist entries are consumed as they're dialed, hostnames are resolved once, and nothing ever refills the list. With the default 2h connection rotation, a single-peer node (the #566 k8s setup) hits
NoReachablePeersevery ~4 hours. We measured 54 deaths across 4 production nodes in 48 hours, each one forcing a full resync.Changes:
An allowlist says who to connect to, not how long to try. The new integration test forces 2-second rotations: on master the node dies on the second rotation, with this change it keeps syncing.