Skip to content

fix(validator): serialize TAO source-balance check under axon_lock#456

Merged
entrius merged 1 commit into
testfrom
fix/axon-recv-tao-balance-lock
Jun 8, 2026
Merged

fix(validator): serialize TAO source-balance check under axon_lock#456
entrius merged 1 commit into
testfrom
fix/axon-recv-tao-balance-lock

Conversation

@anderdc

@anderdc anderdc commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Root cause

The validator logs recurring cannot call recv while another thread is already running recv or recv_streaming errors. Root cause: a single substrate websocket call was made outside the lock that serializes that connection.

In handle_swap_reserve (allways/validator/axon_handlers.py), the source-balance check provider.get_balance(synapse.from_address) ran outside the with validator.axon_lock: block. The comment claimed the source-chain RPC is "a separate connection from substrate, so it doesn't need axon_lock."

That is true for a BTC source (Esplora/Maestro = HTTP) but false for a TAO source: the subtensor provider's get_balance calls self.subtensor.get_balance(...) on axon_subtensor — the exact websocket axon_lock exists to serialize. So every TAO->BTC reserve raced the lock-protected readers (other axon handler threads + the forward loop's bounds_cache reads on axon_subtensor), tripping the recv collision.

Fix

  • Add uses_substrate: bool = False on the chain-provider base (base.py); set uses_substrate = True on the subtensor/TAO provider (subtensor.py). BTC stays at the default False.
  • Gate the balance check on the flag: serialize the TAO read under axon_lock, keep BTC's HTTP read lock-free so a ~300ms Esplora call can't stall the forward loop. (axon_lock is an RLock, so the later re-acquire in the main block is fine.)
if provider.uses_substrate:
    with validator.axon_lock:
        balance = provider.get_balance(synapse.from_address)
else:
    balance = provider.get_balance(synapse.from_address)

Other handlers scanned

Scanned handle_miner_activate and handle_swap_confirm for the same pattern (any provider.* / axon_subtensor.* substrate call outside axon_lock). No additional leaks found — both wrap all their substrate work (is_hotkey_registered, read_miner_commitment, contract reads/votes, get_current_block, verify_transaction) inside with validator.axon_lock. The only leak was the reserve-time balance check.

Tests

Added TestSourceBalanceLock in tests/test_axon_handlers.py:

  • asserts SubtensorProvider.uses_substrate is True, BitcoinProvider.uses_substrate is False, base default False;
  • handler-level test that a TAO-sourced reserve holds axon_lock around get_balance, and that a BTC-sourced reserve does not.

Full suite: 689 passed. ruff check + ruff format --check clean.

handle_swap_reserve called provider.get_balance outside axon_lock with a
comment claiming the source-chain RPC is a separate connection. That holds
for a BTC source (Esplora/Maestro HTTP) but not for a TAO source: the
subtensor provider's get_balance runs on the shared axon_subtensor websocket
that axon_lock exists to serialize. Every TAO->BTC reserve raced the
lock-protected readers, causing recurring 'cannot call recv while another
thread is already running recv' errors.

Mark substrate-backed providers with uses_substrate and gate the balance
check on it: serialize the TAO read under axon_lock, keep BTC's HTTP read
lock-free so a slow Esplora call doesn't stall the forward loop.
@entrius entrius merged commit 147e2c8 into test Jun 8, 2026
3 checks passed
@entrius entrius deleted the fix/axon-recv-tao-balance-lock branch June 8, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants