Skip to content

[Bug] gitt miner post coverage silently and permanently erodes after validator restarts — valid miners are de-scored with no signal and no recovery #1481

@JSONbored

Description

@JSONbored

Verified against origin/test @ 1c813f5 (2026-06-15). The two reproductions below run against the real code; the live evidence is publicly verifiable in the validator telemetry at wandb.ai/entrius-gittensor/gittensor-validators.

Summary

The documented miner workflow is "run gitt miner post once; validators store your PAT and score your merged PRs." In practice a miner's PAT coverage monotonically decays: each time a validator loses its on-disk PAT store, that validator scores the miner 0 every round thereafter, and nothing restores itgitt miner post is a one-shot best-effort broadcast, and miners run no axon, so there is no re-broadcast and no way for a validator to re-request. The miner gets no error, no log, no signal; coverage is only re-established by manually re-running gitt miner post.

Net effect: a valid, registered, actively-contributing miner is silently and indefinitely de-scored by a growing subset of validators. This is live and systemic (receipts below), not theoretical — it has dropped my own miner (UID 29) out of eligibility twice on a multi-week cadence.

Mechanism (verbatim @ origin/test)

A validator loses its store (data/miner_pats.json, gittensor/validator/pat_storage.py:18) on any of: a restart without a persistent ./data (ensure_pats_file() then starts empty, neurons/validator.py:62); a crash/restart loop; or a single failed/corrupt read — _read_file() (pat_storage.py:68) catches (json.JSONDecodeError, OSError) → returns [], and the next save_pat() upserts one entry into that [] and atomically overwrites, erasing all others. (On the standard docker-compose.vali.yml./data:/app/data, atomic writes — the store does survive a clean restart, so this is not about per-operator volume config.)

Downstream, reward.py:100 snapshots load_all_pats() per round; any miner without a stored PAT is logged No stored PAT for miner {uid} — miner must run gitt miner post (oss_contributions/inspections.py:134) and scored 0.

Nothing recovers it: gitt miner post issues a single dendrite(..., timeout=30.0) broadcast with no retry/resend (cli/miner_commands/post.py:133-142) then exits; miners serve no axon (README: "No miner neuron required"), so a validator cannot re-request; and no miner-facing signal exists (the guide implies one post suffices).

Proof 1 — gitt miner post silently reports success and never retries a missed validator

Runs the real gitt miner post (Click CliRunner); only the network/identity boundaries are stubbed — the broadcast-result handling, success flag, exit code, and retry behavior are the real code. Validator UID 30 does not respond:

REPRODUCTION: `gitt miner post` — silent partial success + no retry
  3 validators; validator UID 30 did NOT respond (status 408).
  command exit code        : 0
  reported success         : True
  accepted / total         : 2 / 3
  no_response (silent miss): 1
  broadcast attempts made  : 1
PROVEN: exited 0 and reported success=True while UID 30 got nothing, with exactly
ONE broadcast (no retry). A transient failure to any validator is a silent, un-
retried, un-surfaced coverage gap — the miner is never told they are uncovered.

So a brief blip during the one post (or a validator that is mid-restart) permanently drops the miner from that validator until a future manual re-post — and post reports success anyway.

Proof 2 — a single failed read of miner_pats.json erases every stored PAT

Runs the real pat_storage.save_pat/_read_file; only PATS_FILE is redirected to a scratch path:

[setup]   3 miners stored normally   -> file UIDs = [10, 20, 30]
[control] 4th miner, file readable   -> file UIDs = [10, 20, 30, 40]   (correct)
[trigger] file unreadable (corrupt / I/O error) before the next broadcast
[result]  one miner (UID 99) broadcasts -> file UIDs = [99]
PROVEN: PATs for UIDs [10,20,30,40] silently and permanently erased by one unrelated
broadcast after a single failed read. Those miners are scored 0 until each re-posts.

(Both scripts are self-contained and deterministic; available on request.)

Proof 3 — it is happening live (publicly verifiable)

From wandb.ai/entrius-gittensor/gittensor-validators (anonymously readable; console telemetry; display names are vali-{uid}-{version}):

  • gittensor-181 (run bbd93w0w) restarted at the release — new run created 2026-06-15T23:51Z, ~4 min after the release-20260615-234738 tag — and kept its PATs (PAT check result — UID: 29: valid). Confirms the store persists on a well-configured validator across an upgrade-restart.
  • vali-116 (run u1ropz51, running since 2026-06-13) was missing that same miner's PAT (4/7 coverage on my gitt miner check) despite no restart at this release — i.e. lost on an earlier restart, never recovered. In one ~110-min window it logged a second miner cycling through loss → re-post (UID: 32: no PAT storedPAT broadcast accepted — UID: 32), so this is systemic across miners.
  • rt21-64 (UID 64) is in a restart/crash loop (10+ runs over Jun 8–12) and is unreachable for re-broadcast.

Impact

A transient/operational event (a validator restart) produces a lasting 0-score for a valid miner across every validator that lost its PAT; those validators set weights from the zeros, so it reduces real emission. Coverage only erodes (no mechanism ever restores it), the miner is never told, and it recurs as validators restart (upgrades/crashes).

Relationship to prior reports

Proposed fix

Recovery must be miner-initiated (validators cannot pull — no miner axon). Make the reference tooling keep coverage asserted instead of one-shot:

  • Add a re-assert mode to gitt miner (e.g. gitt miner post --watch / gitt miner ensure) that, on an interval, runs the existing check and re-broadcasts only to validators reporting has_pat=False / pat_valid != True. Cheap (check transmits no PAT), non-spammy (posts only where missing), and makes the existing "miner must re-post" remedy reliable instead of manual-and-silent.
  • Surgical first step if a daemon is out of scope: have gitt miner post retry validators that return no response, and surface partial coverage as a non-zero exit / explicit warning so it is not silent (Proof 1).
  • Document that PAT coverage can be lost on validator restarts and should be re-asserted.

Separate, optional hardening for loss-vector 3 (Proof 2): make save_pat() fail closed when the prior read could not be parsed instead of overwriting (as #829 / #1081 proposed).

Suggested tests

  • gitt miner post retries a non-responding validator and reports full coverage only when actually achieved; persistent partial coverage → non-zero exit / warning (Proof 1 becomes a regression test).
  • The re-assert path posts to exactly the validators whose check returns missing/invalid and skips the valid ones.
  • save_pat() preserves the existing entries (raises) when _read_file() cannot parse the file (Proof 2 becomes a regression test).

Footer — verification provenance

  • Live on origin/test @ 1c813f5 (2026-06-15). Code: pat_storage.py:18/23/36/68/78; neurons/validator.py:62; oss_contributions/reward.py:100, inspections.py:134; cli/miner_commands/post.py:133-142; README miner section. Deploy: docker-compose.vali.yml (./data:/app/data).
  • Telemetry (public, verifiable): wandb.ai/entrius-gittensor/gittensor-validatorsgittensor-181/bbd93w0w (restarted at release, kept PATs), vali-116/u1ropz51 (lost a PAT, no recovery; UID 32 systemic), rt21-64 (crash loop).
  • Both reproductions run against current origin/test. Fix is Python-side (CLI / validator); no contract changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions