Skip to content

[Draft] feat(vardiff): decline-safe vardiff champion + simulation framework + proof#2154

Open
gimballock wants to merge 6 commits into
stratum-mining:mainfrom
marafoundation:vardiff/simulation-framework
Open

[Draft] feat(vardiff): decline-safe vardiff champion + simulation framework + proof#2154
gimballock wants to merge 6 commits into
stratum-mining:mainfrom
marafoundation:vardiff/simulation-framework

Conversation

@gimballock

@gimballock gimballock commented May 13, 2026

Copy link
Copy Markdown

Adds a deterministic in-process simulation framework that characterizes any
Vardiff implementation, the decline-safe champion controller it selected,
and the proof that argues both. Read the three commits in order — controller,
framework, docs — each is a coherent review unit on its own.

History note. This branch's ~99 single-author development commits were
consolidated into the three reviewable commits below. The full step-by-step
trail — including the 10 commits the docs cite as provenance (the 0.55
ceiling, AsymmetricCusum, counter-age, the regime-split corrections) — is
preserved on tag archive/pre-restructure, where every cited SHA stays
reachable. Review the three; reach for the tag only if you want the dead-ends.

What this is

A vardiff controller's only observable is its own tracking error, and that one
fact bounds what any controller can do. The framework measures it, the proof
derives the bound, and the champion is the existence proof that the safe corner
of the achievable frontier is occupiable. The headline is structural, not a
horse race
: across the operating band every reasonable controller is pinned to
the same information floor (~12% best-to-worst spread), so the residual axis is
safety on a decline, not agility — and that is the one axis controllers
genuinely differ on.

This is the research/proof branch. The clean production extraction that ships
the champion is #2188 (a single flat-struct replacement of VardiffState,
no scaffolding); this branch is the framework that selected it and the proof that
argues it.

The three commits, in reading order

  1. feat(vardiff): decline-safe champion controller — the thing that
    ships. The three-stage Composed<Estimator, Boundary, UpdateRule> pipeline;
    the champion is EwmaEstimator(360) / AdaptiveSignPersist(spm6) / AcceleratingPartialRetarget, selected by minimax over share rate under a
    decline-safety constraint (the kills are the selection criterion — the
    alternatives that died on a sustained decline are why this one ships).

  2. feat(vardiff/sim): in-process simulation framework + CI regression guards
    — the apparatus that selected it. Deterministic per-tick Poisson sim, the
    metric, the grid, the decline-safety gate, and three CI-guarded regression
    tests that re-derive the selection criteria on every run. Carries the pinned
    reproduction environment (Cargo.lock + toolchain) and the frozen
    champion/classic baselines the guards assert against.

  3. docs(vardiff): the proof corpus — the proof and how it's argued.
    information-floor.md (the closed theory, each result labelled theory /
    simulation-only / hardware), THEORY.md (the derivation-and-falsification
    notebook), the two plain-language essays (why → how), the investigations
    (incl. the source-verified finding that a deployed pow2-in-loop deadband
    construction is structurally unable to ease on a decline), and the
    claim-warrant validator. The proof corpus also includes a source-verified
    survey of deployed open-source vardiff controllers and their
    decline-safety failure modes (docs/records/VARDIFF_SURVEY.md) — NOMP,
    ckpool, MiningCore — finding the failure axis is a structural property
    (idle-path + trigger-reachability), not algorithm sophistication.

Reproducibility (stated at the scope it is true)

  • Every load-bearing claim in the proof is re-runnable. Each
    simulation-result claim in information-floor.md maps, via the README
    reproduction index, to a deterministic binary that regenerates the number
    (not a frozen value that could drift from the code). The selection criteria
    are CI-guarded.
  • The "no lost work" retraction is SHA-pinned to source. It rests on the
    per-job target snapshot in src/server/extended.rs (verified; the protocol
    validates each share against the target snapshotted at job creation, which the
    vardiff path never mutates) — a protocol fact, reproduced by reading the
    pinned source, not by a sim.
  • The environment is captured (commit + Cargo.lock hash + pinned
    toolchain), so a recorded seed reproduces deterministically.
  • Not claimed: full graph-registration. Every published claim is
    registered and re-runnable; tagging the remaining exploration binaries
    (which back no proof claim beyond what the index already covers) is a typed
    standing debt the validator surfaces, not part of this PR's reproducibility
    guarantee.

Known-open items (disclosed, not surprises)

The claim validator surfaces two standing REVISIT flags by design:

  • Deployed-construction source pin — the critique's reference to the deployed
    pow2-in-loop deadband construction is currently a date + line numbers; should
    strengthen to a commit SHA (line numbers drift).
  • Naming in published voice — the deployed-construction critique points at
    public source and names no operator; naming is a deliberate, separate
    decision, not made here.

Scope honesty (from the proof): the decline response is hardware-confirmed in
direction
on a sustained 50% drop (eased the safe way, no rejection spike); the
settled over-difficulty figure and the slow-moderate-decline gate remain
simulation results; the cost-model weights are a calibrated judgment.

@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 11b2560 to 88d8d1d Compare May 13, 2026 22:06
@gimballock

gimballock commented May 13, 2026

Copy link
Copy Markdown
Author

The code is cheap and only meant to demonstrate the feasibility, but the concept ack revolves around these points imo:

  • We can play dice with share-received events to simulate running the vardiff algorithms over arbitrary ranges of time. But we need to mock SystemTime::now() and add a way to bulk add new shares.
  • With fake time simulations we can do large scale vardiff trials of whatever metrics we want and contrast against correlated attributes like target shares-per-minute.
    • I was interested in convergence time, stable-state jitter, and convergence accuracy
    • But responsiveness to external change is also a key capability, (how fast to adjust to a 50% spike/dip in hashrate)
  • With this compilation of reproducible test results compiled into a profile we can use integration tests to lock in established performance thresholds and ratchet up the expectations if we find better algorithms.

Comment on lines +7 to +13
| share/min | rate | p10 | p50 | p90 | p99 |
| --- | --- | --- | --- | --- | --- |
| 6 | 83.3% | 10m | 12m | 21m | 25m |
| 12 | 95.4% | 10m | 10m | 20m | 25m |
| 30 | 99.5% | 10m | 10m | 15m | 25m |
| 60 | 100.0% | 10m | 10m | 10m | 20m |
| 120 | 100.0% | 10m | 10m | 10m | 15m |

@gimballock gimballock May 13, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first row here shows results of the convergence time test for the the default case (6 spm).
The convergence times are between 10 and 25 minutes, with total failures to converge (w/in quiet_window_secs of simulated time) occurring 17% of the time!

the next few rows describe the results for faster share rates. the most extreme times (25m reduces to 15m) and the total failure cases generally disappear around 30 spm.

Comment on lines +15 to +37
## Settled accuracy (stable load, post-convergence)

`|final_hashrate / true_hashrate - 1|` at trial end. Smaller is better.

| share/min | p10 | p50 | p90 | p99 |
| --- | --- | --- | --- | --- |
| 6 | 0.0% | 4.9% | 23.6% | 70.3% |
| 12 | 0.0% | 0.0% | 12.3% | 26.9% |
| 30 | 0.0% | 0.0% | 0.8% | 15.6% |
| 60 | 0.0% | 0.0% | 0.0% | 3.1% |
| 120 | 0.0% | 0.0% | 0.0% | 0.0% |

## Steady-state jitter (fires per minute)

Post-convergence rate of vardiff fires. Smaller is better — ideal is zero under stable load.

| share/min | p50 | p90 | p99 | mean |
| --- | --- | --- | --- | --- |
| 6 | 0.000 | 0.200 | 0.385 | 0.059 |
| 12 | 0.000 | 0.077 | 0.217 | 0.019 |
| 30 | 0.000 | 0.000 | 0.067 | 0.002 |
| 60 | 0.000 | 0.000 | 0.000 | 0.000 |
| 120 | 0.000 | 0.000 | 0.000 | 0.000 |

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two metrics (proximity to true hashrate and post-converged adjustments) show a similar trend,
Lots of undesired behavior in the extreme cases (top 10%, top 1%) of 6 shares/min case that is alleviated at higher share rates.

Comment on lines +39 to +60
## Reaction time to a 50% drop (step at 15 min)

| share/min | reacted | p10 | p50 | p90 | p99 |
| --- | --- | --- | --- | --- | --- |
| 6 | 69.7% | 1m | 3m | 5m | 5m |
| 12 | 54.8% | 1m | 3m | 5m | 5m |
| 30 | 32.6% | 2m | 4m | 5m | 5m |
| 60 | 16.3% | 3m | 5m | 5m | 5m |
| 120 | 8.6% | 4m | 5m | 5m | 5m |

## Reaction sensitivity (P[fire within 5 min of step change])

| Δ% | 6 | 12 | 30 | 60 | 120 |
| --- | --- | --- | --- | --- | --- |
| -50% | 0.70 | 0.55 | 0.33 | 0.16 | 0.09 |
| -25% | 0.44 | 0.23 | 0.08 | 0.00 | 0.00 |
| -10% | 0.39 | 0.15 | 0.02 | 0.00 | 0.00 |
| -5% | 0.40 | 0.15 | 0.02 | 0.00 | 0.00 |
| +5% | 0.39 | 0.13 | 0.02 | 0.00 | 0.00 |
| +10% | 0.42 | 0.17 | 0.03 | 0.00 | 0.00 |
| +25% | 0.48 | 0.23 | 0.07 | 0.01 | 0.00 |
| +50% | 0.64 | 0.47 | 0.32 | 0.22 | 0.29 |

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tables show how long it takes for vardiff to respond to an unexpected change in hashrate. Where the changes are to either increase or decrease by proportional amounts anywhere from 5% to 50%.

The first table specifically looks at a 50% draw down showing that a full 30% of the time vardiff fails to adjust after 5 min. The next few rows show that the situation worsens at higher share rates, at 120 spm 91% of the trials failed to adjust after 5m.

The second table shows that this effect is basically the same for hashrate changes in the opposite direction and also that changes of lesser magnitude respond much more quickly.

Comment thread sv2/channels-sv2/sim/src/regression.rs Outdated
Comment on lines +16 to +24
//! - **Convergence rate**: `current >= baseline - 0.01`
//! - **Convergence p90**: `current <= baseline * 1.10`
//! - **Settled accuracy p50 / p90**: `current <= baseline * 1.15`
//! - **Jitter p50**: `current <= baseline + 0.02` (absolute; baseline can be near zero)
//! - **Jitter p95**: `current <= baseline * 1.25`
//! - **Reaction rate**: `current >= baseline - 0.02`
//! - **Reaction p50**: `current <= baseline * 1.20`
//! - **Sensitivity at large |Δ| (|Δ| >= 50%)**: `current >= baseline - 0.02`
//! - **Sensitivity at small |Δ| (|Δ| <= 5%)**: `current <= baseline + 0.05`

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convergence rate: Must be no more than 1% slower than the baseline convergence time
Convergence p90: The slowest 10% convergence times must be within 10% of the baseline's convergence time
Settled accuracy: must be within 15% of baseline's accuracy for the slowest 50% / 10%
Jitter p50/p95: must be within 2% and 25% of baseline
...etc.

You see the pattern, there are lots of magic thresholds in this portion of the code that are arbitrarily chosen at this point and fair game for analysis.

@gimballock gimballock force-pushed the vardiff/simulation-framework branch 2 times, most recently from 5cbed7c to 85d6f8b Compare May 17, 2026 14:48
@gimballock gimballock changed the title feat(vardiff): add in-process simulation framework + baseline regression tests [Draft] feat(vardiff): add in-process simulation framework + baseline regression tests May 17, 2026
@adammwest

Copy link
Copy Markdown

after some optimization I got
https://github.com/adammwest/stratum/blob/feat/vardiff_kalman/sv2/channels-sv2/sim/vardiff_best.md
with the Bayesian model

Method Result
Bayesian model Good, Best result
Kalman filters Good
Jurik moving averages Good but artifacts on hashrate changes
Thompson sampling Bad

@gimballock

gimballock commented May 18, 2026

Copy link
Copy Markdown
Author

after some optimization I got https://github.com/adammwest/stratum/blob/feat/vardiff_kalman/sv2/channels-sv2/sim/vardiff_best.md with the Bayesian model
Method Result
Bayesian model Good, Best result
Kalman filters Good
Jurik moving averages Good but artifacts on hashrate changes
Thompson sampling Bad

I'm so excited to see people other people nerding out on vardiff with me! Thank you!

A couple things I noticed in your results, the 2m convergence time is impressive but your response to a 50% hashrate drop only succeeds in readjusting 4.4% of the time. I'm not sure how best to balance those two metrics but probably not one at the expense of the other.

@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 2d10f57 to 414afbb Compare May 19, 2026 14:25
@gimballock gimballock force-pushed the vardiff/simulation-framework branch 4 times, most recently from 211bc98 to 2a88fde Compare May 20, 2026 15:09
@adammwest

Copy link
Copy Markdown

Some learning's I had @gimballock

  • This task is hard, I think the current implementation is optimized to a degree.
  • the most critical thing is the fitness, currently there are many metrics, how you combine all of them into a final value is what determines the goodness for any algorithm, as its a summary there are ways to game it.
  • Use every value in the toml file, if you dont those values will naturally will degrade.
  • There are many ways to combine many numbers which lead to slightly better and slightly worse performance.
    one thing I did which helped was to separate fitness into 2 categories improvement and regression, even better separating these per group e.g stable,coldstart ,... then you can decompose the value.
  • Normalize each metric/group otherwise the more numerous or larger numbers will be the focus.
  • There are many cases where you can get a good score, but the fitness prefers optimizing 1 variable or a set of variables at the expense of others.
  • For grid parameter sweeps, they usually discretize the domain so you are bounded in improvement only by dimension range and amount of queries. so you need to constantly increase queries or shrink ranges. usually you are limited due to time. For this reason I prefer random restart hill climbing I find is generally pretty good when you don't make assumptions about the data.
  • If you have too many parameters to optimize you can over fit, and end up just gaming the test

@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 63a19d0 to a18c3a3 Compare May 21, 2026 14:28
@gimballock

gimballock commented May 21, 2026

Copy link
Copy Markdown
Author

Thanks for these insights @adammwest — especially the point about fitness decomposition and normalization. A lot of what you're describing matches the evolution I've gone through on this PR, so let me give a timeline of
how the approach has matured:

Phase 1: Basic metrics + simulation harness

Initially I focused on three metrics I thought were important: convergence time, jitter, and accuracy. These were evaluated via a time-compressed simulation that replays a synthetic share stream through the vardiff
algorithm. This gave us reproducible, large-scale trials (50 cells × 1000 trials) against correlated attributes like target shares-per-minute.

Phase 2: Decomposed pipeline model

I wanted to make algorithm search more systematic, so I decomposed "a vardiff algorithm" into four independent, replaceable components: estimator, statistic, boundary, and decision rule. The idea was to mix-and-match
implementations at each slot for the best composite.

This model worked well for the classic algorithm, the parametric variant, and the EWMA approach. But when I tried to embed a Bayesian model, it broke down — the components aren't truly independent. There's a sequential
data flow: the estimator needs to communicate its belief to the boundary ("should we respond?") and to the update rule. Additionally, since vardiff triggers on a timer rather than on share arrival, the decision rule needs
to call back to the estimator to update state when adjustments occur. I also dropped the "statistic" component as it wasn't pulling its weight.

The resulting three-stage pipeline (Estimator → Boundary → UpdateRule) is what's in this PR. It successfully hosts the classic algorithm, EWMA, AdaCUSUM, and could host a Bayesian approach.

Phase 3: Aggregate fitness metric

To your point about "how you combine all metrics into a final value" — we now have a configurable aggregate metric that allows weighting across the underlying measurements. This addresses exactly the gaming concern you
raised: rather than optimizing one metric at the expense of others, we can define a weighted composite that represents our desired tradeoff. The regression baseline locks in the full vector of metrics so we catch
regressions in any dimension, not just the aggregate.

Your suggestion to separate fitness into improvement vs. regression categories per scenario group (stable, coldstart, reaction) is a good one. Currently the regression test does compare per-cell, so a coldstart regression
can't hide behind a stable-state improvement, but making this more explicit in the scoring would help.

Phase 4: Realistic operating conditions

After discussions with hardware engineers, I retuned the test scenarios to realistic share rates (2–30 spm instead of the earlier 6–120 range). The engineers confirmed that responding to partial hardware failures and
network slowdowns on an established channel is valuable functionality — even though in practice many hashrate changes currently cause miner reconnections (which resets vardiff anyway). We've been doing live testing with
physical miners on testnet4 and confirmed this pattern: when vardiff ramps difficulty too aggressively, it can interact with firmware timeout behaviors in ways that force reconnections, making reactivity testing harder
than expected.

Current direction

I've backed off from prioritizing convergence speed after seeing overcorrection in practice. The current focus is on:

  • Stability under steady-state (minimal oscillation once converged)
  • Reasonable reactivity (detect genuine changes within 2–3 retarget windows, not 1)
  • Asymmetric cost awareness — difficulty increases are more disruptive than decreases. An overshoot upward causes difficulty-too-low share rejections (wasted miner work), while an undershoot downward just means slightly
    more shares than optimal (cheap). The AsymmetricCusumBoundary encodes this: it requires stronger evidence before raising difficulty than lowering it. We can now actually measure the impact via the share-rejection metrics
    (shares_rejected_total{reason="difficulty-too-low"}) that were recently added to the pool's monitoring (sv2-apps PR Docs: Channel Factory #491).

On your point about normalization: agreed, and the per-metric tolerance budgets in the regression test (absolute slack + optional multiplicative slack) are our current mechanism for this. Open to suggestions on better
normalization approaches

@gimballock

gimballock commented May 30, 2026

Copy link
Copy Markdown
Author

I've been working on a simple proxy to validate the vardiff responsiveness findings from the simulations. My first test aimed to confirm the assessment that the existing vardiff algorithm is slow to respond to hashrate changes after it has already converged.

Test Setup:

I configured an S21 miner → tproxy (vardiff disabled) → shape-proxy (controllable share rate) → SRI pool. The shape-proxy maintains a smoothed share rate based on current pool difficulty and can selectively drop shares to simulate hashrate changes on command.

Methodology:

I ran two parallel instances of the unmodified SRI pool server (main branch) and triggered a 50% share rate drop via API command. I then measured how long it took for the pool's hashrate estimate and share acceptance rate to adjust to the new rate.

Results:

The pool required approximately 40 minutes to complete the initial hashrate adjustment, while the simulation predicted 70% of miners failed to adjust at all and of those that did it took 5m. So it may look like a 20× discrepancy between simulation and observed behavior, its more of a confirmation that it takes a very long time.

I will investigate if there is any discrepancy and re-calibrate the simulation if so. The attached image shows the resulting hashrate and accepted share rate for both trials.

Screenshot 2026-05-30 at 1 33 10 PM

The blue lines are the pool hashrate, you see the initial spike after it first starts up is the convergence spike then is settles and stays flat for not-quite an hour before I activate the 50% hashrate drop, immediately visible in the share rate drop. This stays flat until eventually vardiff fires the adjustment allowing the share rate to return to normal.

@gimballock

Copy link
Copy Markdown
Author
Screenshot 2026-05-30 at 2 15 21 PM Here you can see the previous 12h of consistent hashrate w/ zero rejected shares as evidence that the 'smoothing' of the hashrate via my proxy is stable enough to test against.

@paratoxicdev

Copy link
Copy Markdown

Hi, this is quite a detailed analysis and great decomposition of relevant parts.

Have you had a look at how ckpool implements this? I think he has quite naturally arrived at a very optimal state, balancing the different metrics. This is the repo, you'll have to grep through to find the vardiff implementation: https://github.com/ckolivas/ckpool

I've re-implemented his approach in Rust as well and made it a bit more configurable here: https://github.com/parasitepool/para/blob/master/src/vardiff.rs

Would be interesting to see how that algorithm performs in your benchmarks.

@gimballock

Copy link
Copy Markdown
Author

Hi, this is quite a detailed analysis and great decomposition of relevant parts.

Have you had a look at how ckpool implements this? I think he has quite naturally arrived at a very optimal state, balancing the different metrics. This is the repo, you'll have to grep through to find the vardiff implementation: https://github.com/ckolivas/ckpool

I've re-implemented his approach in Rust as well and made it a bit more configurable here: https://github.com/parasitepool/para/blob/master/src/vardiff.rs

Would be interesting to see how that algorithm performs in your benchmarks.

Thanks for the info @paratoxicdev , I will add this to my investigation. I know that is a sv1 native pool but i will see what bits of his research crossover to sv2 context and see if it's competitive!

@gimballock

Copy link
Copy Markdown
Author

Here is a breakdown of the calibration comparisons I made between the real hashrate tests and the simulation results confirming the predictions with the understanding that; with this algorithm responsiveness scales with the age of the connection. Suggesting that this detail be included in the simulation so metrics are more directly comparable:

⏺ Vardiff Calibration Summary: Simulation vs Real Miner Testing

  Algorithm under test: Classic Parametric vardiff (VardiffState from stratum-core, branch vardiff/parametric-thresholds)

  Hardware: Antminer S21 (~200 TH/s), testnet4
  Pool config: shares_per_minute = 6 and shares_per_minute = 20
  Tool: shape-proxy with Step{1.0→0.5, at_secs:300} and Track{1.0} reset

  ---
  Structural findings confirmed
  
  ┌────────────────────────────┬───────────────────────────────────────────────────────┬────────────────────────────────────────────────────────┬───────┐
  │          Property          │                    Sim prediction                     │                    Real observation                    │ Match │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Steady-state jitter        │ 0.000 fires/min                                       │ Zero fires during 30+ min stable operation (both       │ Yes   │
  │                            │                                                       │ rates)                                                 │       │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Per-fire step magnitude    │ Deterministic: realized_spm / target_spm ratio        │ -16.7% consistently (5 consecutive fires)              │ Yes   │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Fire cadence after counter │ Exactly 300s (15% threshold at delta_time≥300)        │ 22:39→22:44→22:49→22:54→22:59 (5-min cadence)          │ Yes   │
  │  reset                     │                                                       │                                                        │       │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Overshoot after staircase  │ p99 = 69% at 6 spm                                    │ ~60% overshoot → share flood → oscillation             │ Yes   │
  │ descent                    │                                                       │                                                        │       │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Sensitivity depends on     │ Implicit in algorithm (accumulating counter never     │ 5-min counter: 4.4 min reaction; 51-min counter: 51.8  │ Yes   │
  │ counter age                │ resets except on fire)                                │ min reaction                                           │       │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Algorithm symmetric for    │ Sim shows similar reaction rates both directions      │ Confirmed: similar timescales for step-down and        │ Yes   │
  │ ±50%                       │                                                       │ step-up                                                │       │
  └────────────────────────────┴───────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┴───────┘

  ---
  Reaction time: -50% step (hashrate halved)
  
  ┌────────────────────────────────────┬──────────────────────────────────────────────────────────┬──────────────────────────────────┐
  │             Condition              │                      Sim prediction                      │           Real result            │
  ├────────────────────────────────────┼──────────────────────────────────────────────────────────┼──────────────────────────────────┤
  │ 6 spm, fresh counter (~5 min)      │ 3.7% fire within 5 min; those that fire: p50=5m          │ Slot 3: 4.4 min, Slot 4: 8.0 min │
  ├────────────────────────────────────┼──────────────────────────────────────────────────────────┼──────────────────────────────────┤
  │ 6 spm, settled counter (~51 min)   │ ~96% don't fire within 5 min (tail extends indefinitely) │ Slot 3: 51.8 min                 │
  ├────────────────────────────────────┼──────────────────────────────────────────────────────────┼──────────────────────────────────┤
  │ 20 spm, moderate counter (~27 min) │ 14.2% fire within 5 min                                  │ Slot 3: 6.9 min, Slot 4: 8.4 min │
  └────────────────────────────────────┴──────────────────────────────────────────────────────────┴──────────────────────────────────┘

  ---
  Reaction time: +50% step (return to full rate)
  
  ┌───────────────────────────┬──────────────────────────────────┬──────────────────────────────────┐
  │         Condition         │          Sim prediction          │           Real result            │
  ├───────────────────────────┼──────────────────────────────────┼──────────────────────────────────┤
  │ 6 spm, 51-min counter     │ Deep in tail (>96% non-reactive) │ Slot 3: 51.8 min                 │
  ├───────────────────────────┼──────────────────────────────────┼──────────────────────────────────┤
  │ 6 spm, 68-min counter     │ Deep in tail                     │ Slot 4: 15.4 min                 │
  ├───────────────────────────┼──────────────────────────────────┼──────────────────────────────────┤
  │ 20 spm, 20-21 min counter │ Higher spm improves detection    │ Slot 3: 5.9 min, Slot 4: 6.3 min │
  └───────────────────────────┴──────────────────────────────────┴──────────────────────────────────┘

  ---
  Key insight: counter age is the dominant variable
  
  The sim's test design (step at t=15min, 5 min after cold-start convergence) always tests with a fresh counter. This produces the "3.7% react within 5 min"
  figure. In reality, a pool that hasn't fired in hours has a massive accumulated counter that dilutes any step signal — explaining 40-minute to 2.5-hour
  response times observed in earlier tests.

  At 20 spm vs 6 spm with the same counter age (~20 min), reaction time improves ~8x (51.8 min → 5.9 min). This is because more shares per minute means the
  new rate accumulates statistical weight faster against the counter history.

  ---
  Simulation validity assessment
  
  The simulation is trustworthy for relative algorithm comparisons. It correctly models:
  - The threshold table mechanics (fire/no-fire decisions)
  - The deterministic staircase behavior post-fire
  - The zero-jitter steady state
  - The share-rate dependence on detection speed
  - The fundamental sensitivity-decay-over-time flaw

  Gap to address: The sim should add a "counter age" axis (test steps at t=30m, t=60m, t=120m) to characterize the tail distribution, which is where
  real-world pools spend most of their time. The current 5-minute observation window and 15-minute step timing understate the algorithm's poor real-world
  responsiveness.```  

@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 006363a to a58132b Compare June 3, 2026 04:51
@gimballock

gimballock commented Jun 3, 2026

Copy link
Copy Markdown
Author

Ok here is evidence that the top algorithm (deployed side-by-side with the current vardiff) is immune to the age-dependence effect. This reproduces results predicted in the simulation but now seen in real life.

I started a mining channel against both pools and let them mine overnight.
In the morning I dropped the hashrate in half for both pools at roughly the same time and waited till both pool's vardiffs finished adjusting to the new hashrate changes. The annotated image below shows the results from the grafana dashboard.

Screenshot 2026-06-03 at 5 09 20 PM

The current vardiff algorithm not only took several hours to respond, the response it eventually made was in the wrong direction! While the new algorithm responded in a few minutes and settled on the correct value.

gimballock pushed a commit to marafoundation/stratum that referenced this pull request Jun 10, 2026
Clean extraction of the best-performing vardiff algorithm from the
simulation framework in stratum-mining#2154, with all test scaffolding, traits, and
alternative algorithm implementations removed.

The previous VardiffState used a fixed time-dependent threshold ladder
and full retarget. This produced:

- 6.6% median settled error (p99: 30% at low SPM)
- 5–9 minute cold-start convergence (p90)
- 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs)
- 28% target overshoot during cold-start ramp (p99 at SPM 6)

The new algorithm (EWMA + adaptive boundary + accelerating partial retarget):

- Settled accuracy: <3% median error across all SPM
- Cold-start overshoot bounded to <10% (was 28%)
- Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets
- Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%)
- Transient disconnects recover in 1–2 fires rather than requiring a full
  cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash)
- Asymmetric cost: loosening fires 3x faster than tightening, because
  loosening is free but tightening rejects in-flight shares

Breaking: adds private fields to VardiffState (previously all-pub).
Requires channels_sv2 major version bump. Public constructor API
(new, new_with_min) and Vardiff trait interface are unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gimballock pushed a commit to marafoundation/stratum that referenced this pull request Jun 10, 2026
Clean extraction of the best-performing vardiff algorithm from the
simulation framework in stratum-mining#2154, with all test scaffolding, traits, and
alternative algorithm implementations removed.

The previous VardiffState used a fixed time-dependent threshold ladder
and full retarget. This produced:

- 6.6% median settled error (p99: 30% at low SPM)
- 5–9 minute cold-start convergence (p90)
- 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs)
- 28% target overshoot during cold-start ramp (p99 at SPM 6)

The new algorithm (EWMA + adaptive boundary + accelerating partial retarget):

- Settled accuracy: <3% median error across all SPM
- Cold-start overshoot bounded to <10% (was 28%)
- Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets
- Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%)
- Transient disconnects recover in 1–2 fires rather than requiring a full
  cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash)
- Asymmetric cost: loosening fires 3x faster than tightening, because
  loosening is free but tightening rejects in-flight shares

Breaking: adds private fields to VardiffState (previously all-pub).
Requires channels_sv2 major version bump. Public constructor API
(new, new_with_min) and Vardiff trait interface are unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 43d973d to 55e5dd0 Compare June 20, 2026 18:56
gimballock pushed a commit to marafoundation/stratum that referenced this pull request Jun 23, 2026
Fold in the reviewer's substantive points and the external-validation data
from PR stratum-mining#2154 that the reviewer wrote without seeing:

- §1 front-matter (why the model): vardiff exists to hold every connection
  near r* despite orders-of-magnitude hashrate spread; r* bounds
  per-connection load, estimate variance, and reward variance. Poisson-
  in-log justified as forced, not chosen. States it as an evaluation (not
  control-design) model and lists what it omits (within-window steady H,
  one worker/connection, lossless retargets, continuous uncapped D).

- §8 rewritten "three questions, three views": why not one number (a
  time-integral hides when/what-kind, can't separate slow leak from
  transient) or an unconstrained radar (axis-order-dependent area;
  best-in-set rescaling) — only the fixed-reference construction and the
  trajectory's time axis survive. Rank/characterize/trust hierarchy.

- §6 share-volume correction: under-difficulty runs the connection over r*
  permanently; resource cost is linear in excess volume, hence ~linear in
  |e| near operating point — a one-sided regret_under bump, no new form.
  Pricing it can only shrink the cost-optimal offset; magnitude modest for
  headroom pools (the common case, r*≈4–6 spm), convex only under tight
  per-connection quotas (older/stressed machines). Deliverable: measure c.

- §9 NEW "Validation against real hardware": the shape-proxy tests on an
  Antminer S21 / testnet4 reproduced the classic mechanics quantitatively
  (jitter, −16.7%/fire, 300s cadence, overshoot, ±50% symmetry) and the
  counter-age dependence (5-min counter → 4.4min, 51-min → 51.8min); and
  the champion beat classic live on a halved-hashrate drop (classic: hours,
  wrong direction; champion: minutes, correct). Scopes the claim honestly:
  behaviorally validated on hardware, weights remain a calibrated judgment.
  Lists open HW tests (slow decline, multi-connection, measure c).

- Errata from the prior pass retained: Theorem/Lemma reserved for proved
  results, §5/§6 → Argument, offset recast as the asymmetry-optimal
  quantile ≈−0.67·σ_eff, "unbiased" restored in falsifier (b), accuracy
  ceiling clarified as a noise band not a bias, Σs² cadence-cap assumption,
  scalar demoted with commensurability caveat, regret=tracking-loss
  disclaimer. Sections renumbered 1–11; status table gains HW + share-
  volume rows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gimballock gimballock force-pushed the vardiff/simulation-framework branch 2 times, most recently from ac21e77 to dc70e2e Compare June 28, 2026 20:36
@gimballock gimballock changed the title [Draft] feat(vardiff): add in-process simulation framework + baseline regression tests [Draft] feat(vardiff): decline-safe vardiff champion + simulation framework + proof Jun 28, 2026
@gimballock gimballock force-pushed the vardiff/simulation-framework branch 3 times, most recently from c66784a to f3ba47b Compare June 30, 2026 05:34
gimballock pushed a commit to marafoundation/stratum that referenced this pull request Jun 30, 2026
…erified deltas, body rewrite, vnprc reply (NOT pushed)

PR stratum-mining#2188 (the clean upstream production extraction, distinct from stratum-mining#2154) was opened
before the decline-safety arc and ships the SUPERSEDED contender, not the champion.
Verified against source: stratum-mining#2188 = EWMA120/AsymCUSUM-at-10/0.2→0.4 (full_remedy-
adjacent); champion (champion_composed, composed.rs:261) = EWMA360/AdaptiveSignPersist
(seam 6)/AccelRetarget(0.2,0.6,0.05). All three stages differ.

Also verified the vnprc review thread: his June-11 objection cites the EXACT
job_id_to_target lines that became the corpus's §6(ii) no-lost-work retraction — he
was right, independently, three weeks early. Two stale justifications now sit in the
thread: vnprc killed stratum-mining#1 (in-flight shares); the arc killed stratum-mining#2 (the fitness sweep, my
June-12 defense) by replacing fitness-selection with the decline-safety constraint.

Prepared (held for owner timing): the champion extraction spec + cross-check, the
PR-body rewrite (360 not 120, AdaptiveSignPersist not CUSUM-at-10, selection =
decline-gate not fitness, remove the in-flight-shares framing), and a draft vnprc
reply that closes the loop with the current (stronger) justification. The force-push
to the public PR + the posted reply are the loud event — explicitly NOT done here;
they go together on the owner's go.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gimballock pushed a commit to marafoundation/stratum that referenced this pull request Jun 30, 2026
Clean extraction of the best-performing vardiff algorithm from the
simulation framework in stratum-mining#2154, with all test scaffolding, traits, and
alternative algorithm implementations removed.

The previous VardiffState used a fixed time-dependent threshold ladder
and full retarget. This produced:

- 6.6% median settled error (p99: 30% at low SPM)
- 5–9 minute cold-start convergence (p90)
- 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs)
- 28% target overshoot during cold-start ramp (p99 at SPM 6)

The new algorithm (EWMA + adaptive boundary + accelerating partial retarget):

- Settled accuracy: <3% median error across all SPM
- Cold-start overshoot bounded to <10% (was 28%)
- Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets
- Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%)
- Transient disconnects recover in 1–2 fires rather than requiring a full
  cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash)
- Asymmetric cost: loosening fires 3x faster than tightening, because
  loosening is free but tightening rejects in-flight shares

Breaking: adds private fields to VardiffState (previously all-pub).
Requires channels_sv2 major version bump. Public constructor API
(new, new_with_min) and Vardiff trait interface are unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gimballock pushed a commit to marafoundation/stratum that referenced this pull request Jun 30, 2026
…itness-pick contender)

This PR's flat-struct extraction shipped the mid-arc fitness-selected contender
(EWMA 120 / AsymmetricCUSUM-at-10 / retarget 0.2→0.4). The arc's final selection
criterion is decline-safety (a hard no-spiral constraint), not scalar fitness — and
the gate-passing champion differs on every stage. Replace the params with the
champion, verified parameter-for-parameter against champion_composed (the exact
constructor the decline-safety gate guard builds and validates in stratum-mining#2154):

- Estimator: EWMA 120s → 360s (the τ-safety-valley floor; shorter fails the gate).
- Boundary: seam 10 → 6; high boundary AsymmetricCUSUM → sign-persistence CUSUM
  (adds the consecutive-tick discount); tighten_multiplier 3.0 → 8.0. The stronger
  asymmetry is the selection-criterion change made concrete: a hard decline-safety
  constraint demands more tightening-reluctance than a stability-weighted fitness
  score did.
- Update: accelerating retarget eta_max 0.4 → 0.6, acceleration 0.2 → 0.05.

Doc comment rewritten: the asymmetry is dangerous-direction (decline-safety)
protection, NOT "tightening rejects in-flight shares" (which it doesn't — the pool
validates each share against the per-job target snapshotted at job creation; this
was a reviewer's correct catch, now retracted in the corpus).

Tests updated to the champion's contract (they encoded the contender's looser
single-tick behavior): the champion requires SUSTAINED evidence to tighten (8×
multiplier + slow EWMA), so asymmetry is now asserted as a sustained-evidence
property — deep loosening fires within minutes, symmetric tightening is the
deliberately-reluctant direction. Stale comments (3× threshold, eta 0.4, seam 10)
corrected; the low-spm PoissonCI test moved to spm=4 (genuinely below seam 6) so it
exercises the branch it names. 14 vardiff tests pass; fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gimballock pushed a commit to marafoundation/stratum that referenced this pull request Jun 30, 2026
Replace the all-pub threshold-ladder VardiffState with a flat-struct adaptive EWMA
controller — the decline-safety-selected champion, the clean production extraction
of the algorithm derived in stratum-mining#2154. Single-struct replacement: no trait scaffolding,
no alternative implementations.

Three inline stages (behaviorally identical to the champion_composed reference that
passes the decline-safety gate in stratum-mining#2154):
- Estimator: EWMA, tau=360s (the τ-safety-valley floor; shorter windows fail the
  slow-decline gate at sparse rates).
- Boundary: adaptive at a seam of 6 spm — PoissonCI below (sparse-data conservatism),
  sign-persistence CUSUM at/above. The boundary protects the dangerous (tightening)
  direction two ways: an 8× asymmetric multiplier and a sign-persistence discount
  that relaxes only after consecutive same-direction ticks. This is decline-safety /
  dangerous-direction protection — NOT a "tightening rejects in-flight shares" cost
  (it doesn't: the pool validates each share against the per-job target snapshotted
  at job creation, which the vardiff path never mutates).
- Update: accelerating partial retarget, eta 0.2 → 0.6.

Selected by a decline-safety minimax (a hard no-spiral constraint), not a scalar
fitness score — which is why the asymmetry is strong (8×): a safety constraint
demands more tightening-reluctance than a stability-weighted fitness metric.

Breaking: adds private fields to VardiffState (requires channels_sv2 major bump);
shares_since_last_update now means "shares since last evaluation tick." Public
constructors (new, new_with_min) and the Vardiff trait are unchanged.

Tests: implementation-specific suite in test/classic.rs asserting the champion's
contract (sustained-evidence tightening reluctance, the asymmetry, the boundary
seam, partial-retarget damping), plus a teeth-bearing reset-cleanliness invariant
for the sign-persistence state. 15 vardiff tests; fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 25b2fed to e7817b0 Compare June 30, 2026 06:52
… pipeline)

The shipped variable-difficulty controller for channels-sv2, built as a
three-stage decomposition (Estimator / Boundary / UpdateRule) composed via
Composed<E,B,U>. The champion is EwmaEstimator(360) / AdaptiveSignPersist(spm6) /
AcceleratingPartialRetarget — selected by a decline-safety minimax (the
death-spiral gate), not by scalar tracking fitness.

Includes the estimator family (EWMA, Bayesian, ckpool decay, Holt), the boundary
family (PoissonCI, SignPersistenceCusum, adaptive), the update rules
(partial/accelerating retarget), and the contrast controllers (classic SMA,
pow2-PID). Controller-layer tests live in src/vardiff/test/ and exercise the
pipeline independently of the simulation framework.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gimballock gimballock force-pushed the vardiff/simulation-framework branch 6 times, most recently from 558c037 to cc9834f Compare June 30, 2026 08:19
…ards

Deterministic in-process framework for characterizing any Vardiff implementation.
Decomposes a controller into the three orthogonal stages and characterizes each in
isolation, so a metric change is attributable to the stage that changed. Provides
the scenario DSL, the Grid sweep harness, the Metric trait, ~75 analysis/figure
bins, and pinned baselines (champion/classic) with CI regression guards — including
the load-bearing decline-safety champion gate and the classic↔champion behavioral
equivalence test. Cargo.lock + rust-toolchain pinned for reproducibility.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gimballock gimballock force-pushed the vardiff/simulation-framework branch 2 times, most recently from 92491a4 to bfca487 Compare June 30, 2026 17:27
…records, essays, figures

The full proof and reader-facing layer behind the controller and framework:

- information-floor.md — the paper: the closed information-floor theory, the
  death-spiral mechanism, the decline-safety selection criterion, and the metric it
  implies, each result labelled theory / simulation-only / hardware.
- THEORY.md — the derivation-and-falsification notebook (reasoning-in-progress).
- docs/records/ — the supporting record (architecture, findings, the per-controller
  investigations, test specs, and status registers), with README.md as the index.
- docs/essays/ — the two-part plain-language article (why / how), self-contained
  (vendored fonts + fallback stacks), byline included.
- docs/figures/ — the generated SVG figures the paper and essays embed.
- docs/claims/ — the claim-warrant validator and its fixture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gimballock gimballock force-pushed the vardiff/simulation-framework branch from bfca487 to d9d8619 Compare June 30, 2026 21:51
Eric Price and others added 3 commits June 30, 2026 19:10
…ning#2221 class)

Records the bounded result of a parallel source-audit of channels_sv2's
public input surface for the stratum-mining#2221 bug class (functions accepting
non-finite/out-of-range numeric input that produce garbage/dangerous
control values). Confirmed independent findings: A (vardiff new_with_min
NaN-disables-clamp) + B (try_vardiff divisor finiteness), both fixed on
branch fix/vardiff-reject-non-finite-hashrate; C (server set_nominal_hashrate
unvalidated store) as a defense-in-depth note. Six garbage-target caller
sites are the blast radius of stratum-mining#2221 (closed on its merge), not new findings.
Cleared the false suspects (hash_rate_from_target, client setters,
share-accounting sums) — the cleared set outnumbers the confirmed.

Sequencing decided: let stratum-mining#2221 merge first, prepare A+B ready-but-unopened,
then let maintainers' convention choose the tracking construct.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The plain-language 'why'/'how' essays (tier 1) assume no statistics; the
technical report information-floor.md (tier 3) is full rigor. This primer is
the bridge between them — a ground-up, interactive walkthrough of the
estimation theory (the precision floor, the collapse) with no prior stats
assumed, ending by pointing the reader on to the report.

Placed at the docs root beside information-floor.md (NOT in essays/, which
would mis-signal its depth). Standalone artifact: it does not yet wire into
the reading path (no essay->primer pointer added, to avoid editing the
review-branch essays), and references the report in prose rather than as a
hyperlink — both deferred until a whole-document read confirms it's ready to
advertise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the primer as the middle tier of the reading path: both essays now
point to information-floor-primer.html as the on-ramp ('the reasoning behind
the numbers, no prior statistics assumed') alongside the existing report
link, and the primer's closing now links onward to information-floor.md.
Full chain essays -> primer -> report, all relative paths verified to resolve.

Also folds in the earlier essay edit: the square-root formula (and its two
callbacks) removed from better-vardiff.html in favor of the plain
'diminishing returns' intuition — the formal law stays in the report tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants