Pre-registered from the v0.1.0 cycle (advisor review): on a non-receipt platform with unpinned stim 1.16, the plain-BP statistical-tier check once flaked at 5.7% count-delta while the Wilson CIs still overlapped — i.e. the actual statistical test passed and the count-delta heuristic failed. The count-delta bar is the wrong test for cross-platform statistical tiers.
Plan (v0.1.1, not v0.1.0 — changing the bar inside the release cycle would look like moving goalposts):
- statistical tier on platform-local shots: use the Wilson-CI overlap (already implemented in
tridec.validation) or paired bootstrap as the binding test;
- keep the 5% count-delta as a receipt-environment drift sentinel only (different threat model: catching tridec regressions under the exact pins).
Filed against v0.1.0; documented in docs/benchmark.md §5.
Pre-registered from the v0.1.0 cycle (advisor review): on a non-receipt platform with unpinned stim 1.16, the plain-BP statistical-tier check once flaked at 5.7% count-delta while the Wilson CIs still overlapped — i.e. the actual statistical test passed and the count-delta heuristic failed. The count-delta bar is the wrong test for cross-platform statistical tiers.
Plan (v0.1.1, not v0.1.0 — changing the bar inside the release cycle would look like moving goalposts):
tridec.validation) or paired bootstrap as the binding test;Filed against v0.1.0; documented in docs/benchmark.md §5.