bitr/experiments.log at main · cksystemsgroup/bitr · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
# experiments.log — Append-only benchmark results
# Format: date | commit | change | solved | time | notes

2026-03-31 | pre-opt baseline | baseline | 25/25 | 59.7s total
  - tiny: 16/16, all <10ms
  - perf: 9/9, twovars_16=36.4s, wide_mul_32=21.9s, exhaustive_28=0.95s
  - bitwuzla comparison: BITR 59.7s total vs bitwuzla 1.19s total

2026-03-31 | post-opt | benchmark+optimize vs bitwuzla/ric3 | 33/33 | 22.7s total
  Changes:
  - Edge merging: O(n²) → O(n) via HashMap (bvdd.rs)
  - Solve hot path: avoid full BvddNodeKind clone, copy only needed fields
  - BMC: combined double substitution passes into single pass
  - BMC: conjoin constraints into single BVC (one solve instead of N)
  - BMC: persist computed cache across steps (don't clear every step)
  - Theory resolution: multi-variable HSC decomposition (was single-var only)
  - Theory resolution: parallel blast budget raised (single-var 2^33, multi-var 2^32)
  - Theory resolution: parallel threshold lowered to 100K (was 1M)
  - Theory resolution: oracle enabled regardless of timeout mode
  - Byte-blast: time-based bailout (500ms) prevents wasted effort on hard UNSAT
  Results:
  - counter_unsat: 22ms → 7ms (3x, BMC cache persistence)
  - counter_deep_sat: 24ms → 6ms (4x, BMC optimizations)
  - twovars_16_unsat: 36.4s → 5.2s (7x, parallel multi-var blast)
  - wide_mul_32: 21.9s → 16.1s (1.4x, parallel single-var blast)
  - Total: 59.7s → 22.7s (2.6x overall improvement)
  - New benchmarks: 8 additional perf tests (wide_mul_32, twovars_16, counter_deep, etc.)
  - Bitwuzla comparison script: benchmarks/compare_solvers.py

2026-04-02 | CDCL bitblaster reorder | move bitblaster from Stage 3c to 2b | 35/35 | 1.4s total
  Changes:
  - Moved native CDCL bit-blast (splr) from Stage 3c to Stage 2b in theory resolution cascade
  - Previously sat after parallel blast (2^33 budget), so never fired on bottleneck benchmarks
  - Now fires immediately after compiled blast (2^28 budget), before byte-blast and parallel blast
  - Added gate memoization (HashMap) to bitblaster: reuses AND/OR/XOR gates for shared subexprs
  Results:
  - twovars_16_unsat: 25.4s → 0.004s (6,600x, CDCL replaces 2^32 exhaustive enumeration)
  - wide_mul_32: 15.6s → 0.005s (3,200x, CDCL replaces 2^32 exhaustive enumeration)
  - twovars_16_sat: 5.1s → 0.003s (1,500x, CDCL replaces 2^32 exhaustive enumeration)
  - exhaustive_28: 1.0s → 1.0s (unchanged, still uses compiled blast at 2^28)
  - All other benchmarks: unchanged (domains <= 2^28, compiled blast handles them)
  - Total perf time: 22.7s → 1.4s (16x overall improvement)
  - 35/35 benchmarks correct (16 tiny + 19 perf), 103 tests, 0 clippy warnings

2026-04-02 | threshold tuning | compiled blast 2^28 → 2^16, route more through CDCL | 35/35 | 0.1s total
  Changes:
  - Lowered compiled blast budget from 2^28 to 2^16 (65536 domain values)
  - Problems with domain > 65536 now go to CDCL bitblaster instead of exhaustive enumeration
  - exhaustive_28: 1.0s → 0.005s (200x, CDCL replaces 268M-value enumeration)
  - three_var_8_unsat: 0.13s → 0.004s (33x, CDCL replaces 16M-value enumeration)
  - Total perf time: 1.4s → 0.1s (all 19 benchmarks under 10ms)
  - HWMCC BV: 32/155 solved (--timeout 10, with oracle); 14/154 without oracle
  - All 140 HWMCC BV timeouts are sequential BMC problems (0 combinational timeouts)
  - 35/35 benchmarks correct, 103 tests, 0 clippy warnings

2026-04-03 | k-induction | inductive safety proofs for sequential models | 52/155 BV
  Changes:
  - New module: bitr/src/kinduction.rs — k-induction prover
  - Strategy: k-induction on cloned state (25% timeout), then BMC (75% timeout)
  - Inductive step: checks if P holding for k consecutive steps implies P at step k+1
  - Base case reuses standard BMC checking logic
  - Cloned TermTable/BvcManager prevents term table pollution across phases
  - Added BvcManager::negate() for property negation in inductive step
  - Added Clone derives for TermTable, ConstraintTable, BvcManager
  Results:
  - HWMCC BV: 32/155 → 52/155 solved (+63%, 26 gained via induction, 6 lost to timeout)
  - All 26 gained benchmarks are UNSAT (proven safe by induction at low k)
  - 35/35 tiny+perf benchmarks correct, 103 tests, 0 clippy warnings

2026-04-03 | BMC bitblaster + tuning | CDCL bitblaster in BMC, tuned k-ind budget | 60/155 BV
  Changes:
  - Added CDCL bitblaster as fallback in BMC for medium terms (10K-200K nodes)
  - Reduced k-induction budget: 25% → 10% timeout, max_k 10 → 3
  - Added term size guard in k-induction: skip inductive step if terms > 50K
  - Added per-step timeout bail-out in k-induction
  Results:
  - HWMCC BV: 52/155 → 60/155 solved (+15%, 9 gained, 0 lost vs prev)
  - New solves include cal182, stack-p2, pc_sfifo+token_ring combinations
  - Recovered 93.c and analog_estimation_convergence (previously lost to k-ind budget)
  - 35/35 tiny+perf benchmarks correct, 103 tests, 0 clippy warnings

2026-04-03 | COI reduction | cone-of-influence preprocessing | 60/155 BV
  Changes:
  - Added cone-of-influence reduction in BTOR2 parser (btor2.rs)
  - Backward reachability from bad properties and constraints
  - Removes nodes not affecting any property (e.g., 30% on dspfilters: 6700→4719 nodes)
  - States outside COI also excluded (dspfilters: 743→522 states)
  Results:
  - HWMCC BV: 60/155 (stable)
  - 35/35 tiny+perf benchmarks correct, 103 tests, 0 clippy warnings

2026-04-17 | Phase A correctness foundation | soundness fixes
  Plan: /Users/ck/.claude/plans/make-a-plan-on-melodic-dolphin.md

  A1 — Unsound BMC UNSAT on bound exhaustion:
  - bmc.rs:304 now returns SolveResult::Unknown instead of SolveResult::Unsat when
    bound is exhausted without a counterexample. Previously, every bound-exhausted
    BMC reported "unsat" — but this is only sound when the bound ≥ reachable
    diameter, which BMC does not verify.
  - Test: bitr/src/bmc.rs test_bmc_bound_exhausted_returns_unknown.

  A2 — SAT witness verification:
  - Added extract_witness_verified(id, target, verify_term) in bvdd/src/solver.rs
    that extracts a witness and verifies it satisfies verify_term under target.
    Returns None when unverifiable.
  - BMC (bmc.rs tier 1) and k-induction (kinduction.rs solve_bvc tier 1) now
    downgrade BVDD-SAT to Unknown if the witness doesn't verify, letting lower
    tiers (bitblaster/oracle) reattempt.
  - K-induction bitblaster path made explicit about witness verification
    (verified must be true; else fall through to oracle).
  - Tests: test_extract_witness_verified_{accepts_valid, rejects_const_mismatch}.

  A3 — --verify CLI flag:
  - New flag --verify cross-checks every BMC/kind SAT answer against the
    external SMT oracle; panics with a diagnostic dump on disagreement.
    Dev/CI only, no-op when oracle absent.
  - Helper bmc::verify_against_oracle.

  A4 — Golden integration test suite:
  - New bitr/tests/golden.rs runs tiny benchmarks end-to-end (all 16) and
    verifies answers. Also runs tiny suite under --verify mode (zero
    disagreements on all 16 locally).
  - HWMCC subset removed from goldens after A5 — several pre-A5 SAT verdicts
    depended on the unsound reset-signal heuristic. Will reintroduce post-B.

  A5 — Reset-signal soundness fence:
  - Added bmc::collect_non_reset_uses that flags inputs appearing in ANY data
    position (non-ITE-condition). bmc_check and kinduction_check now disqualify
    reset candidates that also appear in data positions of next-state, bad, or
    constraints. Previously the heuristic could zero a non-reset input,
    producing wrong SAT answers on ambiguous cases.
  - Tests: test_collect_ite_cond_inputs_direct_only,
    test_collect_non_reset_uses_{finds_data_usage, flags_dual_use}.

  Results:
  - 117 tests passing (+14 from baseline), 0 clippy warnings in bitr/ code.
  - All 16 tiny benchmarks correct; --verify on all 16 sees zero oracle
    disagreements.
  - HWMCC BV not re-measured in Phase A. The A5 fence is known to reduce
    solved count in the short term by correctly rejecting unsound reset
    inference — a performance regression that Phase B (incremental SAT BMC)
    is designed to recover.
  - Phase A establishes the correctness baseline that every subsequent phase
    must preserve.

2026-04-17 | Phase B incremental SAT BMC | opt-in --incremental-bmc path
  Plan: /Users/ck/.claude/plans/make-a-plan-on-melodic-dolphin.md

  New module bitr/src/incremental_bmc.rs:
  - Allocates per-step SAT literals for each state variable via BitBlaster's
    alloc_vars(). Binds them via set_var_lits() before blasting init,
    constraints, bad, and next expressions.
  - Emits equality clauses linking state[k+1] literals to the blasted
    next-state bits: for each bit i, (¬lit_k1_i ∨ next_i) ∧ (lit_k1_i ∨ ¬next_i).
  - Per step: snapshot the shared CNF, append [bad_lit] unit, run a fresh
    splr solver. (splr 0.17.2 does not expose user-assumption retraction, so
    clone-per-step is the sound equivalent.)
  - On SAT: decode witness from model, return Sat. On Unsat: continue.
  - Bound exhaustion → Unknown (matches Phase A1 soundness).

  Extended bvdd/src/bitblast.rs API:
  - clauses_snapshot(), clear_term_cache(), push_clause(), blast_public(),
    exceeded(), add_target_public(), true_lit(), and a thin
    solve_snapshot_with_unit() that keeps the splr dependency in one crate.

  Wiring (bitr/src/main.rs):
  - New --incremental-bmc flag. When set, the solver runs incremental BMC
    for half the BMC budget BEFORE the legacy bmc_check. Only SAT answers
    from the incremental path are accepted; Unknown falls through to the
    legacy path (which retains theory resolution capabilities).
  - Default remains off until HWMCC validation.

  Tests (all passing):
  - incremental_bmc::tests::ibmc_finds_counter_sat
  - incremental_bmc::tests::ibmc_counter_unsat_returns_unknown_at_bound
  - incremental_bmc::tests::ibmc_combinational_ite_sat
  - All 16 tiny benchmarks pass with --incremental-bmc and with
    --verify --incremental-bmc (zero oracle disagreements).

  Open items (Phase C+):
  - Arrays (OpKind::Read/Write) aren't handled by the bitblaster → incremental
    BMC falls back to legacy on array benchmarks. That's the correct behavior.
  - Learned-clause reuse is sacrificed by clone-per-step. A future upgrade
    could enable splr's support_user_assumption feature for true incremental
    assumption-based solving.
  - HWMCC BV not re-measured yet; Phase C (preprocessing + AIG) and a
    benchmark sweep come next.

  Ad-hoc benchmark spot check (qspiflash_dualflexpress_divthree-p050, 20s):
  - Without --incremental-bmc: 3 BMC steps in 20s (terms grew 517 → 2.7M).
  - With --incremental-bmc: 41 BMC steps in ~7s, then fell through to legacy
    for the rest. 10+x deeper unrolling for the same wall-clock budget.
  - Answer: Unknown both ways (property not refuted within bound); but the
    incremental path has enough room to explore much further.

2026-04-17 | Phase C partial | bitblaster ITE memoization + peepholes
  Plan: /Users/ck/.claude/plans/make-a-plan-on-melodic-dolphin.md

  Extended bvdd/src/bitblast.rs ite_gate:
  - Added peepholes for all 11 common simplifications (ite(s,1,e)=s∨e,
    ite(s,t,0)=s∧t, ite(s,t,-t)=¬(s⊕t), ite(s,s,e)=s∨e, etc.).
  - Added ite_cache: HashMap<(s,t,e), i32> keyed on canonicalized triple
    (selector normalized positive; t/e swapped if selector was negated).
    Previously ITE was the only Tseitin gate not memoized, producing O(n)
    fresh variables per shared ITE subexpression.

  C1 (BTOR2 constant folding + CSE) is a no-op: TermTable::intern (term.rs:74)
  already hash-conses on `(kind, width)`, so BTOR2 duplicate subterms are
  deduplicated at parse time. No additional work needed.

  Results:
  - All 120 tests passing (was 117). No new clippy warnings in modified code.
  - All 16 tiny benchmarks correct under --verify and under
    --verify --incremental-bmc. Zero oracle disagreements.
  - HWMCC BV not re-measured (full sweep pending).

2026-04-17 | HWMCC BV measurement + constraint-accumulation soundness fix
  Plan: /Users/ck/.claude/plans/make-a-plan-on-melodic-dolphin.md

  The first post-Phase-A+B+C2 sweep (results/bv_phaseABC_ibmc.csv,
  --incremental-bmc, 60s via scripts/run_sweep.py) exposed a wrong-SAT verdict
  on picorv32-check-p09 (refs avr/avr_dp/nuxmv: unsat; we: sat). Root cause was
  TWO soundness bugs:

  1. BMC and k-induction base case conjoined bad(state_k) with constraint(state_k)
     only, not with constraint(state_i) for i=0..k. This let the SAT solver find
     spurious traces that violated earlier-step constraints.
     Fix: accumulate constraint(state_i) across steps into a running AND; conjoin
     the accumulated constraint with bad before every property check.

  2. Reset-signal heuristic (zeroing width-1 ITE-cond inputs at step > 0) was
     fundamentally unsound — both too aggressive (zeroing true inputs) and too
     conservative under A5's fence (missing real resets). No shape-based
     identification works; needs BTOR2-level annotation or principled analysis.
     Fix: every input is fresh per step, no heuristic bindings.

  Commit: def492c.

  Re-measurement (results/bv_postfix.csv, same config):
  - Pre-fix  : 35/155 solved (15 sat + 20 unsat) — includes 3+ confirmed wrong SATs
  - Post-fix : 29/155 solved ( 9 sat + 20 unsat) — zero confirmed wrong SATs

  All 6 benchmarks that "lost SAT" in the post-fix run were checked against
  reference solvers (avr, avr_dp, nuxmv, btormc, btor2-selectmc):
  - a16-p128: was sat, refs say UNSAT → previously WRONG
  - a16-p148: was sat, refs say UNSAT → previously WRONG
  - picorv32-check-p09: was sat, refs say UNSAT → previously WRONG (triggering)
  - arbitrated_top_n5_w128_d128_e0: refs all unknown → status unclear
  - picorv32_mutCY_nomem-p8: refs all unknown → status unclear
  - shift_register_top_w32_d64_e0: btor2-selectmc says SAT → possibly legit loss

  Apr-15 baseline (bv_300s.csv, 74/155) was heavily polluted: A1 exposed 40+
  unsound UNSAT claims (bound-exhausted returned as Unsat), and at least 3 SATs
  were spurious. The real sound baseline at 60s is 29/155.

  Next moves (per plan, correctness > performance):
  - Longer timeout sweep to recover Unknowns that are tractable with more time.
  - Phase D: inductive generalization to unlock kind_only timeouts.
  - Investigate shift_register_top_w32_d64_e0 specifically — either our lifter
    is missing something or incremental BMC is slower than the legacy path.

2026-04-18 | 300s HWMCC BV sweep | counterintuitive regression vs 60s
  Same soundness-fixed binary at 300s/bench instead of 60s (run_sweep.py
  with --incremental-bmc, 7h27m total; results/bv_300s_postfix.csv).

  Result: 27/155 solved (7 sat + 20 unsat). LOWER than 60s (29/155; 9 sat +
  20 unsat).

  Transition 60s → 300s:
  - 7 sat → sat, 22 unsat → unsat (core preserved)
  - 2 sat → timeout at 300s: marlann_compute_cp_fail2-p1, vis_arrays_vsa16a_p6
    (solver found SAT within 60s but NOT within 300s)
  - 9 timeout → unknown (graceful exits)
  - 17 unknown → timeout (ran past 60s then hit 305s wrapper-kill)

  Hypothesis: constraint accumulation (def492c) + longer k-induction budget
  let k-induction build bigger BVCs before BMC starts. The larger term table
  slows substitute_states quadratically, so BMC at step k is MORE expensive
  at 300s than at 60s. Effective BMC time is SMALLER at the longer budget.

  Practical consequence: 60s is currently a better operating point than 300s.
  Fixing this requires one of:
  (a) k-induction bail on term-table growth (we have tt.len() > 500k today,
      should be lower or accumulation-budgeted for k-induction specifically).
  (b) Cooperative cancellation inside substitute_states so inner loops
      respect outer --timeout.
  (c) Incremental BMC as the first-class default (no legacy-BMC fallback)
      independent of k-induction budget.

  Sound baseline at 60s: 29/155 = current competitive number.

2026-04-18 | b71796a | incremental BMC as default, --legacy-bmc for opt-in | 29/155 | 6670s (1h51m)
  Test of option (c) from the 300s entry above. Default pipeline now:
  k-induction (25% budget) -> incremental BMC (75%, no legacy fallback).
  Previously: k-ind 25% -> ibmc 37.5% -> legacy 37.5%.

  60s sweep (results/bv_60s_default_ibmc.csv):
  - 9 sat + 20 unsat = 29/155 (identical headline to bv_postfix.csv).
  - Set diff: SAME 29 benchmarks. Zero gained, zero lost.

  Implication: in the prior split-budget run, legacy BMC contributed nothing
  over incremental. Doubling incremental's budget also unlocked no new
  solves. The remaining 126 failures are NOT budget-bound — they are
  algorithmic. More time will not move the number.

  Next levers (not budget):
  - Phase D: inductive generalization (counterexample-to-induction cube
    refinement).
  - Phase C: CNF peepholes + hash-cons of ITE selectors in bitblast.rs.
  - Cooperative cancellation in substitute_states (unblocks meaningful
    --legacy-bmc comparison, still useful as oracle).