Skip to content

P1: feat(intelligence): weak-signal layer - DuckDB truth store + Neo4j GDS lens (#85)#86

Open
dzivkovi wants to merge 2 commits into
mainfrom
feat/intel-graph-weak-signal-85
Open

P1: feat(intelligence): weak-signal layer - DuckDB truth store + Neo4j GDS lens (#85)#86
dzivkovi wants to merge 2 commits into
mainfrom
feat/intel-graph-weak-signal-85

Conversation

@dzivkovi

@dzivkovi dzivkovi commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Implements issue #85 phases 0-4: build and prove the weak-signal / commonality-detection machine on known signals. DuckDB is the truth store; the Neo4j graph is a disposable, regenerable lens.

What shipped

  • scripts/intel_graph.py (standalone, read-only against the corpus): load (DuckDB 6-node/6-edge truth store from existing artifacts, corpus-wide lexical grounding, provenance-bearing expresses rows, no new Gemini pass), project (co-occurrence graph into Neo4j under VI_ labels only; seeded Leiden default + --algo louvain; PageRank; write-back to DuckDB), verify (the issue P1: feat(intelligence): weak-signal / commonality-detection layer (DuckDB truth + Neo4j-GDS lens) #85 acceptance gate with quote @ video @ timestamp citations).
  • tests/test_intel_graph.py: 45 tests incl. the anti-circularity contract (taxonomy must never contribute co-occurrence edges) and a live Neo4j+GDS integration test that skips without a server.
  • CLAUDE.md architecture section (same diff, per the skill-parity rule), intelligence extras gains neo4j>=5.
  • SPEC: docs/plans/2026-07-02-001-feat-intel-graph-weak-signal-plan.md

Gate 1 smoke test (real corpus, post-fix canonical run)

1,260 videos / 29,321 segments / 7,442 surface-term entities / 8,328 claims / 36,328 lexical mentions; load 9m15s (G: Drive I/O-bound), project + community detection + PageRank 22s.

Known cross-vocab link Deterministic (Leiden g=0.8, seeded) Louvain (5 runs) Citation
reliable agents == Ralph Wiggum loop RECOVERED (community holds 'ralph loop' + reliability-cliff vocabulary) 5/5 "The technique is called Ralph Wiggum..." @ engineerprompt @ t=0
context engineering == Factorio MISSED (reported plainly; partition-fragile single-term entity) 2/5 evidence exists @ natebjones @ t=1281 but graph does not structurally link the vocabularies
Cursor -> Claude Code shift RECOVERED (cursor's modal community holds 14/67 claude-code terms) 5/5 "...really move volume over to it" @ natebjones @ t=430

Alias recovery (answer key = taxonomy, never an input edge): mean cohesion 0.4422 vs permutation baseline 0.3418 (lift +0.10), 18/300 sets fully cohesive. Provenance: 1,093/8,328 claims presentable (13.1%); verify never surfaces an unpresentable claim.

Pre-fix vs post-fix: the pre-review run scored the same 2/3 deterministic verdict; review fixes (slug-collision dedup, sibling-transcript guard) shifted substrate counts (7,429 -> 7,442 entities, 29,339 -> 29,321 segments) and the verdicts survived the rerun unchanged.

Key negative finding (part of the deliverable): GDS Louvain has no randomSeed and flips the Factorio pair run-to-run (3/3, 3/3, 2/3 observed); determinism required seeded Leiden PLUS ordering the projection inserts (GDS tie-breaks off internal node ids). Full analysis in the session observations (work/2026-07-02/, session-local).

Eval discipline (ADR-0017)

The frozen 25-query retrieval eval is untouched: this PR changes no retrieval logic (search / index / hybrid paths unmodified). The new verify gate is a separate verifier for a different question class, per the issue.

Premise-dependent claims (falsifiers)

  • "Leiden gamma=0.8 is a defensible operating point, not gate-fitting" rests on: it recovers exactly the pairs Louvain recovers in every run and still misses the fragile one. Falsifier: if gamma was tuned until P2 passed, this would be rigged - it was not, and P2 is reported MISSED.
  • "The Factorio miss is an entity-granularity problem" rests on: the only factorio surface term is the full phrase 'factorio automation analogy' so the token 'factorio' in other transcripts provides no glue. Falsifier: add salient-token entities and the pair still misses.
  • "12.9-13.1% grounding rate is expected, not a bug" rests on: concepts are extracted from mindmaps (Gemini's vocabulary), not transcripts. Falsifier: a matched-vocabulary corpus grounding equally low would indicate a matcher bug.

Review

8-reviewer /ce-code-review pass (correctness, adversarial, testing, maintainability, project-standards, performance, agent-native, learnings). Applied: slug-collision entity dedup (2 reviewers independently), --db after-subcommand parsing (empirically verified bug), empty published DATE cast crash, modal tie-degeneracy guard, taxonomy parse before table wipe (cloud-mount partial-read hazard #67), report persisted before stdout print (Windows cp1252), title-rotation sibling transcript splice guard, --force .duckdb-only unlink, set-based write-back, projection_meta lifecycle, Neo4j driver try/finally, Leiden docstring drift, CLAUDE.md em-dashes, P1 test gaps (title-rotation, permutation baseline). Deferred (Saint-Exupery filter, no behavior change in 2 weeks): citation-helper dedup, NEO4J_BATCH_SIZE constant, scaling-ceiling prefilters (documented in review artifacts), canonical-prefix tie-break for title/url (dedupe --apply is the documented precondition), ADR graduation (deliberate: roadmap says decisions graduate to ADRs as evidence accumulates).

Full test suite: 983 passed (938 pre-existing + 45 new). ruff clean.

Not merging

Per the overnight run's Gate 2(b): review the diff together with the smoke output above before merging. Unknown-signal hunting stays out of scope per the issue.

🤖 Generated with Claude Code

…j GDS lens (#85)

Adds scripts/intel_graph.py (standalone, read-only against the corpus):

- load: builds a DuckDB truth store (6-node/6-edge starter schema) from
  existing meta.json / transcripts / concepts.json / taxonomy.json, with
  corpus-wide lexical grounding (word-boundary regex over a contains
  prefilter) and provenance-bearing expresses rows. No new Gemini pass.
- project: entity co-occurrence (computed in DuckDB, taxonomy never
  contributes edges) projected into Neo4j under VI_-prefixed labels;
  community detection via seeded Leiden (deterministic; GDS Louvain has
  no randomSeed and flips a boundary pair run-to-run, kept behind
  --algo louvain) + PageRank via GDS; results written back to DuckDB.
- verify: issue #85 acceptance gate - alias-set cohesion vs a
  permutation baseline, three known cross-vocabulary pairs under a
  modal-anchored shared-community criterion with a megacommunity cap,
  every recovered link cited quote @ video @ timestamp.

Real-corpus result (1,260 videos / 29,321 segments / 7,442 surface
terms): deterministic gate recovers 2/3 known pairs with citations
(Ralph Wiggum == reliable agents; Cursor -> Claude Code); the Factorio
== context engineering pair is partition-fragile and reported MISSED.
Louvain recovers it in 2 of 5 runs - the instability finding is part of
the deliverable (work/2026-07-02/ observations, session-local).

The frozen 25-query retrieval eval is untouched: no retrieval logic
changed (search/index/hybrid paths not modified).

Includes an 8-reviewer ce-code-review pass; fixes applied: slug-collision
entity dedup, --db subcommand parsing, empty published DATE cast,
modal tie-degeneracy guard, taxonomy parse before table wipe, report
persisted before stdout print (Windows cp1252), title-rotation sibling
transcript guard, --force .duckdb-only unlink, set-based write-back,
projection_meta lifecycle, driver try/finally.

Closes #85 phases 0-4 (prove-the-machine); unknown-signal hunting stays
out of scope per the issue.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@gitguardian

gitguardian Bot commented Jul 2, 2026

Copy link
Copy Markdown

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
18080992 Triggered Generic Password b1a518c tests/test_intel_graph.py View secret
18080992 Triggered Generic Password b1a518c tests/test_intel_graph.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@dzivkovi dzivkovi moved this from Inbox to Review in Video-Intel Jul 2, 2026
Four majors from the cross-model (GPT-5 Codex CLI) pass on PR #86:

- load_corpus wipes + rebuilds inside one DuckDB transaction; a fallible
  read mid-load rolls back to the previous store instead of leaving it
  emptied or half-rebuilt.
- project_to_neo4j invalidates community_id/pagerank/projection_meta
  BEFORE the fallible Neo4j phase; a failed projection now leaves verify
  loudly stateless instead of silently serving stale communities.
- Citation lookups are word-boundary matched (same discipline as
  grounding): a presented quote for 'cursor' can no longer be a
  'precursor' hit; quote_around centers on the boundary match.
- check_pair.resolve dedups by entity_id across patterns so overlapping
  patterns cannot let one surface term vote twice in the modal count.

5 regression tests added (988 suite-wide green). Canonical gate rerun:
verdict unchanged (2/3 recovered, same citations, Gate 1 lift +0.104).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@dzivkovi

dzivkovi commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

Ready for review - two-pass review summary (Gate 2b: not merging without you)

Pass 1 - /ce-code-review (8 Anthropic personas). Caught and fixed in b1a518c: slug-collision entity merge (adversarial + correctness independently), --db after-subcommand argparse rejection (empirically verified), empty published DATE-cast crash, modal tie-degeneracy, taxonomy parse ordered before the table wipe, Windows cp1252 print-before-persist, title-rotation transcript splice, unguarded --force unlink, set-based write-back, projection_meta lifecycle, driver try/finally, Leiden doc drift, em-dashes, two P1 test gaps.

Pass 2 - Codex peer review (GPT-5 Codex via Codex CLI, session 019f218e-aa34-7e31-8b5a-231d4ae8a94d). Verdict: "block on fixes" with 4 majors - all cross-layer gaps the persona pass missed, all applied in 1cf5382:

  1. Load wipe before fallible reads -> whole load now transactional (rollback restores the previous store).
  2. Failed projection served STALE communities as current -> state invalidated before the Neo4j phase; verify fails loudly instead.
  3. Citation lookup was raw substring while grounding was word-boundary -> a 'cursor' quote could be a 'precursor' hit; citations now boundary-matched.
  4. Overlapping patterns let one entity vote twice in the modal count -> deduped by entity_id.

5 regression tests added; suite 988 green; canonical gate rerun after the fixes: verdict unchanged (2/3 recovered deterministically, same citations, Gate 1 lift +0.104).

Deferred from both passes (Saint-Exupery filter): citation-helper SQL dedup, batch-size constant, quadratic scaling ceilings (documented), canonical-prefix title tie-break (dedupe --apply is the precondition), RE2-vs-re.escape residual risk, ADR graduation (deliberate, roadmap rule).

Honest residual noted in the session observations: seeded-Leiden reproducibility is conditional on Neo4j store state - unrelated VI_ node churn between runs can split/merge one boundary community (68 -> 67 observed) without changing gate verdicts or citations.

Merge verdict from the overnight run: ready for your review; the empirical case is the PR body's smoke table plus work/2026-07-02/04-graph-weak-signal-observations.md (session-local).

🤖 Generated with Claude Code

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the issue #85 “weak-signal / commonality-detection” layer by introducing a DuckDB-backed truth store plus a disposable Neo4j GDS projection, along with an acceptance-gate verifier and comprehensive tests to prove recovery of known cross-vocabulary links with citations.

Changes:

  • Added scripts/intel_graph.py implementing load (DuckDB truth store + grounding), project (Neo4j projection + Leiden/Louvain + PageRank writeback), and verify (acceptance gate report).
  • Added tests/test_intel_graph.py with unit coverage for helpers/loader/gates plus a Neo4j+GDS integration test that skips when unavailable.
  • Documented the design/spec and wired up dependencies (neo4j) and repo architecture notes.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/intel_graph.py New standalone weak-signal pipeline (DuckDB load, Neo4j projection, acceptance-gate verify).
tests/test_intel_graph.py New test suite covering helpers, loader behavior, gate logic, and optional Neo4j integration.
pyproject.toml Adds neo4j>=5 to the intelligence optional dependency set.
docs/plans/2026-07-02-001-feat-intel-graph-weak-signal-plan.md Adds the SPEC/plan describing goals, schema, gate, and operational constraints.
CLAUDE.md Adds an architecture note documenting the new intel_graph.py utility and invariants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/intel_graph.py
Comment on lines +29 to +32
Usage:
python scripts/intel_graph.py load [--output-dir DIR] [--db PATH] [--force]
python scripts/intel_graph.py project [--db PATH] [--neo4j-uri URI] [--algo leiden|louvain] [--gamma G] [--max-df N]
python scripts/intel_graph.py verify [--db PATH] [--report PATH]
Comment thread scripts/intel_graph.py
Comment on lines +73 to +78
{
"name": "reliable agents == Ralph Wiggum loop / force-feed",
"user_patterns": ["reliab"],
"creator_patterns": ["ralph", "wiggum", "force-feed"],
"citation_phrases": ["ralph wiggum", "ralph loop", "force-feed"],
},
Comment thread scripts/intel_graph.py
Comment on lines +712 to +733
return con.execute(
"""
WITH anchors AS (
SELECT DISTINCT artifact_id FROM segments WHERE contains(lower(text), ?)
),
terms AS (
SELECT hc.entity_id FROM has_concept hc JOIN anchors USING (artifact_id)
UNION ALL
SELECT m.entity_id FROM mentions m JOIN segments s USING (segment_id)
JOIN anchors a ON a.artifact_id = s.artifact_id
)
SELECT e.entity_id, e.community_id
FROM (
SELECT t.entity_id, count(*) AS n
FROM terms t JOIN entities e2 ON e2.entity_id = t.entity_id
WHERE e2.community_id IS NOT NULL
GROUP BY 1 ORDER BY n DESC, t.entity_id LIMIT ?
) ranked
JOIN entities e USING (entity_id)
""",
[normalize_phrase(phrase), limit],
).fetchall()

### Neo4j projection

- All nodes labeled `VI_Entity` (namespaced); `project` deletes only `VI_`-prefixed labels on rebuild. Runs Louvain (seeded config, concurrency 1 for reproducibility) and PageRank via GDS, writes `community_id` and `pagerank` back to DuckDB `entities`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Review

Development

Successfully merging this pull request may close these issues.

2 participants