P1: feat(intelligence): weak-signal layer - DuckDB truth store + Neo4j GDS lens (#85) by dzivkovi · Pull Request #86 · dzivkovi/video-intel

dzivkovi · 2026-07-02T06:38:59Z

Implements issue #85 phases 0-4: build and prove the weak-signal / commonality-detection machine on known signals. DuckDB is the truth store; the Neo4j graph is a disposable, regenerable lens.

What shipped

scripts/intel_graph.py (standalone, read-only against the corpus): load (DuckDB 6-node/6-edge truth store from existing artifacts, corpus-wide lexical grounding, provenance-bearing expresses rows, no new Gemini pass), project (co-occurrence graph into Neo4j under VI_ labels only; seeded Leiden default + --algo louvain; PageRank; write-back to DuckDB), verify (the issue P1: feat(intelligence): weak-signal / commonality-detection layer (DuckDB truth + Neo4j-GDS lens) #85 acceptance gate with quote @ video @ timestamp citations).
tests/test_intel_graph.py: 45 tests incl. the anti-circularity contract (taxonomy must never contribute co-occurrence edges) and a live Neo4j+GDS integration test that skips without a server.
CLAUDE.md architecture section (same diff, per the skill-parity rule), intelligence extras gains neo4j>=5.
SPEC: docs/plans/2026-07-02-001-feat-intel-graph-weak-signal-plan.md

Gate 1 smoke test (real corpus, post-fix canonical run)

1,260 videos / 29,321 segments / 7,442 surface-term entities / 8,328 claims / 36,328 lexical mentions; load 9m15s (G: Drive I/O-bound), project + community detection + PageRank 22s.

Known cross-vocab link	Deterministic (Leiden g=0.8, seeded)	Louvain (5 runs)	Citation
reliable agents == Ralph Wiggum loop	RECOVERED (community holds 'ralph loop' + reliability-cliff vocabulary)	5/5	"The technique is called Ralph Wiggum..." @ engineerprompt @ t=0
context engineering == Factorio	MISSED (reported plainly; partition-fragile single-term entity)	2/5	evidence exists @ natebjones @ t=1281 but graph does not structurally link the vocabularies
Cursor -> Claude Code shift	RECOVERED (cursor's modal community holds 14/67 claude-code terms)	5/5	"...really move volume over to it" @ natebjones @ t=430

Alias recovery (answer key = taxonomy, never an input edge): mean cohesion 0.4422 vs permutation baseline 0.3418 (lift +0.10), 18/300 sets fully cohesive. Provenance: 1,093/8,328 claims presentable (13.1%); verify never surfaces an unpresentable claim.

Pre-fix vs post-fix: the pre-review run scored the same 2/3 deterministic verdict; review fixes (slug-collision dedup, sibling-transcript guard) shifted substrate counts (7,429 -> 7,442 entities, 29,339 -> 29,321 segments) and the verdicts survived the rerun unchanged.

Key negative finding (part of the deliverable): GDS Louvain has no randomSeed and flips the Factorio pair run-to-run (3/3, 3/3, 2/3 observed); determinism required seeded Leiden PLUS ordering the projection inserts (GDS tie-breaks off internal node ids). Full analysis in the session observations (work/2026-07-02/, session-local).

Eval discipline (ADR-0017)

The frozen 25-query retrieval eval is untouched: this PR changes no retrieval logic (search / index / hybrid paths unmodified). The new verify gate is a separate verifier for a different question class, per the issue.

Premise-dependent claims (falsifiers)

"Leiden gamma=0.8 is a defensible operating point, not gate-fitting" rests on: it recovers exactly the pairs Louvain recovers in every run and still misses the fragile one. Falsifier: if gamma was tuned until P2 passed, this would be rigged - it was not, and P2 is reported MISSED.
"The Factorio miss is an entity-granularity problem" rests on: the only factorio surface term is the full phrase 'factorio automation analogy' so the token 'factorio' in other transcripts provides no glue. Falsifier: add salient-token entities and the pair still misses.
"12.9-13.1% grounding rate is expected, not a bug" rests on: concepts are extracted from mindmaps (Gemini's vocabulary), not transcripts. Falsifier: a matched-vocabulary corpus grounding equally low would indicate a matcher bug.

Review

8-reviewer /ce-code-review pass (correctness, adversarial, testing, maintainability, project-standards, performance, agent-native, learnings). Applied: slug-collision entity dedup (2 reviewers independently), --db after-subcommand parsing (empirically verified bug), empty published DATE cast crash, modal tie-degeneracy guard, taxonomy parse before table wipe (cloud-mount partial-read hazard #67), report persisted before stdout print (Windows cp1252), title-rotation sibling transcript splice guard, --force .duckdb-only unlink, set-based write-back, projection_meta lifecycle, Neo4j driver try/finally, Leiden docstring drift, CLAUDE.md em-dashes, P1 test gaps (title-rotation, permutation baseline). Deferred (Saint-Exupery filter, no behavior change in 2 weeks): citation-helper dedup, NEO4J_BATCH_SIZE constant, scaling-ceiling prefilters (documented in review artifacts), canonical-prefix tie-break for title/url (dedupe --apply is the documented precondition), ADR graduation (deliberate: roadmap says decisions graduate to ADRs as evidence accumulates).

Full test suite: 983 passed (938 pre-existing + 45 new). ruff clean.

Not merging

Per the overnight run's Gate 2(b): review the diff together with the smoke output above before merging. Unknown-signal hunting stays out of scope per the issue.

🤖 Generated with Claude Code

…j GDS lens (#85) Adds scripts/intel_graph.py (standalone, read-only against the corpus): - load: builds a DuckDB truth store (6-node/6-edge starter schema) from existing meta.json / transcripts / concepts.json / taxonomy.json, with corpus-wide lexical grounding (word-boundary regex over a contains prefilter) and provenance-bearing expresses rows. No new Gemini pass. - project: entity co-occurrence (computed in DuckDB, taxonomy never contributes edges) projected into Neo4j under VI_-prefixed labels; community detection via seeded Leiden (deterministic; GDS Louvain has no randomSeed and flips a boundary pair run-to-run, kept behind --algo louvain) + PageRank via GDS; results written back to DuckDB. - verify: issue #85 acceptance gate - alias-set cohesion vs a permutation baseline, three known cross-vocabulary pairs under a modal-anchored shared-community criterion with a megacommunity cap, every recovered link cited quote @ video @ timestamp. Real-corpus result (1,260 videos / 29,321 segments / 7,442 surface terms): deterministic gate recovers 2/3 known pairs with citations (Ralph Wiggum == reliable agents; Cursor -> Claude Code); the Factorio == context engineering pair is partition-fragile and reported MISSED. Louvain recovers it in 2 of 5 runs - the instability finding is part of the deliverable (work/2026-07-02/ observations, session-local). The frozen 25-query retrieval eval is untouched: no retrieval logic changed (search/index/hybrid paths not modified). Includes an 8-reviewer ce-code-review pass; fixes applied: slug-collision entity dedup, --db subcommand parsing, empty published DATE cast, modal tie-degeneracy guard, taxonomy parse before table wipe, report persisted before stdout print (Windows cp1252), title-rotation sibling transcript guard, --force .duckdb-only unlink, set-based write-back, projection_meta lifecycle, driver try/finally. Closes #85 phases 0-4 (prove-the-machine); unknown-signal hunting stays out of scope per the issue. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

gitguardian · 2026-07-02T06:39:04Z

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
18080992	Triggered	Generic Password	`b1a518c`	tests/test_intel_graph.py	View secret
18080992	Triggered	Generic Password	`b1a518c`	tests/test_intel_graph.py	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

Four majors from the cross-model (GPT-5 Codex CLI) pass on PR #86: - load_corpus wipes + rebuilds inside one DuckDB transaction; a fallible read mid-load rolls back to the previous store instead of leaving it emptied or half-rebuilt. - project_to_neo4j invalidates community_id/pagerank/projection_meta BEFORE the fallible Neo4j phase; a failed projection now leaves verify loudly stateless instead of silently serving stale communities. - Citation lookups are word-boundary matched (same discipline as grounding): a presented quote for 'cursor' can no longer be a 'precursor' hit; quote_around centers on the boundary match. - check_pair.resolve dedups by entity_id across patterns so overlapping patterns cannot let one surface term vote twice in the modal count. 5 regression tests added (988 suite-wide green). Canonical gate rerun: verdict unchanged (2/3 recovered, same citations, Gate 1 lift +0.104). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

dzivkovi · 2026-07-02T06:49:26Z

Ready for review - two-pass review summary (Gate 2b: not merging without you)

Pass 1 - /ce-code-review (8 Anthropic personas). Caught and fixed in b1a518c: slug-collision entity merge (adversarial + correctness independently), --db after-subcommand argparse rejection (empirically verified), empty published DATE-cast crash, modal tie-degeneracy, taxonomy parse ordered before the table wipe, Windows cp1252 print-before-persist, title-rotation transcript splice, unguarded --force unlink, set-based write-back, projection_meta lifecycle, driver try/finally, Leiden doc drift, em-dashes, two P1 test gaps.

Pass 2 - Codex peer review (GPT-5 Codex via Codex CLI, session 019f218e-aa34-7e31-8b5a-231d4ae8a94d). Verdict: "block on fixes" with 4 majors - all cross-layer gaps the persona pass missed, all applied in 1cf5382:

Load wipe before fallible reads -> whole load now transactional (rollback restores the previous store).
Failed projection served STALE communities as current -> state invalidated before the Neo4j phase; verify fails loudly instead.
Citation lookup was raw substring while grounding was word-boundary -> a 'cursor' quote could be a 'precursor' hit; citations now boundary-matched.
Overlapping patterns let one entity vote twice in the modal count -> deduped by entity_id.

5 regression tests added; suite 988 green; canonical gate rerun after the fixes: verdict unchanged (2/3 recovered deterministically, same citations, Gate 1 lift +0.104).

Deferred from both passes (Saint-Exupery filter): citation-helper SQL dedup, batch-size constant, quadratic scaling ceilings (documented), canonical-prefix title tie-break (dedupe --apply is the precondition), RE2-vs-re.escape residual risk, ADR graduation (deliberate, roadmap rule).

Honest residual noted in the session observations: seeded-Leiden reproducibility is conditional on Neo4j store state - unrelated VI_ node churn between runs can split/merge one boundary community (68 -> 67 observed) without changing gate verdicts or citations.

Merge verdict from the overnight run: ready for your review; the empirical case is the PR body's smoke table plus work/2026-07-02/04-graph-weak-signal-observations.md (session-local).

🤖 Generated with Claude Code

Copilot

Pull request overview

Implements the issue #85 “weak-signal / commonality-detection” layer by introducing a DuckDB-backed truth store plus a disposable Neo4j GDS projection, along with an acceptance-gate verifier and comprehensive tests to prove recovery of known cross-vocabulary links with citations.

Changes:

Added scripts/intel_graph.py implementing load (DuckDB truth store + grounding), project (Neo4j projection + Leiden/Louvain + PageRank writeback), and verify (acceptance gate report).
Added tests/test_intel_graph.py with unit coverage for helpers/loader/gates plus a Neo4j+GDS integration test that skips when unavailable.
Documented the design/spec and wired up dependencies (neo4j) and repo architecture notes.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
scripts/intel_graph.py	New standalone weak-signal pipeline (DuckDB load, Neo4j projection, acceptance-gate verify).
tests/test_intel_graph.py	New test suite covering helpers, loader behavior, gate logic, and optional Neo4j integration.
pyproject.toml	Adds `neo4j>=5` to the `intelligence` optional dependency set.
docs/plans/2026-07-02-001-feat-intel-graph-weak-signal-plan.md	Adds the SPEC/plan describing goals, schema, gate, and operational constraints.
CLAUDE.md	Adds an architecture note documenting the new `intel_graph.py` utility and invariants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+Usage:
+    python scripts/intel_graph.py load    [--output-dir DIR] [--db PATH] [--force]
+    python scripts/intel_graph.py project [--db PATH] [--neo4j-uri URI] [--algo leiden|louvain] [--gamma G] [--max-df N]
+    python scripts/intel_graph.py verify  [--db PATH] [--report PATH]


+    {
+        "name": "reliable agents == Ralph Wiggum loop / force-feed",
+        "user_patterns": ["reliab"],
+        "creator_patterns": ["ralph", "wiggum", "force-feed"],
+        "citation_phrases": ["ralph wiggum", "ralph loop", "force-feed"],
+    },


+    return con.execute(
+        """
+        WITH anchors AS (
+            SELECT DISTINCT artifact_id FROM segments WHERE contains(lower(text), ?)
+        ),
+        terms AS (
+            SELECT hc.entity_id FROM has_concept hc JOIN anchors USING (artifact_id)
+            UNION ALL
+            SELECT m.entity_id FROM mentions m JOIN segments s USING (segment_id)
+            JOIN anchors a ON a.artifact_id = s.artifact_id
+        )
+        SELECT e.entity_id, e.community_id
+        FROM (
+            SELECT t.entity_id, count(*) AS n
+            FROM terms t JOIN entities e2 ON e2.entity_id = t.entity_id
+            WHERE e2.community_id IS NOT NULL
+            GROUP BY 1 ORDER BY n DESC, t.entity_id LIMIT ?
+        ) ranked
+        JOIN entities e USING (entity_id)
+        """,
+        [normalize_phrase(phrase), limit],
+    ).fetchall()


+
+### Neo4j projection
+
+- All nodes labeled `VI_Entity` (namespaced); `project` deletes only `VI_`-prefixed labels on rebuild. Runs Louvain (seeded config, concurrency 1 for reproducibility) and PageRank via GDS, writes `community_id` and `pagerank` back to DuckDB `entities`.


github-project-automation Bot added this to Video-Intel Jul 2, 2026

github-project-automation Bot moved this to Inbox in Video-Intel Jul 2, 2026

dzivkovi moved this from Inbox to Review in Video-Intel Jul 2, 2026

dzivkovi requested a review from Copilot July 2, 2026 11:22

Copilot started reviewing on behalf of dzivkovi July 2, 2026 11:23 View session

Copilot AI reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P1: feat(intelligence): weak-signal layer - DuckDB truth store + Neo4j GDS lens (#85)#86

P1: feat(intelligence): weak-signal layer - DuckDB truth store + Neo4j GDS lens (#85)#86
dzivkovi wants to merge 2 commits into
mainfrom
feat/intel-graph-weak-signal-85

dzivkovi commented Jul 2, 2026

Uh oh!

gitguardian Bot commented Jul 2, 2026

Uh oh!

dzivkovi commented Jul 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### Neo4j projection

		- All nodes labeled `VI_Entity` (namespaced); `project` deletes only `VI_`-prefixed labels on rebuild. Runs Louvain (seeded config, concurrency 1 for reproducibility) and PageRank via GDS, writes `community_id` and `pagerank` back to DuckDB `entities`.

Conversation

dzivkovi commented Jul 2, 2026

What shipped

Gate 1 smoke test (real corpus, post-fix canonical run)

Eval discipline (ADR-0017)

Premise-dependent claims (falsifiers)

Review

Not merging

Uh oh!

gitguardian Bot commented Jul 2, 2026

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Uh oh!

dzivkovi commented Jul 2, 2026

Ready for review - two-pass review summary (Gate 2b: not merging without you)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants