SIGSEGV in kuzu::NodeTable::lookupPK during sustained GraphRAG indexing (v10.0.7)

## Summary

`agent-brain index` reliably crashes with `SIGSEGV` (segmentation fault) inside `kuzu::storage::NodeTable::lookupPK` during sustained GraphRAG indexing, when `graphrag.store_type: kuzu` and `graphrag.doc_extractor: langextract` are enabled. Observed three times on the same corpus, each time after 2–3 hours of indexing and when the Kuzu DB has grown past ~1–2 GB. Vector + BM25 indexing is unaffected; the crash is isolated to the Kuzu native worker.

Workaround: switching `graphrag.store_type: kuzu` → `simple` eliminates the crash (no native code on the hot path).

## Environment

| Component | Version |
|---|---|
| `agent-brain-cli` | 10.0.7 (PyPI, installed via `uv tool install --with "agent-brain-rag[graphrag-all]==10.0.7" agent-brain-cli==10.0.7`) |
| `agent-brain-rag` | 10.0.7 |
| `kuzu` | 0.11.3 |
| `llama-index-graph-stores-kuzu` | 0.9.1 |
| `langextract` | 1.5.0 |
| Python | 3.12.9 |
| OS | macOS 26.2 (25C56), arm64 (Mac14,6) |
| RAM | 96 GB (memory pressure was never high — 93% free at crash time) |

## Config (relevant subset)

```yaml
embedding:
  provider: "openai"
  model: "text-embedding-3-large"
  api_key_env: "OPENAI_API_KEY"

summarization:
  provider: "anthropic"
  model: "claude-haiku-4-5-20251001"
  api_key_env: "ANTHROPIC_API_KEY"

storage:
  backend: "chroma"

graphrag:
  enabled: true
  store_type: "kuzu"       # ← the offending setting
  use_code_metadata: true
  doc_extractor: "langextract"
  traversal_depth: 2
  max_triplets_per_chunk: 10
```

## Reproduction

1. Configure as above.
2. Index a corpus of ~230 markdown documents (~1,984 chunks at chunk_size=1024, chunk_overlap=100):
   ```bash
   agent-brain index ./corpus_dir \
     --chunk-size 1024 \
     --chunk-overlap 100 \
     --exclude-patterns "**/images/**"
   ```
3. Let it run. Vector + BM25 phase completes quickly (~1 min). GraphRAG phase runs at ~17 chunks/min via LangExtract → gpt-4o-mini.
4. After ~2 hours, when Kuzu DB has grown to ~1–2 GB, the server process gets SIGSEGV (process is killed by the kernel).

## Observed crashes

| Server PID | Started | Died | Lifetime | Kuzu size at death |
|---|---|---|---|---|
| 90292 | 14:31 PDT | 16:33 PDT | 2h02m | unknown (hit 7200s job timeout before segfault opportunity) |
| 59334 | 16:44 PDT | 19:25 PDT | 2h41m | ~2.3 GB |
| 92503 | 19:50 PDT | 22:18 PDT | 2h28m | ~1.0 GB (had been auto-recovered from snapshot) |

The crash report (Apple `.ips` format) is for an interim process `pid 88888` that segfaulted within 5 minutes of restart — likely while opening the Kuzu DB that had been left locked by the killed `pid 59334`.

## Crash report

```
"exception": {
  "type": "EXC_BAD_ACCESS",
  "signal": "SIGSEGV",
  "subtype": "KERN_INVALID_ADDRESS at 0x0000000000000008"
}
"termination": {
  "namespace": "SIGNAL",
  "indicator": "Segmentation fault: 11"
}
```

### Faulting thread backtrace (thread 15)

```
#0  _kuzu.cpython-312-darwin.so  0x663814
#1  _kuzu.cpython-312-darwin.so  0x54c704
#2  _kuzu.cpython-312-darwin.so  0x5f4854
#3  _kuzu.cpython-312-darwin.so  0x5fcb74
#4  _kuzu.cpython-312-darwin.so  0x5f8bb0
#5  _kuzu.cpython-312-darwin.so  0x647150  kuzu::storage::NodeTable::lookupPK(
                                              kuzu::transaction::Transaction const*,
                                              kuzu::common::ValueVector*,
                                              unsigned long long,
                                              unsigned long long&) const
#6  _kuzu.cpython-312-darwin.so  0x563604
#7  _kuzu.cpython-312-darwin.so  0x596cc4  kuzu::processor::PhysicalOperator::getNextTuple(
                                              kuzu::processor::ExecutionContext*)
#8  _kuzu.cpython-312-darwin.so  0x59aa58
#9  _kuzu.cpython-312-darwin.so  0x5aca08
#10 _kuzu.cpython-312-darwin.so  0x9edec   kuzu::common::TaskScheduler::runWorkerThread()
#11 _kuzu.cpython-312-darwin.so  0x9f5e0
#12 libsystem_pthread.dylib      0x6c08    _pthread_start
#13 libsystem_pthread.dylib      0x1ba8    thread_start
```

The null-pointer + 8-byte deref (`0x0000000000000008`) in `NodeTable::lookupPK` during a `getNextTuple` looks like a use-after-free or torn read of an internal storage pointer — likely a concurrency bug between the LangExtract triplet-writer thread and Kuzu's own query worker.

## Server log around death

Just before the kill, the log shows normal LangExtract activity — successful OpenAI calls every few seconds, periodic `graph_snapshot.write: wrote snapshot` events. No Python exception, no traceback. The last line in every case is:

```
/Users/.../python3.12/multiprocessing/resource_tracker.py:255: UserWarning:
resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
```

That `resource_tracker` warning is what Python prints during cleanup when the interpreter is being torn down by an external signal — consistent with the kernel SIGSEGV.

## v10.0.6 / v10.0.7 self-heal observed

After each crash, restart correctly detected the Kuzu DB as corrupted (file lock held by dead PID) and:

```
[WARNING] agent_brain_server.storage.graph_store:
  Kuzu graph store at .../kuzu_db appears corrupted (likely from a prior process kill
  mid-indexing): IO exception: Could not set lock on file ...
  Renaming to .corrupted-<ts> and starting fresh.

[INFO]  agent_brain_server.storage.graph_store:
  Quarantined corrupted Kuzu files: db=.../kuzu_db.corrupted-20260528T005033Z

[WARNING] agent_brain_server.storage.graph_store:
  Restored 110 triplets from snapshot snapshot-2026-05-28T00-50-33Z.json
  after recovering corrupted kuzu_db at .../kuzu_db
```

The self-heal is excellent — it preserved a 2.4 GB corrupted DB for forensics and restored the latest snapshot. But the underlying segfault was not fixed by 10.0.6/10.0.7; it just got better recovery. Each subsequent indexing run re-creates the conditions for the same crash.

## Workaround

Switch graph store to in-memory JSON:

```yaml
graphrag:
  store_type: "simple"   # was: "kuzu"
```

This routes around the native code entirely. Trade-off: graph state lives in memory + JSON snapshots, but no SIGSEGV.

## Suggestions

1. The narrowest likely fix is a lock/refcount issue inside `NodeTable::lookupPK` when called from a query worker while another writer thread is mutating the same node. Either:
   - Take a read lock on the page in `lookupPK` before dereferencing the index entry, or
   - Validate the entry pointer before reading offset 0x8.

2. Consider adding a config option to bound the Kuzu DB size at which `agent-brain index` automatically pauses + checkpoints (e.g. `graphrag.kuzu_max_db_mb`) as a defense-in-depth measure for users hitting this in the wild.

3. The Kuzu version pinned in 10.0.7's extras (`kuzu==0.11.3`) is a few releases old at time of writing — worth checking whether Kuzu themselves have shipped a fix for this code path in a newer point release before adopting upstream.

Happy to capture and share the full `.ips` file privately if it would help — it's 122 KB of JSON and contains stack traces for all threads, not just the faulting one.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV in kuzu::NodeTable::lookupPK during sustained GraphRAG indexing (v10.0.7) #178

Summary

Environment

Config (relevant subset)

Reproduction

Observed crashes

Crash report

Faulting thread backtrace (thread 15)

Server log around death

v10.0.6 / v10.0.7 self-heal observed

Workaround

Suggestions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Version
`agent-brain-cli`	10.0.7 (PyPI, installed via `uv tool install --with "agent-brain-rag[graphrag-all]==10.0.7" agent-brain-cli==10.0.7`)
`agent-brain-rag`	10.0.7
`kuzu`	0.11.3
`llama-index-graph-stores-kuzu`	0.9.1
`langextract`	1.5.0
Python	3.12.9
OS	macOS 26.2 (25C56), arm64 (Mac14,6)
RAM	96 GB (memory pressure was never high — 93% free at crash time)

Server PID	Started	Died	Lifetime	Kuzu size at death
90292	14:31 PDT	16:33 PDT	2h02m	unknown (hit 7200s job timeout before segfault opportunity)
59334	16:44 PDT	19:25 PDT	2h41m	~2.3 GB
92503	19:50 PDT	22:18 PDT	2h28m	~1.0 GB (had been auto-recovered from snapshot)

SIGSEGV in kuzu::NodeTable::lookupPK during sustained GraphRAG indexing (v10.0.7) #178

Description

Summary

Environment

Config (relevant subset)

Reproduction

Observed crashes

Crash report

Faulting thread backtrace (thread 15)

Server log around death

v10.0.6 / v10.0.7 self-heal observed

Workaround

Suggestions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions