Skip to content

SIGSEGV in kuzu::NodeTable::lookupPK during sustained GraphRAG indexing (v10.0.7) #178

@RichardHightower

Description

@RichardHightower

Summary

agent-brain index reliably crashes with SIGSEGV (segmentation fault) inside kuzu::storage::NodeTable::lookupPK during sustained GraphRAG indexing, when graphrag.store_type: kuzu and graphrag.doc_extractor: langextract are enabled. Observed three times on the same corpus, each time after 2–3 hours of indexing and when the Kuzu DB has grown past ~1–2 GB. Vector + BM25 indexing is unaffected; the crash is isolated to the Kuzu native worker.

Workaround: switching graphrag.store_type: kuzusimple eliminates the crash (no native code on the hot path).

Environment

Component Version
agent-brain-cli 10.0.7 (PyPI, installed via uv tool install --with "agent-brain-rag[graphrag-all]==10.0.7" agent-brain-cli==10.0.7)
agent-brain-rag 10.0.7
kuzu 0.11.3
llama-index-graph-stores-kuzu 0.9.1
langextract 1.5.0
Python 3.12.9
OS macOS 26.2 (25C56), arm64 (Mac14,6)
RAM 96 GB (memory pressure was never high — 93% free at crash time)

Config (relevant subset)

embedding:
  provider: "openai"
  model: "text-embedding-3-large"
  api_key_env: "OPENAI_API_KEY"

summarization:
  provider: "anthropic"
  model: "claude-haiku-4-5-20251001"
  api_key_env: "ANTHROPIC_API_KEY"

storage:
  backend: "chroma"

graphrag:
  enabled: true
  store_type: "kuzu"       # ← the offending setting
  use_code_metadata: true
  doc_extractor: "langextract"
  traversal_depth: 2
  max_triplets_per_chunk: 10

Reproduction

  1. Configure as above.
  2. Index a corpus of ~230 markdown documents (~1,984 chunks at chunk_size=1024, chunk_overlap=100):
    agent-brain index ./corpus_dir \
      --chunk-size 1024 \
      --chunk-overlap 100 \
      --exclude-patterns "**/images/**"
  3. Let it run. Vector + BM25 phase completes quickly (~1 min). GraphRAG phase runs at ~17 chunks/min via LangExtract → gpt-4o-mini.
  4. After ~2 hours, when Kuzu DB has grown to ~1–2 GB, the server process gets SIGSEGV (process is killed by the kernel).

Observed crashes

Server PID Started Died Lifetime Kuzu size at death
90292 14:31 PDT 16:33 PDT 2h02m unknown (hit 7200s job timeout before segfault opportunity)
59334 16:44 PDT 19:25 PDT 2h41m ~2.3 GB
92503 19:50 PDT 22:18 PDT 2h28m ~1.0 GB (had been auto-recovered from snapshot)

The crash report (Apple .ips format) is for an interim process pid 88888 that segfaulted within 5 minutes of restart — likely while opening the Kuzu DB that had been left locked by the killed pid 59334.

Crash report

"exception": {
  "type": "EXC_BAD_ACCESS",
  "signal": "SIGSEGV",
  "subtype": "KERN_INVALID_ADDRESS at 0x0000000000000008"
}
"termination": {
  "namespace": "SIGNAL",
  "indicator": "Segmentation fault: 11"
}

Faulting thread backtrace (thread 15)

#0  _kuzu.cpython-312-darwin.so  0x663814
#1  _kuzu.cpython-312-darwin.so  0x54c704
#2  _kuzu.cpython-312-darwin.so  0x5f4854
#3  _kuzu.cpython-312-darwin.so  0x5fcb74
#4  _kuzu.cpython-312-darwin.so  0x5f8bb0
#5  _kuzu.cpython-312-darwin.so  0x647150  kuzu::storage::NodeTable::lookupPK(
                                              kuzu::transaction::Transaction const*,
                                              kuzu::common::ValueVector*,
                                              unsigned long long,
                                              unsigned long long&) const
#6  _kuzu.cpython-312-darwin.so  0x563604
#7  _kuzu.cpython-312-darwin.so  0x596cc4  kuzu::processor::PhysicalOperator::getNextTuple(
                                              kuzu::processor::ExecutionContext*)
#8  _kuzu.cpython-312-darwin.so  0x59aa58
#9  _kuzu.cpython-312-darwin.so  0x5aca08
#10 _kuzu.cpython-312-darwin.so  0x9edec   kuzu::common::TaskScheduler::runWorkerThread()
#11 _kuzu.cpython-312-darwin.so  0x9f5e0
#12 libsystem_pthread.dylib      0x6c08    _pthread_start
#13 libsystem_pthread.dylib      0x1ba8    thread_start

The null-pointer + 8-byte deref (0x0000000000000008) in NodeTable::lookupPK during a getNextTuple looks like a use-after-free or torn read of an internal storage pointer — likely a concurrency bug between the LangExtract triplet-writer thread and Kuzu's own query worker.

Server log around death

Just before the kill, the log shows normal LangExtract activity — successful OpenAI calls every few seconds, periodic graph_snapshot.write: wrote snapshot events. No Python exception, no traceback. The last line in every case is:

/Users/.../python3.12/multiprocessing/resource_tracker.py:255: UserWarning:
resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

That resource_tracker warning is what Python prints during cleanup when the interpreter is being torn down by an external signal — consistent with the kernel SIGSEGV.

v10.0.6 / v10.0.7 self-heal observed

After each crash, restart correctly detected the Kuzu DB as corrupted (file lock held by dead PID) and:

[WARNING] agent_brain_server.storage.graph_store:
  Kuzu graph store at .../kuzu_db appears corrupted (likely from a prior process kill
  mid-indexing): IO exception: Could not set lock on file ...
  Renaming to .corrupted-<ts> and starting fresh.

[INFO]  agent_brain_server.storage.graph_store:
  Quarantined corrupted Kuzu files: db=.../kuzu_db.corrupted-20260528T005033Z

[WARNING] agent_brain_server.storage.graph_store:
  Restored 110 triplets from snapshot snapshot-2026-05-28T00-50-33Z.json
  after recovering corrupted kuzu_db at .../kuzu_db

The self-heal is excellent — it preserved a 2.4 GB corrupted DB for forensics and restored the latest snapshot. But the underlying segfault was not fixed by 10.0.6/10.0.7; it just got better recovery. Each subsequent indexing run re-creates the conditions for the same crash.

Workaround

Switch graph store to in-memory JSON:

graphrag:
  store_type: "simple"   # was: "kuzu"

This routes around the native code entirely. Trade-off: graph state lives in memory + JSON snapshots, but no SIGSEGV.

Suggestions

  1. The narrowest likely fix is a lock/refcount issue inside NodeTable::lookupPK when called from a query worker while another writer thread is mutating the same node. Either:

    • Take a read lock on the page in lookupPK before dereferencing the index entry, or
    • Validate the entry pointer before reading offset 0x8.
  2. Consider adding a config option to bound the Kuzu DB size at which agent-brain index automatically pauses + checkpoints (e.g. graphrag.kuzu_max_db_mb) as a defense-in-depth measure for users hitting this in the wild.

  3. The Kuzu version pinned in 10.0.7's extras (kuzu==0.11.3) is a few releases old at time of writing — worth checking whether Kuzu themselves have shipped a fix for this code path in a newer point release before adopting upstream.

Happy to capture and share the full .ips file privately if it would help — it's 122 KB of JSON and contains stack traces for all threads, not just the faulting one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions