Skip to content

Prisma SQLite engine query timeout supersedes PRAGMA busy_timeout — HAPPIER_SQLITE_BUSY_TIMEOUT_MS effectively dead under contention (self-hosted light) #176

@danljungstrom

Description

@danljungstrom

Summary

On the self-hosted light variant (SQLite backend), Prisma's engine raises PrismaClientKnownRequestError: Socket timeout (the database failed to respond to a query within the configured timeout) after ~5 seconds when SQLite is under sustained write contention. This happens regardless of PRAGMA busy_timeout — including busy_timeout = 30000 (30 s), which is set by applySqliteRuntimePragmas (apps/server/sources/storage/prisma.ts) and verified to apply to the connection where the PRAGMA is issued.

In effect, HAPPIER_SQLITE_BUSY_TIMEOUT_MS does not protect writes from being aborted: any caller queued behind SQLite's writer lock for more than ~5 s gets Socket timeout, even though the lock would have released within busy_timeout.

This produces user-visible disconnects: the daemon's POST /v1/machines (and any other write-path control call) returns 500 / times out while session-presence flushes monopolise the writer slot. See companion issue #175 on the sessionCache flush amplifier — these two fixes are independent and stack.

Affected component

  • Variant: self-hosted light only (Cloud / main.ts uses MySQL, unaffected)
  • File: apps/server/sources/storage/prisma.tsapplySqliteRuntimePragmas, getPrismaClient construction
  • Datasource: apps/server/prisma/sqlite/schema.prisma
  • Verified against dev HEAD 379c9941 (2026-05-15) — Prisma client construction is unchanged: new PrismaClient() without a driver adapter

Root cause

Two independent layers govern how long a write can wait for the SQLite writer lock:

  1. SQLite's busy_timeout PRAGMA — set per connection. Tells SQLite "if the writer lock is held, retry for up to N ms before returning SQLITE_BUSY."
  2. Prisma engine's per-query IPC timeout — set by the engine binary's internal client-server protocol, not by application code. Default for SQLite is roughly 5 s and is not configurable via the public Prisma API (the ?socket_timeout= URL parameter is Postgres-only).

When the engine timeout (~5 s) is shorter than busy_timeout (30 s), the engine fails the query before busy_timeout has a chance to succeed. The PRAGMA value is effectively dead.

Additionally, applySqliteRuntimePragmas is invoked once via client.$queryRawUnsafe('PRAGMA …') at startup, on whichever pool connection the call lands on. Prisma's SQLite engine binary does not expose a per-connection setup hook, so subsequent pool connections start fresh with the SQLite defaults (busy_timeout = 0, synchronous = FULL). The PRAGMAs only stick on connections that happen to share the same engine-side state — not reliably across the pool.

This second issue means that even within the 5 s engine window, some writes get instant SQLITE_BUSY failures because their connection never received the busy_timeout PRAGMA.

Reproduction

  1. Self-host light variant, SQLite, with HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000 in stack env (verified in /proc/<server-pid>/environ)
  2. Drive enough concurrent writes that any single one queues for more than ~5 s (≥6 active sessions doing presence flushes is sufficient — see companion issue)
  3. Observe in stack log:
[server] [07:16:40.608] ERROR: Error updating session: PrismaClientKnownRequestError:
[server] Invalid `db.session.updateMany()` invocation in
[server]   …/sessionCache.ts:338:38
[server] Socket timeout (the database failed to respond to a query within the configured timeout).
[server] stack: "PrismaClientKnownRequestError: Socket timeout (...)
[server]     at Qr.transaction (…/sqlite-client/runtime/library.js:115:11521)
[server]     at async Proxy._transactionWithCallback (…/sqlite-client/runtime/library.js:134:8044)
[server]     at async inTx (…/storage/inTx.ts:59:76)
[server]     at async Object.<anonymous> (…/api/routes/machines/machinesRoutes.ts:297:27)"
  1. Time-to-failure for any failing query is ~5 s, never approaching the 30 s busy_timeout configured

Evidence

  • applySqliteRuntimePragmas is called once at startup; the source comment documents the intent of these PRAGMAs but the engine timeout overrides the wait window in practice
  • Stack process.env confirmed correct via /proc/<pid>/environ for all four levels of the process tree (hstack.mjs, run.mjs, tsx wrapper, server child) — HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000 is present
  • WAL is healthy: wal_checkpoint(TRUNCATE) returns clean, integrity check passes, freelist 0
  • Identical symptom across multiple incidents (*.corrupt-bak, *.preswap, *.pre-vacuum siblings in server-light/) suggests the issue compounds over time, occasionally corrupting the DB outright

Proposed remediation

Two complementary fixes, in priority order.

Fix A — Switch to @prisma/adapter-better-sqlite3 with per-connection PRAGMA hook (structural)

Prisma supports driver adapters that let the application own the connection. With better-sqlite3, a connection-setup hook can apply busy_timeout, synchronous, and journal_mode on every connection deterministically.

Sketch:

import { PrismaBetterSQLite3 } from '@prisma/adapter-better-sqlite3';
import Database from 'better-sqlite3';

function createSqliteAdapter(env: NodeJS.ProcessEnv) {
    const pragmas = resolveSqliteRuntimePragmasFromEnv(env);
    return new PrismaBetterSQLite3({
        url: env.DATABASE_URL!,
        afterConnect: (db: Database.Database) => {
            db.pragma(`journal_mode = ${pragmas.journalMode}`);
            db.pragma(`synchronous = ${pragmas.synchronous}`);
            db.pragma(`busy_timeout = ${pragmas.busyTimeoutMs}`);
        },
    });
}

const adapter = createSqliteAdapter(process.env);
_db = new PrismaClient({ adapter });

This also bypasses the Prisma engine IPC entirely (better-sqlite3 is in-process), so the ~5 s engine timeout disappears. SQLite's busy_timeout becomes the only wait-time governor — which is what HAPPIER_SQLITE_BUSY_TIMEOUT_MS is documented to do.

Effort: medium-large. New dependency (@prisma/adapter-better-sqlite3, better-sqlite3). Refactor prisma.ts client construction. Re-validate transaction semantics. 1–2 days incl. tests + smoke pass against full server boot path.

Fix B — Wrap writes in a transient-retry helper (band-aid, ships fast)

For any P2024 Socket timeout, retry the operation up to N times with small backoff. This does not fix the root cause but eliminates the user-visible cascade.

export async function withSqliteRetry<T>(
    op: () => Promise<T>,
    { attempts = 5, backoffMs = 250 } = {},
): Promise<T> {
    let lastErr: unknown;
    for (let i = 0; i < attempts; i++) {
        try {
            return await op();
        } catch (e: any) {
            if (e?.code !== 'P2024') throw e;
            lastErr = e;
            await new Promise(r => setTimeout(r, backoffMs * (i + 1)));
        }
    }
    throw lastErr;
}

Apply at strategic call sites (presence flush, machine registration, usage report). Eliminates the daemon's machine-registration retry storm.

Effort: small-medium. ~30-line helper + call-site adoption. Few hours.

Fix C — Document the trade-off

If neither A nor B is adopted, HAPPIER_SQLITE_BUSY_TIMEOUT_MS should be documented as effectively capped at the Prisma engine timeout (~5 s) for the SQLite engine binary path. Today the env var name and description suggest it controls the actual wait window, which is misleading.

Effort estimate

  • Fix A (structural): medium-large — 1–2 days
  • Fix B (retry wrapper): small-medium — 4–6 h
  • Fix C (docs only): small — 30 min

Recommended: ship B first (immediate operator relief), then A to close the structural gap, then deprecate the engine-IPC code path.

Investigation done

  • Confirmed env var reaches all four levels of the server process tree (hstack.mjs, run.mjs, tsx, server child) via /proc/<pid>/environ
  • Read applySqliteRuntimePragmas and resolveSqliteRuntimePragmasFromEnv — values resolve correctly (defaults: WAL, NORMAL, 5000 ms; user-set: 30000 ms)
  • Captured full stack trace pointing at sqlite-client/runtime/library.js:115:11521 (the engine IPC layer, not SQLite proper) — confirms the timeout is engine-side, not SQLite-side
  • Time-to-failure is consistently ~5 s across multiple captured errors, matching documented Prisma engine timeout, not the configured 30 s busy_timeout
  • Verified the per-connection PRAGMA limitation against Prisma engine semantics — $queryRawUnsafe runs on one connection, not all pool connections
  • Reviewed all dev HEAD commits since v0.2.0 touching prisma / storage — none address either layer of this problem
  • Confirmed Cloud variant (main.ts / MySQL) is structurally immune (different driver, no per-process writer lock)

Related

Companion issue: #175 — SessionCache flush amplifies contention by issuing N sequential per-session writes per tick, producing the queue depth that triggers the 5 s engine timeout. Either fix here removes the cascade; both together remove the structural risk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions