Prisma SQLite engine query timeout supersedes PRAGMA busy_timeout — HAPPIER_SQLITE_BUSY_TIMEOUT_MS effectively dead under contention (self-hosted light)

## Summary

On the self-hosted **light** variant (SQLite backend), Prisma's engine raises `PrismaClientKnownRequestError: Socket timeout (the database failed to respond to a query within the configured timeout)` after ~5 seconds when SQLite is under sustained write contention. This happens **regardless of `PRAGMA busy_timeout`** — including `busy_timeout = 30000` (30 s), which is set by `applySqliteRuntimePragmas` (`apps/server/sources/storage/prisma.ts`) and verified to apply to the connection where the PRAGMA is issued.

In effect, `HAPPIER_SQLITE_BUSY_TIMEOUT_MS` does not protect writes from being aborted: any caller queued behind SQLite's writer lock for more than ~5 s gets `Socket timeout`, even though the lock would have released within `busy_timeout`.

This produces user-visible disconnects: the daemon's `POST /v1/machines` (and any other write-path control call) returns 500 / times out while session-presence flushes monopolise the writer slot. See companion issue #175 on the sessionCache flush amplifier — these two fixes are independent and stack.

## Affected component

- Variant: self-hosted light only (Cloud / `main.ts` uses MySQL, unaffected)
- File: `apps/server/sources/storage/prisma.ts` — `applySqliteRuntimePragmas`, `getPrismaClient` construction
- Datasource: `apps/server/prisma/sqlite/schema.prisma`
- Verified against `dev` HEAD `379c9941` (2026-05-15) — Prisma client construction is unchanged: `new PrismaClient()` without a driver adapter

## Root cause

Two independent layers govern how long a write can wait for the SQLite writer lock:

1. **SQLite's `busy_timeout` PRAGMA** — set per *connection*. Tells SQLite "if the writer lock is held, retry for up to N ms before returning `SQLITE_BUSY`."
2. **Prisma engine's per-query IPC timeout** — set by the engine binary's internal client-server protocol, not by application code. Default for SQLite is roughly 5 s and is not configurable via the public Prisma API (the `?socket_timeout=` URL parameter is Postgres-only).

When the engine timeout (~5 s) is shorter than `busy_timeout` (30 s), the engine fails the query before `busy_timeout` has a chance to succeed. The PRAGMA value is effectively dead.

Additionally, `applySqliteRuntimePragmas` is invoked **once** via `client.$queryRawUnsafe('PRAGMA …')` at startup, on whichever pool connection the call lands on. Prisma's SQLite engine binary does not expose a per-connection setup hook, so subsequent pool connections start fresh with the SQLite defaults (`busy_timeout = 0`, `synchronous = FULL`). The PRAGMAs only stick on connections that happen to share the same engine-side state — not reliably across the pool.

This second issue means that even within the 5 s engine window, some writes get instant `SQLITE_BUSY` failures because their connection never received the `busy_timeout` PRAGMA.

## Reproduction

1. Self-host light variant, SQLite, with `HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000` in stack env (verified in `/proc/<server-pid>/environ`)
2. Drive enough concurrent writes that any single one queues for more than ~5 s (≥6 active sessions doing presence flushes is sufficient — see companion issue)
3. Observe in stack log:

```
[server] [07:16:40.608] ERROR: Error updating session: PrismaClientKnownRequestError:
[server] Invalid `db.session.updateMany()` invocation in
[server]   …/sessionCache.ts:338:38
[server] Socket timeout (the database failed to respond to a query within the configured timeout).
```

```
[server] stack: "PrismaClientKnownRequestError: Socket timeout (...)
[server]     at Qr.transaction (…/sqlite-client/runtime/library.js:115:11521)
[server]     at async Proxy._transactionWithCallback (…/sqlite-client/runtime/library.js:134:8044)
[server]     at async inTx (…/storage/inTx.ts:59:76)
[server]     at async Object.<anonymous> (…/api/routes/machines/machinesRoutes.ts:297:27)"
```

4. Time-to-failure for any failing query is ~5 s, never approaching the 30 s `busy_timeout` configured

## Evidence

- `applySqliteRuntimePragmas` is called once at startup; the source comment documents the intent of these PRAGMAs but the engine timeout overrides the wait window in practice
- Stack `process.env` confirmed correct via `/proc/<pid>/environ` for all four levels of the process tree (`hstack.mjs`, `run.mjs`, `tsx` wrapper, server child) — `HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000` is present
- WAL is healthy: `wal_checkpoint(TRUNCATE)` returns clean, integrity check passes, freelist 0
- Identical symptom across multiple incidents (`*.corrupt-bak`, `*.preswap`, `*.pre-vacuum` siblings in `server-light/`) suggests the issue compounds over time, occasionally corrupting the DB outright

## Proposed remediation

Two complementary fixes, in priority order.

### Fix A — Switch to `@prisma/adapter-better-sqlite3` with per-connection PRAGMA hook (structural)

Prisma supports driver adapters that let the application own the connection. With `better-sqlite3`, a connection-setup hook can apply `busy_timeout`, `synchronous`, and `journal_mode` on **every** connection deterministically.

Sketch:

```ts
import { PrismaBetterSQLite3 } from '@prisma/adapter-better-sqlite3';
import Database from 'better-sqlite3';

function createSqliteAdapter(env: NodeJS.ProcessEnv) {
    const pragmas = resolveSqliteRuntimePragmasFromEnv(env);
    return new PrismaBetterSQLite3({
        url: env.DATABASE_URL!,
        afterConnect: (db: Database.Database) => {
            db.pragma(`journal_mode = ${pragmas.journalMode}`);
            db.pragma(`synchronous = ${pragmas.synchronous}`);
            db.pragma(`busy_timeout = ${pragmas.busyTimeoutMs}`);
        },
    });
}

const adapter = createSqliteAdapter(process.env);
_db = new PrismaClient({ adapter });
```

This also bypasses the Prisma engine IPC entirely (better-sqlite3 is in-process), so the ~5 s engine timeout disappears. SQLite's `busy_timeout` becomes the only wait-time governor — which is what `HAPPIER_SQLITE_BUSY_TIMEOUT_MS` is documented to do.

**Effort: medium-large.** New dependency (`@prisma/adapter-better-sqlite3`, `better-sqlite3`). Refactor `prisma.ts` client construction. Re-validate transaction semantics. 1–2 days incl. tests + smoke pass against full server boot path.

### Fix B — Wrap writes in a transient-retry helper (band-aid, ships fast)

For any `P2024 Socket timeout`, retry the operation up to N times with small backoff. This does not fix the root cause but eliminates the user-visible cascade.

```ts
export async function withSqliteRetry<T>(
    op: () => Promise<T>,
    { attempts = 5, backoffMs = 250 } = {},
): Promise<T> {
    let lastErr: unknown;
    for (let i = 0; i < attempts; i++) {
        try {
            return await op();
        } catch (e: any) {
            if (e?.code !== 'P2024') throw e;
            lastErr = e;
            await new Promise(r => setTimeout(r, backoffMs * (i + 1)));
        }
    }
    throw lastErr;
}
```

Apply at strategic call sites (presence flush, machine registration, usage report). Eliminates the daemon's machine-registration retry storm.

**Effort: small-medium.** ~30-line helper + call-site adoption. Few hours.

### Fix C — Document the trade-off

If neither A nor B is adopted, `HAPPIER_SQLITE_BUSY_TIMEOUT_MS` should be documented as effectively capped at the Prisma engine timeout (~5 s) for the SQLite engine binary path. Today the env var name and description suggest it controls the actual wait window, which is misleading.

## Effort estimate

- Fix A (structural): **medium-large** — 1–2 days
- Fix B (retry wrapper): **small-medium** — 4–6 h
- Fix C (docs only): **small** — 30 min

Recommended: ship B first (immediate operator relief), then A to close the structural gap, then deprecate the engine-IPC code path.

## Investigation done

- Confirmed env var reaches all four levels of the server process tree (`hstack.mjs`, `run.mjs`, `tsx`, server child) via `/proc/<pid>/environ`
- Read `applySqliteRuntimePragmas` and `resolveSqliteRuntimePragmasFromEnv` — values resolve correctly (defaults: `WAL`, `NORMAL`, 5000 ms; user-set: 30000 ms)
- Captured full stack trace pointing at `sqlite-client/runtime/library.js:115:11521` (the engine IPC layer, not SQLite proper) — confirms the timeout is engine-side, not SQLite-side
- Time-to-failure is consistently ~5 s across multiple captured errors, matching documented Prisma engine timeout, not the configured 30 s `busy_timeout`
- Verified the per-connection PRAGMA limitation against Prisma engine semantics — `$queryRawUnsafe` runs on one connection, not all pool connections
- Reviewed all `dev` HEAD commits since v0.2.0 touching prisma / storage — none address either layer of this problem
- Confirmed Cloud variant (`main.ts` / MySQL) is structurally immune (different driver, no per-process writer lock)

## Related

Companion issue: #175 — SessionCache flush amplifies contention by issuing N sequential per-session writes per tick, producing the queue depth that triggers the 5 s engine timeout. Either fix here removes the cascade; both together remove the structural risk.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prisma SQLite engine query timeout supersedes PRAGMA busy_timeout — HAPPIER_SQLITE_BUSY_TIMEOUT_MS effectively dead under contention (self-hosted light) #176

Summary

Affected component

Root cause

Reproduction

Evidence

Proposed remediation

Fix A — Switch to `@prisma/adapter-better-sqlite3` with per-connection PRAGMA hook (structural)

Fix B — Wrap writes in a transient-retry helper (band-aid, ships fast)

Fix C — Document the trade-off

Effort estimate

Investigation done

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Prisma SQLite engine query timeout supersedes PRAGMA busy_timeout — HAPPIER_SQLITE_BUSY_TIMEOUT_MS effectively dead under contention (self-hosted light) #176

Description

Summary

Affected component

Root cause

Reproduction

Evidence

Proposed remediation

Fix A — Switch to @prisma/adapter-better-sqlite3 with per-connection PRAGMA hook (structural)

Fix B — Wrap writes in a transient-retry helper (band-aid, ships fast)

Fix C — Document the trade-off

Effort estimate

Investigation done

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Fix A — Switch to `@prisma/adapter-better-sqlite3` with per-connection PRAGMA hook (structural)