Summary
On the self-hosted light variant (SQLite backend), Prisma's engine raises PrismaClientKnownRequestError: Socket timeout (the database failed to respond to a query within the configured timeout) after ~5 seconds when SQLite is under sustained write contention. This happens regardless of PRAGMA busy_timeout — including busy_timeout = 30000 (30 s), which is set by applySqliteRuntimePragmas (apps/server/sources/storage/prisma.ts) and verified to apply to the connection where the PRAGMA is issued.
In effect, HAPPIER_SQLITE_BUSY_TIMEOUT_MS does not protect writes from being aborted: any caller queued behind SQLite's writer lock for more than ~5 s gets Socket timeout, even though the lock would have released within busy_timeout.
This produces user-visible disconnects: the daemon's POST /v1/machines (and any other write-path control call) returns 500 / times out while session-presence flushes monopolise the writer slot. See companion issue #175 on the sessionCache flush amplifier — these two fixes are independent and stack.
Affected component
- Variant: self-hosted light only (Cloud /
main.ts uses MySQL, unaffected)
- File:
apps/server/sources/storage/prisma.ts — applySqliteRuntimePragmas, getPrismaClient construction
- Datasource:
apps/server/prisma/sqlite/schema.prisma
- Verified against
dev HEAD 379c9941 (2026-05-15) — Prisma client construction is unchanged: new PrismaClient() without a driver adapter
Root cause
Two independent layers govern how long a write can wait for the SQLite writer lock:
- SQLite's
busy_timeout PRAGMA — set per connection. Tells SQLite "if the writer lock is held, retry for up to N ms before returning SQLITE_BUSY."
- Prisma engine's per-query IPC timeout — set by the engine binary's internal client-server protocol, not by application code. Default for SQLite is roughly 5 s and is not configurable via the public Prisma API (the
?socket_timeout= URL parameter is Postgres-only).
When the engine timeout (~5 s) is shorter than busy_timeout (30 s), the engine fails the query before busy_timeout has a chance to succeed. The PRAGMA value is effectively dead.
Additionally, applySqliteRuntimePragmas is invoked once via client.$queryRawUnsafe('PRAGMA …') at startup, on whichever pool connection the call lands on. Prisma's SQLite engine binary does not expose a per-connection setup hook, so subsequent pool connections start fresh with the SQLite defaults (busy_timeout = 0, synchronous = FULL). The PRAGMAs only stick on connections that happen to share the same engine-side state — not reliably across the pool.
This second issue means that even within the 5 s engine window, some writes get instant SQLITE_BUSY failures because their connection never received the busy_timeout PRAGMA.
Reproduction
- Self-host light variant, SQLite, with
HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000 in stack env (verified in /proc/<server-pid>/environ)
- Drive enough concurrent writes that any single one queues for more than ~5 s (≥6 active sessions doing presence flushes is sufficient — see companion issue)
- Observe in stack log:
[server] [07:16:40.608] ERROR: Error updating session: PrismaClientKnownRequestError:
[server] Invalid `db.session.updateMany()` invocation in
[server] …/sessionCache.ts:338:38
[server] Socket timeout (the database failed to respond to a query within the configured timeout).
[server] stack: "PrismaClientKnownRequestError: Socket timeout (...)
[server] at Qr.transaction (…/sqlite-client/runtime/library.js:115:11521)
[server] at async Proxy._transactionWithCallback (…/sqlite-client/runtime/library.js:134:8044)
[server] at async inTx (…/storage/inTx.ts:59:76)
[server] at async Object.<anonymous> (…/api/routes/machines/machinesRoutes.ts:297:27)"
- Time-to-failure for any failing query is ~5 s, never approaching the 30 s
busy_timeout configured
Evidence
applySqliteRuntimePragmas is called once at startup; the source comment documents the intent of these PRAGMAs but the engine timeout overrides the wait window in practice
- Stack
process.env confirmed correct via /proc/<pid>/environ for all four levels of the process tree (hstack.mjs, run.mjs, tsx wrapper, server child) — HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000 is present
- WAL is healthy:
wal_checkpoint(TRUNCATE) returns clean, integrity check passes, freelist 0
- Identical symptom across multiple incidents (
*.corrupt-bak, *.preswap, *.pre-vacuum siblings in server-light/) suggests the issue compounds over time, occasionally corrupting the DB outright
Proposed remediation
Two complementary fixes, in priority order.
Fix A — Switch to @prisma/adapter-better-sqlite3 with per-connection PRAGMA hook (structural)
Prisma supports driver adapters that let the application own the connection. With better-sqlite3, a connection-setup hook can apply busy_timeout, synchronous, and journal_mode on every connection deterministically.
Sketch:
import { PrismaBetterSQLite3 } from '@prisma/adapter-better-sqlite3';
import Database from 'better-sqlite3';
function createSqliteAdapter(env: NodeJS.ProcessEnv) {
const pragmas = resolveSqliteRuntimePragmasFromEnv(env);
return new PrismaBetterSQLite3({
url: env.DATABASE_URL!,
afterConnect: (db: Database.Database) => {
db.pragma(`journal_mode = ${pragmas.journalMode}`);
db.pragma(`synchronous = ${pragmas.synchronous}`);
db.pragma(`busy_timeout = ${pragmas.busyTimeoutMs}`);
},
});
}
const adapter = createSqliteAdapter(process.env);
_db = new PrismaClient({ adapter });
This also bypasses the Prisma engine IPC entirely (better-sqlite3 is in-process), so the ~5 s engine timeout disappears. SQLite's busy_timeout becomes the only wait-time governor — which is what HAPPIER_SQLITE_BUSY_TIMEOUT_MS is documented to do.
Effort: medium-large. New dependency (@prisma/adapter-better-sqlite3, better-sqlite3). Refactor prisma.ts client construction. Re-validate transaction semantics. 1–2 days incl. tests + smoke pass against full server boot path.
Fix B — Wrap writes in a transient-retry helper (band-aid, ships fast)
For any P2024 Socket timeout, retry the operation up to N times with small backoff. This does not fix the root cause but eliminates the user-visible cascade.
export async function withSqliteRetry<T>(
op: () => Promise<T>,
{ attempts = 5, backoffMs = 250 } = {},
): Promise<T> {
let lastErr: unknown;
for (let i = 0; i < attempts; i++) {
try {
return await op();
} catch (e: any) {
if (e?.code !== 'P2024') throw e;
lastErr = e;
await new Promise(r => setTimeout(r, backoffMs * (i + 1)));
}
}
throw lastErr;
}
Apply at strategic call sites (presence flush, machine registration, usage report). Eliminates the daemon's machine-registration retry storm.
Effort: small-medium. ~30-line helper + call-site adoption. Few hours.
Fix C — Document the trade-off
If neither A nor B is adopted, HAPPIER_SQLITE_BUSY_TIMEOUT_MS should be documented as effectively capped at the Prisma engine timeout (~5 s) for the SQLite engine binary path. Today the env var name and description suggest it controls the actual wait window, which is misleading.
Effort estimate
- Fix A (structural): medium-large — 1–2 days
- Fix B (retry wrapper): small-medium — 4–6 h
- Fix C (docs only): small — 30 min
Recommended: ship B first (immediate operator relief), then A to close the structural gap, then deprecate the engine-IPC code path.
Investigation done
- Confirmed env var reaches all four levels of the server process tree (
hstack.mjs, run.mjs, tsx, server child) via /proc/<pid>/environ
- Read
applySqliteRuntimePragmas and resolveSqliteRuntimePragmasFromEnv — values resolve correctly (defaults: WAL, NORMAL, 5000 ms; user-set: 30000 ms)
- Captured full stack trace pointing at
sqlite-client/runtime/library.js:115:11521 (the engine IPC layer, not SQLite proper) — confirms the timeout is engine-side, not SQLite-side
- Time-to-failure is consistently ~5 s across multiple captured errors, matching documented Prisma engine timeout, not the configured 30 s
busy_timeout
- Verified the per-connection PRAGMA limitation against Prisma engine semantics —
$queryRawUnsafe runs on one connection, not all pool connections
- Reviewed all
dev HEAD commits since v0.2.0 touching prisma / storage — none address either layer of this problem
- Confirmed Cloud variant (
main.ts / MySQL) is structurally immune (different driver, no per-process writer lock)
Related
Companion issue: #175 — SessionCache flush amplifies contention by issuing N sequential per-session writes per tick, producing the queue depth that triggers the 5 s engine timeout. Either fix here removes the cascade; both together remove the structural risk.
Summary
On the self-hosted light variant (SQLite backend), Prisma's engine raises
PrismaClientKnownRequestError: Socket timeout (the database failed to respond to a query within the configured timeout)after ~5 seconds when SQLite is under sustained write contention. This happens regardless ofPRAGMA busy_timeout— includingbusy_timeout = 30000(30 s), which is set byapplySqliteRuntimePragmas(apps/server/sources/storage/prisma.ts) and verified to apply to the connection where the PRAGMA is issued.In effect,
HAPPIER_SQLITE_BUSY_TIMEOUT_MSdoes not protect writes from being aborted: any caller queued behind SQLite's writer lock for more than ~5 s getsSocket timeout, even though the lock would have released withinbusy_timeout.This produces user-visible disconnects: the daemon's
POST /v1/machines(and any other write-path control call) returns 500 / times out while session-presence flushes monopolise the writer slot. See companion issue #175 on the sessionCache flush amplifier — these two fixes are independent and stack.Affected component
main.tsuses MySQL, unaffected)apps/server/sources/storage/prisma.ts—applySqliteRuntimePragmas,getPrismaClientconstructionapps/server/prisma/sqlite/schema.prismadevHEAD379c9941(2026-05-15) — Prisma client construction is unchanged:new PrismaClient()without a driver adapterRoot cause
Two independent layers govern how long a write can wait for the SQLite writer lock:
busy_timeoutPRAGMA — set per connection. Tells SQLite "if the writer lock is held, retry for up to N ms before returningSQLITE_BUSY."?socket_timeout=URL parameter is Postgres-only).When the engine timeout (~5 s) is shorter than
busy_timeout(30 s), the engine fails the query beforebusy_timeouthas a chance to succeed. The PRAGMA value is effectively dead.Additionally,
applySqliteRuntimePragmasis invoked once viaclient.$queryRawUnsafe('PRAGMA …')at startup, on whichever pool connection the call lands on. Prisma's SQLite engine binary does not expose a per-connection setup hook, so subsequent pool connections start fresh with the SQLite defaults (busy_timeout = 0,synchronous = FULL). The PRAGMAs only stick on connections that happen to share the same engine-side state — not reliably across the pool.This second issue means that even within the 5 s engine window, some writes get instant
SQLITE_BUSYfailures because their connection never received thebusy_timeoutPRAGMA.Reproduction
HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000in stack env (verified in/proc/<server-pid>/environ)busy_timeoutconfiguredEvidence
applySqliteRuntimePragmasis called once at startup; the source comment documents the intent of these PRAGMAs but the engine timeout overrides the wait window in practiceprocess.envconfirmed correct via/proc/<pid>/environfor all four levels of the process tree (hstack.mjs,run.mjs,tsxwrapper, server child) —HAPPIER_SQLITE_BUSY_TIMEOUT_MS=30000is presentwal_checkpoint(TRUNCATE)returns clean, integrity check passes, freelist 0*.corrupt-bak,*.preswap,*.pre-vacuumsiblings inserver-light/) suggests the issue compounds over time, occasionally corrupting the DB outrightProposed remediation
Two complementary fixes, in priority order.
Fix A — Switch to
@prisma/adapter-better-sqlite3with per-connection PRAGMA hook (structural)Prisma supports driver adapters that let the application own the connection. With
better-sqlite3, a connection-setup hook can applybusy_timeout,synchronous, andjournal_modeon every connection deterministically.Sketch:
This also bypasses the Prisma engine IPC entirely (better-sqlite3 is in-process), so the ~5 s engine timeout disappears. SQLite's
busy_timeoutbecomes the only wait-time governor — which is whatHAPPIER_SQLITE_BUSY_TIMEOUT_MSis documented to do.Effort: medium-large. New dependency (
@prisma/adapter-better-sqlite3,better-sqlite3). Refactorprisma.tsclient construction. Re-validate transaction semantics. 1–2 days incl. tests + smoke pass against full server boot path.Fix B — Wrap writes in a transient-retry helper (band-aid, ships fast)
For any
P2024 Socket timeout, retry the operation up to N times with small backoff. This does not fix the root cause but eliminates the user-visible cascade.Apply at strategic call sites (presence flush, machine registration, usage report). Eliminates the daemon's machine-registration retry storm.
Effort: small-medium. ~30-line helper + call-site adoption. Few hours.
Fix C — Document the trade-off
If neither A nor B is adopted,
HAPPIER_SQLITE_BUSY_TIMEOUT_MSshould be documented as effectively capped at the Prisma engine timeout (~5 s) for the SQLite engine binary path. Today the env var name and description suggest it controls the actual wait window, which is misleading.Effort estimate
Recommended: ship B first (immediate operator relief), then A to close the structural gap, then deprecate the engine-IPC code path.
Investigation done
hstack.mjs,run.mjs,tsx, server child) via/proc/<pid>/environapplySqliteRuntimePragmasandresolveSqliteRuntimePragmasFromEnv— values resolve correctly (defaults:WAL,NORMAL, 5000 ms; user-set: 30000 ms)sqlite-client/runtime/library.js:115:11521(the engine IPC layer, not SQLite proper) — confirms the timeout is engine-side, not SQLite-sidebusy_timeout$queryRawUnsaferuns on one connection, not all pool connectionsdevHEAD commits since v0.2.0 touching prisma / storage — none address either layer of this problemmain.ts/ MySQL) is structurally immune (different driver, no per-process writer lock)Related
Companion issue: #175 — SessionCache flush amplifies contention by issuing N sequential per-session writes per tick, producing the queue depth that triggers the 5 s engine timeout. Either fix here removes the cascade; both together remove the structural risk.