Skip to content

Avoid staging local KV cache payloads#353

Open
leblancfg wants to merge 1 commit into
antirez:mainfrom
leblancfg:kv-streaming-write
Open

Avoid staging local KV cache payloads#353
leblancfg wants to merge 1 commit into
antirez:mainfrom
leblancfg:kv-streaming-write

Conversation

@leblancfg

@leblancfg leblancfg commented Jun 7, 2026

Copy link
Copy Markdown

Problem

Disk KV saves currently write the session payload twice for local sessions:

  1. serialize the DS4 payload to a staged temp file;
    ds4_session_stage_payload(session, &staged, ...)
  2. copy that staged payload into the final .kv.tmp... cache file;
    ds4_session_write_staged_payload(&staged, fp, ...)
  3. rename the final temp file into place.
    rename(tmp, path)

Current path: session -> staged payload tmp -> final .kv.tmp -> rename.
This PR: session -> final .kv.tmp -> rename.

For long contexts, this payload can be multi-GiB. The extra staged-file copy adds latency to cold/continued/session saves and also raises peak temporary disk usage.

In this PR

When the engine can predict the payload size with ds4_session_payload_bytes(), write the payload directly into the final temporary cache file.

The cache file is still atomic from the reader's point of view: it is written as .tmp... and renamed only after the full header, text, payload and trailer are written successfully. If the direct write fails, the temp file is unlinked as before.

Unknown-size payloads still use the old staged path. In practice this preserves the existing distributed-session behavior, since distributed payload size is not predicted by ds4_session_payload_bytes().

Implementation details:

  • add ds4_session_save_payload_counted(), a small wrapper around the existing ds4_session_save_payload() serializer;
  • keep the payload format in one implementation instead of duplicating serialization code;
  • server and agent save paths use direct writes only when the expected payload size is known;
  • after direct write, verify that the measured bytes written match the expected payload size;
  • no loader or cache format change.

Benchmark

I used a long-context server run because this is where disk KV saves are large enough for the extra copy to matter. The benchmark does not try to show faster prefill, just measures the save path around normal long-context checkpoints.

Machine: Apple M2 Max, 64 GiB unified memory, Metal SSD streaming, 32GB routed expert cache.

Command:

./ds4-server \
  -m ./ds4flash.gguf \
  --ssd-streaming \
  --ssd-streaming-cache-experts 32GB \
  --ctx 524288 \
  --kv-disk-dir "$CACHE" \
  --kv-disk-space-mb 200000 \
  --kv-cache-min-tokens 512 \
  --kv-cache-cold-max-tokens 520192 \
  --kv-cache-continued-interval-tokens 10000 \
  --kv-cache-boundary-trim-tokens 32 \
  --kv-cache-boundary-align-tokens 2048

Request:

  • prompt: first 405,181 bytes of speed-bench/promessi_sposi.txt;
  • prompt tokens: 128,190;
  • model=deepseek-chat, thinking=false, temperature=0, max_tokens=1;
  • fresh KV cache dir for each run.

For main, I used a temporary timing-only log around ds4_session_stage_payload() so the old path can be measured as serialize to temp + copy temp to cache. That instrumentation is not part of this PR.

Per-checkpoint save times, excluding shutdown:

tokens saved KV file old: serialize temp old: copy temp old total new direct delta
20,480 291.77 MiB 142.0 ms 76.3 ms 218.3 ms 170.6 ms -47.7 ms
40,960 560.66 MiB 267.7 ms 144.2 ms 411.9 ms 289.0 ms -122.9 ms
61,440 829.55 MiB 392.3 ms 221.8 ms 614.1 ms 378.5 ms -235.6 ms
81,920 1098.44 MiB 526.4 ms 441.3 ms 967.7 ms 515.9 ms -451.8 ms
102,400 1367.33 MiB 616.5 ms 955.7 ms 1572.2 ms 602.6 ms -969.6 ms
122,880 1636.22 MiB 740.3 ms 1049.0 ms 1789.3 ms 705.0 ms -1084.3 ms
126,976 1690.00 MiB 728.4 ms 1014.2 ms 1742.6 ms 716.7 ms -1025.9 ms

Tests

make clean
make -j4
git diff --check
./ds4_test --server
./ds4-eval --self-test-extractors
make q4k-dot-test
make cpu

Additional live cache check on the direct-save branch, reusing the 128k-token benchmark cache:

  • cache hit restored 126,976 tokens from disk
  • only 1,214 prompt suffix tokens were prefetched
  • reported usage: cached_tokens=126976, cache_write_tokens=1214
  • cache load log: load=355.5 ms for the 1.69 GiB cold checkpoint

@leblancfg leblancfg marked this pull request as ready for review June 8, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant