ds4.c is a DeepSeek V4 Flash specific inference engine. It is not a generic
GGUF runner. The goal is a small, readable, high-performance C codebase with
Objective-C only where Metal requires it and Metal kernels under metal/.
- Keep the production path as whole-model Metal graph inference.
- Keep model loading mmap-backed; do not eagerly copy the full GGUF.
- Keep the CPU backend CPU-only and use it only as reference/debug code.
- Preserve correctness before speed. Do not keep a faster path with unexplained attention, KV cache, or logits drift.
- Make long local agent sessions practical through live KV reuse and disk KV checkpoints.
- Comment important inference code where the model mechanics, cache lifetime, memory policy, or API orchestration are not obvious from the local code.
- Prefer comments beside the implementation over separate design documents.
- Keep comments instructive and compact: explain why a shape, ordering, cache boundary, or memory choice exists.
- Keep public APIs narrow. CLI/server code should not know tensor internals.
- Do not add permanent semantic variants behind flags. Diagnostic switches are fine when they validate the one release path.
- Do not introduce C++.
- Avoid large CPU inference runs on macOS; the CPU path has previously exposed kernel VM failures with very large mappings.
- Do not run multiple huge model processes concurrently. The instance lock is intentional.
- Prefer short Metal smoke tests for build verification.
ds4.c: model loading, tokenizer, CPU reference code, Metal graph scheduling, sessions, disk-cache payload serialization.ds4_cli.c: command line, linenoise REPL, interactive transcript handling.ds4_server.c: OpenAI/Anthropic compatible HTTP API, worker queue, streaming, tool-call mapping, disk KV cache policy.ds4_metal.m: Objective-C Metal runtime and kernel wrappers.metal/*.metal: compute kernels.tests/: unit and live integration tests.misc/: ignored notes, experiments, and old planning material.
Use make for build validation. Use make test for unit/regression tests when a
model and Metal are available. Use live server tests only when intentionally
testing the API surface.