Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,9 @@ Thumbs.db
# Benchmark results (keep structure, ignore large outputs)
benchmarks/results/*.json
!benchmarks/results/.gitkeep

# OpenCode (user-specific local config at project root)
/opencode.json

# Build artifacts
remote-build.log
50 changes: 29 additions & 21 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,31 +37,40 @@ No API key is required by default. If your client demands one, any non-empty str

### OpenCode

[OpenCode](https://opencode.ai) connects directly via OpenAI-compatible providers.
[OpenCode](https://opencode.ai) connects via the `@ai-sdk/openai-compatible` provider.

> **Important:** Use `@ai-sdk/openai-compatible`, not `@ai-sdk/openai`. The latter crashes on
> models that emit `<think>` tokens (see [Troubleshooting](TROUBLESHOOTING.md#opencode-text-part-msg-not-found)).

```json
// ~/.config/opencode/config.json
// opencode.json (project root or ~/.config/opencode/opencode.json)
{
"$schema": "https://opencode.ai/config.json",
"model": "foundry/Qwen3.5-9B-UD-Q4_K_XL.gguf",
"provider": {
"foundry": {
"npm": "@ai-sdk/openai-compatible",
"name": "Foundry",
"type": "openai",
"url": "http://localhost:8080/v1",
"options": {
"baseURL": "http://localhost:8080/v1",
"apiKey": "sk-local"
},
"models": {
"qwen": {
"id": "qwen3.5-9b",
"name": "Qwen 3.5 9B"
},
"qwen-coder": {
"id": "qwen3-coder-30b-a3b",
"name": "Qwen 3 Coder 30B A3B"
"Qwen3.5-9B-UD-Q4_K_XL.gguf": {
"name": "Qwen 3.5 9B",
"limit": {
"context": 262144,
"output": 32768
}
}
}
}
}
}
```

The model ID must match what `/v1/models` returns (check with `curl http://localhost:8080/v1/models`).

### Cursor

Settings > Models > OpenAI API Base:
Expand Down Expand Up @@ -493,7 +502,7 @@ for chunk in stream:
print(chunk.choices[0].delta.content, end="", flush=True)
```

The Docker Compose configuration includes TCP tuning (BBR congestion control, busy polling) for minimal streaming latency. Time-to-first-token is typically ~50-200 ms depending on prompt length.
The host tuning script (`sudo ./scripts/host-setup.sh`) configures BBR congestion control and busy polling for minimal streaming latency. Time-to-first-token is typically ~50-200 ms depending on prompt length.

## Multi-GPU Agent Routing

Expand Down Expand Up @@ -539,6 +548,10 @@ def get_client(agent_id: int) -> OpenAI:

## Troubleshooting

For a comprehensive troubleshooting guide, see [TROUBLESHOOTING.md](TROUBLESHOOTING.md).

Quick reference for the most common issues:

### Model loads on CPU instead of GPU

If you see `no devices with dedicated memory found` in the logs, the CUDA backend failed to load. Check:
Expand All @@ -553,6 +566,10 @@ If you see `no devices with dedicated memory found` in the logs, the CUDA backen
2. Check VRAM: `nvidia-smi` -- if VRAM is full, reduce context with `FOUNDRY_CTX_LENGTH`
3. Check if all slots are occupied: `curl http://localhost:8080/metrics | grep slots`

### OpenCode: "text part msg_... not found"

Use `@ai-sdk/openai-compatible` (not `@ai-sdk/openai`) and ensure the server has `--reasoning-format none`. See [TROUBLESHOOTING.md](TROUBLESHOOTING.md#opencode-text-part-msg-not-found) for details.

### Connection refused

1. Container might still be loading the model. Check `docker logs <container>` for progress.
Expand All @@ -571,12 +588,3 @@ docker run --gpus all -p 8080:8080 \
```

For GPUs with less than 16 GB VRAM, use Qwen3.5-9B (only 5.66 GB model weight). For 16+ GB, Qwen3-Coder-30B-A3B's MoE expert offloading can spill inactive experts to CPU.

### Inconsistent response speeds

If response speed varies between requests, check for:

1. **Prompt cache misses**: First message in a conversation is always slower (prompt processing)
2. **Concurrent slot contention**: Other agents may be using slots simultaneously
3. **GPU thermal throttling**: Check `nvidia-smi -q -d PERFORMANCE` for throttle reasons
4. **CPU interrupt interference**: Pin GPU IRQs to dedicated cores (see README Host Tuning section)
Loading