Context
After #144 / #146 landed the persistent hed-lsp client, the per-request "Initializing annotation workflow..." gap dropped from 20-60s to ~10-12s. The remaining time is dominated by the LLM-based keyword extraction inside _semantic_preprocess_node (src/agents/workflow.py), not by the LSP call itself.
Measurement in prod (docker exec hedit python ...):
HedLspClient.spawn_stdio + initialize: 0.52 s
- One batched
hed/suggest with 5 queries: 0.46 s
- Same call cached: 0.01 s
So LSP is fine. The slow piece is self.feedback_llm.ainvoke(...) in _extract_keywords. In prod that LLM is wired to the evaluation model (qwen/qwen3.6-35b-a3b @ wandb), which is ~7-10 s for a "extract 5 nouns" task. Even claude-haiku-4.5 takes ~7-9 s when extended thinking is on by default.
A keyword extraction call with claude-haiku-4.5 + thinking: disabled + max_tokens=200 runs in ~1 s with identical-quality output.
Sub-issues
- B. Use fast LLM (thinking disabled) for keyword extraction. Single-PR fix.
- C. Run
semantic_preprocess in parallel with the first annotate LLM call so the perceived pre-annotate window goes to ~0. Folds hints in on retry if they arrive in time. Separate PR after B lands.
Acceptance
Context
After #144 / #146 landed the persistent hed-lsp client, the per-request "Initializing annotation workflow..." gap dropped from 20-60s to ~10-12s. The remaining time is dominated by the LLM-based keyword extraction inside
_semantic_preprocess_node(src/agents/workflow.py), not by the LSP call itself.Measurement in prod (
docker exec hedit python ...):HedLspClient.spawn_stdio+ initialize: 0.52 shed/suggestwith 5 queries: 0.46 sSo LSP is fine. The slow piece is
self.feedback_llm.ainvoke(...)in_extract_keywords. In prod that LLM is wired to the evaluation model (qwen/qwen3.6-35b-a3b @ wandb), which is ~7-10 s for a "extract 5 nouns" task. Even claude-haiku-4.5 takes ~7-9 s when extended thinking is on by default.A keyword extraction call with claude-haiku-4.5 +
thinking: disabled+max_tokens=200runs in ~1 s with identical-quality output.Sub-issues
semantic_preprocessin parallel with the firstannotateLLM call so the perceived pre-annotate window goes to ~0. Folds hints in on retry if they arrive in time. Separate PR after B lands.Acceptance