fix(worker): warn when worker_id unset inside a container (#2359)#2366
Conversation
a18bbd0 to
c820bf0
Compare
|
I checked the changed surface locally: cd hindsight-api-slim
uv run pytest tests/test_container_detection.py
uv run ruff check hindsight_api/utils.py hindsight_api/api/http.py tests/test_container_detection.py
uv run ruff format --check hindsight_api/utils.py hindsight_api/api/http.py tests/test_container_detection.py
git diff --check origin/main...review/pr-2366-worker-warningResult: The warning path looks right for the API-embedded worker. One non-blocking coverage note: the standalone I would either mirror the warning in The current CI failures I looked at appear unrelated to this Python change: Rust CLI build jobs are failing on generated client signatures expecting |
c820bf0 to
0e204b3
Compare
Default worker_id falls back to socket.gethostname(), which inside Docker/ Kubernetes is the random container hostname and changes on every container recreation. recover_own_tasks() only reclaims tasks whose worker_id matches the current worker, so tasks left in 'processing' under the old hostname are never recovered — consolidation and other async ops can get stuck forever. Add detect_container_runtime() and log a prominent warning at worker start when HINDSIGHT_API_WORKER_ID is unset and a container runtime is detected, pointing operators to set a stable worker id.
0e204b3 to
4d4a679
Compare
Problem
Several users (e.g. #2359) report consolidation/async tasks getting permanently stuck in
processing, never claimed despite available worker slots.Root cause:
worker_iddefaults tosocket.gethostname(), which inside Docker/Kubernetes is the random container hostname — it changes every time the container is recreated.recover_own_tasks()only reclaims tasks whoseworker_idmatches the current worker, so a task left inprocessingunder a now-dead container's hostname is never recovered. New requests then dedupe onto the stuck op, and consolidation wedges indefinitely.The real fix is to set a stable
HINDSIGHT_API_WORKER_ID, but the failure mode is silent today.Change
detect_container_runtime()(docker/kubernetes/None) viaKUBERNETES_SERVICE_HOST,/.dockerenv, and/proc/1/cgroup.HINDSIGHT_API_WORKER_IDis unset and a container runtime is detected, log a prominent warning explaining the instability and how to fix it.No behavior change to id resolution; non-containerized runs are unaffected.
Notes
This is a guardrail, not a recovery mechanism — an already-stuck op (keyed to a dead container id) still won't self-heal. A follow-up orphan-recovery path can address that separately.
Tests
tests/test_container_detection.pycovers all three detection paths plus the negative case.