fix(worker): warn when worker_id unset inside a container (#2359) by nicoloboschi · Pull Request #2366 · vectorize-io/hindsight

nicoloboschi · 2026-06-23T10:55:19Z

Problem

Several users (e.g. #2359) report consolidation/async tasks getting permanently stuck in processing, never claimed despite available worker slots.

Root cause: worker_id defaults to socket.gethostname(), which inside Docker/Kubernetes is the random container hostname — it changes every time the container is recreated. recover_own_tasks() only reclaims tasks whose worker_id matches the current worker, so a task left in processing under a now-dead container's hostname is never recovered. New requests then dedupe onto the stuck op, and consolidation wedges indefinitely.

The real fix is to set a stable HINDSIGHT_API_WORKER_ID, but the failure mode is silent today.

Change

Add detect_container_runtime() (docker / kubernetes / None) via KUBERNETES_SERVICE_HOST, /.dockerenv, and /proc/1/cgroup.
At worker start, when HINDSIGHT_API_WORKER_ID is unset and a container runtime is detected, log a prominent warning explaining the instability and how to fix it.

No behavior change to id resolution; non-containerized runs are unaffected.

Notes

This is a guardrail, not a recovery mechanism — an already-stuck op (keyed to a dead container id) still won't self-heal. A follow-up orphan-recovery path can address that separately.

Tests

tests/test_container_detection.py covers all three detection paths plus the negative case.

koriyoshi2041 · 2026-06-23T11:48:05Z

I checked the changed surface locally:

cd hindsight-api-slim
uv run pytest tests/test_container_detection.py
uv run ruff check hindsight_api/utils.py hindsight_api/api/http.py tests/test_container_detection.py
uv run ruff format --check hindsight_api/utils.py hindsight_api/api/http.py tests/test_container_detection.py
git diff --check origin/main...review/pr-2366-worker-warning

Result: 4 passed; Ruff/format/diff whitespace all clean.

The warning path looks right for the API-embedded worker. One non-blocking coverage note: the standalone hindsight-worker CLI still defaults --worker-id to config.worker_id or socket.gethostname() in hindsight_api/worker/main.py, but it does not call detect_container_runtime(). The docs recommend the production split of HINDSIGHT_API_WORKER_ENABLED=false plus separate hindsight-worker processes, so plain Docker / non-Helm deployments could still miss the new startup warning on the path where they are most likely to run dedicated workers. Helm is already fine because worker-statefulset.yaml sets HINDSIGHT_API_WORKER_ID from the pod name.

I would either mirror the warning in worker/main.py too, or track that as the next small follow-up. Not a reason to block the guardrail here.

The current CI failures I looked at appear unrelated to this Python change: Rust CLI build jobs are failing on generated client signatures expecting Option<u64> while hindsight-cli/src/api.rs passes Option<i64>, and shard 2 has a broad RuntimeError: function '_has_torch_function' already has a docstring import failure.

Default worker_id falls back to socket.gethostname(), which inside Docker/ Kubernetes is the random container hostname and changes on every container recreation. recover_own_tasks() only reclaims tasks whose worker_id matches the current worker, so tasks left in 'processing' under the old hostname are never recovered — consolidation and other async ops can get stuck forever. Add detect_container_runtime() and log a prominent warning at worker start when HINDSIGHT_API_WORKER_ID is unset and a container runtime is detected, pointing operators to set a stable worker id.

nicoloboschi force-pushed the fix/warn-unstable-worker-id-docker-2359 branch 2 times, most recently from a18bbd0 to c820bf0 Compare June 23, 2026 11:06

nicoloboschi force-pushed the fix/warn-unstable-worker-id-docker-2359 branch from c820bf0 to 0e204b3 Compare June 23, 2026 12:12

nicoloboschi force-pushed the fix/warn-unstable-worker-id-docker-2359 branch from 0e204b3 to 4d4a679 Compare June 23, 2026 12:30

nicoloboschi merged commit 680305a into main Jun 23, 2026
289 of 294 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker): warn when worker_id unset inside a container (#2359)#2366

fix(worker): warn when worker_id unset inside a container (#2359)#2366
nicoloboschi merged 1 commit into
mainfrom
fix/warn-unstable-worker-id-docker-2359

nicoloboschi commented Jun 23, 2026

Uh oh!

koriyoshi2041 commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nicoloboschi commented Jun 23, 2026

Problem

Change

Notes

Tests

Uh oh!

koriyoshi2041 commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants