Skip to content

fix(worker): warn when worker_id unset inside a container (#2359)#2366

Merged
nicoloboschi merged 1 commit into
mainfrom
fix/warn-unstable-worker-id-docker-2359
Jun 23, 2026
Merged

fix(worker): warn when worker_id unset inside a container (#2359)#2366
nicoloboschi merged 1 commit into
mainfrom
fix/warn-unstable-worker-id-docker-2359

Conversation

@nicoloboschi

Copy link
Copy Markdown
Collaborator

Problem

Several users (e.g. #2359) report consolidation/async tasks getting permanently stuck in processing, never claimed despite available worker slots.

Root cause: worker_id defaults to socket.gethostname(), which inside Docker/Kubernetes is the random container hostname — it changes every time the container is recreated. recover_own_tasks() only reclaims tasks whose worker_id matches the current worker, so a task left in processing under a now-dead container's hostname is never recovered. New requests then dedupe onto the stuck op, and consolidation wedges indefinitely.

The real fix is to set a stable HINDSIGHT_API_WORKER_ID, but the failure mode is silent today.

Change

  • Add detect_container_runtime() (docker / kubernetes / None) via KUBERNETES_SERVICE_HOST, /.dockerenv, and /proc/1/cgroup.
  • At worker start, when HINDSIGHT_API_WORKER_ID is unset and a container runtime is detected, log a prominent warning explaining the instability and how to fix it.

No behavior change to id resolution; non-containerized runs are unaffected.

Notes

This is a guardrail, not a recovery mechanism — an already-stuck op (keyed to a dead container id) still won't self-heal. A follow-up orphan-recovery path can address that separately.

Tests

tests/test_container_detection.py covers all three detection paths plus the negative case.

@nicoloboschi nicoloboschi force-pushed the fix/warn-unstable-worker-id-docker-2359 branch 2 times, most recently from a18bbd0 to c820bf0 Compare June 23, 2026 11:06
@koriyoshi2041

Copy link
Copy Markdown
Contributor

I checked the changed surface locally:

cd hindsight-api-slim
uv run pytest tests/test_container_detection.py
uv run ruff check hindsight_api/utils.py hindsight_api/api/http.py tests/test_container_detection.py
uv run ruff format --check hindsight_api/utils.py hindsight_api/api/http.py tests/test_container_detection.py
git diff --check origin/main...review/pr-2366-worker-warning

Result: 4 passed; Ruff/format/diff whitespace all clean.

The warning path looks right for the API-embedded worker. One non-blocking coverage note: the standalone hindsight-worker CLI still defaults --worker-id to config.worker_id or socket.gethostname() in hindsight_api/worker/main.py, but it does not call detect_container_runtime(). The docs recommend the production split of HINDSIGHT_API_WORKER_ENABLED=false plus separate hindsight-worker processes, so plain Docker / non-Helm deployments could still miss the new startup warning on the path where they are most likely to run dedicated workers. Helm is already fine because worker-statefulset.yaml sets HINDSIGHT_API_WORKER_ID from the pod name.

I would either mirror the warning in worker/main.py too, or track that as the next small follow-up. Not a reason to block the guardrail here.

The current CI failures I looked at appear unrelated to this Python change: Rust CLI build jobs are failing on generated client signatures expecting Option<u64> while hindsight-cli/src/api.rs passes Option<i64>, and shard 2 has a broad RuntimeError: function '_has_torch_function' already has a docstring import failure.

@nicoloboschi nicoloboschi force-pushed the fix/warn-unstable-worker-id-docker-2359 branch from c820bf0 to 0e204b3 Compare June 23, 2026 12:12
Default worker_id falls back to socket.gethostname(), which inside Docker/
Kubernetes is the random container hostname and changes on every container
recreation. recover_own_tasks() only reclaims tasks whose worker_id matches
the current worker, so tasks left in 'processing' under the old hostname are
never recovered — consolidation and other async ops can get stuck forever.

Add detect_container_runtime() and log a prominent warning at worker start
when HINDSIGHT_API_WORKER_ID is unset and a container runtime is detected,
pointing operators to set a stable worker id.
@nicoloboschi nicoloboschi force-pushed the fix/warn-unstable-worker-id-docker-2359 branch from 0e204b3 to 4d4a679 Compare June 23, 2026 12:30
@nicoloboschi nicoloboschi merged commit 680305a into main Jun 23, 2026
289 of 294 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants