Skip to content

Update task status after the task fails#500

Open
MoeexT wants to merge 2 commits into
mainfrom
develop/ray
Open

Update task status after the task fails#500
MoeexT wants to merge 2 commits into
mainfrom
develop/ray

Conversation

@MoeexT
Copy link
Copy Markdown
Contributor

@MoeexT MoeexT commented Jun 4, 2026

image

close: #486

MoeexT added 2 commits June 4, 2026 17:23
When Ray head/worker pods are deleted during task execution:
- Task was stuck in RUNNING forever, frontend never updated

Changes:
1. job_task_scheduler.py: Add connection failure counter (5 retries)
   + stall detection (120s no log progress = FAILED)
2. operator_runtime.py: Add GET /api/task/{id}/status endpoint
   to expose RayJobScheduler task status to backend-python
3. cleaning_task_scheduler.py: Add background polling loop that
   queries runtime status every 2s and updates database on
   terminal states (completed/failed/cancelled)
4. operator_runtime.py: get_from_cfg() supports default fallback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

任务运行过程中,删除ray-cluster-worker 与ray-cluster-head容器,会导致容器任务卡住

1 participant