Skip to content

Add developer doc for job lifecycle events#4935

Open
dejanzele wants to merge 3 commits into
armadaproject:masterfrom
dejanzele:docs/job-lifecycle-events
Open

Add developer doc for job lifecycle events#4935
dejanzele wants to merge 3 commits into
armadaproject:masterfrom
dejanzele:docs/job-lifecycle-events

Conversation

@dejanzele

@dejanzele dejanzele commented May 28, 2026

Copy link
Copy Markdown
Member

Adds a developer reference for the events and state transitions across a job run's lifecycle.

The doc covers the multi-cluster topology and the two transports (Pulsar and the gRPC lease stream) that carry events between the control plane and executors, the job-level and run-level state machines, and the internal proto event vocabulary alongside its mapping to the external API event vocabulary. It then walks through step-by-step flows for the four terminal cases: succeeded, failed (both organic terminal-phase and executor-issue-handler paths), preempted, and cancelled.

The preempt section documents the current double-emission behavior. Each preempted run produces two JobPreemptedEvent messages on the external stream and an overwrite in the job_run_errors row that replaces the scheduler's preemption description with the executor's generic "Run preempted" text. Operators integrating with the event stream need to know about this.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@greptile-apps

greptile-apps Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a new developer reference document (docs/developer/job-lifecycle-events.md) covering the job-run lifecycle in Armada — topology, transport mechanics, job/run state machines, internal and external event vocabularies, and step-by-step sequence diagrams for the four terminal flows (succeeded, failed, preempted, cancelled). No production code is changed.

  • State machines and event vocabulary are documented with Mermaid diagrams and transition tables, cross-referencing jobstates.go and the Pulsar event proto schema.
  • Preemption double-emission — a known bug where each preemption produces two external JobPreemptedEvent messages and overwrites the job_run_errors row with generic executor content — is explicitly called out in a "Known issues" sub-section, which is valuable for operators integrating with the event stream.

Confidence Score: 5/5

This PR adds only documentation — no production code is changed, so there is no runtime risk from merging.

All changes are in a single new Markdown file. The doc is well-researched and cross-checks cleanly against the conversion layer, scheduler ingester ignore list, and Lookout state enums. The one factual inaccuracy found (the note overstating which fields are populated in the external JobFailedEvent) is a minor description issue and carries no operational risk.

No files require special attention; the single changed Markdown file is safe to merge as-is.

Important Files Changed

Filename Overview
docs/developer/job-lifecycle-events.md New developer reference doc covering job lifecycle events; broadly accurate but the conversion table note for JobErrors → JobFailedEvent overstates which fields are populated for non-PodError reasons.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph Internal["Internal Pulsar Events"]
        JRL[JobRunLeased]
        JRA[JobRunAssigned]
        JRR[JobRunRunning]
        JRS[JobRunSucceeded]
        JRRE[JobRunErrors]
        JRPR[JobRunPreempted]
        JRC[JobRunCancelled]
        JS[JobSucceeded]
        JE[JobErrors]
        CJ[CancelledJob]
    end
    subgraph External["External API Events"]
        JLE[JobLeasedEvent]
        JPE[JobPendingEvent]
        JRuE[JobRunningEvent]
        JLRE[JobLeaseReturnedEvent]
        JLEE[JobLeaseExpiredEvent]
        JPRE[JobPreemptedEvent]
        JSE[JobSucceededEvent]
        JFE[JobFailedEvent]
        JCE[JobCancelledEvent]
        IGN[ignored]
    end
    JRL -->|JobLeasedEvent| JLE
    JRA -->|JobPendingEvent| JPE
    JRR -->|JobRunningEvent| JRuE
    JRS --> IGN
    JRRE -->|PodLeaseReturned| JLRE
    JRRE -->|LeaseExpired| JLEE
    JRRE -->|all other reasons| IGN
    JRPR -->|JobPreemptedEvent| JPRE
    JRC --> IGN
    JS -->|JobSucceededEvent| JSE
    JE -->|Terminal=true, any reason| JFE
    CJ -->|JobCancelledEvent| JCE
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    subgraph Internal["Internal Pulsar Events"]
        JRL[JobRunLeased]
        JRA[JobRunAssigned]
        JRR[JobRunRunning]
        JRS[JobRunSucceeded]
        JRRE[JobRunErrors]
        JRPR[JobRunPreempted]
        JRC[JobRunCancelled]
        JS[JobSucceeded]
        JE[JobErrors]
        CJ[CancelledJob]
    end
    subgraph External["External API Events"]
        JLE[JobLeasedEvent]
        JPE[JobPendingEvent]
        JRuE[JobRunningEvent]
        JLRE[JobLeaseReturnedEvent]
        JLEE[JobLeaseExpiredEvent]
        JPRE[JobPreemptedEvent]
        JSE[JobSucceededEvent]
        JFE[JobFailedEvent]
        JCE[JobCancelledEvent]
        IGN[ignored]
    end
    JRL -->|JobLeasedEvent| JLE
    JRA -->|JobPendingEvent| JPE
    JRR -->|JobRunningEvent| JRuE
    JRS --> IGN
    JRRE -->|PodLeaseReturned| JLRE
    JRRE -->|LeaseExpired| JLEE
    JRRE -->|all other reasons| IGN
    JRPR -->|JobPreemptedEvent| JPRE
    JRC --> IGN
    JS -->|JobSucceededEvent| JSE
    JE -->|Terminal=true, any reason| JFE
    CJ -->|JobCancelledEvent| JCE
Loading

Reviews (12): Last reviewed commit: "Correct MaxRunsExceeded handler, podIssu..." | Re-trigger Greptile

Comment thread docs/developer/job-lifecycle-events.md
Comment thread docs/developer/job-lifecycle-events.md
@dejanzele dejanzele force-pushed the docs/job-lifecycle-events branch 8 times, most recently from 44bcedb to 7903765 Compare May 29, 2026 10:49
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the docs/job-lifecycle-events branch from 7903765 to 09a154e Compare May 29, 2026 11:11
…nversion note in job-lifecycle doc

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant