Skip to content

Categorize internal failures with a static failure category#4972

Open
dejanzele wants to merge 3 commits into
armadaproject:masterfrom
dejanzele:categorize-internal-errors
Open

Categorize internal failures with a static failure category#4972
dejanzele wants to merge 3 commits into
armadaproject:masterfrom
dejanzele:categorize-internal-errors

Conversation

@dejanzele

@dejanzele dejanzele commented Jun 19, 2026

Copy link
Copy Markdown
Member

Armada-generated job-run failures previously carried no failure_category/failure_subcategory. Only operator-classified pod failures (via the executor categorizer) were tagged, so any dashboard attributing "why did this run end" had a large unlabeled remainder for everything Armada itself decided.

This change stamps the failures Armada itself authors with a static, code-owned category internal plus a subcategory naming the cause. The boundary is the source of the error. internal covers errors Armada produces from its own logic, where it writes a fixed, self-authored message. Errors where Armada merely relays dynamic external content (the kubelet, the K8s API, the scheduler) are left to the operator categorizer, which attributes them into the operator's own categories (infra, user_error, and so on) using its configured default when no rule matches. An operator never has to match Armada's internal error strings, and the top-level value reads as a triage signal: internal means the failure was Armada's own machinery, not the workload.

Stamped internal:

  • Scheduler: lease expiry, max-runs-exceeded, job rejection.
  • Executor: failed job creation, reconciliation pod-missing, and the Armada-detected structural pod issues it authors a fixed message for (stuck-terminating, externally-deleted, active-deadline, issue-handler-error).

For the structural pod issues the categorization is deterministic: the categorizer is not consulted, so a configured rule cannot override internal. The only fallback anywhere is the categorizer's own defaultCategory, which applies to the external causes below.

Not stamped internal:

  • Stuck-starting-up and unschedulable pod issues (image pull, scheduling) run through the operator categorizer, which attributes their external cause.
  • Lease returns and failed pod submissions wrap the kubelet or K8s API message. These paths do not run the categorizer yet, so they currently carry no category. Wiring the categorizer into them is follow-up work (see below).
  • Preemption is left uncategorized. It is a scheduling action rather than a failure, is metered separately, and does not flow through the failed-run path.

The subcategory vocabulary lives in internal/common/errormatch (a dependency-free leaf already holding the sibling condition constants), so both the executor and the scheduler stamp from one shared set without an import cycle.

No proto change, no migration. The change populates the existing Error.FailureCategory/Error.FailureSubcategory proto fields. One observable metric effect: the executor failure counter armada_executor_job_failure_category_total now reports internal for the structural pod issues, which previously emitted no category (the counter is a no-op on an empty category). This is intentional.

Where the stamps are visible today:

  • Lease expiry, failed job creation, reconciliation pod-missing, and the structural pod issues are JobRunErrors and reach the Lookout job_run.failure_category / failure_subcategory columns.
  • Max-runs-exceeded and job-rejected are JobErrors (not JobRunErrors), so they do not reach the Lookout job_run columns. The api event conversion only copies these fields on the PodError arm, so they do not reach the api stream either. Their stamps are not observable anywhere today. They are set for construction-site consistency and to be ready when JobErrors persistence or the api conversion is extended.

Follow-up (separate PR): an onPodIssue structured matcher so operators can categorize the external Armada-detected causes (stuck-starting-up, unschedulable, and optionally the structural ones) into their own categories without regexing Armada's messages, plus running the categorizer on the lease-return and submission paths.

@greptile-apps

greptile-apps Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR stamps Armada-generated terminal errors with a static internal failure category and a code-owned subcategory, so dashboards and Lookout can attribute "why did this run end" without a large unlabeled remainder. The boundary is clear: errors Armada itself authors get internal; errors that relay external content (kubelet, K8s API, scheduler) are left to the operator categorizer.

  • New constants in errormatch: CategoryInternal + 8 subcategories, all guarded by a length-bound test against the lookout job_run.failure_subcategory varchar(63) column.
  • Executor changes: CreateMinimalJobFailedEvent gains two new parameters; handleNonRetryableJobIssue routes structural pod issues (StuckTerminating, ExternallyDeleted, ErrorDuringIssueHandling, ActiveDeadlineExceeded) directly to the internal category and bypasses the classifier; reconciliation pod-missing and job-creation failures are also stamped.
  • Scheduler changes: LeaseExpired, MaxRunsExceeded, and JobRejected errors are stamped at their construction sites; preemption is deliberately left uncategorized.

Confidence Score: 5/5

Purely additive change that populates existing proto fields; no schema, proto, or metric changes; all callers of the modified function have been updated.

The change touches only error-construction sites, adds no new control flow, and is fully covered by updated and new tests. The internal/external routing in handleNonRetryableJobIssue is exhaustive over all seven podIssueType values, and the classifier is still consulted for the three types left to operators. No pre-existing callers of CreateMinimalJobFailedEvent were missed.

No files require special attention.

Important Files Changed

Filename Overview
internal/common/errormatch/types.go Adds CategoryInternal constant and nine subcategory constants; all values are short, doc-commented, and guarded by the companion length-bound test.
internal/executor/reporter/event.go CreateMinimalJobFailedEvent gains two new parameters (failureCategory, failureSubcategory); all callers have been updated; preemption path correctly keeps empty strings.
internal/executor/service/job_requester.go sendFailedEvent now passes CategoryInternal/SubcategoryJobCreationFailed to CreateMinimalJobFailedEvent; straightforward two-line addition.
internal/executor/service/pod_issue_handler.go handleNonRetryableJobIssue routes structural issues to the new internalSubcategoryForPodIssueType helper (bypassing the classifier); all seven podIssueType values are covered with a default fallback; reconciliation handler stamped with internal/pod-missing.
internal/scheduler/scheduler.go Three scheduler-generated terminal errors (LeaseExpired, MaxRunsExceeded, JobRejected) are stamped with internal category; changes are isolated to error-construction sites.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Run ends] --> B{Source of failure?}

    B -->|Armada logic| C[Internal path]
    B -->|External content relayed| D[Operator categorizer path]

    C --> C1[Scheduler]
    C --> C2[Executor]

    C1 --> C1a["LeaseExpired → internal/lease-expired"]
    C1 --> C1b["MaxRunsExceeded → internal/max-runs-exceeded"]
    C1 --> C1c["JobRejected → internal/job-rejected"]

    C2 --> C2a["JobCreationFailed → internal/job-creation-failed"]
    C2 --> C2b["PodMissing → internal/pod-missing"]
    C2 --> C2c["StuckTerminating → internal/stuck-terminating"]
    C2 --> C2d["ExternallyDeleted → internal/externally-deleted"]
    C2 --> C2e["ErrorDuringIssueHandling → internal/issue-handler-error"]
    C2 --> C2f["ActiveDeadlineExceeded → internal/active-deadline"]

    D --> D1["StuckStartingUp → classifier rules / defaultCategory"]
    D --> D2["UnableToSchedule → classifier rules / defaultCategory"]
    D --> D3["FailedStartingUp → classifier rules / defaultCategory"]
    D --> D4["LeaseReturned / FailedPodSubmission → uncategorized NULL"]

    E[Preemption] -->|scheduling action, not a failure| F[Uncategorized NULL — deliberate]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Run ends] --> B{Source of failure?}

    B -->|Armada logic| C[Internal path]
    B -->|External content relayed| D[Operator categorizer path]

    C --> C1[Scheduler]
    C --> C2[Executor]

    C1 --> C1a["LeaseExpired → internal/lease-expired"]
    C1 --> C1b["MaxRunsExceeded → internal/max-runs-exceeded"]
    C1 --> C1c["JobRejected → internal/job-rejected"]

    C2 --> C2a["JobCreationFailed → internal/job-creation-failed"]
    C2 --> C2b["PodMissing → internal/pod-missing"]
    C2 --> C2c["StuckTerminating → internal/stuck-terminating"]
    C2 --> C2d["ExternallyDeleted → internal/externally-deleted"]
    C2 --> C2e["ErrorDuringIssueHandling → internal/issue-handler-error"]
    C2 --> C2f["ActiveDeadlineExceeded → internal/active-deadline"]

    D --> D1["StuckStartingUp → classifier rules / defaultCategory"]
    D --> D2["UnableToSchedule → classifier rules / defaultCategory"]
    D --> D3["FailedStartingUp → classifier rules / defaultCategory"]
    D --> D4["LeaseReturned / FailedPodSubmission → uncategorized NULL"]

    E[Preemption] -->|scheduling action, not a failure| F[Uncategorized NULL — deliberate]
Loading

Reviews (7): Last reviewed commit: "Categorize Armada-authored scheduler fai..." | Re-trigger Greptile

Comment on lines +511 to +519
var failureSubcategory string
switch issue.RunIssue.PodIssue.Type {
case UnableToSchedule:
failureSubcategory = errormatch.SubcategoryUnschedulable
case StuckStartingUp:
failureSubcategory = errormatch.SubcategoryStuckStartingUp
case FailedStartingUp:
failureSubcategory = errormatch.SubcategoryFailedStartingUp
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-exhaustive switch leaves subcategory silently empty

The switch covers the three retryable types that exist today (UnableToSchedule, StuckStartingUp, FailedStartingUp), but has no default branch. If a future retryable podIssueType is added without updating this switch, the event will carry FailureCategory = "internal" with an empty FailureSubcategory = "" — a category/subcategory mismatch that is hard to catch because the code compiles and runs without error. Adding a default: failureSubcategory = "unknown" (or logging an unexpected-type warning) would make such a gap visible immediately.

@dejanzele dejanzele force-pushed the categorize-internal-errors branch 4 times, most recently from 0af53eb to 4cbb8a5 Compare June 22, 2026 13:07
@datadog-armadaproject

datadog-armadaproject Bot commented Jun 22, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 2 Pipeline jobs failed

CI | All jobs succeeded   View in Datadog   GitHub Actions

CI | test / Golang Integration Tests   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 4cbb8a5 | Docs | Give us feedback!

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the categorize-internal-errors branch from 4cbb8a5 to b6b3cfc Compare June 22, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant