Source scheduler failure metric labels from error categorization, remove TrackedErrorRegexes by dejanzele · Pull Request #4980 · armadaproject/armada

dejanzele · 2026-06-25T12:55:49Z

What Armada exposes now

armada_scheduler_job_error_classification_by_queue and _by_node now label failures with the semantic category from error categorization, read off the Error proto (FailureCategory/FailureSubcategory) instead of a regex match against the message. The metric names and label sets are unchanged. Only the label values change.

armada_scheduler_job_error_classification_by_queue{queue="analytics", category="user_error", subcategory=""} 2
armada_scheduler_job_error_classification_by_queue{queue="analytics", category="internal",   subcategory="lease-expired"} 1
armada_scheduler_job_error_classification_by_node{node="worker-1", cluster="c1", category="user_error", subcategory=""} 2

category was the Error.Reason type (podError, leaseExpired, ...) and is now the semantic category, so dashboards filtering on the old values need updating. subcategory was the first matching regex (empty in practice) and is now FailureSubcategory.

This replaces trackedErrorRegexes, which is removed along with it: the scheduler.metrics.trackedErrorRegexes config and the errorTypeAndMessageFromError / errorRegexes plumbing in metrics.New. It was never set in-repo, so a deployment still setting it just logs an "unused key" warning.

Validation

End to end on a Helm-deployed stack: failing jobs classified by the executor (user_error, oom) and a killed-executor run (internal/lease-expired) all landed on the metric with the expected labels and counts. max-runs-exceeded and job-rejected are job-level rather than run-level errors, so they do not surface in this metric.

The metric names and label sets are unchanged, so the existing PrometheusRule keeps evaluating as before. Pre-existing bugs in that file are fixed separately in #4983.

greptile-apps · 2026-06-25T13:02:38Z

Greptile Summary

This PR replaces regex-based error classification with direct reads from FailureCategory/FailureSubcategory proto fields on armadaevents.Error, removing TrackedErrorRegexes from the config and the associated compilation/matching logic in metrics.New and jobStateMetrics.

metrics.New is now infallible (returns *Metrics directly instead of (*Metrics, error)), and newJobStateMetrics drops the errorRegexes parameter entirely.
Label values for armada_scheduler_job_error_classification_by_queue/node change semantically (e.g., podError → user_error, empty subcategory → oom); dashboards filtering on the old values will need updating.
TestCategoriseErrors is updated to exercise the new proto-sourced labels (infrastructure/oom) end-to-end.

Confidence Score: 5/5

Safe to merge — the change is a straightforward removal of dead regex machinery in favour of reading two proto fields that are already populated by the executor.

The only observable side-effect is a change in metric label values (e.g. podError to user_error), which is explicitly documented. The path from failure detection to counter increment is now a single getter call on a nil-safe proto receiver, and call sites in schedulerapp.go and tests are consistently updated. No new error paths are introduced.

No files require special attention.

Important Files Changed

Filename	Overview
internal/scheduler/metrics/state_metrics.go	Removes errorRegexes field and failedCategoryAndSubCategoryFromJob/errorTypeAndMessageFromError helpers; calls proto getters directly at the failure-recording site. Clean simplification with no logic regressions.
internal/scheduler/metrics/metrics.go	New() is now infallible — drops regex compilation and the returned error; signature and call sites are updated consistently.
internal/scheduler/configuration/configuration.go	Removes TrackedErrorRegexes field from MetricsConfig; deployments that still set this key will get an unused-key warning, not an error.
internal/scheduler/metrics/state_metrics_test.go	Test signatures updated and TestCategoriseErrors rewritten to use FailureCategory/FailureSubcategory proto fields; TestReset and TestDisable retain pre-existing (harmless) use of state label values for the error-counter slots.
internal/scheduler/schedulerapp.go	Call site updated to match new infallible New() signature; TrackedErrorRegexes argument removed. Clean change.
internal/scheduler/scheduler_test.go	Package-level schedulerMetrics variable updated to match the new infallible New() signature.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Job Run Fails] --> B[ReportStateTransitions]
    B --> C{jst.Failed?}
    C -- Yes --> D[Lookup jobRunError by run ID]
    D --> E[Read FailureCategory and FailureSubcategory from proto]
    E --> F[Increment jobErrorsByQueue counter]
    E --> G[Increment jobErrorsByNode counter]

    subgraph OLD ["Removed - TrackedErrorRegexes path"]
        H[errorTypeAndMessageFromError] --> I[Loop over compiled regexes]
        I --> J[Return type and first matched regex string]
    end

    style OLD fill:#ffcccc,stroke:#cc0000

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Job Run Fails] --> B[ReportStateTransitions]
    B --> C{jst.Failed?}
    C -- Yes --> D[Lookup jobRunError by run ID]
    D --> E[Read FailureCategory and FailureSubcategory from proto]
    E --> F[Increment jobErrorsByQueue counter]
    E --> G[Increment jobErrorsByNode counter]

    subgraph OLD ["Removed - TrackedErrorRegexes path"]
        H[errorTypeAndMessageFromError] --> I[Loop over compiled regexes]
        I --> J[Return type and first matched regex string]
    end

    style OLD fill:#ffcccc,stroke:#cc0000

_{Reviews (11): Last reviewed commit: "Merge branch 'master' into rewire-schedu..." | Re-trigger Greptile}

datadog-armadaproject · 2026-06-25T13:17:08Z

⚠️ Warnings

🚦 1 Pipeline job failed

CI | test / Golang Integration Tests

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 91177c3 | Docs | Give us feedback!}

…lureSubcategory and remove TrackedErrorRegexes Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

mergify · 2026-06-26T13:56:02Z

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

Queue this pull request

dejanzele force-pushed the rewire-scheduler-error-metric branch from 5047645 to 5eeb998 Compare June 25, 2026 12:58

greptile-apps Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread internal/scheduler/metrics/state_metrics.go Outdated

dejanzele force-pushed the rewire-scheduler-error-metric branch from 5eeb998 to 91177c3 Compare June 25, 2026 13:06

dejanzele force-pushed the rewire-scheduler-error-metric branch 7 times, most recently from 4e2715f to b0dd7f9 Compare June 26, 2026 12:14

Source scheduler failure metric labels from Error.FailureCategory/Fai…

3af6b9f

…lureSubcategory and remove TrackedErrorRegexes Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele force-pushed the rewire-scheduler-error-metric branch from b0dd7f9 to 3af6b9f Compare June 26, 2026 13:01

JamesMurkin approved these changes Jun 26, 2026

View reviewed changes

dejanzele enabled auto-merge (squash) June 26, 2026 15:18

Merge branch 'master' into rewire-scheduler-error-metric

0b33010

dejanzele merged commit 1eface2 into armadaproject:master Jun 26, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Source scheduler failure metric labels from error categorization, remove TrackedErrorRegexes#4980

Source scheduler failure metric labels from error categorization, remove TrackedErrorRegexes#4980
dejanzele merged 2 commits into
armadaproject:masterfrom
dejanzele:rewire-scheduler-error-metric

dejanzele commented Jun 25, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

datadog-armadaproject Bot commented Jun 25, 2026

Uh oh!

mergify Bot commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dejanzele commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Armada exposes now

Validation

Uh oh!

greptile-apps Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

datadog-armadaproject Bot commented Jun 25, 2026

⚠️ Warnings

Uh oh!

mergify Bot commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dejanzele commented Jun 25, 2026 •

edited

Loading

greptile-apps Bot commented Jun 25, 2026 •

edited

Loading