Source scheduler failure metric labels from error categorization, remove TrackedErrorRegexes#4980
Conversation
5047645 to
5eeb998
Compare
Greptile SummaryThis PR replaces regex-based error classification with direct reads from
Confidence Score: 5/5Safe to merge — the change is a straightforward removal of dead regex machinery in favour of reading two proto fields that are already populated by the executor. The only observable side-effect is a change in metric label values (e.g. podError to user_error), which is explicitly documented. The path from failure detection to counter increment is now a single getter call on a nil-safe proto receiver, and call sites in schedulerapp.go and tests are consistently updated. No new error paths are introduced. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Job Run Fails] --> B[ReportStateTransitions]
B --> C{jst.Failed?}
C -- Yes --> D[Lookup jobRunError by run ID]
D --> E[Read FailureCategory and FailureSubcategory from proto]
E --> F[Increment jobErrorsByQueue counter]
E --> G[Increment jobErrorsByNode counter]
subgraph OLD ["Removed - TrackedErrorRegexes path"]
H[errorTypeAndMessageFromError] --> I[Loop over compiled regexes]
I --> J[Return type and first matched regex string]
end
style OLD fill:#ffcccc,stroke:#cc0000
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[Job Run Fails] --> B[ReportStateTransitions]
B --> C{jst.Failed?}
C -- Yes --> D[Lookup jobRunError by run ID]
D --> E[Read FailureCategory and FailureSubcategory from proto]
E --> F[Increment jobErrorsByQueue counter]
E --> G[Increment jobErrorsByNode counter]
subgraph OLD ["Removed - TrackedErrorRegexes path"]
H[errorTypeAndMessageFromError] --> I[Loop over compiled regexes]
I --> J[Return type and first matched regex string]
end
style OLD fill:#ffcccc,stroke:#cc0000
Reviews (11): Last reviewed commit: "Merge branch 'master' into rewire-schedu..." | Re-trigger Greptile |
5eeb998 to
91177c3
Compare
4e2715f to
b0dd7f9
Compare
…lureSubcategory and remove TrackedErrorRegexes Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
b0dd7f9 to
3af6b9f
Compare
|
Tick the box to add this pull request to the merge queue (same as
|
What Armada exposes now
armada_scheduler_job_error_classification_by_queueand_by_nodenow label failures with the semantic category from error categorization, read off theErrorproto (FailureCategory/FailureSubcategory) instead of a regex match against the message. The metric names and label sets are unchanged. Only the label values change.categorywas theError.Reasontype (podError,leaseExpired, ...) and is now the semantic category, so dashboards filtering on the old values need updating.subcategorywas the first matching regex (empty in practice) and is nowFailureSubcategory.This replaces
trackedErrorRegexes, which is removed along with it: thescheduler.metrics.trackedErrorRegexesconfig and theerrorTypeAndMessageFromError/errorRegexesplumbing inmetrics.New. It was never set in-repo, so a deployment still setting it just logs an "unused key" warning.Validation
End to end on a Helm-deployed stack: failing jobs classified by the executor (
user_error,oom) and a killed-executor run (internal/lease-expired) all landed on the metric with the expected labels and counts.max-runs-exceededandjob-rejectedare job-level rather than run-level errors, so they do not surface in this metric.The metric names and label sets are unchanged, so the existing PrometheusRule keeps evaluating as before. Pre-existing bugs in that file are fixed separately in #4983.