Categorize internal failures with a static failure category#4972
Open
dejanzele wants to merge 3 commits into
Open
Categorize internal failures with a static failure category#4972dejanzele wants to merge 3 commits into
dejanzele wants to merge 3 commits into
Conversation
Contributor
Comment on lines
+511
to
+519
| var failureSubcategory string | ||
| switch issue.RunIssue.PodIssue.Type { | ||
| case UnableToSchedule: | ||
| failureSubcategory = errormatch.SubcategoryUnschedulable | ||
| case StuckStartingUp: | ||
| failureSubcategory = errormatch.SubcategoryStuckStartingUp | ||
| case FailedStartingUp: | ||
| failureSubcategory = errormatch.SubcategoryFailedStartingUp | ||
| } |
Contributor
There was a problem hiding this comment.
Non-exhaustive switch leaves subcategory silently empty
The switch covers the three retryable types that exist today (UnableToSchedule, StuckStartingUp, FailedStartingUp), but has no default branch. If a future retryable podIssueType is added without updating this switch, the event will carry FailureCategory = "internal" with an empty FailureSubcategory = "" — a category/subcategory mismatch that is hard to catch because the code compiles and runs without error. Adding a default: failureSubcategory = "unknown" (or logging an unexpected-type warning) would make such a gap visible immediately.
0af53eb to
4cbb8a5
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
4cbb8a5 to
b6b3cfc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Armada-generated job-run failures previously carried no
failure_category/failure_subcategory. Only operator-classified pod failures (via the executor categorizer) were tagged, so any dashboard attributing "why did this run end" had a large unlabeled remainder for everything Armada itself decided.This change stamps the failures Armada itself authors with a static, code-owned category
internalplus a subcategory naming the cause. The boundary is the source of the error.internalcovers errors Armada produces from its own logic, where it writes a fixed, self-authored message. Errors where Armada merely relays dynamic external content (the kubelet, the K8s API, the scheduler) are left to the operator categorizer, which attributes them into the operator's own categories (infra, user_error, and so on) using its configured default when no rule matches. An operator never has to match Armada's internal error strings, and the top-level value reads as a triage signal:internalmeans the failure was Armada's own machinery, not the workload.Stamped
internal:For the structural pod issues the categorization is deterministic: the categorizer is not consulted, so a configured rule cannot override
internal. The only fallback anywhere is the categorizer's own defaultCategory, which applies to the external causes below.Not stamped
internal:The subcategory vocabulary lives in
internal/common/errormatch(a dependency-free leaf already holding the sibling condition constants), so both the executor and the scheduler stamp from one shared set without an import cycle.No proto change, no migration. The change populates the existing
Error.FailureCategory/Error.FailureSubcategoryproto fields. One observable metric effect: the executor failure counterarmada_executor_job_failure_category_totalnow reportsinternalfor the structural pod issues, which previously emitted no category (the counter is a no-op on an empty category). This is intentional.Where the stamps are visible today:
JobRunErrorsand reach the Lookoutjob_run.failure_category/failure_subcategorycolumns.JobErrors(notJobRunErrors), so they do not reach the Lookoutjob_runcolumns. The api event conversion only copies these fields on thePodErrorarm, so they do not reach the api stream either. Their stamps are not observable anywhere today. They are set for construction-site consistency and to be ready whenJobErrorspersistence or the api conversion is extended.Follow-up (separate PR): an
onPodIssuestructured matcher so operators can categorize the external Armada-detected causes (stuck-starting-up, unschedulable, and optionally the structural ones) into their own categories without regexing Armada's messages, plus running the categorizer on the lease-return and submission paths.