Version of Eraser
v1.3.1
Expected Behavior
I have multiple clusters running Eraser w/ v1.3.1 and we've set our success rate pretty low (80%), down from 95% because we couldn't get Eraser to mark the ImageJob as successful. Looking at the logs, it seems that there's a bug in the success rate math that causes Eraser to think it's 0% successful when one or two pods fail in a strange way.
{"level":"info","ts":1755535148.5651102,"logger":"controller","msg":"Marking job as failed","process":"imagejob-controller","success ratio":0.8,"actual ratio":0}
In reality, the job had 272 successful nodes and one node that causes the pod to reach an outOfCpu state. This happened on other clusters with nodes w/ memory pressure instead of cpu pressure.
Expected behavior: the ImageJob is marked as successful (as it's currently >99% successful) and the pods are cleaned up (we have .runtimeConfig.manager.imageJob.cleanup.delayOnSuccess set to 0s).
Actual Behavior
Actual behavior: ImageJob fails w/ 0% success rate and pods aren't cleaned up. (we have .runtimeConfig.manager.imageJob.cleanup.delayOnFailure set to 5h).
Steps To Reproduce
K8s v1.32.6
Eraser helm chart v1.3.1
helm values:
runtimeConfig:
manager:
nodeFilter:
type: exclude
selectors:
- eraser.sh/exclude-node # exclude nodes with this label
scheduling:
repeatInterval: "6h" # default is 24h
imageJob:
successRatio: 0.80 # 80% success ratio for image jobs to be considered 'successful'. Needs to be lower than 100% to account for cpu/memory pressure that causes the job to fail occasionally.
cleanup:
delayOnSuccess: "0s" # clean up pods immediately after success
delayOnFailure: "5h" # keep the pods around for 5 hours after failure to allow for investigation
Then get a node to have enough cpu/mem pressure to cause an imagejob pod to error with outOfCpu or outOfMemory.
Are you willing to submit PRs to contribute to this bug fix?
Version of Eraser
v1.3.1
Expected Behavior
I have multiple clusters running Eraser w/ v1.3.1 and we've set our success rate pretty low (80%), down from 95% because we couldn't get Eraser to mark the ImageJob as successful. Looking at the logs, it seems that there's a bug in the success rate math that causes Eraser to think it's 0% successful when one or two pods fail in a strange way.
In reality, the job had 272 successful nodes and one node that causes the pod to reach an
outOfCpustate. This happened on other clusters with nodes w/ memory pressure instead of cpu pressure.Expected behavior: the ImageJob is marked as successful (as it's currently >99% successful) and the pods are cleaned up (we have
.runtimeConfig.manager.imageJob.cleanup.delayOnSuccessset to0s).Actual Behavior
Actual behavior: ImageJob fails w/ 0% success rate and pods aren't cleaned up. (we have
.runtimeConfig.manager.imageJob.cleanup.delayOnFailureset to5h).Steps To Reproduce
K8s v1.32.6
Eraser helm chart v1.3.1
helm values:
Then get a node to have enough cpu/mem pressure to cause an imagejob pod to error with
outOfCpuoroutOfMemory.Are you willing to submit PRs to contribute to this bug fix?