Skip to content

Introduce deadline for init containers#4960

Open
nikola-jokic wants to merge 6 commits into
masterfrom
nikola-jokic/deadline-for-init-containers
Open

Introduce deadline for init containers#4960
nikola-jokic wants to merge 6 commits into
masterfrom
nikola-jokic/deadline-for-init-containers

Conversation

@nikola-jokic

Copy link
Copy Markdown
Contributor

What type of PR is this?

Enhancement

What this PR does / why we need it

Init containers could take time that is counted in the startup time of main containers. This change includes init containers so that checks related to the main container are based on the time where the last transition time occurred, including the init container state change.

The deadline doesn't include side-car containers (init containers with restart policy always).

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces a configurable deadlineForInitContainers that measures elapsed time from when the first init container starts running, failing the pod if all regular (non-sidecar) init containers haven't completed within the window. It also moves the LastStatusChange call and clock dependency into PodChecks itself, and expands LastStatusChange to include init container status transitions.

  • New init container deadline: getInitContainerState tracks the earliest start time across all non-sidecar init containers and fires ActionFail if they haven't collectively completed within deadlineForInitContainers; sidecar init containers (restartPolicy: Always) are excluded.
  • GetAction signature change: timeInState is no longer passed by the caller; the clock and LastStatusChange call are now encapsulated inside PodChecks, requiring a clock.Clock at construction time.
  • LastStatusChange broadened: now considers InitContainerStatuses alongside ContainerStatuses, so the bad-node and node-assignment deadline timers reset on init container transitions.

Confidence Score: 4/5

The runtime logic is sound, but one newly added test contains a message string that doesn't match what the production code emits, meaning that test will fail on CI.

The core init container deadline logic, sidecar exclusion, and LastStatusChange broadening all look correct. The one concrete defect is in pod_checks_test.go: Test_GetAction_InitContainerDeadlineUsesFirstInitStart asserts "is running" but pod_checks.go emits "is still running", so assert.Contains will return false and the test will fail. The rest of the changes (clock injection, config field, handler simplification) are straightforward.

internal/executor/podchecks/pod_checks_test.go — the message assertion in the first init container deadline test needs to match the production string.

Important Files Changed

Filename Overview
internal/executor/podchecks/pod_checks.go Core logic for init container deadline check added; GetAction now internalises clock and LastStatusChange call; getInitContainerState correctly skips sidecars and distinguishes first/current/last containers.
internal/executor/podchecks/pod_checks_test.go New init container deadline tests added; one test (Test_GetAction_InitContainerDeadlineUsesFirstInitStart) contains a message string mismatch ("is running" vs code's "is still running") that will cause it to fail.
internal/executor/util/pod_util.go LastStatusChange now iterates over both InitContainerStatuses and ContainerStatuses so the timeInState baseline reflects init container transitions.
internal/executor/service/pod_issue_handler.go LastStatusChange error handling moved into GetAction; detectPodIssues simplified with equivalent behavior (ActionWait on error).
internal/executor/service/pod_issue_handler_test.go Updated makePendingPodChecker to pass RealClock to NewPodChecks to match the new signature; no logic changes.
internal/executor/configuration/podchecks/types.go Adds DeadlineForInitContainers duration field to Checks config struct.
config/executor/config.yaml Adds deadlineForInitContainers: 5m to the default pending pod checks config.
internal/executor/application.go Passes clock.RealClock{} to NewPodChecks to satisfy the updated constructor signature.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[GetAction called] --> B[LastStatusChange pod]
    B -->|error| C[return ActionWait]
    B -->|ok| D{deadlineForInitContainers > 0?}
    D -->|no| H
    D -->|yes| E[getInitContainerState]
    E --> F{first != nil AND last == nil AND elapsed > deadline?}
    F -->|no| H[compute timeInState]
    F -->|yes| G[return ActionFail / PodStartupIssue]
    H --> I{timeInState > deadlineForNodeAssignment AND no node?}
    I -->|yes| J[return ActionRetry / NoNodeAssigned]
    I -->|no| K{timeInState > deadlineForUpdates AND bad node?}
    K -->|yes| L[return ActionRetry / NoStatusUpdates]
    K -->|no| M[eventChecks + containerStateChecks]
    M --> N[return maxAction]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[GetAction called] --> B[LastStatusChange pod]
    B -->|error| C[return ActionWait]
    B -->|ok| D{deadlineForInitContainers > 0?}
    D -->|no| H
    D -->|yes| E[getInitContainerState]
    E --> F{first != nil AND last == nil AND elapsed > deadline?}
    F -->|no| H[compute timeInState]
    F -->|yes| G[return ActionFail / PodStartupIssue]
    H --> I{timeInState > deadlineForNodeAssignment AND no node?}
    I -->|yes| J[return ActionRetry / NoNodeAssigned]
    I -->|no| K{timeInState > deadlineForUpdates AND bad node?}
    K -->|yes| L[return ActionRetry / NoStatusUpdates]
    K -->|no| M[eventChecks + containerStateChecks]
    M --> N[return maxAction]
Loading

Reviews (6): Last reviewed commit: "wip" | Re-trigger Greptile

Comment thread internal/executor/podchecks/pod_checks.go Outdated
Comment on lines 91 to 95

func Test_GetAction_BadNodeButUnderTimeLimit(t *testing.T) {
podChecks := podChecksWithMocks(ActionWait, ActionWait)
result, cause, message := podChecks.GetAction(createBasicPod(true), []*v1.Event{}, 10*time.Second)
result, cause, message := podChecks.GetAction(createBasicPodInStateFor(true, 10*time.Second), []*v1.Event{})
assert.Equal(t, result, ActionWait)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing test for init container in Waiting/CrashLoopBackoff state

There is test coverage for Waiting (before first start), Running, and Terminated (exit 0 and multi-container partial success), but no test for the state where an init container previously ran with a non-zero exit code and is now in Waiting state (CrashLoopBackoff). That scenario exercises the LastState gap described in the sibling comment on pod_checks.go, and a test would confirm whether the deadline is expected to fire or not during the backoff window.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Comment thread internal/executor/podchecks/pod_checks_test.go Outdated
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
@datadog-armadaproject

datadog-armadaproject Bot commented Jun 22, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 2 Pipeline jobs failed

CI | All jobs succeeded   View in Datadog   GitHub Actions

CI | test / Golang Unit Tests   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: ff14288 | Docs | Give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant