Skip to content

Fix flaky TestEventServer_GetJobSetEvents context budget#4925

Merged
dejanzele merged 2 commits into
armadaproject:masterfrom
dejanzele:fix-event-test-shared-ctx
May 22, 2026
Merged

Fix flaky TestEventServer_GetJobSetEvents context budget#4925
dejanzele merged 2 commits into
armadaproject:masterfrom
dejanzele:fix-event-test-shared-ctx

Conversation

@dejanzele

@dejanzele dejanzele commented May 21, 2026

Copy link
Copy Markdown
Member

TestEventServer_GetJobSetEvents_ErrorIfMissing (4 subtests) and TestEventServer_GetJobSetEvents_Permissions (3 subtests) share a single 5s armadacontext.WithTimeout across all their subtests. Each subtest spins up a fresh lookout DB and runs every migration via withEventServer and WithLookoutDb, so the per-subtest cost is not trivial.

On a slow CI runner the cumulative cost blew the 5s budget. We hit this on PR #4920 where 4 subtests of _ErrorIfMissing took 5.74s combined and the last one died with context deadline exceeded inside PostgresQueueRepository.upsertQueue.

@greptile-apps

greptile-apps Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes flaky tests in TestEventServer_GetJobSetEvents_ErrorIfMissing and TestEventServer_GetJobSetEvents_Permissions by moving the armadacontext.WithTimeout call inside each t.Run block, giving every subtest its own independent 5-second deadline instead of sharing a single one across all subtests.

  • The shared outer context is removed; each t.Run closure now creates and defers its own ctx/cancel, so slow CI runners can no longer exhaust the budget before later subtests start.
  • Both test functions are simultaneously refactored to table-driven style, extracting a reusable jobRunAssignedEvent() helper and eliminating ~100 lines of repetitive setup code.

Confidence Score: 5/5

This change only touches test code; no production logic is affected and the fix is structurally sound.

Each subtest now creates its own independent 5-second context inside its t.Run closure, directly addressing the cumulative-budget exhaustion described in the PR. The table-driven refactor reduces duplication without introducing new test logic. The defer cancel() is correctly scoped to the closure, loop-variable capture is safe (subtests are sequential), and no logic regressions were found across the four _ErrorIfMissing and three _Permissions cases.

No files require special attention.

Important Files Changed

Filename Overview
internal/server/event/event_test.go Per-subtest context isolation correctly implemented; table-driven refactor is clean with no logic regressions found.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["TestEventServer_GetJobSetEvents_ErrorIfMissing"] --> B["for name, tc := range tests"]
    B --> C["t.Run(name, ...)"]
    C --> D["ctx, cancel := WithTimeout(5s)  ← NEW per-subtest"]
    D --> E["defer cancel()"]
    E --> F["withEventServer(ctx, t, ...)"]
    F --> G{"tc.publishEvent?"}
    G -- yes --> H["reportPulsarEvent(ctx, ...)"]
    G -- no --> I["GetJobSetEvents(...)"]
    H --> I
    I --> J{"tc.expectErrorCode == OK?"}
    J -- yes --> K["require.NoError"]
    J -- no --> L["assert gRPC error code"]
    K --> M["assert len(stream.sendMessages)"]
    L --> M
Loading

Reviews (8): Last reviewed commit: "Merge branch 'master' into fix-event-tes..." | Re-trigger Greptile

@dejanzele dejanzele force-pushed the fix-event-test-shared-ctx branch 2 times, most recently from 4d0ef99 to 6e298b5 Compare May 21, 2026 12:01
Comment thread internal/server/event/event_test.go Outdated
@dejanzele dejanzele force-pushed the fix-event-test-shared-ctx branch from 6e298b5 to 554ac0d Compare May 21, 2026 17:56
d2burkhalter
d2burkhalter previously approved these changes May 21, 2026
@dejanzele dejanzele force-pushed the fix-event-test-shared-ctx branch 2 times, most recently from e336956 to f0afe52 Compare May 22, 2026 08:47
@dejanzele

Copy link
Copy Markdown
Member Author

@greptileai

The two tests shared a single 5s armadacontext.WithTimeout across all their
subtests. Each subtest spins up a fresh lookout DB and runs every migration
via withEventServer and WithLookoutDb, so the per-subtest cost is not
trivial. Under CI load on a slow runner the cumulative cost blew the
budget, surfacing as context deadline exceeded inside CreateQueue.

Two changes here:
- Each subtest now gets its own 5s budget rather than sharing one. Adding
  new cases no longer tightens the envelope for existing ones.
- Both tests converted to map-based table-driven, with the shared
  event-publishing boilerplate extracted into a jobRunAssignedEvent helper.

Net -108 lines, same coverage.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the fix-event-test-shared-ctx branch from f0afe52 to aa1da3a Compare May 22, 2026 09:11
@dejanzele dejanzele enabled auto-merge (squash) May 22, 2026 10:50
@dejanzele dejanzele merged commit 61f6ce1 into armadaproject:master May 22, 2026
17 checks passed
nikola-jokic pushed a commit that referenced this pull request Jun 4, 2026
`TestEventServer_GetJobSetEvents_ErrorIfMissing` (4 subtests) and
`TestEventServer_GetJobSetEvents_Permissions` (3 subtests) share a
single 5s `armadacontext.WithTimeout` across all their subtests. Each
subtest spins up a fresh lookout DB and runs every migration via
`withEventServer` and `WithLookoutDb`, so the per-subtest cost is not
trivial.

On a slow CI runner the cumulative cost blew the 5s budget. We hit this
on PR #4920 where 4 subtests of `_ErrorIfMissing` took 5.74s combined
and the last one died with `context deadline exceeded` inside
`PostgresQueueRepository.upsertQueue`.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants