Skip to content

feat(scheduler): Extract short job penalty service#4969

Open
tgucks wants to merge 6 commits into
masterfrom
extract-short-job-penalty-svc
Open

feat(scheduler): Extract short job penalty service#4969
tgucks wants to merge 6 commits into
masterfrom
extract-short-job-penalty-svc

Conversation

@tgucks

@tgucks tgucks commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

What type of PR is this?

Feature/refactor

What this PR does / why we need it

Extracts short-job-penalty tracking out of the scheduling hot path and the jobDb retention logic into a dedicated, self-contained ShortJobPenalty service.

Previously the penalty was recomputed every scheduling cycle by scanning all terminal jobs held in the jobDb. To make those jobs available for the scan, terminal short jobs were deliberately kept in the jobDb while their penalty was active. This required an occasional full GC to clean them up periodically. This PR makes ShortJobPenalty own its own state:

  • Terminal jobs are reported to the service once each via ReportFinishedJob at each point where a job can go terminal. This records the job's resources keyed by (pool, queue).
  • Penalties are snapshotted once per scheduling cycle via Snapshot() and read back per-pool from that immutable ShortJobPenaltySnapshot via GetPenaltiesForPool, replacing the inline per-job ShouldApplyPenalty accumulation in calculateJobSchedulingInfo.
  • Entries expire automatically via a deadline-ordered min-heap (runStart + cutoff[pool]), with a derived per-(pool, queue) running total cache kept in sync as entries are added and expired. Access is guarded by a mutex.

This removes the need for the periodic full GC of terminal jobs from the jobDb. They're deleted as they go terminal now.

Special notes for your reviewer

  • ShortJobPenalty is now stateful and should be concurrency-safe (sync.Mutex); the entries are the source of truth and sums is a derived cache. A Snapshot() taken once per cycle gives every pool a consistent point-in-time view.
  • syncState's signature changed (dropped the fullJobGc bool); call sites in cycle and initialize were updated accordingly.
  • Tests in short_job_penalty_test.go were substantially expanded to cover reporting, dedup, expiry, and per-pool reads.

@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR extracts short-job-penalty tracking from the scheduling hot path into a dedicated, mutex-protected ShortJobPenalty service that owns its own state via a deadline-ordered min-heap and a derived (pool, queue) running-total cache. Terminal jobs are now deleted from the jobDb immediately, and penalties are registered via ReportFinishedJob at three explicit termination sites.

  • ShortJobPenalty now accumulates entries on ReportFinishedJob, expires them lazily on each Snapshot() or write call, and exposes an immutable ShortJobPenaltySnapshot consumed once per scheduling cycle.
  • jobdb.go drops the terminalJobs immutable set and all associated goroutine logic, reducing the Upsert fan-out from 7 to 6 goroutines.
  • syncState loses the fullJobGc bool parameter and its periodic full-GC path; terminal jobs are now always deleted on the cycle they arrive.

Confidence Score: 3/5

Safe to merge for steady-state operation; scheduler restarts within the short-job cutoff window will silently discard all in-flight penalties.

On every scheduler restart, terminal jobs loaded by FetchInitialJobs arrive at ReconcileDifferences already fully terminal. They are deleted from the in-memory jobDb without ReportFinishedJob being called, because generateUpdateMessagesFromJob early-returns for any job where InTerminalState() is true, and initialise discards the returned job slice entirely. Any queue that ran short jobs within the cutoff window before the restart enters the next scheduling cycle without those penalties applied.

internal/scheduler/scheduler.go — the syncState terminal-deletion loop and the initialise call site; adding ReportFinishedJob calls there would close the restart gap

Important Files Changed

Filename Overview
internal/scheduler/scheduling/short_job_penalty.go Core refactored service: mutex-guarded heap + sums cache with expiry; logic is sound, but the caller-must-hold-lock contract on shouldApplyPenalty is undocumented (flagged in prior review thread)
internal/scheduler/scheduling/short_job_penalty_types.go New types file: ShortJobPenalty struct, ShortJobPenaltySnapshot, and entryHeap min-heap — all correct; heap interface compile-time assertion is a nice touch
internal/scheduler/scheduler.go ReportFinishedJob is called from three new sites but missed for externally-completed jobs whose terminal state is already set by ReconcileDifferences; initialise path loses all in-flight penalties on restart
internal/scheduler/jobdb/jobdb.go Removes the terminalJobs immutable set and all associated tracking; clean, no issues
internal/scheduler/scheduling/scheduling_algo.go Snapshot taken once per Schedule() call and threaded through runPoolSchedulingRound/newFairSchedulingAlgoContext/calculateJobSchedulingInfo; shortJobPenaltyByQueue now sourced from snapshot instead of inline accumulation
internal/scheduler/scheduler_test.go New integration test covers the three terminalisation sites; missing coverage for the initialise/restart path where terminal jobs are loaded from DB without ReportFinishedJob being called
internal/scheduler/jobdb/jobdb_test.go Removes terminal-job tracking tests consistent with the feature removal; no issues
internal/scheduler/scheduling/short_job_penalty_test.go Substantially expanded with dedup, expiry, partial expiry, per-pool isolation, and concurrent-safety tests; the succeeded-job test seeds a non-terminal job state that differs from production ReconcileDifferences output

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Executor
    participant DB
    participant syncState
    participant generateUpdateMessages
    participant ShortJobPenalty
    participant Schedule

    Executor->>DB: "run.succeeded = true"
    syncState->>DB: FetchJobUpdates (run update, job.succeeded still false)
    syncState->>syncState: "ReconcileDifferences - job non-terminal (run=succeeded, job=running)"
    syncState-->>generateUpdateMessages: jobDbJobs (non-terminal job with succeeded run)
    generateUpdateMessages->>generateUpdateMessages: lastRun.Succeeded() → WithSucceeded(true)
    generateUpdateMessages->>ShortJobPenalty: ReportFinishedJob ✓
    generateUpdateMessages->>DB: publish Succeeded event

    Note over DB: Pulsar consumer marks job.succeeded=true

    syncState->>DB: "next cycle FetchJobUpdates (job.succeeded=true now)"
    syncState->>syncState: job already terminal in jobDb - no-op change
    syncState->>syncState: BatchDelete terminal job

    Note over ShortJobPenalty: On RESTART
    syncState->>DB: "FetchInitialJobs (job.succeeded=true AND run.succeeded=true)"
    syncState->>syncState: "ReconcileDifferences → job.WithSucceeded(true) → InTerminalState=true"
    syncState->>syncState: BatchDelete without ReportFinishedJob ✗
    ShortJobPenalty-->>Schedule: no penalty for recently-completed short jobs
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Executor
    participant DB
    participant syncState
    participant generateUpdateMessages
    participant ShortJobPenalty
    participant Schedule

    Executor->>DB: "run.succeeded = true"
    syncState->>DB: FetchJobUpdates (run update, job.succeeded still false)
    syncState->>syncState: "ReconcileDifferences - job non-terminal (run=succeeded, job=running)"
    syncState-->>generateUpdateMessages: jobDbJobs (non-terminal job with succeeded run)
    generateUpdateMessages->>generateUpdateMessages: lastRun.Succeeded() → WithSucceeded(true)
    generateUpdateMessages->>ShortJobPenalty: ReportFinishedJob ✓
    generateUpdateMessages->>DB: publish Succeeded event

    Note over DB: Pulsar consumer marks job.succeeded=true

    syncState->>DB: "next cycle FetchJobUpdates (job.succeeded=true now)"
    syncState->>syncState: job already terminal in jobDb - no-op change
    syncState->>syncState: BatchDelete terminal job

    Note over ShortJobPenalty: On RESTART
    syncState->>DB: "FetchInitialJobs (job.succeeded=true AND run.succeeded=true)"
    syncState->>syncState: "ReconcileDifferences → job.WithSucceeded(true) → InTerminalState=true"
    syncState->>syncState: BatchDelete without ReportFinishedJob ✗
    ShortJobPenalty-->>Schedule: no penalty for recently-completed short jobs
Loading

Comments Outside Diff (1)

  1. internal/scheduler/scheduler.go, line 863-866 (link)

    P1 ReportFinishedJob is never called for externally-completed jobs

    generateUpdateMessagesFromJob early-returns nil for any job that is already in terminal state, silently skipping the ReportFinishedJob call. In practice, when syncState fetches a job whose jobs.succeeded/failed/cancelled DB column is already true, ReconcileDifferencesreconcileJobDifferences calls job.WithSucceeded(true) (line 186 of reconciliation.go), producing a job with InTerminalState() == true. That job flows into generateUpdateMessages and hits this early return — the penalty is never registered.

    The code path through ReportFinishedJob that the tests cover (seeding a job with only run.Succeeded=true, not job.Succeeded=true) does not match what ReconcileDifferences produces when the DB row reflects both the run and the job as terminal, which is the common case once the job-status-updater has committed. The fix is to call s.shortJobPenalty.ReportFinishedJob(j) in syncState's terminal-deletion loop, before txn.BatchDelete, so every job deleted there is reported regardless of which DB column was updated first.

Reviews (4): Last reviewed commit: "Rename types for clarity" | Re-trigger Greptile

Comment thread internal/scheduler/scheduler.go Outdated
Comment on lines 57 to +81
@@ -49,5 +77,117 @@ func (sjp *ShortJobPenalty) ShouldApplyPenalty(job *jobdb.Job) bool {
return false
}

return sjp.now.Sub(*jobStart) < sjp.cutoffDurationByPool[jobRun.Pool()]
return sjp.now.Sub(*jobStart) < sjp.cutoffs[jobRun.Pool()]
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Undocumented mutex precondition on shouldApplyPenalty

shouldApplyPenalty reads sjp.now and sjp.cutoffs directly without acquiring sjp.mu. This is currently safe because every call site (ReportFinishedJob) already holds the lock before dispatching here. However, the function carries an implicit contract — it must be called with the mutex held — which isn't documented. A future internal caller that invokes it without the lock would silently introduce a data race that only surfaces under -race and only when SetNow is called concurrently. Adding a brief comment (e.g. // caller must hold sjp.mu) makes the contract explicit and avoids a subtle footgun.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +182 to 193
func (sjp *ShortJobPenalty) subtractFromSums(pool, queue string, resources internaltypes.ResourceList) {
queueSums := sjp.sums[pool]
remaining := queueSums[queue].Subtract(resources)
if remaining.AllZero() {
delete(queueSums, queue)
if len(queueSums) == 0 {
delete(sjp.sums, pool)
}
return
}
queueSums[queue] = remaining
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Nil map write in subtractFromSums is unreachable but unguarded

queueSums := sjp.sums[pool] returns nil if the pool key was previously deleted from sums (which subtractFromSums itself does on line 188). If a heap entry were ever present without a matching sums entry — an invariant violation — the code would reach queueSums[queue] = remaining on a nil map and panic. The invariant is maintained correctly today (every heap.Push is paired with an addToSums call), but adding a nil guard or an assertion would make the invariant self-documenting and prevent a silent panic if the invariant is ever accidentally broken.

@tgucks tgucks force-pushed the extract-short-job-penalty-svc branch from f39438e to 640e870 Compare June 17, 2026 13:00
tgucks added 4 commits June 17, 2026 15:12
Signed-off-by: Trey Guckian <24757349+tgucks@users.noreply.github.com>
Signed-off-by: Trey Guckian <24757349+tgucks@users.noreply.github.com>
Signed-off-by: Trey Guckian <24757349+tgucks@users.noreply.github.com>
Signed-off-by: Trey Guckian <24757349+tgucks@users.noreply.github.com>
@tgucks tgucks force-pushed the extract-short-job-penalty-svc branch from 640e870 to cb5940b Compare June 17, 2026 20:18
tgucks added 2 commits June 18, 2026 09:45
Signed-off-by: Trey Guckian <24757349+tgucks@users.noreply.github.com>
Signed-off-by: Trey Guckian <24757349+tgucks@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant