OPCT-405: pkg/retrieve: fix hang and add context/signal handling#220
Conversation
- Add signal.NotifyContext for SIGTERM/SIGINT in NewCmdRetrieve - Thread context.Context through retrieve path - Add progressReader/progressWriter with progress logging every 30s - Add formatBytes() for human-readable byte counts - Add comprehensive unit tests - Fix goroutine leak with sync.Once in progressReader
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughSummary by CodeRabbit
WalkthroughThe retrieval pipeline now supports signal-triggered cancellation and download progress tracking. A signal context is created from SIGTERM/SIGINT in the command entrypoint, flows through the retry orchestration and into the actual retrieval functions, allowing early termination. Progress tracking primitives ( ChangesRetrieval Pipeline with Context Cancellation and Progress Logging
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
We may need to validate this change in a live cluster (we don't have presubmit job set here) as it touches retrieve which depends on aggregator server. /hold |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/retrieve/retrieve.go (1)
127-185:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftCancellation currently stops at the pod download.
retrieveResultsacceptsctx, but afterdownloadFromPod(ctx)the scan/copy/extract path ignores it. A SIGTERM/SIGINT that arrives whilecleaner.ScanPatchTarGzipReaderFor,io.Copy, orsonobuoyclient.UntarAllis running will still keep the process alive until those local phases finish, which leaves the command vulnerable to the same “won’t exit promptly” failure mode on large archives or slow disks.As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/retrieve/retrieve.go` around lines 127 - 185, The function retrieveResults currently ignores ctx after downloadFromPod, so long-running operations (cleaner.ScanPatchTarGzipReaderFor, io.Copy to scannedFile, and sonobuoyclient.UntarAll) can't be cancelled; update the flow to respect ctx by propagating cancellation: if cleaner.ScanPatchTarGzipReaderFor can accept ctx, call it with ctx, otherwise wrap scannedReader with a context-aware reader (e.g., use an io.Pipe and start a goroutine that copies from the original reader to the pipe while watching ctx.Done() and closing the writer on cancel), perform io.Copy to scannedFile in a goroutine or via a context-aware copy that aborts on ctx.Done(), and run sonobuoyclient.UntarAll in a goroutine so you can select on ctx.Done() to close scannedIn and abort extraction; ensure all deferred closes/removals still run and return ctx.Err() when cancellation occurs so retrieveResults cancels promptly.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/retrieve/retrieve.go`:
- Around line 44-46: Change RunE in NewCmdRetrieve to derive the signal-aware
context from cmd.Context() (use signal.NotifyContext(cmd.Context(), ...))
instead of context.Background(), and pass that ctx into retrieveResults; inside
retrieveResults ensure the same ctx is used not only for downloadFromPod but
also for the local phases by (a) calling cleaner.ScanPatchTarGzipReaderFor with
a context-aware variant or wrapping checks for ctx.Done while reading/writing,
(b) making the file write/io.Copy abort when ctx is cancelled (check ctx.Done
and return), and (c) invoking sonobuoyclient.UntarAll and the subsequent rename
with cancellation support (either by using context-aware versions or checking
ctx between steps and returning early). Update function signatures
(retrieveResults and any helper calls) to accept the ctx where needed so
cancellation/deadline from cmd.Context() propagates through downloadFromPod,
cleaner.ScanPatchTarGzipReaderFor, io.Copy, sonobuoyclient.UntarAll and rename
operations.
---
Outside diff comments:
In `@pkg/retrieve/retrieve.go`:
- Around line 127-185: The function retrieveResults currently ignores ctx after
downloadFromPod, so long-running operations (cleaner.ScanPatchTarGzipReaderFor,
io.Copy to scannedFile, and sonobuoyclient.UntarAll) can't be cancelled; update
the flow to respect ctx by propagating cancellation: if
cleaner.ScanPatchTarGzipReaderFor can accept ctx, call it with ctx, otherwise
wrap scannedReader with a context-aware reader (e.g., use an io.Pipe and start a
goroutine that copies from the original reader to the pipe while watching
ctx.Done() and closing the writer on cancel), perform io.Copy to scannedFile in
a goroutine or via a context-aware copy that aborts on ctx.Done(), and run
sonobuoyclient.UntarAll in a goroutine so you can select on ctx.Done() to close
scannedIn and abort extraction; ensure all deferred closes/removals still run
and return ctx.Err() when cancellation occurs so retrieveResults cancels
promptly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: ef48508d-904a-44f5-bfd8-870e4641d117
📒 Files selected for processing (2)
pkg/retrieve/retrieve.gopkg/retrieve/retrieve_test.go
| RunE: func(cmd *cobra.Command, args []string) error { | ||
| ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT) | ||
| defer cancel() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the file
ls -la pkg/retrieve/retrieve.go
# Show the RunE block around the claimed lines
nl -ba pkg/retrieve/retrieve.go | sed -n '1,120p'
# Find and print retrieveResultsRetry / retrieveResults / downloadFromPod sections
rg -n "func (retrieveResultsRetry|retrieveResults|downloadFromPod)\\b" pkg/retrieve/retrieve.go
nl -ba pkg/retrieve/retrieve.go | sed -n '120,260p'
# Check for any other context.Background() usage in this file
rg -n "context\\.Background\\(" pkg/retrieve/retrieve.go
# Check where ctx is used (call sites) in retrieveResults and surrounding helpers
rg -n "\\bctx\\b" pkg/retrieve/retrieve.goRepository: redhat-openshift-ecosystem/opct
Length of output: 211
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="pkg/retrieve/retrieve.go"
echo "== File: $FILE =="
wc -l "$FILE"
echo "== RunE context block (approx lines 35-70) =="
sed -n '35,80p' "$FILE"
echo "== Function headers (retrieveResultsRetry / retrieveResults / downloadFromPod) =="
rg -n "func (retrieveResultsRetry|retrieveResults|downloadFromPod)\\b" "$FILE"
echo "== retrieveResultsRetry body (approx lines 80-190) =="
sed -n '80,200p' "$FILE"
echo "== retrieveResults body (if different) (approx lines 200-320) =="
sed -n '200,340p' "$FILE"
echo "== downloadFromPod body (approx lines 340-520) =="
sed -n '340,520p' "$FILE"
echo "== Any context.Background usage in this file =="
rg -n "context\\.Background\\(" "$FILE" || true
echo "== Any ctx usage in this file (to spot gaps) =="
rg -n "\\bctx\\b" "$FILE" | head -n 120Repository: redhat-openshift-ecosystem/opct
Length of output: 11877
Derive signal context from cmd.Context() and propagate ctx through local scan/redact + extract phases.
NewCmdRetrieve’sRunEroots the signal-aware context fromcontext.Background(), discarding any cancellation/deadline supplied via Cobra/ExecuteContext; base it oncmd.Context()instead.retrieveResultsusesctxonly fordownloadFromPod; the scan/redact (cleaner.ScanPatchTarGzipReaderFor), file writing (io.Copy), and extraction (sonobuoyclient.UntarAll+ rename) run without anyctxchecks, so termination signals won’t reliably interrupt long local phases.
Suggested fix
- ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
+ baseCtx := cmd.Context()
+ if baseCtx == nil {
+ baseCtx = context.Background()
+ }
+ ctx, cancel := signal.NotifyContext(baseCtx, syscall.SIGTERM, syscall.SIGINT)
defer cancel()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| RunE: func(cmd *cobra.Command, args []string) error { | |
| ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT) | |
| defer cancel() | |
| RunE: func(cmd *cobra.Command, args []string) error { | |
| baseCtx := cmd.Context() | |
| if baseCtx == nil { | |
| baseCtx = context.Background() | |
| } | |
| ctx, cancel := signal.NotifyContext(baseCtx, syscall.SIGTERM, syscall.SIGINT) | |
| defer cancel() |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/retrieve/retrieve.go` around lines 44 - 46, Change RunE in NewCmdRetrieve
to derive the signal-aware context from cmd.Context() (use
signal.NotifyContext(cmd.Context(), ...)) instead of context.Background(), and
pass that ctx into retrieveResults; inside retrieveResults ensure the same ctx
is used not only for downloadFromPod but also for the local phases by (a)
calling cleaner.ScanPatchTarGzipReaderFor with a context-aware variant or
wrapping checks for ctx.Done while reading/writing, (b) making the file
write/io.Copy abort when ctx is cancelled (check ctx.Done and return), and (c)
invoking sonobuoyclient.UntarAll and the subsequent rename with cancellation
support (either by using context-aware versions or checking ctx between steps
and returning early). Update function signatures (retrieveResults and any helper
calls) to accept the ctx where needed so cancellation/deadline from
cmd.Context() propagates through downloadFromPod,
cleaner.ScanPatchTarGzipReaderFor, io.Copy, sonobuoyclient.UntarAll and rename
operations.
Source: Coding guidelines
|
Changes from this PR is producing transient results and instability on retrieve in Current: PR: Considering this is lower priority than next and observed only on CI, I am holding to review later existing priorities. cc @bshaw7 |
Summary
Fix
opct retrievehanging after successful result collection by adding signal handling and context propagation throughout the retrieve path. Rebased on top of #219.Supersedes #214 (rebased onto current
mainafter #219 was merged).Problem
The periodic job
periodic-ci-redhat-openshift-ecosystem-opct-main-4.18-platform-external-vsphere-upgrade(build 2057492027129466880) failed becauseopct retrievehung in theopct-conformance-test-resultsstep. The conformance tests completed successfully and the results archive was correctly written to GCS, but the retrieve process did not exit.When Prow sent SIGTERM after the 30m step timeout, the Go process had no signal handler and ignored it. Prow then waited the full 30m grace period before force-killing the container (exit code 127).
Changes
signal.NotifyContextfor SIGTERM/SIGINT inNewCmdRetrieve, threading the context through the retrieve pathcontext.Contextparameter toretrieveResultsRetryandretrieveResultsso cancellation propagates correctlyprogressReaderandprogressWriterstructs that wrapio.Reader/io.Writerand log bytes transferred every 30s, withsync.Once-protected close to prevent goroutine leaks and double-close panicsformatBytes()helper for progress log messagesTesting
go build ./pkg/retrieve/— PASSgo vet ./pkg/retrieve/— PASSgo test -v ./pkg/retrieve/— 24/24 PASS (8.1s)References