Skip to content

OPCT-405: pkg/retrieve: fix hang and add context/signal handling#220

Open
redhat-chai-bot wants to merge 1 commit into
redhat-openshift-ecosystem:mainfrom
redhat-chai-bot:OPCT-405-pkg-retrieve-fix-hang-in-writeResultsToDirectory-by-adding-context-and-signal-handling
Open

OPCT-405: pkg/retrieve: fix hang and add context/signal handling#220
redhat-chai-bot wants to merge 1 commit into
redhat-openshift-ecosystem:mainfrom
redhat-chai-bot:OPCT-405-pkg-retrieve-fix-hang-in-writeResultsToDirectory-by-adding-context-and-signal-handling

Conversation

@redhat-chai-bot

Copy link
Copy Markdown

Summary

Fix opct retrieve hanging after successful result collection by adding signal handling and context propagation throughout the retrieve path. Rebased on top of #219.

Supersedes #214 (rebased onto current main after #219 was merged).

Problem

The periodic job periodic-ci-redhat-openshift-ecosystem-opct-main-4.18-platform-external-vsphere-upgrade (build 2057492027129466880) failed because opct retrieve hung in the opct-conformance-test-results step. The conformance tests completed successfully and the results archive was correctly written to GCS, but the retrieve process did not exit.

When Prow sent SIGTERM after the 30m step timeout, the Go process had no signal handler and ignored it. Prow then waited the full 30m grace period before force-killing the container (exit code 127).

Changes

  • Signal handling: Add signal.NotifyContext for SIGTERM/SIGINT in NewCmdRetrieve, threading the context through the retrieve path
  • Context propagation: Add context.Context parameter to retrieveResultsRetry and retrieveResults so cancellation propagates correctly
  • Progress monitoring: Add progressReader and progressWriter structs that wrap io.Reader/io.Writer and log bytes transferred every 30s, with sync.Once-protected close to prevent goroutine leaks and double-close panics
  • Human-readable formatting: Add formatBytes() helper for progress log messages
  • Comprehensive unit tests: 24 tests covering progressReader, progressWriter, retrieveResultsRetry, and formatBytes

Testing

  • go build ./pkg/retrieve/ — PASS
  • go vet ./pkg/retrieve/ — PASS
  • go test -v ./pkg/retrieve/ — 24/24 PASS (8.1s)

References

- Add signal.NotifyContext for SIGTERM/SIGINT in NewCmdRetrieve
- Thread context.Context through retrieve path
- Add progressReader/progressWriter with progress logging every 30s
- Add formatBytes() for human-readable byte counts
- Add comprehensive unit tests
- Fix goroutine leak with sync.Once in progressReader
@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bshaw7 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from mtulio and vr4manta June 10, 2026 14:16
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Downloads now display periodic progress updates showing bytes transferred.
    • Graceful cancellation support added via SIGTERM/SIGINT signals for interrupted operations.
  • Tests

    • Comprehensive unit test coverage added for progress tracking and retrieval logic.

Walkthrough

The retrieval pipeline now supports signal-triggered cancellation and download progress tracking. A signal context is created from SIGTERM/SIGINT in the command entrypoint, flows through the retry orchestration and into the actual retrieval functions, allowing early termination. Progress tracking primitives (progressReader, progressWriter, formatBytes) wrap I/O streams and provide periodic logging with human-readable byte formatting. Comprehensive tests validate context cancellation semantics, retry behavior, progress tracking concurrency safety, and binary-unit conversions.

Changes

Retrieval Pipeline with Context Cancellation and Progress Logging

Layer / File(s) Summary
Signal context and retry orchestration
pkg/retrieve/retrieve.go
Imports for signal handling and atomic operations added. NewCmdRetrieve establishes a signal-aware context for SIGTERM/SIGINT and passes it to retrieveResultsRetry. Retry loop now accepts context, checks for cancellation before/within attempts, and uses select + time.After for cancellation-aware retry waits. Package-level retrieveFunc injected for test overrides.
Context propagation through retrieval
pkg/retrieve/retrieve.go
retrieveResults and downloadFromPod signatures updated to accept ctx. Download streams pod exec output via exec.StreamWithContext(ctx, ...) so context cancellation interrupts the streaming operation.
Progress tracking infrastructure
pkg/retrieve/retrieve.go
progressReader and progressWriter types added with atomic byte counters, periodic logging goroutines, and safe shutdown via sync.Once/channel signaling. formatBytes converts byte counts to human-readable binary units. Download wraps temp file writer with progressWriter to log write progress periodically.
Test suite
pkg/retrieve/retrieve_test.go
Comprehensive tests for progressReader (byte counting, EOF semantics, done-channel closure, concurrent safety, stutter-read handling), progressWriter (byte counting, goroutine shutdown, concurrent safety), retrieveResultsRetry (context cancellation behavior, retry exhaustion/minimum attempts, cancellation-aware retry waits), and formatBytes (binary-unit boundaries). Test helper implementations (failingReader, multiEOFReader, slowReader, chunkedReader, stutteringReader) drive the scenarios.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: fixing a hang and adding context/signal handling to the retrieve package.
Description check ✅ Passed The description is well-detailed and directly related to the changeset, explaining the problem, changes made, and testing results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mtulio

mtulio commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

We may need to validate this change in a live cluster (we don't have presubmit job set here) as it touches retrieve which depends on aggregator server.

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 10, 2026
@mtulio mtulio added the kind/bug Categorizes issue or PR as related to a bug. label Jun 10, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/retrieve/retrieve.go (1)

127-185: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Cancellation currently stops at the pod download.

retrieveResults accepts ctx, but after downloadFromPod(ctx) the scan/copy/extract path ignores it. A SIGTERM/SIGINT that arrives while cleaner.ScanPatchTarGzipReaderFor, io.Copy, or sonobuoyclient.UntarAll is running will still keep the process alive until those local phases finish, which leaves the command vulnerable to the same “won’t exit promptly” failure mode on large archives or slow disks.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/retrieve/retrieve.go` around lines 127 - 185, The function
retrieveResults currently ignores ctx after downloadFromPod, so long-running
operations (cleaner.ScanPatchTarGzipReaderFor, io.Copy to scannedFile, and
sonobuoyclient.UntarAll) can't be cancelled; update the flow to respect ctx by
propagating cancellation: if cleaner.ScanPatchTarGzipReaderFor can accept ctx,
call it with ctx, otherwise wrap scannedReader with a context-aware reader
(e.g., use an io.Pipe and start a goroutine that copies from the original reader
to the pipe while watching ctx.Done() and closing the writer on cancel), perform
io.Copy to scannedFile in a goroutine or via a context-aware copy that aborts on
ctx.Done(), and run sonobuoyclient.UntarAll in a goroutine so you can select on
ctx.Done() to close scannedIn and abort extraction; ensure all deferred
closes/removals still run and return ctx.Err() when cancellation occurs so
retrieveResults cancels promptly.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/retrieve/retrieve.go`:
- Around line 44-46: Change RunE in NewCmdRetrieve to derive the signal-aware
context from cmd.Context() (use signal.NotifyContext(cmd.Context(), ...))
instead of context.Background(), and pass that ctx into retrieveResults; inside
retrieveResults ensure the same ctx is used not only for downloadFromPod but
also for the local phases by (a) calling cleaner.ScanPatchTarGzipReaderFor with
a context-aware variant or wrapping checks for ctx.Done while reading/writing,
(b) making the file write/io.Copy abort when ctx is cancelled (check ctx.Done
and return), and (c) invoking sonobuoyclient.UntarAll and the subsequent rename
with cancellation support (either by using context-aware versions or checking
ctx between steps and returning early). Update function signatures
(retrieveResults and any helper calls) to accept the ctx where needed so
cancellation/deadline from cmd.Context() propagates through downloadFromPod,
cleaner.ScanPatchTarGzipReaderFor, io.Copy, sonobuoyclient.UntarAll and rename
operations.

---

Outside diff comments:
In `@pkg/retrieve/retrieve.go`:
- Around line 127-185: The function retrieveResults currently ignores ctx after
downloadFromPod, so long-running operations (cleaner.ScanPatchTarGzipReaderFor,
io.Copy to scannedFile, and sonobuoyclient.UntarAll) can't be cancelled; update
the flow to respect ctx by propagating cancellation: if
cleaner.ScanPatchTarGzipReaderFor can accept ctx, call it with ctx, otherwise
wrap scannedReader with a context-aware reader (e.g., use an io.Pipe and start a
goroutine that copies from the original reader to the pipe while watching
ctx.Done() and closing the writer on cancel), perform io.Copy to scannedFile in
a goroutine or via a context-aware copy that aborts on ctx.Done(), and run
sonobuoyclient.UntarAll in a goroutine so you can select on ctx.Done() to close
scannedIn and abort extraction; ensure all deferred closes/removals still run
and return ctx.Err() when cancellation occurs so retrieveResults cancels
promptly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: ef48508d-904a-44f5-bfd8-870e4641d117

📥 Commits

Reviewing files that changed from the base of the PR and between 274c457 and d6618f3.

📒 Files selected for processing (2)
  • pkg/retrieve/retrieve.go
  • pkg/retrieve/retrieve_test.go

Comment thread pkg/retrieve/retrieve.go
Comment on lines 44 to +46
RunE: func(cmd *cobra.Command, args []string) error {
ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
defer cancel()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the file
ls -la pkg/retrieve/retrieve.go

# Show the RunE block around the claimed lines
nl -ba pkg/retrieve/retrieve.go | sed -n '1,120p'

# Find and print retrieveResultsRetry / retrieveResults / downloadFromPod sections
rg -n "func (retrieveResultsRetry|retrieveResults|downloadFromPod)\\b" pkg/retrieve/retrieve.go
nl -ba pkg/retrieve/retrieve.go | sed -n '120,260p'

# Check for any other context.Background() usage in this file
rg -n "context\\.Background\\(" pkg/retrieve/retrieve.go

# Check where ctx is used (call sites) in retrieveResults and surrounding helpers
rg -n "\\bctx\\b" pkg/retrieve/retrieve.go

Repository: redhat-openshift-ecosystem/opct

Length of output: 211


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="pkg/retrieve/retrieve.go"
echo "== File: $FILE =="
wc -l "$FILE"

echo "== RunE context block (approx lines 35-70) =="
sed -n '35,80p' "$FILE"

echo "== Function headers (retrieveResultsRetry / retrieveResults / downloadFromPod) =="
rg -n "func (retrieveResultsRetry|retrieveResults|downloadFromPod)\\b" "$FILE"

echo "== retrieveResultsRetry body (approx lines 80-190) =="
sed -n '80,200p' "$FILE"

echo "== retrieveResults body (if different) (approx lines 200-320) =="
sed -n '200,340p' "$FILE"

echo "== downloadFromPod body (approx lines 340-520) =="
sed -n '340,520p' "$FILE"

echo "== Any context.Background usage in this file =="
rg -n "context\\.Background\\(" "$FILE" || true

echo "== Any ctx usage in this file (to spot gaps) =="
rg -n "\\bctx\\b" "$FILE" | head -n 120

Repository: redhat-openshift-ecosystem/opct

Length of output: 11877


Derive signal context from cmd.Context() and propagate ctx through local scan/redact + extract phases.

  • NewCmdRetrieve’s RunE roots the signal-aware context from context.Background(), discarding any cancellation/deadline supplied via Cobra/ExecuteContext; base it on cmd.Context() instead.
  • retrieveResults uses ctx only for downloadFromPod; the scan/redact (cleaner.ScanPatchTarGzipReaderFor), file writing (io.Copy), and extraction (sonobuoyclient.UntarAll + rename) run without any ctx checks, so termination signals won’t reliably interrupt long local phases.
Suggested fix
-			ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
+			baseCtx := cmd.Context()
+			if baseCtx == nil {
+				baseCtx = context.Background()
+			}
+			ctx, cancel := signal.NotifyContext(baseCtx, syscall.SIGTERM, syscall.SIGINT)
 			defer cancel()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
RunE: func(cmd *cobra.Command, args []string) error {
ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
defer cancel()
RunE: func(cmd *cobra.Command, args []string) error {
baseCtx := cmd.Context()
if baseCtx == nil {
baseCtx = context.Background()
}
ctx, cancel := signal.NotifyContext(baseCtx, syscall.SIGTERM, syscall.SIGINT)
defer cancel()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/retrieve/retrieve.go` around lines 44 - 46, Change RunE in NewCmdRetrieve
to derive the signal-aware context from cmd.Context() (use
signal.NotifyContext(cmd.Context(), ...)) instead of context.Background(), and
pass that ctx into retrieveResults; inside retrieveResults ensure the same ctx
is used not only for downloadFromPod but also for the local phases by (a)
calling cleaner.ScanPatchTarGzipReaderFor with a context-aware variant or
wrapping checks for ctx.Done while reading/writing, (b) making the file
write/io.Copy abort when ctx is cancelled (check ctx.Done and return), and (c)
invoking sonobuoyclient.UntarAll and the subsequent rename with cancellation
support (either by using context-aware versions or checking ctx between steps
and returning early). Update function signatures (retrieveResults and any helper
calls) to accept the ctx where needed so cancellation/deadline from
cmd.Context() propagates through downloadFromPod,
cleaner.ScanPatchTarGzipReaderFor, io.Copy, sonobuoyclient.UntarAll and rename
operations.

Source: Coding guidelines

@mtulio

mtulio commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Changes from this PR is producing transient results and instability on retrieve in main

Current:

$ ./build/opct-linux-amd64 retrieve
INFO[2026-06-11T03:19:55-03:00] Collecting results...                        
INFO[2026-06-11T03:19:56-03:00] Downloading archive from aggregator server... 
INFO[2026-06-11T03:20:10-03:00] Downloaded 169.0 MB in 14s                   
INFO[2026-06-11T03:20:10-03:00] Scanning archive for sensitive data...       
INFO[2026-06-11T03:20:30-03:00] Results saved to /home/to/opct/opct_202606102158_d0c49692-6371-410f-bc6f-35acbe0fefd8.tar.gz 
INFO[2026-06-11T03:20:30-03:00] Run 'opct report -s ./report <archive>.tar.gz' to review the validation results.

PR:

$ ./build/opct-linux-amd64 retrieve
INFO[2026-06-11T03:51:32-03:00] Collecting results...                        
INFO[2026-06-11T03:51:33-03:00] Downloading archive from aggregator server... 
E0611 03:51:55.201911 2003195 v2.go:150] "Unhandled Error" err="next reader: read tcp 192.168.10.55:52162->44.209.144.43:6443: read: connection reset by peer" logger="UnhandledError"
ERRO[2026-06-11T03:51:55-03:00] error retrieving results from aggregator server: error streaming results from pod: error reading from error stream: next reader: read tcp 192.168.10.55:52162->44.209.144.43:6443: read: connection reset by peer 
WARN[2026-06-11T03:51:55-03:00] Retrying retrieval 9 more times after 2 sec  
INFO[2026-06-11T03:51:58-03:00] Downloading archive from aggregator server... 
E0611 03:51:59.998954 2003195 websocket.go:514] Websocket Ping failed: set tcp 192.168.10.55:52162: use of closed network connection
INFO[2026-06-11T03:52:17-03:00] Downloaded 169.0 MiB in 20s                  
INFO[2026-06-11T03:52:17-03:00] Scanning archive for sensitive data...       
INFO[2026-06-11T03:52:37-03:00] Results saved to /home/to/opct/opct_202606102158_d0c49692-6371-410f-bc6f-35acbe0fefd8.tar.gz 
INFO[2026-06-11T03:52:37-03:00] Run 'opct report -s ./report <archive>.tar.gz' to review the validation results. 

Considering this is lower priority than next and observed only on CI, I am holding to review later existing priorities.

cc @bshaw7
/hold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants