Skip to content

Comments

HYPERFLEET-532 | docs: E2E test run strategy and resource management spike#84

Open
yasun1 wants to merge 2 commits intoopenshift-hyperfleet:mainfrom
yasun1:HYPERFLEET-532
Open

HYPERFLEET-532 | docs: E2E test run strategy and resource management spike#84
yasun1 wants to merge 2 commits intoopenshift-hyperfleet:mainfrom
yasun1:HYPERFLEET-532

Conversation

@yasun1
Copy link
Contributor

@yasun1 yasun1 commented Jan 30, 2026

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive E2E test run strategy: scope, goals, core design principles; a test-run model and lifecycle; roles for fixture adapters and suites; deployment and namespace strategy with naming conventions and cross-namespace topology; component ownership, resource isolation, and race-condition prevention; test-suite organization, state management, cleanup/retention and reconciliation policies; observability and debugging guidance; open questions and a phased MVP → post‑MVP implementation plan.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a new Spike Report Markdown at hyperfleet/e2e-testing/e2e-run-strategy-spike-report.md defining an end-to-end test-run strategy for HyperFleet. The document covers problem scope, goals, design principles, an E2E Test Run model (definition, identification, lifecycle), fixture-adapter concepts, deployment and component lifecycle, isolation and race-condition prevention, suite organization and state management, cleanup and retention policies, observability, open questions, and a phased implementation plan (Phase 1: adapter-landing-zone MVP; Phase 2: Fixture Adapter; Phase 3+: enhancements).

Sequence Diagram(s)

sequenceDiagram
  participant CI as CI / Test Runner
  participant Controller as Run Controller
  participant Adapter as Fixture Adapter
  participant K8s as Kubernetes API
  participant Core as Core Suite
  participant AdapterSuite as Adapter Suite

  CI->>Controller: Create TestRun CR (labels, retention)
  Controller->>K8s: Create Namespace (one-per-run)
  Controller->>Adapter: Request fixture provisioning (fixture refs)
  Adapter->>K8s: Provision adapter resources in run-namespace
  Adapter->>AdapterSuite: Deploy & trigger Adapter Suite
  Controller->>Core: Trigger Core Suite (after adapter readiness)
  Core->>K8s: Validate topology across namespaces
  AdapterSuite->>Controller: Report status (ready/passed/failed)
  Core->>Controller: Report status (passed/failed)
  Controller->>K8s: Begin teardown per retention policy (or retain)
  Controller->>Adapter: Notify cleanup
  Adapter->>K8s: Delete adapter resources
  Controller->>K8s: Delete Namespace (if not retained)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • xueli181114
  • rh-amarin
  • ciaranRoche
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: addition of a comprehensive spike report document on E2E test run strategy and resource management.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@hyperfleet/e2e-testing/e2e-run-strategy-spike-report.md`:
- Around line 650-654: Add a new subsection titled "Cleanup Failure Handling"
under Section 8.5 (currently referenced as 8.6) that specifies: test suites must
implement timeout-based forced cleanup with a default 60s, cleanup failures must
be detected and logged (but should not mark the test as failed—the test result
should reflect validation only), reconciler/orphan-detection should surface
failed cleanups, and the Test Framework must perform final
namespace/infrastructure deletion regardless of suite cleanup status; include
these four bullet points verbatim and ensure the subsection name is "Cleanup
Failure Handling".
- Around line 323-336: Update the "Component Lifecycle Ownership" table row for
"Broker Resources" and expand Section 6.2 to explicitly state the cleanup
responsibilities and sequence: declare that adapter-created subscriptions are
adapter-owned and must be removed by the adapter on removal (e.g., During its
BeforeEach/AfterAll hooks), sentinel-owned topics are removed by the Sentinel
during Test Run teardown, and the cleanup order is adapters-first then Sentinel
to avoid races; also require idempotent cleanup operations and a simple
coordination mechanism (e.g., atomic marker or lock) to avoid conflicts when
hot-plugging adapters as described in Section 8.2.2.
- Around line 309-311: The markdown code block showing the namespace example
currently has no language specifier; update the fenced block around the example
string "e2e-{TEST_RUN_ID}" to include a language (e.g., change the opening fence
from ``` to ```text) so the snippet is marked as plain text for rendering and
syntax highlighting.
- Around line 581-613: The fenced code block showing the Ginkgo test structure
should include a language specifier so syntax highlighting is enabled; update
the opening fence of the block that begins with Describe("DNS Adapter Suite",
func() { to use ```go instead of ```; leave the rest of the block unchanged so
the Ginkgo/Gomega test symbols (Describe, BeforeAll, It, AfterAll, BeforeEach,
AfterEach) are highlighted correctly.
- Around line 486-494: The fenced code block showing the example flow lacks a
language tag; update the block surrounding the numbered steps (the
triple-backtick block containing "1. Test creates Cluster via API...") to
include a language specifier such as ```text (or ```markdown) so the block is
explicitly marked and rendered correctly; ensure the opening fence becomes
```text and leave the content unchanged.
- Around line 139-149: The fenced lifecycle diagram block that currently starts
with ``` and contains the steps "Create Namespace → Deploy Infrastructure →
Infrastructure Ready → Execute Test Suites → Cleanup" should include a language
specifier (e.g., change the opening fence to ```text or ```mermaid) so the
diagram renders with proper formatting; update the code block opening fence
accordingly while leaving the block content unchanged.
- Around line 565-655: Add a new Section 8.5 between the existing "### 8.4 Test
Organization Guidelines" and "### 8.6 State Management and Suite Independence"
headers to restore contiguous numbering; create a brief header "### 8.5 [Title]"
(e.g., "Adapter Lifecycle Policies" or "Test Isolation Patterns") and include
2–4 sentences summarizing the intended content (or a placeholder note) so the
document numbering is corrected and readers aren’t confused by the gap; update
any cross-references if present.
🧹 Nitpick comments (3)
hyperfleet/e2e-testing/e2e-run-strategy-spike-report.md (3)

716-720: Consider reconciler frequency for active development environments.

The 30-minute reconciler frequency may be too infrequent for developers running tests locally in rapid succession. In resource-constrained development clusters, this could lead to:

  • Quota exhaustion while waiting for cleanup
  • Developer friction (manual cleanup required)
  • Reduced parallel test capacity

Consider either:

  • Shortening the default frequency for development (e.g., 5-10 minutes)
  • Making frequency configurable per environment
  • Adding immediate cleanup for passed tests (10-minute retention might still be too long)
⚙️ Proposed enhancement

Update the reconciler description:

 - Runs periodically (frequency configurable, typically 30 minutes)
+  - Default: 30 minutes (CI), 5 minutes (local development)
+  - Configurable via environment variable: E2E_RECONCILER_FREQUENCY

674-678: Reconsider default 2-hour TTL for development environments.

The 2-hour default TTL may be excessive for local development scenarios where:

  • Developers iterate rapidly (multiple test runs per hour)
  • Local clusters have limited resources
  • Immediate feedback on quota issues is preferred

While 2 hours provides safety for CI interruptions, it could cause resource exhaustion in active development. Consider a shorter default (e.g., 30-60 minutes) or making it environment-specific.

⏱️ Proposed refinement

Update the default retention section:

-- **Default TTL**: 2 hours from creation
+- **Default TTL**: 
+  - CI context (`ci=yes`): 2 hours from creation
+  - Local context (`ci=no`): 1 hour from creation

684-696: Consider shortening passed test retention.

The 10-minute retention for passed tests may still be too long in high-velocity development environments. Since passed tests have "minimal debugging value" (as stated), consider:

  • Reducing to 2-5 minutes (still prevents race conditions)
  • Immediate deletion by E2E flow after test completion
  • Making this configurable per environment

This would improve resource turnover and reduce quota pressure.

Comment on lines 650 to 654
**Cleanup Responsibility**:

- Test cases and suites clean their own state (adapters, test data)
- Infrastructure cleanup handled by Test Framework (see Section 9 for retention policy)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Address test cleanup failure scenarios.

Section 8.6 states "Test cases and suites clean their own state (adapters, test data)" but doesn't address what happens if cleanup fails. Failed cleanup could:

  • Leave orphaned adapters consuming resources
  • Pollute the namespace for subsequent test suites
  • Interfere with the Test Framework's final teardown

Consider adding guidance for:

  • Cleanup failure detection and logging
  • Whether cleanup failures should fail the test
  • Fallback mechanisms (e.g., force-delete with timeout)
🛡️ Proposed addition

Add a subsection under 8.5 (currently 8.6):

**Cleanup Failure Handling**:
- Test suites implement timeout-based forced cleanup (default: 60s)
- Cleanup failures are logged but don't fail the test (test result reflects actual validation)
- Failed cleanups are detected by reconciler via orphaned resources
- Test Framework performs final namespace deletion regardless of suite cleanup status
🤖 Prompt for AI Agents
In `@hyperfleet/e2e-testing/e2e-run-strategy-spike-report.md` around lines 650 -
654, Add a new subsection titled "Cleanup Failure Handling" under Section 8.5
(currently referenced as 8.6) that specifies: test suites must implement
timeout-based forced cleanup with a default 60s, cleanup failures must be
detected and logged (but should not mark the test as failed—the test result
should reflect validation only), reconciler/orphan-detection should surface
failed cleanups, and the Test Framework must perform final
namespace/infrastructure deletion regardless of suite cleanup status; include
these four bullet points verbatim and ensure the subsection name is "Cleanup
Failure Handling".

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@hyperfleet/e2e-testing/e2e-run-strategy-spike-report.md`:
- Around line 351-356: Change the requirement text for "All E2E test namespaces"
so it says namespaces must "include the following labels" (or "must include at
least these labels") rather than "must carry exactly three labels"; keep the
three listed labels (`ci`, `test-run-id`, `managed-by`) and their allowed values
but remove the constraint that no other labels may be present to allow
platform-injected labels.

Comment on lines 351 to 356
All E2E test namespaces must carry exactly three labels:

1. **`ci`**: Execution context (`yes` | `no`)
2. **`test-run-id`**: Test Run identifier
3. **`managed-by`**: Ownership marker (`e2e-test-framework`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid “exactly three labels” to allow platform-required labels.

Kubernetes or cluster policies may inject additional labels; require “must include” instead of “exactly three” to prevent conflicts.

✅ Suggested edit
-All E2E test namespaces must carry exactly three labels:
+All E2E test namespaces must carry at least the following three labels:
🤖 Prompt for AI Agents
In `@hyperfleet/e2e-testing/e2e-run-strategy-spike-report.md` around lines 351 -
356, Change the requirement text for "All E2E test namespaces" so it says
namespaces must "include the following labels" (or "must include at least these
labels") rather than "must carry exactly three labels"; keep the three listed
labels (`ci`, `test-run-id`, `managed-by`) and their allowed values but remove
the constraint that no other labels may be present to allow platform-injected
labels.


**Problem**: Core Suite needs to validate HyperFleet framework behavior (event flow, status aggregation, error handling). Functional adapters have external dependencies (cloud APIs, GCP projects) and cannot provide the controlled, repeatable scenarios needed for framework testing. What type of adapter should Core Suite use?

**Decision**: Fixture Adapter in dedicated repository
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to use "test Adapter " . And it is not a static test adapter, its configuration can be customized.


Test suites represent **validation focus**, not environment configurations.

#### 8.2.1 Core Suite
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate why we need this concept ? Is there attributes in test case can indicate what are core suite ?

Copy link
Contributor Author

@yasun1 yasun1 Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fundamental distinction is:

  • Core Suite: Uses fake adapters that are already deployed (doesn't manage adapter lifecycle)
  • Adapter Suite: Manages fake adapter lifecycle (deploy/remove during tests by hot-plugging)

In test cases, we could use words: baseline and adapter on the case title to distinguish them.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@hyperfleet/e2e-testing/e2e-run-strategy-spike-report.md`:
- Around line 225-235: In the markdown block showing the repository tree in
hyperfleet/e2e-testing/e2e-run-strategy-spike-report.md, add a language
specifier to the fenced code block header (change ``` to ```text) so the tree is
rendered with the "text" language; update the opening fence immediately before
the "e2e-testing/" line in that block to ```text.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@hyperfleet/e2e-testing/e2e-run-strategy-spike-report.md`:
- Line 923: The checklist line "[ ] Add test result-based retention updates
(passed: 10min, failed: 24h/6h)" uses "10min" which is inconsistent with other
duration formats; update that text to use the same format as the rest of the
document (e.g., "10 minutes" or, if you choose compact notation consistently,
"10m") and make the failed durations ("24h/6h") match the chosen style (e.g.,
"24 hours / 6 hours" or "24h / 6h") so the entire entry follows the project's
standardized duration format.
🧹 Nitpick comments (1)
hyperfleet/e2e-testing/e2e-run-strategy-spike-report.md (1)

650-654: Consider clarifying cleanup failure handling.

The cleanup responsibility section mentions that test cases and suites clean their own state, but doesn't address what happens when cleanup operations fail. This could leave implementers uncertain about:

  • Whether cleanup failures should fail the test or just be logged
  • Timeout strategies for hung cleanup operations
  • Fallback mechanisms when graceful cleanup is blocked

While Section 9 covers infrastructure retention policies, adding brief guidance here on suite-level cleanup failures would improve implementation clarity.

💡 Possible addition

Consider adding a brief note:

**Cleanup Failure Handling**:
- Cleanup operations should implement timeouts to prevent indefinite hangs
- Cleanup failures are logged but do not affect test pass/fail status
- Orphaned resources are handled by the reconciler (Section 9.3)


**HYPERFLEET-XXX: Retention Policy**
- [ ] Implement namespace retention annotation logic
- [ ] Add test result-based retention updates (passed: 10min, failed: 24h/6h)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Standardize time duration format.

The duration "10min" is inconsistent with the format used elsewhere in the document (e.g., "2 hours", "10 minutes", "24 hours").

✏️ Suggested fix
-- [ ] Add test result-based retention updates (passed: 10min, failed: 24h/6h)
+- [ ] Add test result-based retention updates (passed: 10 minutes, failed: 24h/6h)

Alternatively, use "10m" if adopting compact notation throughout, but ensure consistency.

🧰 Tools
🪛 LanguageTool

[grammar] ~923-~923: Ensure spelling is correct
Context: ...result-based retention updates (passed: 10min, failed: 24h/6h) - [ ] Configure defaul...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
In `@hyperfleet/e2e-testing/e2e-run-strategy-spike-report.md` at line 923, The
checklist line "[ ] Add test result-based retention updates (passed: 10min,
failed: 24h/6h)" uses "10min" which is inconsistent with other duration formats;
update that text to use the same format as the rest of the document (e.g., "10
minutes" or, if you choose compact notation consistently, "10m") and make the
failed durations ("24h/6h") match the chosen style (e.g., "24 hours / 6 hours"
or "24h / 6h") so the entire entry follows the project's standardized duration
format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants