Skip to content

Feat/workload recon 03 consolidated#39

Open
echiugoog wants to merge 20 commits intonstogner:mainfrom
echiugoog:feat/workload_recon_03_consolidated
Open

Feat/workload recon 03 consolidated#39
echiugoog wants to merge 20 commits intonstogner:mainfrom
echiugoog:feat/workload_recon_03_consolidated

Conversation

@echiugoog
Copy link
Copy Markdown
Contributor

No description provided.

- Bump Go to 1.25.7 in Dockerfile and Makefile
- Update Makefile to generate RBAC manifests
- Update dependencies in go.mod and go.sum
- Add Slices custom resource definition
- Split aggregator into modular components: events, poller, and report
- Add NodePoller to decouple API polling from main aggregation
- Implement in-memory observed state store for events
- Move GCSClient interface to internal/gcsclient
- Add WorkloadReconciler to replace JobSet and Node reconcilers
- Support JobSet and Slice objects in WorkloadReconciler
- Remove obsolete JobSet and Node reconcilers
- Update Manager to register WorkloadReconciler
- Update RBAC rules for the new reconciler
- Add GCS and GKE latency metrics
- Add Clone method to Report utilizing maps.Copy
- Refactor EventRecords to track state changes
- Refactor integration tests to use the new WorkloadReconciler
- Remove obsolete sleeps and improve test polling
- Add architectural documentation, including the new CurrentObservedStore
- Allow dynamic naming of WorkloadReconciler
- Prevent conflicts during sequential integration tests
@nstogner nstogner marked this pull request as ready for review March 25, 2026 15:06
…tly from EventLog

This change simplifies the main aggregation loop. Instead of pre-populating the report
with the latest observed states in the Aggregator and passing them down, the
SummaryProducer now receives the EventLog and fetches both the historical events and
current observed states directly to compute the final summaries.
Replaced several 'TODO' comments above function signatures with proper descriptive
documentation for the WorkloadReconciler and k8sutils. Also updated getOwnerStatus
and getOwnerActiveStatus to use named return variables for better clarity.
- Corrected the architecture diagram to show the Summary Producer pulling directly
  from the Current Observed Store.
- Added the Pod Reconciler to the State Observers section and clarified its limited
  role in scheduling discovery without persisting to the Event Log.
- Grouped State Observers into 'Pollers' and 'Reconcilers'.
- Fixed a truncated label '(Pers)' to '(Persist)' in the ASCII diagram.
Removed the in-memory map `sliceOwnerMap` along with its explicit locking and sync
methods (`loadSliceOwnerMap`, `saveSliceOwnerMap`).
The `WorkloadReconciler` now directly queries the `ConfigMap` utilizing the
`controller-runtime` cache mechanisms, simplifying the state synchronization and
making the component largely stateless.

test(manager): add unit test to verify Config unmarshaling

Added a test to ensure that the `Config` struct, specifically the nested
`types.NamespacedName` for `SliceOwnerMapConfigMapRef` and `ReportConfigMapRef`,
is correctly deserialized from JSON and correctly overrides the default values
configured in `MustConfigure()`.

refactor(controller): clean up config map variable names and redundant init

- Removed redundant default setting for SliceOwnerMapConfigMapRef in Init()
  as it is already handled properly in manager/manager.go (MustConfigure).
- Renamed 'cmToSave' to a more generic 'configMap' in the Update block.

test(integration): supply explicit SliceOwnerMapConfigMapRef namespace

The integration tests instantiate the controller environment using a custom
test configuration object. The recent refactoring to use a ConfigMap directly
exposed a test environment gap where an empty namespace was passed for
the ConfigMap reference.

Because Kubernetes does not allow creating ConfigMaps with an empty namespace,
the WorkloadReconciler was unable to save the slice-owner-map. This resulted
in silent test failures for slice recovery/interruption validation because
the WorkloadReconciler lost track of deleted slices without the ConfigMap.

This commit updates the  to include a proper namespace and name for
the SliceOwnerMapConfigMapRef.
…nges

Replaced the concurrent errgroup logic in WorkloadReconciler's Reconcile method
with a single batched call `AppendStateChanges` on the EventLog.

This ensures that:
- The controller can retry the entire reconciliation on failure without entering
  a partial failure state.
- A single combined error is returned correctly back to the controller-runtime.
…onciler namespace

Checks the  environment variable (where the reconciler is typically
deployed) and logs an informational warning during controller setup if the
 points to a different namespace. This serves as a useful
indicator for potential configuration or RBAC issues where the reconciler is attempting
to access state outside of its own isolated namespace scope.
…ming

- Replaced hardcoded json strings with shared constants.
- Extracted monolithic Reconcile logic into processJobSets, processLWS, and processSlices.
- Renamed 'uid' to 'jobsetUid' and 'ownerMap' to 'sliceOwnerMap' for clarity.
Updated the high-level data flow diagram in arch.md to accurately show
GKE/Kubernetes API sources and the parallel fetching from observed vs event stores.
Also added documenting comments to key reconciliation helper functions.
To prevent a race condition where the SummaryProducer (running on an interval)
might read a partial state snapshot while the WorkloadReconciler updates multiple
resources, we've added an `UpdateBatch` method to the `CurrentObservedStore`.

`AppendStateChanges` now takes a single write-lock (`mu.Lock()`) to update the
observed state of all provided resources atomically, before proceeding to
synchronize those states iteratively with the GCS Event Store.
@echiugoog echiugoog force-pushed the feat/workload_recon_03_consolidated branch from 54c90d1 to 9c16784 Compare March 30, 2026 20:11
…l metrics

To prevent exporting empty metrics on startup, the Aggregation loop now uses an
`IsPopulated` check directly on the in-memory `CurrentObservedStore`. This avoids
complex and deadlock-prone synchronization channels between the controller and the
aggregator loops.

The aggregator waits until the essential resource keys (e.g. `jobsets.json` and
`node-pools.json`) have been written to the observed state cache at least once by
the `WorkloadReconciler` and `NodePoller`. A grace period timeout is included to
ensure the application eventually proceeds if a resource legitimately does not
exist, while allowing the `readyz` probe to pass immediately so Kubernetes does
not kill the container.
@echiugoog echiugoog force-pushed the feat/workload_recon_03_consolidated branch from 9c16784 to fca1880 Compare March 31, 2026 20:03
- To prevent exporting artificially empty metrics on startup, the Aggregation
  loop now uses an `IsPopulated` check on the in-memory `CurrentObservedStore`.
  It waits indefinitely until essential resource keys (e.g., `jobsets.json`)
  are written at least once by the `WorkloadReconciler` and `NodePoller`.
- The `readyz` probe is explicitly unblocked prior to this check, ensuring
  Kubernetes accurately identifies the pod as "Running" while metrics are paused.
- Modernized concurrent loops (Polling, Aggregation, Metrics Server) using
  Go 1.25's `sync.WaitGroup.Go()` for cleaner syntax and leak-proof execution.
- Ensured graceful shutdown on SIGTERM by propagating context cancellation
  and explicitly waiting for all WaitGroup goroutines to finish.
- Fixed inverted error logic in the metrics server's `ListenAndServe` block
  where `context.Canceled` was incorrectly negated.
- Added a 5-second context timeout to `metricsServer.Shutdown()` to prevent
  indefinite hangs from active client connections during teardown.
@echiugoog echiugoog force-pushed the feat/workload_recon_03_consolidated branch from fca1880 to 237b084 Compare April 1, 2026 18:50
- Added example configurations for SliceOwnerMapConfigMapRef and
  ReportConfigMapRef to hack/dev-config.json.
- Added a new unit test TestConfigUnmarshalBothConfigMaps in
  internal/manager/manager_test.go to explicitly verify unmarshaling
  of both ConfigMap references from JSON.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants