Open
Conversation
- Bump Go to 1.25.7 in Dockerfile and Makefile - Update Makefile to generate RBAC manifests - Update dependencies in go.mod and go.sum
- Add Slices custom resource definition
- Split aggregator into modular components: events, poller, and report - Add NodePoller to decouple API polling from main aggregation - Implement in-memory observed state store for events - Move GCSClient interface to internal/gcsclient
- Add WorkloadReconciler to replace JobSet and Node reconcilers - Support JobSet and Slice objects in WorkloadReconciler - Remove obsolete JobSet and Node reconcilers - Update Manager to register WorkloadReconciler - Update RBAC rules for the new reconciler
- Add GCS and GKE latency metrics - Add Clone method to Report utilizing maps.Copy - Refactor EventRecords to track state changes
- Refactor integration tests to use the new WorkloadReconciler - Remove obsolete sleeps and improve test polling - Add architectural documentation, including the new CurrentObservedStore
- Allow dynamic naming of WorkloadReconciler - Prevent conflicts during sequential integration tests
nstogner
reviewed
Mar 25, 2026
…tly from EventLog This change simplifies the main aggregation loop. Instead of pre-populating the report with the latest observed states in the Aggregator and passing them down, the SummaryProducer now receives the EventLog and fetches both the historical events and current observed states directly to compute the final summaries.
Replaced several 'TODO' comments above function signatures with proper descriptive documentation for the WorkloadReconciler and k8sutils. Also updated getOwnerStatus and getOwnerActiveStatus to use named return variables for better clarity.
- Corrected the architecture diagram to show the Summary Producer pulling directly from the Current Observed Store. - Added the Pod Reconciler to the State Observers section and clarified its limited role in scheduling discovery without persisting to the Event Log. - Grouped State Observers into 'Pollers' and 'Reconcilers'. - Fixed a truncated label '(Pers)' to '(Persist)' in the ASCII diagram.
Removed the in-memory map `sliceOwnerMap` along with its explicit locking and sync methods (`loadSliceOwnerMap`, `saveSliceOwnerMap`). The `WorkloadReconciler` now directly queries the `ConfigMap` utilizing the `controller-runtime` cache mechanisms, simplifying the state synchronization and making the component largely stateless. test(manager): add unit test to verify Config unmarshaling Added a test to ensure that the `Config` struct, specifically the nested `types.NamespacedName` for `SliceOwnerMapConfigMapRef` and `ReportConfigMapRef`, is correctly deserialized from JSON and correctly overrides the default values configured in `MustConfigure()`. refactor(controller): clean up config map variable names and redundant init - Removed redundant default setting for SliceOwnerMapConfigMapRef in Init() as it is already handled properly in manager/manager.go (MustConfigure). - Renamed 'cmToSave' to a more generic 'configMap' in the Update block. test(integration): supply explicit SliceOwnerMapConfigMapRef namespace The integration tests instantiate the controller environment using a custom test configuration object. The recent refactoring to use a ConfigMap directly exposed a test environment gap where an empty namespace was passed for the ConfigMap reference. Because Kubernetes does not allow creating ConfigMaps with an empty namespace, the WorkloadReconciler was unable to save the slice-owner-map. This resulted in silent test failures for slice recovery/interruption validation because the WorkloadReconciler lost track of deleted slices without the ConfigMap. This commit updates the to include a proper namespace and name for the SliceOwnerMapConfigMapRef.
…nges Replaced the concurrent errgroup logic in WorkloadReconciler's Reconcile method with a single batched call `AppendStateChanges` on the EventLog. This ensures that: - The controller can retry the entire reconciliation on failure without entering a partial failure state. - A single combined error is returned correctly back to the controller-runtime.
…onciler namespace Checks the environment variable (where the reconciler is typically deployed) and logs an informational warning during controller setup if the points to a different namespace. This serves as a useful indicator for potential configuration or RBAC issues where the reconciler is attempting to access state outside of its own isolated namespace scope.
…ming - Replaced hardcoded json strings with shared constants. - Extracted monolithic Reconcile logic into processJobSets, processLWS, and processSlices. - Renamed 'uid' to 'jobsetUid' and 'ownerMap' to 'sliceOwnerMap' for clarity.
Updated the high-level data flow diagram in arch.md to accurately show GKE/Kubernetes API sources and the parallel fetching from observed vs event stores. Also added documenting comments to key reconciliation helper functions.
To prevent a race condition where the SummaryProducer (running on an interval) might read a partial state snapshot while the WorkloadReconciler updates multiple resources, we've added an `UpdateBatch` method to the `CurrentObservedStore`. `AppendStateChanges` now takes a single write-lock (`mu.Lock()`) to update the observed state of all provided resources atomically, before proceeding to synchronize those states iteratively with the GCS Event Store.
54c90d1 to
9c16784
Compare
…l metrics To prevent exporting empty metrics on startup, the Aggregation loop now uses an `IsPopulated` check directly on the in-memory `CurrentObservedStore`. This avoids complex and deadlock-prone synchronization channels between the controller and the aggregator loops. The aggregator waits until the essential resource keys (e.g. `jobsets.json` and `node-pools.json`) have been written to the observed state cache at least once by the `WorkloadReconciler` and `NodePoller`. A grace period timeout is included to ensure the application eventually proceeds if a resource legitimately does not exist, while allowing the `readyz` probe to pass immediately so Kubernetes does not kill the container.
9c16784 to
fca1880
Compare
- To prevent exporting artificially empty metrics on startup, the Aggregation loop now uses an `IsPopulated` check on the in-memory `CurrentObservedStore`. It waits indefinitely until essential resource keys (e.g., `jobsets.json`) are written at least once by the `WorkloadReconciler` and `NodePoller`. - The `readyz` probe is explicitly unblocked prior to this check, ensuring Kubernetes accurately identifies the pod as "Running" while metrics are paused. - Modernized concurrent loops (Polling, Aggregation, Metrics Server) using Go 1.25's `sync.WaitGroup.Go()` for cleaner syntax and leak-proof execution. - Ensured graceful shutdown on SIGTERM by propagating context cancellation and explicitly waiting for all WaitGroup goroutines to finish. - Fixed inverted error logic in the metrics server's `ListenAndServe` block where `context.Canceled` was incorrectly negated. - Added a 5-second context timeout to `metricsServer.Shutdown()` to prevent indefinite hangs from active client connections during teardown.
fca1880 to
237b084
Compare
- Added example configurations for SliceOwnerMapConfigMapRef and ReportConfigMapRef to hack/dev-config.json. - Added a new unit test TestConfigUnmarshalBothConfigMaps in internal/manager/manager_test.go to explicitly verify unmarshaling of both ConfigMap references from JSON.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.