This document breaks the implementation into milestones. bink is used for e2e tests starting from Milestone 2 (a tool for spinning up K8s clusters backed by bootc VMs with a local registry).
Use kubebuilder to generate the initial project structure:
kubebuilder init --domain bootc.dev --repo github.com/jlebon/bootc-operator
kubebuilder create api --group bootc --version v1alpha1 --kind BootcNodePool \
--resource --controller
kubebuilder create api --group bootc --version v1alpha1 --kind BootcNode \
--resource --controller=falseBootcNodePool gets --controller because the Pool Reconciler lives
there. BootcNode gets --controller=false because the daemon is a
separate binary and the Pool Reconciler adds its own watch on
BootcNode.
- Add
+kubebuilder:resource:scope=Clustermarker - Flesh out
BootcNodePoolSpec(nodeSelector, image, rollout, disruption, pullSecretRef) - Flesh out
BootcNodePoolStatus(targetDigest, deployedDigest, updateAvailable, counts, conditions) - Pool condition type/reason constants (
UpToDate,Degradedwith all their reasons)
- Add
+kubebuilder:resource:scope=Clustermarker - Flesh out
BootcNodeSpec(desiredImage, desiredImageState, pullSecretRef, pullSecretHash) - Flesh out
BootcNodeStatus(booted, staged, rollback using a sharedImageInfostruct, conditions) - Node condition type/reason constants (
Idlewith its reasons) DesiredImageStateenum type (Staged,Booted)
- envtest: verify CRDs load, objects can be created/retrieved with full spec and status, and enum validation rejects invalid values.
Set up the bink-based e2e test infrastructure used by all subsequent milestones. Each e2e test gets its own bink cluster for full isolation and potential for parallelization.
Create test/e2e/ with a helper package that provides a single entry
point:
func TestCRDSmoke(t *testing.T) {
env := e2eutil.New(t, e2eutil.Config{}) // starts cluster, deploys CRDs, registers t.Cleanup
// env.Client ready to use
// cluster torn down automatically when test ends
}New(t, cfg) handles the full lifecycle transparently. The Config
struct maps to bink's cluster configuration (e.g. cluster name, VM
memory) and can be extended as needed by later milestones.
- Finds
binkbinary on PATH (overridable viaBINK_PATHenv var). Fails the test if not found. - Runs
bink cluster startwith a unique cluster name per test (e.g.e2e-<short-hash>, overridable viaOptions) - Waits for node1 to be Ready
- Runs
bink api exposeto generate a kubeconfig - Applies the kustomize manifest from
config/default/(at M2 this is just CRDs; as later milestones add the controller Deployment and daemon DaemonSet, e2e tests automatically get the full operator) - Builds a controller-runtime client
- Registers
t.Cleanup→bink cluster stop --remove-data
First e2e test: create a BootcNodePool and BootcNode, verify they are accepted and retrievable. Validates both the harness and the CRDs on a real cluster.
- Run the smoke test, verify it passes end-to-end.
- Verify the cluster is torn down after test completes (including on failure).
Build the controller first since it is the more complex piece and its interaction with BootcNode objects can be fully tested via envtest by simulating daemon responses (writing BootcNode status directly). This also nails down exactly what the controller writes to BootcNode.spec and expects from BootcNode.status, giving the daemon a clear contract to implement.
Validation is mostly envtest-based since the daemon doesn't exist yet and the controller's logic (membership sync, rollout state machine, status aggregation) can be fully exercised by simulating BootcNode status updates. E2e tests become more valuable in Milestone 4 when the full controller+daemon loop can be tested end-to-end.
- Watch BootcNodePool, Node (with predicates: label changes, Ready
condition,
spec.unschedulableonly), BootcNode (via ownerReference) - Match nodes via
nodeSelector, create BootcNode with ownerReference to pool - Label nodes
bootc.dev/managed: "" - Handle node leaving pool (label removed or node deleted): delete
BootcNode, remove
bootc.dev/managedlabel, restore cordon state viabootc.dev/was-cordonedannotation on the BootcNode - Conflict detection: node matches multiple pools →
Degraded/NodeConflicton conflicting pools, skip rollout steps
Validation: ✅
- envtest: create Nodes and a BootcNodePool, verify BootcNodes are
created with correct ownerReference and desiredImage. Verify nodes
are labeled
bootc.dev/managed. Remove a node's matching label, verify BootcNode is deleted and label removed. Create overlapping pools, verifyDegraded/NodeConflict. - e2e: enhance the smoke test to deploy the controller, create a BootcNodePool with worker nodes, and verify BootcNodes appear and nodes are labeled. This exercises the e2e harness for the first time with a running operator and validates it functions in a real cluster.
- Set
targetDigestdirectly fromspec.image.ref(digest refs only; tag resolution deferred to Milestone 5) - Sync
desiredImageon all owned BootcNodes; resetdesiredImageStatetoStagedwhendesiredImagechanges - Reboot slot accounting based on
maxUnavailable - Drive transitions per the state table: detect Staged → cordon (record
prior state in
bootc.dev/was-cordonedon the BootcNode) → drain (usingk8s.io/kubectl/pkg/drain, async goroutine, no timeout by default) → setdesiredImageState: Booted - Post-reboot: detect node is healthy (
Idle=True, not Degraded) and K8s Node is Ready → uncordon (respectingwas-cordonedon the BootcNode) → free slot spec.rollout.paused: block new reboot slot assignments; let in-progress staging complete- Error handling: node Degraded → mark pool Degraded, continue rollout
on other nodes; when 2+ nodes in reboot slots are unhealthy (Degraded
or not Ready), stop assigning new slots
(no new
desiredImageState: Booted)
Validation (envtest):
- Simulate a 3-node rollout with
maxUnavailable: 1by writing BootcNode status transitions (Staging → Staged → Rebooting → Idle). Verify only one node is cordoned/drained at a time. - Verify
desiredImageStateresets toStagedwhendesiredImagechanges mid-rollout. - Verify pause blocks new reboot slot assignments but lets staging complete. Resume and verify rollout continues.
- Simulate
Degradedon one node, verify rollout continues on others. - Simulate 2 unhealthy nodes in reboot slots (e.g. post-reboot
NotReady), verify the controller stops assigning new slots even
when
maxUnavailablehas capacity.
- Compute
nodeCount,updatedCount,updatingCount,degradedCount UpToDatecondition with reasons:AllUpdated,RolloutInProgress,PausedDegradedcondition with reasons:NodeConflict,NodeDegraded,OK- Message on
UpToDate=Falseincludes breakdown (e.g. "5/10 updated; 2 staging, 2 staged, 1 rebooting") - Set
deployedDigest = targetDigestwhen all nodes match
Validation (envtest):
- Verify
nodeCount,updatedCount,updatingCount,degradedCountreflect BootcNode states accurately. - Verify
UpToDatecondition transitions:RolloutInProgressduring update,Pausedwhen paused,AllUpdatedwhen complete. - Verify
Degradedcondition reasons:NodeDegraded,NodeConflict, andOKwhen clear. - Verify
UpToDate=Falsemessage includes state breakdown. - Verify
deployedDigest = targetDigestwhen all nodes match.
- DaemonSet manifest:
privileged: true,hostPID: true,nodeSelector: bootc.dev/managed: "", ServiceAccount + RBAC for BootcNode read/write - Minimal daemon binary that starts and exits cleanly (no-op)
Validation (e2e): enhance the existing e2e test to include the
DaemonSet, label a node with bootc.dev/managed, and verify the daemon
pod starts on that node.
- Binary identifies its node name (downward API env var)
- Watches its single BootcNode via field selector on
metadata.name - Parses
bootc status --json --format-version=1viansenter -m/proc/1/ns/mnt(filteringcontainerenv var) - Writes
BootcNode.status(booted/staged/rollback fields +Idlecondition) - On startup: read bootc status, populate status, set
Idle=TrueifdesiredImage == booted
Validation (e2e): enhance the existing e2e test to verify the
daemon populates BootcNode status from bootc status.
- Detect
spec.desiredImage != booted→ setIdle=False reason=Staging, runbootc switch <desiredImage>(no--download-onlyfor now; pending upstream) - On success → set
Idle=False reason=Staged - On error → set
Idle=True,Degraded=True reason=Error - Detect
spec.desiredImageState == Booted+ staged matches desired → setIdle=False reason=Rebooting, runbootc switch --applyand reboot - Handle re-stage: if
staged.imageDigest != desiredImagein Staged state, go back to Staging
Validation (e2e): enhance the existing e2e test to push an updated
bootc image to the local registry, set desiredImage to the new image,
and verify the daemon stages it and reports Idle=False reason=Staged.
Then set desiredImageState: Booted, verify the node reboots into the
new image and the daemon reports Idle=True.
- fsnotify watch on
/proc/1/root/ostree/bootcfor CHMOD events - Fallback: also try
/proc/1/root/sysroot/state/deploy(for composefs where/ostree/bootcdoesn't exist) - Polling fallback every ~5 minutes
- On event: re-read bootc status, update BootcNode.status if changed
Validation (e2e): add a test that triggers an external bootc status change on the host (e.g. via SSH) and verifies the daemon detects it and updates BootcNode.status within a bounded time window (shorter than the polling interval, proving fsnotify is working).
- Registry client (e.g.
go-containerregistry) to resolve tags to digests - On pool reconcile: if
image.refis a tag and enough time has elapsed since last resolution, query registry - Store resolved digest in
status.targetDigest, track resolution time - Set
updateAvailable = (targetDigest != deployedDigest) - Schedule next resolution via
RequeueAfter
Validation (e2e): add a test that pushes two image versions to the
local registry under a tag, creates a pool referencing the tag, and
verifies the tag resolves and the initial rollout completes. Then push
a new version to the same tag, verify updateAvailable surfaces, and
the updated image rolls out to the node.
- Watch Secrets referenced by pools (
EnqueueRequestsFromMapFunc) - Controller copies
pullSecretRef+ content hash toBootcNode.spec - Daemon: on
pullSecretHashchange, GET the Secret, write.dockerconfigjsonto/run/ostree/auth.jsonon the host via nsenter - Daemon ServiceAccount gets
geton Secrets in operator namespace
Validation (e2e): add a test that creates a pull-secret-protected image in the registry, verifies the operator propagates the secret, and the daemon uses it for staging.
A lot of this can and should be done in parallel with the earlier milestones. Once we have a test harness and a basic E2E test, we should be unblocked on hooking up CI.
- Render a static manifest from
config/default/kustomize for single-command install (kubectl create -f https://...) without needing the repo cloned - CI pipeline running bink-based e2e tests
- README with quickstart