diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 00000000..023bb2c4 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,210 @@ +# wasp-agent Architecture + +## Overview + +wasp-agent provides a non-intrusive method to enable LimitedSwap for OpenShift +pod containers. It replicates the functionality described in +[KEP-2400](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap) +without requiring changes to the kubelet API or the activation of experimental +feature gates. + +## Motivation + +Traditional virtualization platforms (e.g., RHEV, VMware vSphere) leverage +memory overcommitment and swap to achieve high VM density. While OpenShift +Virtualization supports memory overcommitment, its dependency on KEP-2400 for +swap support presents a challenge. Because KEP-2400 remains in an experimental +state, enabling it in production OpenShift clusters is prohibited, as it risks +cluster stability and prevents future upgrades. + +### Goals + +- Provide LimitedSwap functionality as described in KEP-2400 without requiring + the experimental swap feature-gate. +- Design the solution for seamless transition to the Kubernetes-native + LimitedSwap implementation once it reaches GA. + +### Non-Goals + +- Unlimited swap memory usage. +- Swap only for VM pods. +- Protecting workloads from high swap activity. +- Platforms other than OpenShift. +- Single-node clusters. +- Compact clusters (schedulable infra nodes). +- Hosted control planes. + +## User Stories + +- As a cluster administrator I want to enable swap for VM workload pods so I + can achieve higher VM density on my cluster. +- As a cluster administrator I want to keep my swap-enabled cluster + upgradeable. +- As a cluster administrator I want to move my swap-enabled cluster to native + Kubernetes swap once it reaches GA without observing any regression related + to swapping. + +## Design + +To implement swap support independently of the kubelet, wasp-agent intercepts +the pod creation phase and enforces swap limits that persist throughout the pod +lifecycle. We evaluated the Node Resource Interface (NRI) and OCI runtime hooks +for lifecycle integration and selected OCI hooks due to their established +maturity and broad adoption in the ecosystem. + +When the OCI runtime calls the OCI hook it passes pod information to the hook +via stdin. However the information lacks several things: pod memory requests, +pod priority, and the QoS classification. Filtering based on QoS classification +is still possible at the hook level because each QoS class has a dedicated +cgroup hierarchy in the cgroupfs. The container cgroup path is parsed from +stdin and its QoS class is derived from the path. + +To fully comply with KEP-2400 a multi-phased approach is used: + +1. **OCI hook** — filter based on QoS class, set unlimited swap. +2. **Controller** — watch running pods and fix the swap value if necessary. + +The wasp-agent controller uses an informer to read pod information from the API +server. + +![OCI Hook Sequence](./images/wasp-agent-hook.png) + +### Direct Manipulation of cgroupfs + +Using the OCI hook it is possible to write the `max` value to the +`memory.swap.max` interface file. However in the OpenShift platform both +kubelet and CRI-O are configured to use the systemd cgroup driver. + +The initial assumption was that swap values would be set only once during the +container lifecycle, at its creation phase and before the application starts. +That assumption proved wrong — systemd's cgroup driver can re-apply the full +set of values to the container cgroupfs hierarchy upon resource update requests. +One such case is when the CPU manager static policy is enabled: resource update +requests can arrive at any stage of the container lifecycle. + +### Prestart vs. Poststart + +To sync the swap value with systemd the d-bus client must be used. At the same +time the hook implementation is kept as simple as possible without introducing +delays or performance issues. When the OCI runtime executes a resource update +to a running container it triggers the d-bus client and systemd is synced with +the new values. The OCI runtime needs to read an internal container state file +for the update, and this file does not yet exist when a prestart hook is called. +Therefore the hook runs in **poststart** state. + +### Short-Lived Containers + +The poststart hook timing means container PID 1 is already running when the +hook executes. Some deployments have containers with a very short lifetime (e.g. +a single-line init-container that copies a file). A race condition exists +between hook execution and systemd cgroup teardown — the update fails because +the cgroupfs no longer exists. + +To handle this, the hook captures and analyzes the execution return code. On +error, it checks whether the PID still exists; if not, it silently exits. The +assumption is that the memory footprint of short-lived containers is low enough +that enabling swap for them is not critical. + +### OCI Hook Deployment + +For hook injection CRI-O requires specific configuration. The wasp-agent +controller is responsible for deploying: + +- **OCI hook configuration JSON** — the hook descriptor. +- **Hook executable shell script** — the actual hook implementation. + +Starting with OpenShift 4.20, `runc` was replaced by `crun`. This affects the +hook script which uses the OCI runtime tool to update resources. For backward +compatibility wasp-agent detects the effective OCI runtime from CRI-O +configuration and renders the hook script accordingly. This dynamic detection +rules out using the Machine Config Operator for deployment, which would require +maintaining different hook file versions mapped to OpenShift versions. + +The controller also handles OCI file cleanup before termination. To ensure +cleanup is mutually exclusive with subsequent file creations by other instances +on the same node, OCI files are suffixed with the current wasp-agent pod name. + +### Swap Limit Fix-Up + +The OCI hook does not have the full picture of the pod spec. Still, the +application must be able to use swap before it starts, with the swap limit +corrected later. The wasp-agent controller handles this fix-up using go-client +to list running pods. + +Since the controller action is node-local, it only lists pods running on the +same node. The go-client is configured with a field selector matching the node +name. + +For consistency the controller continuously relists all running pods and +performs the fix if needed — a reconciliation loop for the swap memory limit. +The fix is applied directly to `memory.swap.max` in the cgroup. Interacting +with the d-bus client or OCI runtime for the update would complicate the +implementation and risk racing with kubelet. + +To calculate the cgroup path, the controller uses the CRI-O CRI API to fetch +the container host PID, then reads the path from `/proc//cgroup`. + +![Swap Limit Fix-Up Controller](./images/wasp-agent-controller.png) + +## Deployment + +The following Kubernetes resources are required: + +| Resource | Purpose | +|---|---| +| Namespace | Isolation for wasp-agent components | +| DaemonSet | Deploy wasp-agent on every worker node | +| ServiceAccount | Identity for the wasp-agent pods | +| Role | RBAC permissions (list pods) | +| RoleBinding | Bind the role to the service account | +| SecurityContextConstraint | Privileged access for OCI file and cgroup operations | + +Since swapping is supported only on worker nodes, wasp-agent is deployed as a +DaemonSet. The application needs list permissions on Pod resources. + +From a pod security perspective wasp-agent performs the following privileged +operations: + +- Install and clean OCI files. +- Communicate with CRI-O via the gRPC unix socket. +- Write values to the cgroupfs. + +wasp-agent is considered a system-critical pod because of the swap limit +fix-up; its pod priority is adjusted accordingly. + +## Alternatives Considered + +### Kubernetes Swap (KEP-2400) + +At the time of this proposal Kubernetes swap is still graduating; its usage +requires a feature gate. Enabling it makes an OpenShift cluster +non-upgradeable. + +### NRI (Node Resource Interface) + +NRI is a well-deserved candidate as an alternative to OCI hooks. The OCI +approach was chosen due to feature maturity and proven usage in other solutions. + +## Upgrade / Rollback Compatibility + +**Upgrade:** Installing wasp-agent takes immediate effect. Existing workload +resource limits are updated with new swap memory limits. If swap is not +provisioned, default behavior is preserved. + +**Rollback:** To opt out, workloads must be restarted to restore the previous +behavior (no swap). + +### Upgrading to Kubernetes Swap GA + +During the upgrade the target OpenShift version should: + +1. Discontinue deploying wasp-agent. +2. Opt in for Kubernetes swap. + +The transition is seamless — workloads continue consuming swap memory on the +newly upgraded nodes. + +## Testing + +Functional testing follows the same approach as KEP-2400, running on the +OpenShift platform using OpenShift CI. diff --git a/docs/images/wasp-agent-controller.png b/docs/images/wasp-agent-controller.png new file mode 100644 index 00000000..b6a4f25c Binary files /dev/null and b/docs/images/wasp-agent-controller.png differ diff --git a/docs/images/wasp-agent-hook.png b/docs/images/wasp-agent-hook.png new file mode 100644 index 00000000..51f3272b Binary files /dev/null and b/docs/images/wasp-agent-hook.png differ