fix: release stale IP assignments from crashed containers in NodeSubnet mode#4350
fix: release stale IP assignments from crashed containers in NodeSubnet mode#4350
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes IP leaks in NodeSubnet mode caused by pod sandbox/container restarts (new InfraContainerID / interface key) by proactively releasing stale IP assignments, preventing pool exhaustion and scheduling failures.
Changes:
- Add runtime stale-IP cleanup before allocating a new IP config (
releaseStaleIPConfigsForPod). - Add startup reconciliation cleanup to release Assigned IPs not matching the provided “active” pod key set (
releaseStaleAssignedIPs). - Add unit tests covering stale release behavior and crash-loop exhaustion prevention scenarios.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| cns/restserver/ipam.go | Adds runtime stale-IP release prior to allocation. |
| cns/restserver/internalapi.go | Adds reconciliation-time stale Assigned-IP release based on “active” pod keys. |
| cns/restserver/ipam_test.go | Adds tests for pod restart / crash-loop scenarios and no-op behavior. |
| cns/restserver/internalapi_test.go | Adds tests intended to validate NodeSubnet reconciliation stale-IP cleanup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Build set of active pod keys and release any stale Assigned IPs that | ||
| // don't correspond to currently running pods. This handles cases where | ||
| // CNS state was persisted with orphaned IP assignments from previous | ||
| // container restarts. | ||
| activePodKeys := make(map[string]bool, len(podInfoByIP)) | ||
| for _, pi := range podInfoByIP { | ||
| activePodKeys[pi.Key()] = true | ||
| } | ||
| service.releaseStaleAssignedIPs(activePodKeys) |
There was a problem hiding this comment.
The new reconciliation cleanup builds activePodKeys from the podInfoByIP argument and then releases any Assigned IPs whose pod key isn’t in that set. In the NodeSubnet initialization path, podInfoByIP is sourced from the CNS endpoint state store (previous saved endpoints), which may include stale entries (so leaks won’t be released) or may be incomplete (risking release of IPs for still-running pods). The comment here also says “currently running pods”, which isn’t necessarily true for that provider. Please clarify the contract for podInfoByIP (and/or wire this to a provider that enumerates actual active pods) and consider guarding the release to avoid releasing IPs when the active set is known to be unreliable.
| oldContainerID := "aaaa1111bbbb2222cccc3333dddd4444eeee5555ffff6666" | ||
| oldInterfaceID := oldContainerID[:8] + "-eth0" | ||
| oldPodInfo := cns.NewPodInfo(oldInterfaceID, oldContainerID, "mypod", "default") | ||
|
|
||
| req1 := cns.IPConfigsRequest{ | ||
| PodInterfaceID: oldPodInfo.InterfaceID(), | ||
| InfraContainerID: oldPodInfo.InfraContainerID(), | ||
| } |
There was a problem hiding this comment.
TestIPAMReleaseStaleIPOnPodRestart builds PodInfo with cns.NewPodInfo(oldInterfaceID, oldContainerID, ...), but NewPodInfo’s signature is (infraContainerID, interfaceID, ...). This swaps InfraContainerID/InterfaceID and makes the subsequent request populate PodInterfaceID and InfraContainerID with reversed values, so the test may not reflect the real restart scenario and can mask regressions. Consider passing args in the correct order (infraContainerID first, interfaceID second) and setting IPConfigsRequest fields directly to match the runtime inputs you want to simulate.
| // TestReconcileIPAMStateForNodeSubnetReleasesStaleIPs verifies that | ||
| // ReconcileIPAMStateForNodeSubnet releases IPs that are Assigned to pods | ||
| // that are no longer running on the node (i.e. their key is not in podInfoByIP). | ||
| func TestReconcileIPAMStateForNodeSubnetReleasesStaleIPs(t *testing.T) { | ||
| restartService() | ||
| setEnv(t) | ||
| setOrchestratorTypeInternal(cns.KubernetesCRD) | ||
|
|
||
| // Create an NC with 4 secondary IPs | ||
| secondaryIPConfigs := make(map[string]cns.SecondaryIPConfig) | ||
| ipIDs := make([]string, 4) | ||
| for i := 0; i < 4; i++ { | ||
| ipaddress := fmt.Sprintf("10.0.0.%d", 6+i) | ||
| secIPConfig := newSecondaryIPConfig(ipaddress, -1) | ||
| ipID := uuid.New().String() | ||
| ipIDs[i] = ipID | ||
| secondaryIPConfigs[ipID] = secIPConfig | ||
| } | ||
| ncReq := generateNetworkContainerRequest(secondaryIPConfigs, "nodesubnet-nc1", "-1") | ||
|
|
||
| // Simulate: 2 pods are currently running, and 1 stale pod (old InfraContainerID) has leaked IPs. | ||
| // Pod1 (running) has IP 10.0.0.6 | ||
| // Pod2 (running) has IP 10.0.0.7 | ||
| // Pod1's OLD container (stale) had IP 10.0.0.8 — this should be released. | ||
| // 10.0.0.9 is available | ||
| activePodInfo := map[string]cns.PodInfo{ | ||
| "10.0.0.6": cns.NewPodInfo("newcontainer1", "newcont1-eth0", "pod1", "ns1"), | ||
| "10.0.0.7": cns.NewPodInfo("container2", "cont2-eth0", "pod2", "ns1"), | ||
| } | ||
|
|
||
| returnCode := svc.ReconcileIPAMStateForNodeSubnet( | ||
| []*cns.CreateNetworkContainerRequest{ncReq}, activePodInfo, | ||
| ) | ||
| require.Equal(t, types.Success, returnCode) | ||
|
|
||
| // Verify: 2 IPs should be Assigned, rest should be Available | ||
| assignedCount := 0 | ||
| availableCount := 0 | ||
| for _, ipConfig := range svc.PodIPConfigState { //nolint:gocritic // ignore copy | ||
| switch ipConfig.GetState() { | ||
| case types.Assigned: | ||
| assignedCount++ | ||
| case types.Available: | ||
| availableCount++ | ||
| } | ||
| } | ||
| assert.Equal(t, 2, assignedCount, "only the 2 active pods should have Assigned IPs") | ||
| assert.Equal(t, 2, availableCount, "remaining 2 IPs should be Available") |
There was a problem hiding this comment.
TestReconcileIPAMStateForNodeSubnetReleasesStaleIPs claims to cover a leaked/stale assignment (10.0.0.8) but never seeds CNS state with any pre-existing Assigned IPs for a non-active pod key before calling ReconcileIPAMStateForNodeSubnet. As written, the assertions (2 Assigned / 2 Available) would pass even without releaseStaleAssignedIPs, so the test doesn’t actually validate the new cleanup behavior. Seed an Assigned IP entry with a pod key not present in activePodInfo (or run two reconciles where the second omits a previously-assigned pod) and assert that key is removed and the IP transitions back to Available.
…net mode
Reason for Change:
When a pod's sandbox is recreated (e.g., during crash-loops or failed sandbox creation), kubelet creates a new container with a new InfraContainerID. CNS uses this ID to derive the pod's interface key for IP tracking. The old IP assignment (under the old key) is never released if CNI DEL fails or is skipped for the partially-created sandbox, causing an IP leak. Over time, repeated restarts exhaust the entire IP pool, blocking all new pod scheduling on the node.
This was observed in production (ICM 755853515) where 4 system pods crash-looping on a single node leaked 32 IPs (the entire pool), leaving 0 IPs available for 12+ hours.
The SWIFT/CRD path already handles this via IPsNotInUse in the NodeNetworkConfig spec. The NodeSubnet path had no equivalent cleanup.
This PR adds stale IP cleanup at two points:
Runtime (ipam.go): Before allocating a new IP, releaseStaleIPConfigsForPod checks for IPs assigned to the same pod name+namespace but under a different interface key (previous container), and releases them.
Startup reconciliation (internalapi.go): After ReconcileIPAssignment, releaseStaleAssignedIPs compares all Assigned IPs against the set of currently running pods and releases any that don't match an active pod key.
Issue Fixed:
Fixes ICM 755853515 — Stuck pod due to IP allocation failure in NodeSubnet mode.
Requirements:
Notes: