Skip to content

fix: release stale IP assignments from crashed containers in NodeSubnet mode#4350

Open
bohuini wants to merge 1 commit intomasterfrom
bohui/fixipleak
Open

fix: release stale IP assignments from crashed containers in NodeSubnet mode#4350
bohuini wants to merge 1 commit intomasterfrom
bohui/fixipleak

Conversation

@bohuini
Copy link
Copy Markdown
Contributor

@bohuini bohuini commented Apr 13, 2026

…net mode

Reason for Change:
When a pod's sandbox is recreated (e.g., during crash-loops or failed sandbox creation), kubelet creates a new container with a new InfraContainerID. CNS uses this ID to derive the pod's interface key for IP tracking. The old IP assignment (under the old key) is never released if CNI DEL fails or is skipped for the partially-created sandbox, causing an IP leak. Over time, repeated restarts exhaust the entire IP pool, blocking all new pod scheduling on the node.

This was observed in production (ICM 755853515) where 4 system pods crash-looping on a single node leaked 32 IPs (the entire pool), leaving 0 IPs available for 12+ hours.

The SWIFT/CRD path already handles this via IPsNotInUse in the NodeNetworkConfig spec. The NodeSubnet path had no equivalent cleanup.

This PR adds stale IP cleanup at two points:

  1. Runtime (ipam.go): Before allocating a new IP, releaseStaleIPConfigsForPod checks for IPs assigned to the same pod name+namespace but under a different interface key (previous container), and releases them.

  2. Startup reconciliation (internalapi.go): After ReconcileIPAssignment, releaseStaleAssignedIPs compares all Assigned IPs against the set of currently running pods and releases any that don't match an active pod key.

Issue Fixed:
Fixes ICM 755853515 — Stuck pod due to IP allocation failure in NodeSubnet mode.

Requirements:

Notes:

@bohuini bohuini requested a review from a team as a code owner April 13, 2026 23:43
@bohuini bohuini requested review from Copilot and csfmomo April 13, 2026 23:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes IP leaks in NodeSubnet mode caused by pod sandbox/container restarts (new InfraContainerID / interface key) by proactively releasing stale IP assignments, preventing pool exhaustion and scheduling failures.

Changes:

  • Add runtime stale-IP cleanup before allocating a new IP config (releaseStaleIPConfigsForPod).
  • Add startup reconciliation cleanup to release Assigned IPs not matching the provided “active” pod key set (releaseStaleAssignedIPs).
  • Add unit tests covering stale release behavior and crash-loop exhaustion prevention scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
cns/restserver/ipam.go Adds runtime stale-IP release prior to allocation.
cns/restserver/internalapi.go Adds reconciliation-time stale Assigned-IP release based on “active” pod keys.
cns/restserver/ipam_test.go Adds tests for pod restart / crash-loop scenarios and no-op behavior.
cns/restserver/internalapi_test.go Adds tests intended to validate NodeSubnet reconciliation stale-IP cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +462 to +470
// Build set of active pod keys and release any stale Assigned IPs that
// don't correspond to currently running pods. This handles cases where
// CNS state was persisted with orphaned IP assignments from previous
// container restarts.
activePodKeys := make(map[string]bool, len(podInfoByIP))
for _, pi := range podInfoByIP {
activePodKeys[pi.Key()] = true
}
service.releaseStaleAssignedIPs(activePodKeys)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new reconciliation cleanup builds activePodKeys from the podInfoByIP argument and then releases any Assigned IPs whose pod key isn’t in that set. In the NodeSubnet initialization path, podInfoByIP is sourced from the CNS endpoint state store (previous saved endpoints), which may include stale entries (so leaks won’t be released) or may be incomplete (risking release of IPs for still-running pods). The comment here also says “currently running pods”, which isn’t necessarily true for that provider. Please clarify the contract for podInfoByIP (and/or wire this to a provider that enumerates actual active pods) and consider guarding the release to avoid releasing IPs when the active set is known to be unreliable.

Copilot uses AI. Check for mistakes.
Comment on lines +2505 to +2512
oldContainerID := "aaaa1111bbbb2222cccc3333dddd4444eeee5555ffff6666"
oldInterfaceID := oldContainerID[:8] + "-eth0"
oldPodInfo := cns.NewPodInfo(oldInterfaceID, oldContainerID, "mypod", "default")

req1 := cns.IPConfigsRequest{
PodInterfaceID: oldPodInfo.InterfaceID(),
InfraContainerID: oldPodInfo.InfraContainerID(),
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestIPAMReleaseStaleIPOnPodRestart builds PodInfo with cns.NewPodInfo(oldInterfaceID, oldContainerID, ...), but NewPodInfo’s signature is (infraContainerID, interfaceID, ...). This swaps InfraContainerID/InterfaceID and makes the subsequent request populate PodInterfaceID and InfraContainerID with reversed values, so the test may not reflect the real restart scenario and can mask regressions. Consider passing args in the correct order (infraContainerID first, interfaceID second) and setting IPConfigsRequest fields directly to match the runtime inputs you want to simulate.

Copilot uses AI. Check for mistakes.
Comment on lines +1762 to +1809
// TestReconcileIPAMStateForNodeSubnetReleasesStaleIPs verifies that
// ReconcileIPAMStateForNodeSubnet releases IPs that are Assigned to pods
// that are no longer running on the node (i.e. their key is not in podInfoByIP).
func TestReconcileIPAMStateForNodeSubnetReleasesStaleIPs(t *testing.T) {
restartService()
setEnv(t)
setOrchestratorTypeInternal(cns.KubernetesCRD)

// Create an NC with 4 secondary IPs
secondaryIPConfigs := make(map[string]cns.SecondaryIPConfig)
ipIDs := make([]string, 4)
for i := 0; i < 4; i++ {
ipaddress := fmt.Sprintf("10.0.0.%d", 6+i)
secIPConfig := newSecondaryIPConfig(ipaddress, -1)
ipID := uuid.New().String()
ipIDs[i] = ipID
secondaryIPConfigs[ipID] = secIPConfig
}
ncReq := generateNetworkContainerRequest(secondaryIPConfigs, "nodesubnet-nc1", "-1")

// Simulate: 2 pods are currently running, and 1 stale pod (old InfraContainerID) has leaked IPs.
// Pod1 (running) has IP 10.0.0.6
// Pod2 (running) has IP 10.0.0.7
// Pod1's OLD container (stale) had IP 10.0.0.8 — this should be released.
// 10.0.0.9 is available
activePodInfo := map[string]cns.PodInfo{
"10.0.0.6": cns.NewPodInfo("newcontainer1", "newcont1-eth0", "pod1", "ns1"),
"10.0.0.7": cns.NewPodInfo("container2", "cont2-eth0", "pod2", "ns1"),
}

returnCode := svc.ReconcileIPAMStateForNodeSubnet(
[]*cns.CreateNetworkContainerRequest{ncReq}, activePodInfo,
)
require.Equal(t, types.Success, returnCode)

// Verify: 2 IPs should be Assigned, rest should be Available
assignedCount := 0
availableCount := 0
for _, ipConfig := range svc.PodIPConfigState { //nolint:gocritic // ignore copy
switch ipConfig.GetState() {
case types.Assigned:
assignedCount++
case types.Available:
availableCount++
}
}
assert.Equal(t, 2, assignedCount, "only the 2 active pods should have Assigned IPs")
assert.Equal(t, 2, availableCount, "remaining 2 IPs should be Available")
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestReconcileIPAMStateForNodeSubnetReleasesStaleIPs claims to cover a leaked/stale assignment (10.0.0.8) but never seeds CNS state with any pre-existing Assigned IPs for a non-active pod key before calling ReconcileIPAMStateForNodeSubnet. As written, the assertions (2 Assigned / 2 Available) would pass even without releaseStaleAssignedIPs, so the test doesn’t actually validate the new cleanup behavior. Seed an Assigned IP entry with a pod key not present in activePodInfo (or run two reconciles where the second omits a previously-assigned pod) and assert that key is removed and the IP transitions back to Available.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants