fix: release stale IP assignments from crashed containers in NodeSubnet mode by bohuini · Pull Request #4350 · Azure/azure-container-networking

bohuini · 2026-04-13T23:43:59Z

…net mode

Reason for Change:
When a pod's sandbox is recreated (e.g., during crash-loops or failed sandbox creation), kubelet creates a new container with a new InfraContainerID. CNS uses this ID to derive the pod's interface key for IP tracking. The old IP assignment (under the old key) is never released if CNI DEL fails or is skipped for the partially-created sandbox, causing an IP leak. Over time, repeated restarts exhaust the entire IP pool, blocking all new pod scheduling on the node.

This was observed in production (ICM 755853515) where 4 system pods crash-looping on a single node leaked 32 IPs (the entire pool), leaving 0 IPs available for 12+ hours.

The SWIFT/CRD path already handles this via IPsNotInUse in the NodeNetworkConfig spec. The NodeSubnet path had no equivalent cleanup.

This PR adds stale IP cleanup at two points:

Runtime (ipam.go): Before allocating a new IP, releaseStaleIPConfigsForPod checks for IPs assigned to the same pod name+namespace but under a different interface key (previous container), and releases them.
Startup reconciliation (internalapi.go): After ReconcileIPAssignment, releaseStaleAssignedIPs compares all Assigned IPs against the set of currently running pods and releases any that don't match an active pod key.

Issue Fixed:
Fixes ICM 755853515 — Stuck pod due to IP allocation failure in NodeSubnet mode.

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
relevant PR labels added

Notes:

…net mode

Copilot

Pull request overview

Fixes IP leaks in NodeSubnet mode caused by pod sandbox/container restarts (new InfraContainerID / interface key) by proactively releasing stale IP assignments, preventing pool exhaustion and scheduling failures.

Changes:

Add runtime stale-IP cleanup before allocating a new IP config (releaseStaleIPConfigsForPod).
Add startup reconciliation cleanup to release Assigned IPs not matching the provided “active” pod key set (releaseStaleAssignedIPs).
Add unit tests covering stale release behavior and crash-loop exhaustion prevention scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
cns/restserver/ipam.go	Adds runtime stale-IP release prior to allocation.
cns/restserver/internalapi.go	Adds reconciliation-time stale Assigned-IP release based on “active” pod keys.
cns/restserver/ipam_test.go	Adds tests for pod restart / crash-loop scenarios and no-op behavior.
cns/restserver/internalapi_test.go	Adds tests intended to validate NodeSubnet reconciliation stale-IP cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T23:52:36Z

+	// Build set of active pod keys and release any stale Assigned IPs that
+	// don't correspond to currently running pods. This handles cases where
+	// CNS state was persisted with orphaned IP assignments from previous
+	// container restarts.
+	activePodKeys := make(map[string]bool, len(podInfoByIP))
+	for _, pi := range podInfoByIP {
+		activePodKeys[pi.Key()] = true
+	}
+	service.releaseStaleAssignedIPs(activePodKeys)


The new reconciliation cleanup builds activePodKeys from the podInfoByIP argument and then releases any Assigned IPs whose pod key isn’t in that set. In the NodeSubnet initialization path, podInfoByIP is sourced from the CNS endpoint state store (previous saved endpoints), which may include stale entries (so leaks won’t be released) or may be incomplete (risking release of IPs for still-running pods). The comment here also says “currently running pods”, which isn’t necessarily true for that provider. Please clarify the contract for podInfoByIP (and/or wire this to a provider that enumerates actual active pods) and consider guarding the release to avoid releasing IPs when the active set is known to be unreliable.

Copilot · 2026-04-13T23:52:37Z

+	oldContainerID := "aaaa1111bbbb2222cccc3333dddd4444eeee5555ffff6666"
+	oldInterfaceID := oldContainerID[:8] + "-eth0"
+	oldPodInfo := cns.NewPodInfo(oldInterfaceID, oldContainerID, "mypod", "default")
+
+	req1 := cns.IPConfigsRequest{
+		PodInterfaceID:   oldPodInfo.InterfaceID(),
+		InfraContainerID: oldPodInfo.InfraContainerID(),
+	}


TestIPAMReleaseStaleIPOnPodRestart builds PodInfo with cns.NewPodInfo(oldInterfaceID, oldContainerID, ...), but NewPodInfo’s signature is (infraContainerID, interfaceID, ...). This swaps InfraContainerID/InterfaceID and makes the subsequent request populate PodInterfaceID and InfraContainerID with reversed values, so the test may not reflect the real restart scenario and can mask regressions. Consider passing args in the correct order (infraContainerID first, interfaceID second) and setting IPConfigsRequest fields directly to match the runtime inputs you want to simulate.

Copilot · 2026-04-13T23:52:37Z

+// TestReconcileIPAMStateForNodeSubnetReleasesStaleIPs verifies that
+// ReconcileIPAMStateForNodeSubnet releases IPs that are Assigned to pods
+// that are no longer running on the node (i.e. their key is not in podInfoByIP).
+func TestReconcileIPAMStateForNodeSubnetReleasesStaleIPs(t *testing.T) {
+	restartService()
+	setEnv(t)
+	setOrchestratorTypeInternal(cns.KubernetesCRD)
+
+	// Create an NC with 4 secondary IPs
+	secondaryIPConfigs := make(map[string]cns.SecondaryIPConfig)
+	ipIDs := make([]string, 4)
+	for i := 0; i < 4; i++ {
+		ipaddress := fmt.Sprintf("10.0.0.%d", 6+i)
+		secIPConfig := newSecondaryIPConfig(ipaddress, -1)
+		ipID := uuid.New().String()
+		ipIDs[i] = ipID
+		secondaryIPConfigs[ipID] = secIPConfig
+	}
+	ncReq := generateNetworkContainerRequest(secondaryIPConfigs, "nodesubnet-nc1", "-1")
+
+	// Simulate: 2 pods are currently running, and 1 stale pod (old InfraContainerID) has leaked IPs.
+	// Pod1 (running) has IP 10.0.0.6
+	// Pod2 (running) has IP 10.0.0.7
+	// Pod1's OLD container (stale) had IP 10.0.0.8 — this should be released.
+	// 10.0.0.9 is available
+	activePodInfo := map[string]cns.PodInfo{
+		"10.0.0.6": cns.NewPodInfo("newcontainer1", "newcont1-eth0", "pod1", "ns1"),
+		"10.0.0.7": cns.NewPodInfo("container2", "cont2-eth0", "pod2", "ns1"),
+	}
+
+	returnCode := svc.ReconcileIPAMStateForNodeSubnet(
+		[]*cns.CreateNetworkContainerRequest{ncReq}, activePodInfo,
+	)
+	require.Equal(t, types.Success, returnCode)
+
+	// Verify: 2 IPs should be Assigned, rest should be Available
+	assignedCount := 0
+	availableCount := 0
+	for _, ipConfig := range svc.PodIPConfigState { //nolint:gocritic // ignore copy
+		switch ipConfig.GetState() {
+		case types.Assigned:
+			assignedCount++
+		case types.Available:
+			availableCount++
+		}
+	}
+	assert.Equal(t, 2, assignedCount, "only the 2 active pods should have Assigned IPs")
+	assert.Equal(t, 2, availableCount, "remaining 2 IPs should be Available")


TestReconcileIPAMStateForNodeSubnetReleasesStaleIPs claims to cover a leaked/stale assignment (10.0.0.8) but never seeds CNS state with any pre-existing Assigned IPs for a non-active pod key before calling ReconcileIPAMStateForNodeSubnet. As written, the assertions (2 Assigned / 2 Available) would pass even without releaseStaleAssignedIPs, so the test doesn’t actually validate the new cleanup behavior. Seed an Assigned IP entry with a pod key not present in activePodInfo (or run two reconciles where the second omits a previously-assigned pod) and assert that key is removed and the IP transitions back to Available.

Fix stale IP assignment leak during pod container restarts in NodeSub…

f7e0388

…net mode

bohuini requested a review from a team as a code owner April 13, 2026 23:43

bohuini requested review from Copilot and csfmomo April 13, 2026 23:44

Copilot started reviewing on behalf of bohuini April 13, 2026 23:45 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: release stale IP assignments from crashed containers in NodeSubnet mode#4350

fix: release stale IP assignments from crashed containers in NodeSubnet mode#4350
bohuini wants to merge 1 commit intomasterfrom
bohui/fixipleak

bohuini commented Apr 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bohuini commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bohuini commented Apr 13, 2026 •

edited

Loading