ci: mitigate windows scenario flakiness by QxBytes · Pull Request #4345 · Azure/azure-container-networking

QxBytes · 2026-04-10T22:35:00Z

Reason for Change:

Mitigates flakiness for windows and other test stages in the acn pr pipeline.

Ignore

Extra stages created in pipeline.yaml -- these are for testing only
Hardcoded 1.7.9-0 and MCR in .pipelines/templates/windows-overlay-e2e-step-template.yaml
These MUST be reverted before merging though

All changes should only affect acn pr, but I have run load test/release test without any obvious issues.

Main Changes

Moves k8s e2e job completely to a step template. Affects only the acn pr pipeline. The template intends to mimic the existing behavior completely and multiple test runs show that it runs without issue. Affects windows and linux, all scenarios.
Moves testing for each scenario to a single job, so it can be retried as a whole. I am aware this could increase the time for a test, but the benefit is that it is much easier to continueOnError and check a single job for a warning than multiple, which brings me to my next point
Add e2e warning handler for the main windows tests-- dualstack-overlay, overlay, and stateless (3 total). This adds a job after the main testing to check if the logs mention any of the known issues in windows, such as null characters in the statefile (if windows bsod), the network not being found (occurs when the statefile has an hns network but it was somehow deleted from the vm). These issues have only occurred in the pipelines and are basically noise. Windows has already said byo cni for windows is not supported so these issues are unlikely to be fixed. If these log lines are detected, the stage is marked with a warning but still passes
Adds exec: already started to the ubuntu cniv1 linux whitelist error messages-- this appears to be a bug with the cni framework-- exec.Run should never be called twice but the upstream cni framework code seems to make that a possible outcome making the test results unreliable if it occurs.
The check cni logs script now prints out the matched line, and failure logs will be collected even if the main testing stage succeeded with issues. Previously it only ran if the main testing job failed.
Modifies dualstack test that checks hns network to accept either an HNSNetwork struct or an array of HNSNetwork structs. Previously, if only one HNSNetwork was returned, there would be an error unmarshalling into the array. The second network is the ext hns network which is not necessary for container networking (just a qol dummy network).
Adds a script to collect logs from bad pods if they are not running for all sidecars and init containers, and prints out the status of all pods on failure.

Mitigating Windows BYOCNI Scenarios
In the new windows overlay template, I aim to keep all existing tests but wanted to unify windows overlay into one template since previously there were 3 separate templates. Stateless step template was removed because it was windows-only. The others remain because they have linux counterparts. I have added two new parts after investigating the issues with windows byocni nodes.

Windows byo cni nodes do not seem to clean up the azure-vnet.json statefile when I restart the nodes
Usually, windowsnodereset.ps1 will be called and then the cleanupnetwork.ps1 is called to remove stale statefiles like in linux
However, cleanupnetwork.ps1 only runs the deletion if the cni name is set to "azure" in kubeclusterconfig.json. For byocni nodes, that value is set to "none"
First change is to before tests run, patch the kubeclusterconfig.json on windows nodes to use plugin name "azure"
Second change is to modify windowsnodereset.ps1 to use *>> instead of >> so the logs from the cleanupnetwork.ps1 script are visible in windowsnodereset.log-- otherwise we have no visibility into those logs
I have also added collection of the windowsnodereset.log log to the failure logs collection step
Witnessed the following for host networking pods during create which we determined were unrelated to our components (containerd?): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: hcs::CreateComputeSystem b9257d8579494625ee895a32e4be77da62991d162a858e44bcd29c116ff510b9: The parameter is incorrect.: unknown

Issue Fixed:

See above

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
relevant PR labels added

Notes:

… found and exiting with an error

…ttern for all windows scenarios

…ror whitelist in byocni windows since c:\k\kubeclusterconfig.json is set to none instead of azure, the windows cleanup script doesn't run. so if there is a populated azure-vnet.json, it doesn't clean up on restart. if the hns network is somehow deleted or never created, azure-vnet.exe thinks an hns network exists because it sees the network in the azure-vnet.json but when it tries to open it, it errors out-- this is unrecoverable

…cni windows nodes

when it was its own task, if it failed, it would never be retried (500 error) if it resides in windows overlay tests, if that test failed it would restart and hopefully retry setting the cni name to azure (after the restart node)

if there is only a single hns network, the bytes will give you an an hns network object, not an array of hns network objects, causing unmarshal to fail. to fix this, we try unmarshal into a single object if we cannot unmarshal into an array the second network is the "ext" network which I do not believe has impact on datapath-- pod -> node, pod -> pod connectivity are fine

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Mitigates CI flakiness (especially Windows) by restructuring ACN PR pipeline E2E execution into step templates, improving log capture/diagnostics, and adding a warning-based “known issues” handler.

Changes:

Introduces reusable pipeline step templates (Windows overlay E2E + Kubernetes E2E) and migrates scenarios to single-job execution for easier retries.
Adds warning handler jobs to allow known Windows log signatures to yield “SucceededWithIssues” instead of failing the stage.
Improves diagnostics: collects additional Windows logs, prints grep matches, and gathers logs for non-running pods.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
test/validate/windows_validate.go	Makes HNS network JSON parsing robust to single-object vs array output; tightens validation messaging/logging.
hack/scripts/patch-kubeclusterconfig.sh	New script to patch Windows node config/reset scripts to improve cleanup and logging visibility.
hack/scripts/k8se2e-tests.sh	Adjusts ginkgo skip expression to incorporate additional skip patterns.
hack/scripts/collect-windows-logs.sh	Collects windowsnodereset.log in addition to existing state/log artifacts.
hack/scripts/collect-bad-pod-info.sh	New diagnostics script to dump events and container logs for non-running pods.
hack/scripts/check-cni-log-contents.sh	Prints matching file list and matching lines for better triage.
.pipelines/templates/windows-overlay-e2e-step-template.yaml	New consolidated Windows overlay E2E step template (control plane, datapath, restart validation, optional wireserver tests).
.pipelines/templates/warning-handler-job-template.yaml	Improves visibility by printing file/pattern parameters before scanning logs.
.pipelines/templates/log-template.yaml	Always prints all pods and runs the new “bad pod info” collector.
.pipelines/templates/k8s-e2e-step-template.yaml	New template to download k8s e2e binaries and run suites (plus Windows kube-proxy restart).
.pipelines/singletenancy/dualstack-overlay/dualstackoverlay-e2e-step-template.yaml	Collapses steps to a single Linux-only execution path (Windows now handled by shared Windows template).
.pipelines/singletenancy/dualstack-overlay/dualstackoverlay-e2e-job-template.yaml	Uses step templates + adds Windows warning handler + collects logs on SucceededWithIssues.
.pipelines/singletenancy/cilium/cilium-e2e-job-template.yaml	Migrates from k8s-e2e job template to new k8s-e2e step template within the scenario job.
.pipelines/singletenancy/cilium-overlay/cilium-overlay-e2e-job-template.yaml	Same migration to k8s-e2e step template.
.pipelines/singletenancy/cilium-overlay-withhubble/cilium-overlay-e2e-job-template.yaml	Same migration to k8s-e2e step template.
.pipelines/singletenancy/cilium-overlay-ebpf/cilium-overlay-e2e-job-template.yaml	Same migration to k8s-e2e step template.
.pipelines/singletenancy/cilium-nodesubnet/cilium-nodesubnet-e2e-job-template.yaml	Same migration to k8s-e2e step template.
.pipelines/singletenancy/cilium-ebpf/cilium-e2e-job-template.yaml	Same migration to k8s-e2e step template.
.pipelines/singletenancy/cilium-dualstack-overlay/cilium-dualstackoverlay-e2e-job-template.yaml	Same migration to k8s-e2e step template.
.pipelines/singletenancy/azure-cni-overlay/azure-cni-overlay-e2e-step-template.yaml	Linux steps simplified; Windows path now uses shared Windows overlay template.
.pipelines/singletenancy/azure-cni-overlay/azure-cni-overlay-e2e-job-template.yaml	Uses step templates + Windows warning handler + collects logs on SucceededWithIssues.
.pipelines/singletenancy/azure-cni-overlay-stateless/azure-cni-overlay-stateless-e2e-step-template.yaml	Removes scenario-specific Windows stateless step template (replaced by shared Windows overlay template).
.pipelines/singletenancy/azure-cni-overlay-stateless/azure-cni-overlay-stateless-e2e-job-template.yaml	Switches to shared Windows overlay template + adds k8s-e2e steps + warning handler + log collection on SucceededWithIssues.
.pipelines/singletenancy/aks/e2e-job-template.yaml	Adds k8s-e2e step template usage; expands warning patterns; updates failure-log condition to include SucceededWithIssues.
.pipelines/singletenancy/aks-swift/e2e-job-template.yaml	Migrates to k8s-e2e step template.
.pipelines/pipeline.yaml	Adds additional Windows stages (noted as temporary in PR description) and updates cleanup config mapping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T17:22:49Z

.pipelines/templates/windows-overlay-e2e-step-template.yaml

+        nodeList=`kubectl get node -owide | grep Windows | awk '{print $1}'`
+        for node in $nodeList; do
+            taint=`kubectl describe node $node | grep Taints | awk '{print $2}'`
+            if [ $taint == "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" ]; then


With set -e, an empty/whitespace taint value will make [ $taint == ... ] error (e.g., 'unary operator expected') and fail the entire step. Quote the variable and use a POSIX-safe comparison ([ \"$taint\" = \"...\" ]), or guard for empty/<none> before the comparison.

Suggested change

if [ $taint == "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" ]; then

if [ "$taint" = "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" ]; then

.pipelines/templates/k8s-e2e-step-template.yaml

Copilot · 2026-04-13T17:22:50Z

.pipelines/pipeline.yaml

+    - template: singletenancy/azure-cni-overlay/azure-cni-overlay-e2e-job-template.yaml
+      parameters:
+        name: "win_azure_overlay_e2e2"
+        displayName: Azure Overlay Windows 2
+        os: windows


PR description calls out these extra stages as 'for testing only' and says they 'MUST be reverted before merging'. As-is, they will permanently add more Windows clusters/stages to the PR pipeline (cost + runtime). Please remove these temporary stage additions from .pipelines/pipeline.yaml before merge, or gate them behind a non-default parameter/condition so they cannot run in normal PR builds.

Copilot · 2026-04-13T17:22:50Z

.pipelines/pipeline.yaml

+    - template: singletenancy/azure-cni-overlay/azure-cni-overlay-e2e-job-template.yaml
+      parameters:
+        name: "win_azure_overlay_e2e3"
+        displayName: Azure Overlay Windows 3


PR description calls out these extra stages as 'for testing only' and says they 'MUST be reverted before merging'. As-is, they will permanently add more Windows clusters/stages to the PR pipeline (cost + runtime). Please remove these temporary stage additions from .pipelines/pipeline.yaml before merge, or gate them behind a non-default parameter/condition so they cannot run in normal PR builds.

Copilot · 2026-04-13T17:22:51Z

.pipelines/templates/windows-overlay-e2e-step-template.yaml

+        # Run control plane scale tests
+        export CNI_IMAGE_REPO=MCR
+        export CNS_IMAGE_REPO=MCR
+        sudo -E env "PATH=$PATH" make test-load SCALE_UP=32 OS_TYPE=windows CNI_TYPE=${{ parameters.cniType }} VALIDATE_STATEFILE=true INSTALL_CNS=true ${{ parameters.testLoadArgs }} CNS_VERSION=v1.7.9-0 CNI_VERSION=v1.7.9-0 CLEANUP=true


The PR description explicitly flags the hardcoded MCR repo and v1.7.9-0 version pins as temporary and says they 'MUST be reverted before merging'. Please switch these back to the normal version derivation (e.g., $(make cni-version)/$(make cns-version) or existing pipeline variables) and remove the hardcoded repo override so merges don’t silently test a fixed version unrelated to the PR.

Copilot · 2026-04-13T17:22:51Z

hack/scripts/patch-kubeclusterconfig.sh

+  for attempt in 1 2 3; do
+    echo "Attempt $attempt: Patching kubeclusterconfig.json on $pod"
+    if kubectl exec -n kube-system "$pod" -- powershell.exe -command \
+      'Get-Content "c:\k\kubeclusterconfig.json" -Raw | ConvertFrom-Json | % { $_.Cni.Name = "azure"; $_ } | ConvertTo-Json -Depth 20 | Set-Content "c:\k\kubeclusterconfig.json"'; then


PowerShell Set-Content defaults to UTF-16LE in Windows PowerShell, which can unintentionally change file encoding and break consumers expecting UTF-8 (common for JSON). It’s safer to specify an explicit encoding (e.g., UTF-8) when rewriting kubeclusterconfig.json (and potentially the .ps1) to avoid hard-to-diagnose downstream parsing issues.

hack/scripts/patch-kubeclusterconfig.sh

.pipelines/templates/log-template.yaml

Copilot · 2026-04-13T17:22:52Z

test/validate/windows_validate.go

 	}
-
-	return hnsNetworkResult, nil
+	// if unmarshal into array above fails, try umarshal into single object


Correct spelling in the comment: 'umarshal' -> 'unmarshal'.

Suggested change

// if unmarshal into array above fails, try umarshal into single object

// if unmarshal into array above fails, try unmarshal into single object

Copilot · 2026-04-13T17:22:52Z

.pipelines/templates/k8s-e2e-step-template.yaml

+        curl -L https://dl.k8s.io/$k8sVersion/kubernetes-test-linux-amd64.tar.gz -o ./kubernetes-test-linux-amd64.tar.gz
+
+        # https://github.com/kubernetes/sig-release/blob/master/release-engineering/artifacts.md#content-of-kubernetes-test-system-archtargz-on-example-of-kubernetes-test-linux-amd64targz-directories-removed-from-list
+        # explictly unzip and strip directories from ginkgo and e2e.test


Correct spelling in the comment: 'explictly' -> 'explicitly'.

Suggested change

# explictly unzip and strip directories from ginkgo and e2e.test

# explicitly unzip and strip directories from ginkgo and e2e.test

jpayne3506 · 2026-04-13T18:45:44Z

.pipelines/templates/windows-overlay-e2e-step-template.yaml

+        # Run control plane scale tests
+        export CNI_IMAGE_REPO=MCR
+        export CNS_IMAGE_REPO=MCR
+        sudo -E env "PATH=$PATH" make test-load SCALE_UP=32 OS_TYPE=windows CNI_TYPE=${{ parameters.cniType }} VALIDATE_STATEFILE=true INSTALL_CNS=true ${{ parameters.testLoadArgs }} CNS_VERSION=v1.7.9-0 CNI_VERSION=v1.7.9-0 CLEANUP=true


friendly comment to help remind to change

QxBytes added 20 commits April 9, 2026 02:47

initial draft

245145c

hardcode for testing

a90ac8f

increase time to download container image

5b1ef48

move upstream k8s to step template

f4d9688

add null characters to succeed with issues

6d74b4e

fix matching-- an empty directory now counts as the pattern not being…

0038188

… found and exiting with an error

add whitelisted log for cni framework bug to only warn for cniv1 e2e

e023a7c

add literal \x00 in addition to the existing null character search pa…

6c934cf

…ttern for all windows scenarios

duplicate stages for testing

c987c38

make cniv1 logs collect even during warning

dc1be42

increase timeout for windows again from 6m to 12m

b72daa8

try patching kubeclusterconfig cni name so cleanup script runs on byo…

b8b1f77

…cni windows nodes

print out matches

080b851

collect windows node reset logs

b3e1cdb

patch windows node reset so we see the logs

e9038e9

make windowsnodereset patch idempotent

f7572e8

add skip to basic tests k8s

16e2d22

QxBytes self-assigned this Apr 10, 2026

QxBytes added fix Fixes something. ci Infra or tooling. windows labels Apr 10, 2026

QxBytes added 3 commits April 11, 2026 01:00

collect more info on failure

9a9469a

add patch for restart scenario too

74cf7ce

improve collect info script

998498c

QxBytes marked this pull request as ready for review April 13, 2026 17:14

QxBytes requested a review from a team as a code owner April 13, 2026 17:14

QxBytes requested a review from pjohnst5 April 13, 2026 17:14

Copilot AI review requested due to automatic review settings April 13, 2026 17:14

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Copilot started reviewing on behalf of QxBytes April 13, 2026 17:23 View session

jpayne3506 reviewed Apr 13, 2026

View reviewed changes

jpayne3506 approved these changes Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: mitigate windows scenario flakiness#4345

ci: mitigate windows scenario flakiness#4345
QxBytes wants to merge 23 commits intomasterfrom
alew/consolidate-windows-tests

QxBytes commented Apr 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

jpayne3506 Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if [ $taint == "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" ]; then
	if [ "$taint" = "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" ]; then

	// if unmarshal into array above fails, try umarshal into single object
	// if unmarshal into array above fails, try unmarshal into single object

	# explictly unzip and strip directories from ginkgo and e2e.test
	# explicitly unzip and strip directories from ginkgo and e2e.test

Conversation

QxBytes commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

jpayne3506 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

QxBytes commented Apr 10, 2026 •

edited

Loading