Skip to content

ci: mitigate windows scenario flakiness#4345

Open
QxBytes wants to merge 23 commits intomasterfrom
alew/consolidate-windows-tests
Open

ci: mitigate windows scenario flakiness#4345
QxBytes wants to merge 23 commits intomasterfrom
alew/consolidate-windows-tests

Conversation

@QxBytes
Copy link
Copy Markdown
Contributor

@QxBytes QxBytes commented Apr 10, 2026

Reason for Change:

Mitigates flakiness for windows and other test stages in the acn pr pipeline.

Ignore

  • Extra stages created in pipeline.yaml -- these are for testing only
  • Hardcoded 1.7.9-0 and MCR in .pipelines/templates/windows-overlay-e2e-step-template.yaml
  • These MUST be reverted before merging though

All changes should only affect acn pr, but I have run load test/release test without any obvious issues.

Main Changes

  • Moves k8s e2e job completely to a step template. Affects only the acn pr pipeline. The template intends to mimic the existing behavior completely and multiple test runs show that it runs without issue. Affects windows and linux, all scenarios.
  • Moves testing for each scenario to a single job, so it can be retried as a whole. I am aware this could increase the time for a test, but the benefit is that it is much easier to continueOnError and check a single job for a warning than multiple, which brings me to my next point
  • Add e2e warning handler for the main windows tests-- dualstack-overlay, overlay, and stateless (3 total). This adds a job after the main testing to check if the logs mention any of the known issues in windows, such as null characters in the statefile (if windows bsod), the network not being found (occurs when the statefile has an hns network but it was somehow deleted from the vm). These issues have only occurred in the pipelines and are basically noise. Windows has already said byo cni for windows is not supported so these issues are unlikely to be fixed. If these log lines are detected, the stage is marked with a warning but still passes
  • Adds exec: already started to the ubuntu cniv1 linux whitelist error messages-- this appears to be a bug with the cni framework-- exec.Run should never be called twice but the upstream cni framework code seems to make that a possible outcome making the test results unreliable if it occurs.
  • The check cni logs script now prints out the matched line, and failure logs will be collected even if the main testing stage succeeded with issues. Previously it only ran if the main testing job failed.
  • Modifies dualstack test that checks hns network to accept either an HNSNetwork struct or an array of HNSNetwork structs. Previously, if only one HNSNetwork was returned, there would be an error unmarshalling into the array. The second network is the ext hns network which is not necessary for container networking (just a qol dummy network).
  • Adds a script to collect logs from bad pods if they are not running for all sidecars and init containers, and prints out the status of all pods on failure.

Mitigating Windows BYOCNI Scenarios
In the new windows overlay template, I aim to keep all existing tests but wanted to unify windows overlay into one template since previously there were 3 separate templates. Stateless step template was removed because it was windows-only. The others remain because they have linux counterparts. I have added two new parts after investigating the issues with windows byocni nodes.

  • Windows byo cni nodes do not seem to clean up the azure-vnet.json statefile when I restart the nodes
  • Usually, windowsnodereset.ps1 will be called and then the cleanupnetwork.ps1 is called to remove stale statefiles like in linux
  • However, cleanupnetwork.ps1 only runs the deletion if the cni name is set to "azure" in kubeclusterconfig.json. For byocni nodes, that value is set to "none"
  • First change is to before tests run, patch the kubeclusterconfig.json on windows nodes to use plugin name "azure"
  • Second change is to modify windowsnodereset.ps1 to use *>> instead of >> so the logs from the cleanupnetwork.ps1 script are visible in windowsnodereset.log-- otherwise we have no visibility into those logs
  • I have also added collection of the windowsnodereset.log log to the failure logs collection step
  • Witnessed the following for host networking pods during create which we determined were unrelated to our components (containerd?): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: hcs::CreateComputeSystem b9257d8579494625ee895a32e4be77da62991d162a858e44bcd29c116ff510b9: The parameter is incorrect.: unknown

Issue Fixed:

See above

Requirements:

Notes:

QxBytes added 20 commits April 9, 2026 02:47
…ror whitelist

in byocni windows since c:\k\kubeclusterconfig.json is set to none instead of azure,
the windows cleanup script doesn't run. so if there is a populated azure-vnet.json, it
doesn't clean up on restart. if the hns network is somehow deleted or never created,
azure-vnet.exe thinks an hns network exists because it sees the network in the azure-vnet.json
but when it tries to open it, it errors out-- this is unrecoverable
when it was its own task, if it failed, it would never be retried (500 error)
if it resides in windows overlay tests, if that test failed it would restart
and hopefully retry setting the cni name to azure (after the restart node)
if there is only a single hns network, the bytes will give you an an
hns network object, not an array of hns network objects, causing unmarshal
to fail. to fix this, we try unmarshal into a single object if we cannot
unmarshal into an array

the second network is the "ext" network which I do not believe has impact
on datapath-- pod -> node, pod -> pod connectivity are fine
@QxBytes QxBytes self-assigned this Apr 10, 2026
@QxBytes QxBytes added fix Fixes something. ci Infra or tooling. windows labels Apr 10, 2026
@QxBytes QxBytes marked this pull request as ready for review April 13, 2026 17:14
@QxBytes QxBytes requested a review from a team as a code owner April 13, 2026 17:14
@QxBytes QxBytes requested a review from pjohnst5 April 13, 2026 17:14
Copilot AI review requested due to automatic review settings April 13, 2026 17:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Mitigates CI flakiness (especially Windows) by restructuring ACN PR pipeline E2E execution into step templates, improving log capture/diagnostics, and adding a warning-based “known issues” handler.

Changes:

  • Introduces reusable pipeline step templates (Windows overlay E2E + Kubernetes E2E) and migrates scenarios to single-job execution for easier retries.
  • Adds warning handler jobs to allow known Windows log signatures to yield “SucceededWithIssues” instead of failing the stage.
  • Improves diagnostics: collects additional Windows logs, prints grep matches, and gathers logs for non-running pods.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
test/validate/windows_validate.go Makes HNS network JSON parsing robust to single-object vs array output; tightens validation messaging/logging.
hack/scripts/patch-kubeclusterconfig.sh New script to patch Windows node config/reset scripts to improve cleanup and logging visibility.
hack/scripts/k8se2e-tests.sh Adjusts ginkgo skip expression to incorporate additional skip patterns.
hack/scripts/collect-windows-logs.sh Collects windowsnodereset.log in addition to existing state/log artifacts.
hack/scripts/collect-bad-pod-info.sh New diagnostics script to dump events and container logs for non-running pods.
hack/scripts/check-cni-log-contents.sh Prints matching file list and matching lines for better triage.
.pipelines/templates/windows-overlay-e2e-step-template.yaml New consolidated Windows overlay E2E step template (control plane, datapath, restart validation, optional wireserver tests).
.pipelines/templates/warning-handler-job-template.yaml Improves visibility by printing file/pattern parameters before scanning logs.
.pipelines/templates/log-template.yaml Always prints all pods and runs the new “bad pod info” collector.
.pipelines/templates/k8s-e2e-step-template.yaml New template to download k8s e2e binaries and run suites (plus Windows kube-proxy restart).
.pipelines/singletenancy/dualstack-overlay/dualstackoverlay-e2e-step-template.yaml Collapses steps to a single Linux-only execution path (Windows now handled by shared Windows template).
.pipelines/singletenancy/dualstack-overlay/dualstackoverlay-e2e-job-template.yaml Uses step templates + adds Windows warning handler + collects logs on SucceededWithIssues.
.pipelines/singletenancy/cilium/cilium-e2e-job-template.yaml Migrates from k8s-e2e job template to new k8s-e2e step template within the scenario job.
.pipelines/singletenancy/cilium-overlay/cilium-overlay-e2e-job-template.yaml Same migration to k8s-e2e step template.
.pipelines/singletenancy/cilium-overlay-withhubble/cilium-overlay-e2e-job-template.yaml Same migration to k8s-e2e step template.
.pipelines/singletenancy/cilium-overlay-ebpf/cilium-overlay-e2e-job-template.yaml Same migration to k8s-e2e step template.
.pipelines/singletenancy/cilium-nodesubnet/cilium-nodesubnet-e2e-job-template.yaml Same migration to k8s-e2e step template.
.pipelines/singletenancy/cilium-ebpf/cilium-e2e-job-template.yaml Same migration to k8s-e2e step template.
.pipelines/singletenancy/cilium-dualstack-overlay/cilium-dualstackoverlay-e2e-job-template.yaml Same migration to k8s-e2e step template.
.pipelines/singletenancy/azure-cni-overlay/azure-cni-overlay-e2e-step-template.yaml Linux steps simplified; Windows path now uses shared Windows overlay template.
.pipelines/singletenancy/azure-cni-overlay/azure-cni-overlay-e2e-job-template.yaml Uses step templates + Windows warning handler + collects logs on SucceededWithIssues.
.pipelines/singletenancy/azure-cni-overlay-stateless/azure-cni-overlay-stateless-e2e-step-template.yaml Removes scenario-specific Windows stateless step template (replaced by shared Windows overlay template).
.pipelines/singletenancy/azure-cni-overlay-stateless/azure-cni-overlay-stateless-e2e-job-template.yaml Switches to shared Windows overlay template + adds k8s-e2e steps + warning handler + log collection on SucceededWithIssues.
.pipelines/singletenancy/aks/e2e-job-template.yaml Adds k8s-e2e step template usage; expands warning patterns; updates failure-log condition to include SucceededWithIssues.
.pipelines/singletenancy/aks-swift/e2e-job-template.yaml Migrates to k8s-e2e step template.
.pipelines/pipeline.yaml Adds additional Windows stages (noted as temporary in PR description) and updates cleanup config mapping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nodeList=`kubectl get node -owide | grep Windows | awk '{print $1}'`
for node in $nodeList; do
taint=`kubectl describe node $node | grep Taints | awk '{print $2}'`
if [ $taint == "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" ]; then
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With set -e, an empty/whitespace taint value will make [ $taint == ... ] error (e.g., 'unary operator expected') and fail the entire step. Quote the variable and use a POSIX-safe comparison ([ \"$taint\" = \"...\" ]), or guard for empty/<none> before the comparison.

Suggested change
if [ $taint == "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" ]; then
if [ "$taint" = "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" ]; then

Copilot uses AI. Check for mistakes.
Comment on lines +419 to +423
- template: singletenancy/azure-cni-overlay/azure-cni-overlay-e2e-job-template.yaml
parameters:
name: "win_azure_overlay_e2e2"
displayName: Azure Overlay Windows 2
os: windows
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description calls out these extra stages as 'for testing only' and says they 'MUST be reverted before merging'. As-is, they will permanently add more Windows clusters/stages to the PR pipeline (cost + runtime). Please remove these temporary stage additions from .pipelines/pipeline.yaml before merge, or gate them behind a non-default parameter/condition so they cannot run in normal PR builds.

Copilot uses AI. Check for mistakes.
Comment on lines +431 to +434
- template: singletenancy/azure-cni-overlay/azure-cni-overlay-e2e-job-template.yaml
parameters:
name: "win_azure_overlay_e2e3"
displayName: Azure Overlay Windows 3
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description calls out these extra stages as 'for testing only' and says they 'MUST be reverted before merging'. As-is, they will permanently add more Windows clusters/stages to the PR pipeline (cost + runtime). Please remove these temporary stage additions from .pipelines/pipeline.yaml before merge, or gate them behind a non-default parameter/condition so they cannot run in normal PR builds.

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +73
# Run control plane scale tests
export CNI_IMAGE_REPO=MCR
export CNS_IMAGE_REPO=MCR
sudo -E env "PATH=$PATH" make test-load SCALE_UP=32 OS_TYPE=windows CNI_TYPE=${{ parameters.cniType }} VALIDATE_STATEFILE=true INSTALL_CNS=true ${{ parameters.testLoadArgs }} CNS_VERSION=v1.7.9-0 CNI_VERSION=v1.7.9-0 CLEANUP=true
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description explicitly flags the hardcoded MCR repo and v1.7.9-0 version pins as temporary and says they 'MUST be reverted before merging'. Please switch these back to the normal version derivation (e.g., $(make cni-version)/$(make cns-version) or existing pipeline variables) and remove the hardcoded repo override so merges don’t silently test a fixed version unrelated to the PR.

Copilot uses AI. Check for mistakes.
for attempt in 1 2 3; do
echo "Attempt $attempt: Patching kubeclusterconfig.json on $pod"
if kubectl exec -n kube-system "$pod" -- powershell.exe -command \
'Get-Content "c:\k\kubeclusterconfig.json" -Raw | ConvertFrom-Json | % { $_.Cni.Name = "azure"; $_ } | ConvertTo-Json -Depth 20 | Set-Content "c:\k\kubeclusterconfig.json"'; then
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PowerShell Set-Content defaults to UTF-16LE in Windows PowerShell, which can unintentionally change file encoding and break consumers expecting UTF-8 (common for JSON). It’s safer to specify an explicit encoding (e.g., UTF-8) when rewriting kubeclusterconfig.json (and potentially the .ps1) to avoid hard-to-diagnose downstream parsing issues.

Copilot uses AI. Check for mistakes.
}

return hnsNetworkResult, nil
// if unmarshal into array above fails, try umarshal into single object
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct spelling in the comment: 'umarshal' -> 'unmarshal'.

Suggested change
// if unmarshal into array above fails, try umarshal into single object
// if unmarshal into array above fails, try unmarshal into single object

Copilot uses AI. Check for mistakes.
curl -L https://dl.k8s.io/$k8sVersion/kubernetes-test-linux-amd64.tar.gz -o ./kubernetes-test-linux-amd64.tar.gz

# https://github.com/kubernetes/sig-release/blob/master/release-engineering/artifacts.md#content-of-kubernetes-test-system-archtargz-on-example-of-kubernetes-test-linux-amd64targz-directories-removed-from-list
# explictly unzip and strip directories from ginkgo and e2e.test
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct spelling in the comment: 'explictly' -> 'explicitly'.

Suggested change
# explictly unzip and strip directories from ginkgo and e2e.test
# explicitly unzip and strip directories from ginkgo and e2e.test

Copilot uses AI. Check for mistakes.
# Run control plane scale tests
export CNI_IMAGE_REPO=MCR
export CNS_IMAGE_REPO=MCR
sudo -E env "PATH=$PATH" make test-load SCALE_UP=32 OS_TYPE=windows CNI_TYPE=${{ parameters.cniType }} VALIDATE_STATEFILE=true INSTALL_CNS=true ${{ parameters.testLoadArgs }} CNS_VERSION=v1.7.9-0 CNI_VERSION=v1.7.9-0 CLEANUP=true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

friendly comment to help remind to change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Infra or tooling. fix Fixes something. windows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants