Conversation
… found and exiting with an error
…ttern for all windows scenarios
…ror whitelist in byocni windows since c:\k\kubeclusterconfig.json is set to none instead of azure, the windows cleanup script doesn't run. so if there is a populated azure-vnet.json, it doesn't clean up on restart. if the hns network is somehow deleted or never created, azure-vnet.exe thinks an hns network exists because it sees the network in the azure-vnet.json but when it tries to open it, it errors out-- this is unrecoverable
…cni windows nodes
when it was its own task, if it failed, it would never be retried (500 error) if it resides in windows overlay tests, if that test failed it would restart and hopefully retry setting the cni name to azure (after the restart node)
if there is only a single hns network, the bytes will give you an an hns network object, not an array of hns network objects, causing unmarshal to fail. to fix this, we try unmarshal into a single object if we cannot unmarshal into an array the second network is the "ext" network which I do not believe has impact on datapath-- pod -> node, pod -> pod connectivity are fine
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Mitigates CI flakiness (especially Windows) by restructuring ACN PR pipeline E2E execution into step templates, improving log capture/diagnostics, and adding a warning-based “known issues” handler.
Changes:
- Introduces reusable pipeline step templates (Windows overlay E2E + Kubernetes E2E) and migrates scenarios to single-job execution for easier retries.
- Adds warning handler jobs to allow known Windows log signatures to yield “SucceededWithIssues” instead of failing the stage.
- Improves diagnostics: collects additional Windows logs, prints grep matches, and gathers logs for non-running pods.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| test/validate/windows_validate.go | Makes HNS network JSON parsing robust to single-object vs array output; tightens validation messaging/logging. |
| hack/scripts/patch-kubeclusterconfig.sh | New script to patch Windows node config/reset scripts to improve cleanup and logging visibility. |
| hack/scripts/k8se2e-tests.sh | Adjusts ginkgo skip expression to incorporate additional skip patterns. |
| hack/scripts/collect-windows-logs.sh | Collects windowsnodereset.log in addition to existing state/log artifacts. |
| hack/scripts/collect-bad-pod-info.sh | New diagnostics script to dump events and container logs for non-running pods. |
| hack/scripts/check-cni-log-contents.sh | Prints matching file list and matching lines for better triage. |
| .pipelines/templates/windows-overlay-e2e-step-template.yaml | New consolidated Windows overlay E2E step template (control plane, datapath, restart validation, optional wireserver tests). |
| .pipelines/templates/warning-handler-job-template.yaml | Improves visibility by printing file/pattern parameters before scanning logs. |
| .pipelines/templates/log-template.yaml | Always prints all pods and runs the new “bad pod info” collector. |
| .pipelines/templates/k8s-e2e-step-template.yaml | New template to download k8s e2e binaries and run suites (plus Windows kube-proxy restart). |
| .pipelines/singletenancy/dualstack-overlay/dualstackoverlay-e2e-step-template.yaml | Collapses steps to a single Linux-only execution path (Windows now handled by shared Windows template). |
| .pipelines/singletenancy/dualstack-overlay/dualstackoverlay-e2e-job-template.yaml | Uses step templates + adds Windows warning handler + collects logs on SucceededWithIssues. |
| .pipelines/singletenancy/cilium/cilium-e2e-job-template.yaml | Migrates from k8s-e2e job template to new k8s-e2e step template within the scenario job. |
| .pipelines/singletenancy/cilium-overlay/cilium-overlay-e2e-job-template.yaml | Same migration to k8s-e2e step template. |
| .pipelines/singletenancy/cilium-overlay-withhubble/cilium-overlay-e2e-job-template.yaml | Same migration to k8s-e2e step template. |
| .pipelines/singletenancy/cilium-overlay-ebpf/cilium-overlay-e2e-job-template.yaml | Same migration to k8s-e2e step template. |
| .pipelines/singletenancy/cilium-nodesubnet/cilium-nodesubnet-e2e-job-template.yaml | Same migration to k8s-e2e step template. |
| .pipelines/singletenancy/cilium-ebpf/cilium-e2e-job-template.yaml | Same migration to k8s-e2e step template. |
| .pipelines/singletenancy/cilium-dualstack-overlay/cilium-dualstackoverlay-e2e-job-template.yaml | Same migration to k8s-e2e step template. |
| .pipelines/singletenancy/azure-cni-overlay/azure-cni-overlay-e2e-step-template.yaml | Linux steps simplified; Windows path now uses shared Windows overlay template. |
| .pipelines/singletenancy/azure-cni-overlay/azure-cni-overlay-e2e-job-template.yaml | Uses step templates + Windows warning handler + collects logs on SucceededWithIssues. |
| .pipelines/singletenancy/azure-cni-overlay-stateless/azure-cni-overlay-stateless-e2e-step-template.yaml | Removes scenario-specific Windows stateless step template (replaced by shared Windows overlay template). |
| .pipelines/singletenancy/azure-cni-overlay-stateless/azure-cni-overlay-stateless-e2e-job-template.yaml | Switches to shared Windows overlay template + adds k8s-e2e steps + warning handler + log collection on SucceededWithIssues. |
| .pipelines/singletenancy/aks/e2e-job-template.yaml | Adds k8s-e2e step template usage; expands warning patterns; updates failure-log condition to include SucceededWithIssues. |
| .pipelines/singletenancy/aks-swift/e2e-job-template.yaml | Migrates to k8s-e2e step template. |
| .pipelines/pipeline.yaml | Adds additional Windows stages (noted as temporary in PR description) and updates cleanup config mapping. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| nodeList=`kubectl get node -owide | grep Windows | awk '{print $1}'` | ||
| for node in $nodeList; do | ||
| taint=`kubectl describe node $node | grep Taints | awk '{print $2}'` | ||
| if [ $taint == "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" ]; then |
There was a problem hiding this comment.
With set -e, an empty/whitespace taint value will make [ $taint == ... ] error (e.g., 'unary operator expected') and fail the entire step. Quote the variable and use a POSIX-safe comparison ([ \"$taint\" = \"...\" ]), or guard for empty/<none> before the comparison.
| if [ $taint == "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" ]; then | |
| if [ "$taint" = "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" ]; then |
| - template: singletenancy/azure-cni-overlay/azure-cni-overlay-e2e-job-template.yaml | ||
| parameters: | ||
| name: "win_azure_overlay_e2e2" | ||
| displayName: Azure Overlay Windows 2 | ||
| os: windows |
There was a problem hiding this comment.
PR description calls out these extra stages as 'for testing only' and says they 'MUST be reverted before merging'. As-is, they will permanently add more Windows clusters/stages to the PR pipeline (cost + runtime). Please remove these temporary stage additions from .pipelines/pipeline.yaml before merge, or gate them behind a non-default parameter/condition so they cannot run in normal PR builds.
| - template: singletenancy/azure-cni-overlay/azure-cni-overlay-e2e-job-template.yaml | ||
| parameters: | ||
| name: "win_azure_overlay_e2e3" | ||
| displayName: Azure Overlay Windows 3 |
There was a problem hiding this comment.
PR description calls out these extra stages as 'for testing only' and says they 'MUST be reverted before merging'. As-is, they will permanently add more Windows clusters/stages to the PR pipeline (cost + runtime). Please remove these temporary stage additions from .pipelines/pipeline.yaml before merge, or gate them behind a non-default parameter/condition so they cannot run in normal PR builds.
| # Run control plane scale tests | ||
| export CNI_IMAGE_REPO=MCR | ||
| export CNS_IMAGE_REPO=MCR | ||
| sudo -E env "PATH=$PATH" make test-load SCALE_UP=32 OS_TYPE=windows CNI_TYPE=${{ parameters.cniType }} VALIDATE_STATEFILE=true INSTALL_CNS=true ${{ parameters.testLoadArgs }} CNS_VERSION=v1.7.9-0 CNI_VERSION=v1.7.9-0 CLEANUP=true |
There was a problem hiding this comment.
The PR description explicitly flags the hardcoded MCR repo and v1.7.9-0 version pins as temporary and says they 'MUST be reverted before merging'. Please switch these back to the normal version derivation (e.g., $(make cni-version)/$(make cns-version) or existing pipeline variables) and remove the hardcoded repo override so merges don’t silently test a fixed version unrelated to the PR.
| for attempt in 1 2 3; do | ||
| echo "Attempt $attempt: Patching kubeclusterconfig.json on $pod" | ||
| if kubectl exec -n kube-system "$pod" -- powershell.exe -command \ | ||
| 'Get-Content "c:\k\kubeclusterconfig.json" -Raw | ConvertFrom-Json | % { $_.Cni.Name = "azure"; $_ } | ConvertTo-Json -Depth 20 | Set-Content "c:\k\kubeclusterconfig.json"'; then |
There was a problem hiding this comment.
PowerShell Set-Content defaults to UTF-16LE in Windows PowerShell, which can unintentionally change file encoding and break consumers expecting UTF-8 (common for JSON). It’s safer to specify an explicit encoding (e.g., UTF-8) when rewriting kubeclusterconfig.json (and potentially the .ps1) to avoid hard-to-diagnose downstream parsing issues.
| } | ||
|
|
||
| return hnsNetworkResult, nil | ||
| // if unmarshal into array above fails, try umarshal into single object |
There was a problem hiding this comment.
Correct spelling in the comment: 'umarshal' -> 'unmarshal'.
| // if unmarshal into array above fails, try umarshal into single object | |
| // if unmarshal into array above fails, try unmarshal into single object |
| curl -L https://dl.k8s.io/$k8sVersion/kubernetes-test-linux-amd64.tar.gz -o ./kubernetes-test-linux-amd64.tar.gz | ||
|
|
||
| # https://github.com/kubernetes/sig-release/blob/master/release-engineering/artifacts.md#content-of-kubernetes-test-system-archtargz-on-example-of-kubernetes-test-linux-amd64targz-directories-removed-from-list | ||
| # explictly unzip and strip directories from ginkgo and e2e.test |
There was a problem hiding this comment.
Correct spelling in the comment: 'explictly' -> 'explicitly'.
| # explictly unzip and strip directories from ginkgo and e2e.test | |
| # explicitly unzip and strip directories from ginkgo and e2e.test |
| # Run control plane scale tests | ||
| export CNI_IMAGE_REPO=MCR | ||
| export CNS_IMAGE_REPO=MCR | ||
| sudo -E env "PATH=$PATH" make test-load SCALE_UP=32 OS_TYPE=windows CNI_TYPE=${{ parameters.cniType }} VALIDATE_STATEFILE=true INSTALL_CNS=true ${{ parameters.testLoadArgs }} CNS_VERSION=v1.7.9-0 CNI_VERSION=v1.7.9-0 CLEANUP=true |
There was a problem hiding this comment.
friendly comment to help remind to change
Reason for Change:
Mitigates flakiness for windows and other test stages in the acn pr pipeline.
Ignore
All changes should only affect acn pr, but I have run load test/release test without any obvious issues.
Main Changes
Mitigating Windows BYOCNI Scenarios
In the new windows overlay template, I aim to keep all existing tests but wanted to unify windows overlay into one template since previously there were 3 separate templates. Stateless step template was removed because it was windows-only. The others remain because they have linux counterparts. I have added two new parts after investigating the issues with windows byocni nodes.
Issue Fixed:
See above
Requirements:
Notes: