Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions bindata/network/ovn-kubernetes/common/008-script-lib.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -526,15 +526,21 @@ data:
cni-bin-copy

echo "I$(date "+%m%d %H:%M:%S.%N") - disable conntrack on geneve port"
iptables -t raw -A PREROUTING -p udp --dport {{.GenevePort}} -j NOTRACK
iptables -t raw -A OUTPUT -p udp --dport {{.GenevePort}} -j NOTRACK
ip6tables -t raw -A PREROUTING -p udp --dport {{.GenevePort}} -j NOTRACK
ip6tables -t raw -A OUTPUT -p udp --dport {{.GenevePort}} -j NOTRACK
iptables -t raw -D PREROUTING -p udp --dport {{.GenevePort}} -j NOTRACK 2>/dev/null || true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment explaining that this is to clean up old iptables rules and can eventually go away

iptables -t raw -D OUTPUT -p udp --dport {{.GenevePort}} -j NOTRACK 2>/dev/null || true
ip6tables -t raw -D PREROUTING -p udp --dport {{.GenevePort}} -j NOTRACK 2>/dev/null || true
ip6tables -t raw -D OUTPUT -p udp --dport {{.GenevePort}} -j NOTRACK 2>/dev/null || true
nft add table inet ovn_notrack

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment to the table, for debuggability

nft flush table inet ovn_notrack
nft 'add chain inet ovn_notrack prerouting { type filter hook prerouting priority raw; policy accept; }'
nft 'add chain inet ovn_notrack output { type filter hook output priority raw; policy accept; }'
nft add rule inet ovn_notrack prerouting udp dport {{.GenevePort}} notrack
nft add rule inet ovn_notrack output udp dport {{.GenevePort}} notrack

{{- if .OVNHybridOverlayVXLANPort}}
echo "I$(date "+%m%d %H:%M:%S.%N") - disable conntrack on hybrid overlay VXLAN port"
iptables -t raw -A PREROUTING -p udp --dport {{.OVNHybridOverlayVXLANPort}} -j NOTRACK
iptables -t raw -A OUTPUT -p udp --dport {{.OVNHybridOverlayVXLANPort}} -j NOTRACK
nft add rule inet ovn_notrack prerouting udp dport {{.OVNHybridOverlayVXLANPort}} notrack
nft add rule inet ovn_notrack output udp dport {{.OVNHybridOverlayVXLANPort}} notrack
Comment thread
coderabbitai[bot] marked this conversation as resolved.
{{- end}}

echo "I$(date "+%m%d %H:%M:%S.%N") - starting ovnkube-node"
Expand Down
45 changes: 19 additions & 26 deletions bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -486,7 +486,7 @@ spec:
- mountPath: /etc/systemd/system
name: systemd-units
readOnly: true
# for the iptables wrapper
# for the nftables wrapper

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was talking about the old iptables wrapper scripts to run the iptables binaries off the host filesystem, but we haven't used those since we stopped supporting RHEL 7... you should be able to just remove this mountpoint. (It looks like nothing else uses it.)

- mountPath: /host
name: host-slash
readOnly: true
Expand Down Expand Up @@ -597,41 +597,35 @@ spec:
export KUBECONFIG=/var/run/ovnkube-kubeconfig
{{ end }}

touch /var/run/ovn/add_iptables.sh
chmod 0755 /var/run/ovn/add_iptables.sh
cat <<'EOF' > /var/run/ovn/add_iptables.sh
touch /var/run/ovn/add_nft_icmp.sh
chmod 0755 /var/run/ovn/add_nft_icmp.sh
cat <<'EOF' > /var/run/ovn/add_nft_icmp.sh
#!/bin/sh
if [ -z "$3" ]
then
echo "Called with host address missing, ignore"
exit 0
fi
echo "Adding ICMP drop rule for '$3' "
if iptables -C CHECK_ICMP_SOURCE -p icmp -s $3 -j ICMP_ACTION
then
echo "iptables already set for $3"
else
iptables -A CHECK_ICMP_SOURCE -p icmp -s $3 -j ICMP_ACTION
fi
echo "Adding ICMP drop rule for '$3'"
nft add element inet azure_icmp icmp_sources "{ $3 }"
EOF

echo "I$(date "+%m%d %H:%M:%S.%N") - drop-icmp - start drop-icmp ${K8S_NODE}"
iptables -X CHECK_ICMP_SOURCE || true
iptables -N CHECK_ICMP_SOURCE || true
iptables -F CHECK_ICMP_SOURCE
iptables -D INPUT -p icmp --icmp-type fragmentation-needed -j CHECK_ICMP_SOURCE || true
iptables -I INPUT -p icmp --icmp-type fragmentation-needed -j CHECK_ICMP_SOURCE
iptables -N ICMP_ACTION || true
iptables -F ICMP_ACTION
iptables -A ICMP_ACTION -j LOG
iptables -A ICMP_ACTION -j DROP
iptables -D INPUT -p icmp --icmp-type fragmentation-needed -j CHECK_ICMP_SOURCE 2>/dev/null || true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, add a comment explaining that this is cleanup and can eventually go away

iptables -F CHECK_ICMP_SOURCE 2>/dev/null || true
iptables -X CHECK_ICMP_SOURCE 2>/dev/null || true
iptables -F ICMP_ACTION 2>/dev/null || true
iptables -X ICMP_ACTION 2>/dev/null || true
nft add table inet azure_icmp

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be ip4 rather than inet since this is just (IPv4) ICMP

nft flush table inet azure_icmp
nft 'add set inet azure_icmp icmp_sources { type ipv4_addr; }'
nft 'add chain inet azure_icmp input { type filter hook input priority 0; policy accept; }'
nft add rule inet azure_icmp input icmp type destination-unreachable icmp code frag-needed ip saddr @icmp_sources counter log drop
Comment on lines +600 to +623

@coderabbitai coderabbitai Bot Jun 25, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Make the Azure nftables programming restart-safe.

azure_icmp is host-level state. On container restart, nft add table inet azure_icmp can fail because the previous table still exists, and duplicate host IP observations can make add_nft_icmp.sh return non-zero. That can terminate the long-running oc observe path.

Proposed fix
           echo "Adding ICMP drop rule for '$3'"
-          nft add element inet azure_icmp icmp_sources "{ $3 }"
+          nft add element inet azure_icmp icmp_sources "{ $3 }" 2>/dev/null || true
           EOF
 
           echo "I$(date "+%m%d %H:%M:%S.%N") - drop-icmp - start drop-icmp ${K8S_NODE}"
+          nft delete table inet azure_icmp 2>/dev/null || true
           nft add table inet azure_icmp

Also applies to: 623-623

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml` around lines 600 -
618, Make the Azure nftables setup restart-safe in the ovnkube-node init script
by handling existing host-level state gracefully. In the nft programming block
and the generated add_nft_icmp.sh helper, update the logic around nft add table
inet azure_icmp and nft add element inet azure_icmp icmp_sources so repeated
starts or duplicate IP observations do not return non-zero. Reuse the existing
add_nft_icmp.sh and azure_icmp setup symbols, and prefer idempotent checks or
ignoring “already exists” cases so the oc observe flow keeps running after
container restarts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nft add element is also idempotent for duplicate elements

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bpickard22 Good catch — you're right, the duplicate-element part of my comment doesn't apply here.

The remaining issue is just the restart path for:

nft add table inet azure_icmp

Since azure_icmp is host-level state, that can still fail if the table already exists from a previous container instance, and with set -xe the script would exit before reaching the flush.

#
ip addr show
ip route show
iptables -nvL
iptables -nvL -t nat
oc observe pods -n openshift-ovn-kubernetes --listen-addr='' -l app=ovnkube-node -a '{ .status.hostIP }' -- /var/run/ovn/add_iptables.sh
#systemd-run -qPG -- oc observe pods -n openshift-ovn-kubernetes --listen-addr='' -l app=ovnkube-node -a '{ .status.hostIP }' -- /var/run/ovn/add_iptables.sh
nft list table inet azure_icmp
Comment thread
coderabbitai[bot] marked this conversation as resolved.
oc observe pods -n openshift-ovn-kubernetes --listen-addr='' -l app=ovnkube-node -a '{ .status.hostIP }' -- /var/run/ovn/add_nft_icmp.sh
lifecycle:
preStop:
exec:
Expand All @@ -644,7 +638,6 @@ spec:
- mountPath: /etc/ovn/
name: etc-openvswitch
{{ end }}
# for the iptables wrapper
- mountPath: /host
name: host-slash
readOnly: true
Expand Down Expand Up @@ -673,7 +666,7 @@ spec:
- name: systemd-units
hostPath:
path: /etc/systemd/system
# used for iptables wrapper scripts
# used for nftables wrapper scripts
- name: host-slash
hostPath:
path: /
Expand Down
45 changes: 19 additions & 26 deletions bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -510,7 +510,7 @@ spec:
- mountPath: /etc/systemd/system
name: systemd-units
readOnly: true
# for the iptables wrapper
# for the nftables wrapper
- mountPath: /host
name: host-slash
readOnly: true
Expand Down Expand Up @@ -603,41 +603,35 @@ spec:
export KUBECONFIG=/etc/ovn/kubeconfig
{{ end }}

touch /var/run/ovn/add_iptables.sh
chmod 0755 /var/run/ovn/add_iptables.sh
cat <<'EOF' > /var/run/ovn/add_iptables.sh
touch /var/run/ovn/add_nft_icmp.sh
chmod 0755 /var/run/ovn/add_nft_icmp.sh
cat <<'EOF' > /var/run/ovn/add_nft_icmp.sh
#!/bin/sh
if [ -z "$3" ]
then
echo "Called with host address missing, ignore"
exit 0
fi
echo "Adding ICMP drop rule for '$3' "
if iptables -C CHECK_ICMP_SOURCE -p icmp -s $3 -j ICMP_ACTION
then
echo "iptables already set for $3"
else
iptables -A CHECK_ICMP_SOURCE -p icmp -s $3 -j ICMP_ACTION
fi
echo "Adding ICMP drop rule for '$3'"
nft add element inet azure_icmp icmp_sources "{ $3 }"
EOF

echo "I$(date "+%m%d %H:%M:%S.%N") - drop-icmp - start drop-icmp ${K8S_NODE}"
iptables -X CHECK_ICMP_SOURCE || true
iptables -N CHECK_ICMP_SOURCE || true
iptables -F CHECK_ICMP_SOURCE
iptables -D INPUT -p icmp --icmp-type fragmentation-needed -j CHECK_ICMP_SOURCE || true
iptables -I INPUT -p icmp --icmp-type fragmentation-needed -j CHECK_ICMP_SOURCE
iptables -N ICMP_ACTION || true
iptables -F ICMP_ACTION
iptables -A ICMP_ACTION -j LOG
iptables -A ICMP_ACTION -j DROP
iptables -D INPUT -p icmp --icmp-type fragmentation-needed -j CHECK_ICMP_SOURCE 2>/dev/null || true
iptables -F CHECK_ICMP_SOURCE 2>/dev/null || true
iptables -X CHECK_ICMP_SOURCE 2>/dev/null || true
iptables -F ICMP_ACTION 2>/dev/null || true
iptables -X ICMP_ACTION 2>/dev/null || true
nft add table inet azure_icmp
nft flush table inet azure_icmp
nft 'add set inet azure_icmp icmp_sources { type ipv4_addr; }'
nft 'add chain inet azure_icmp input { type filter hook input priority 0; policy accept; }'
nft add rule inet azure_icmp input icmp type destination-unreachable icmp code frag-needed ip saddr @icmp_sources counter log drop
Comment on lines +606 to +629

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Make the Azure nftables programming restart-safe.

azure_icmp persists in the host network namespace. A restarted container can fail on nft add table inet azure_icmp, and duplicate observed host IPs can make the helper return non-zero and stop oc observe.

Proposed fix
           echo "Adding ICMP drop rule for '$3'"
-          nft add element inet azure_icmp icmp_sources "{ $3 }"
+          nft add element inet azure_icmp icmp_sources "{ $3 }" 2>/dev/null || true
           EOF
 
           echo "I$(date "+%m%d %H:%M:%S.%N") - drop-icmp - start drop-icmp ${K8S_NODE}"
+          nft delete table inet azure_icmp 2>/dev/null || true
           nft add table inet azure_icmp

Also applies to: 629-629

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml` around lines
606 - 624, Make the OVN node ICMP nftables setup restart-safe in the
add_nft_icmp.sh bootstrap block: the existing `nft add table inet azure_icmp`
and related setup should tolerate a pre-existing `azure_icmp` table/objects on
container restart, and the helper should not exit non-zero when the same host IP
is observed more than once. Update the script to use idempotent nft commands or
explicit existence checks around the `azure_icmp` table/set/chain creation and
make the `nft add element inet azure_icmp icmp_sources` path ignore duplicate
elements so `oc observe` continues running.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nft add table is idempotent, and we run a flush after the add which will correctly handle the restart

#
ip addr show
ip route show
iptables -nvL
iptables -nvL -t nat
oc observe pods -n openshift-ovn-kubernetes --listen-addr='' -l app=ovnkube-node -a '{ .status.hostIP }' -- /var/run/ovn/add_iptables.sh
#systemd-run -qPG -- oc observe pods -n openshift-ovn-kubernetes --listen-addr='' -l app=ovnkube-node -a '{ .status.hostIP }' -- /var/run/ovn/add_iptables.sh
nft list table inet azure_icmp
Comment thread
coderabbitai[bot] marked this conversation as resolved.
oc observe pods -n openshift-ovn-kubernetes --listen-addr='' -l app=ovnkube-node -a '{ .status.hostIP }' -- /var/run/ovn/add_nft_icmp.sh
lifecycle:
preStop:
exec:
Expand All @@ -650,7 +644,6 @@ spec:
- mountPath: /etc/ovn/
name: etc-openvswitch
{{ end }}
# for the iptables wrapper
- mountPath: /host
name: host-slash
readOnly: true
Expand Down Expand Up @@ -679,7 +672,7 @@ spec:
- name: systemd-units
hostPath:
path: /etc/systemd/system
# used for iptables wrapper scripts
# used for nftables wrapper scripts
- name: host-slash
hostPath:
path: /
Expand Down
5 changes: 5 additions & 0 deletions pkg/util/k8s/kubeproxy.go
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,11 @@ func GenerateKubeProxyConfiguration(args map[string]operv1.ProxyArgumentList) (s
kpc.IPTables.SyncPeriod.Duration = ka.getDuration("iptables-sync-period")
kpc.IPTables.MinSyncPeriod.Duration = ka.getDuration("iptables-min-sync-period")

kpc.NFTables.MasqueradeBit = ka.getOptInt32("nftables-masquerade-bit")
kpc.NFTables.MasqueradeAll = ka.getBool("nftables-masquerade-all")
kpc.NFTables.SyncPeriod.Duration = ka.getDuration("nftables-sync-period")
kpc.NFTables.MinSyncPeriod.Duration = ka.getDuration("nftables-min-sync-period")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add handling for --conntrack-tcp-be-liberal too (with the other conntrack flags below)

also, we should pre-emptively adjust the validation for --nodeport-addresses to use getString rather than getCIDRList, because it will start supporting non-CIDR values soon. (Also, it looks like the code currently misspells that option as node-port-addresses with an extra hyphen, so you should update it to accept either that wrong form or nodeport-addresses.)


kpc.IPVS.SyncPeriod.Duration = ka.getDuration("ipvs-sync-period")
kpc.IPVS.MinSyncPeriod.Duration = ka.getDuration("ipvs-min-sync-period")
kpc.IPVS.Scheduler = ka.getString("ipvs-scheduler")
Expand Down
79 changes: 79 additions & 0 deletions pkg/util/k8s/kubeproxy_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,85 @@ nodePortAddresses: null
oomScoreAdj: null
portRange: 1000+10
showHiddenMetricsForVersion: ""
winkernel:
enableDSR: false
forwardHealthCheckVip: false
networkName: ""
rootHnsEndpointName: ""
sourceVip: ""
`,
},
{
description: "nftables overrides",
overrides: map[string]operv1.ProxyArgumentList{
"proxy-mode": {"nftables"},
"nftables-masquerade-bit": {"14"},
"nftables-masquerade-all": {"true"},
"nftables-sync-period": {"30s"},
"nftables-min-sync-period": {"10s"},
},
output: `
apiVersion: kubeproxy.config.k8s.io/v1alpha1
bindAddress: 0.0.0.0
bindAddressHardFail: false
clientConnection:
acceptContentTypes: ""
burst: 0
contentType: ""
kubeconfig: ""
qps: 0
clusterCIDR: ""
configSyncPeriod: 0s
conntrack:
maxPerCore: null
min: null
tcpBeLiberal: false
tcpCloseWaitTimeout: null
tcpEstablishedTimeout: null
udpStreamTimeout: 0s
udpTimeout: 0s
detectLocal:
bridgeInterface: ""
interfaceNamePrefix: ""
detectLocalMode: ""
enableProfiling: false
healthzBindAddress: ""
hostnameOverride: ""
iptables:
localhostNodePorts: null
masqueradeAll: false
masqueradeBit: 0
minSyncPeriod: 0s
syncPeriod: 0s
ipvs:
excludeCIDRs: null
minSyncPeriod: 0s
scheduler: ""
strictARP: false
syncPeriod: 0s
tcpFinTimeout: 0s
tcpTimeout: 0s
udpTimeout: 0s
kind: KubeProxyConfiguration
logging:
flushFrequency: 0
options:
json:
infoBufferSize: "0"
text:
infoBufferSize: "0"
verbosity: 0
metricsBindAddress: 0.0.0.0:9102
mode: nftables
nftables:
masqueradeAll: true
masqueradeBit: 14
minSyncPeriod: 10s
syncPeriod: 30s
nodePortAddresses: null
oomScoreAdj: null
portRange: ""
showHiddenMetricsForVersion: ""
winkernel:
enableDSR: false
forwardHealthCheckVip: false
Expand Down