Skip to content

fix: add Contains() check to prevent out-of-CIDR IP allocation on CID…#324

Closed
woshihoujinxin wants to merge 5 commits into
TencentCloud:masterfrom
woshihoujinxin:fix/cube-dev-cidr-validation
Closed

fix: add Contains() check to prevent out-of-CIDR IP allocation on CID…#324
woshihoujinxin wants to merge 5 commits into
TencentCloud:masterfrom
woshihoujinxin:fix/cube-dev-cidr-validation

Conversation

@woshihoujinxin
Copy link
Copy Markdown

…R change

When the CIDR in Cubelet config is changed and network-agent restarts, several code paths silently accepted old-CIDR IPs via Assign() (which ignores out-of-range IPs), causing sandboxes to receive IPs from the wrong network segment.

Changes:

  • ipam: add Contains() to check if IP falls within current CIDR range
  • recover(): destroy kernel taps with out-of-range IPs instead of pooling them; skip and clean up sandboxes when reconcileState fails
  • reconcileState(): reject sandbox with IP outside current CIDR
  • dequeueTapLocked(): defensively destroy out-of-range taps from pool
  • plugin_tap: sync cube-dev IP when CIDR changes, add LinkSetUp/MTU and ensureRouteToCubeDev after IP reconciliation
  • probe: improve timeout error messages with elapsed time and hints
  • Add unit tests covering reconcileState CIDR guard, recover out-of-range tap destruction, dequeueTapLocked defense, and end-to-end IP allocation

@kinwin-ustc
Copy link
Copy Markdown
Collaborator

/cubebot review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

Claude finished @kinwin-ustc's task in 2m 56s —— View job


Code Review for PR #324

Branch: fix/cube-dev-cidr-validationmaster


Overview

This PR addresses the problem where changing the CIDR in Cubelet config and restarting the network-agent would cause sandboxes to receive IPs from the wrong network segment. The fix adds multi-layered defense across the IPAM, recovery, reconciliation, and tap lifecycle paths.


Files Changed

File Change
network-agent/internal/service/ipam.go +10
network-agent/internal/service/local_service.go +31/-4
network-agent/internal/service/tap_lifecycle.go +20/-5
network-agent/internal/service/cidr_change_test.go +638
network-agent/internal/service/ipam_test.go +29
Cubelet/network/plugin_tap.go +99
Cubelet/network/plugin_tap_cubedev_test.go +54
Cubelet/services/cubebox/probe.go +38/-8
Cubelet/services/cubebox/probe_describe_test.go +81

Strengths

1. Multi-layered defense — The fix doesn't rely on a single guard. It operates at three independent layers:

  • recover() — Destroys kernel taps with out-of-CIDR IPs at startup instead of pooling them
  • reconcileState() — Rejects sandboxes with IPs outside the current CIDR range
  • dequeueTapLocked() — Defensively destroys out-of-range taps that somehow reached the pool

This means even if one layer is bypassed, others will catch the problem.

2. Thorough test coverage — The test suite in cidr_change_test.go (638 lines) covers all core scenarios:

  • Out-of-CIDR sandbox rejection in reconcileState
  • In-CIDR acceptance
  • Invalid IP rejection
  • Recovery destruction of out-of-range taps
  • Persisted state cleanup after CIDR change
  • Pool defense in dequeueTapLocked
  • End-to-end IP allocation from new CIDR

3. Cubelet-side cube-dev reconciliationgetOrNewCubeDev now detects IP mismatches and reconciles them, plus ensures LinkSetUp and MTU. This is important because the cube-dev device could have stale state from a previous CIDR configuration.

4. Improved diagnostic messages — The probe timeout error messages now include elapsed time and hints to check cube-dev/CIDR configuration, which will significantly reduce debugging time.


Issues Found

Minor — Potential IP leak on reconcileState failure during recovery

In local_service.go:recover(), when a tap is in the new CIDR range and Assign() is called (line 487), but then reconcileState() fails (line 506 or 521), the cleanup code destroys the tap and deletes the persisted state but does not call allocator.Release() on the IP:

// local_service.go:506-512
if err := s.reconcileState(context.Background(), managed); err != nil {
    s.clearPortMappings(restoredTap)
    _ = cubevsDelTAPDevice(uint32(restoredTap.Index), restoredTap.IP.To4())
    _ = destroyTapFunc(restoredTap.Index)
    _ = s.store.Delete(managed.SandboxID)
    // ⚠️ s.allocator.Release(restoredTap.IP) is NOT called
    continue
}

The same pattern is repeated at lines 521-526. Since Assign() was called at line 487, the IP remains marked as used in the allocator's bitmap. However:

  • Recovery only happens at startup, so the leak is bounded to one event per restart
  • The allocator is not persisted, so a full restart of network-agent would get a fresh bitmap
  • Impact is at most 1-2 IPs lost in the pool per startup

Recommendation: Add s.allocator.Release(restoredTap.IP) to both cleanup blocks in recover() for correctness.

Fix this →

Nit — Duplicated ensureRouteToCubeDev

The ensureRouteToCubeDev function is duplicated between network-agent/internal/service/local_service.go and Cubelet/network/plugin_tap.go. These are separate processes with separate build targets, so the duplication is intentional and acceptable, but consider whether a shared utility package would be appropriate for future maintenance.


Summary

The PR is well-structured with a clear goal and comprehensive test coverage. The multi-layered approach to the CIDR change problem is robust. The one minor issue (IP leak in recovery failure path) is low impact but worth fixing for correctness.

Recommendation: Approve with the minor fix noted above.


@chenhengqi
Copy link
Copy Markdown
Collaborator

When the CIDR in Cubelet config is changed and network-agent restarts,
several code paths silently accepted old-CIDR IPs via Assign() (which
ignores out-of-range IPs), causing sandboxes to receive IPs from the
wrong network segment.

This kind of change should be done on installation.

Copy link
Copy Markdown
Collaborator

@chenhengqi chenhengqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this CIDR change on installation script instead.

@woshihoujinxin
Copy link
Copy Markdown
Author

I think we should make this CIDR change on installation script instead.

Could you explain that in more detail?

@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 23, 2026

Claude finished @woshihoujinxin's task in 1m 17s —— View job


Updated Review for PR #324

Todo

  • Gather context: Read changed files and understand the current diff
  • Check git log for new commits since last review
  • Analyze changes and provide updated review feedback
  • Final summary

After reviewing the full diff and new commits since the previous review, here is my updated analysis.

What Changed Since the Previous Review

Four new commits were added:

Commit Description
e16cfe1 Release IP in recover() failure paths to prevent IP leak
858e98a Tests verifying IP release on reconcileState failure
f1d6d41 Cleanup TAP devices in down-local.sh to prevent restart race
caa39e6 Support CIDR configuration via environment variables in one-click

~ Issues from Previous Review are Addressed ~

The IP leak issue identified in the previous review (missing s.allocator.Release() in the reconcileState failure cleanup blocks in recover()) has been fixed. Both cleanup paths at local_service.go:506-512 and local_service.go:521-527 now call s.allocator.Release(restoredTap.IP) before continuing. Tests in cidr_change_test.go:365-497 validate this behavior.

~ The Install Script Path (chenhengqi's Request) ~

The commits f1d6d41 and caa39e6 add the installation-script approach that @chenhengqi requested:

  • env.example: Exposes CUBE_SANDBOX_CIDR (default 192.168.0.0/18) alongside deny-out CIDRs
  • install.sh:generate_cubelet_config_cidr(): Uses sed to replace the __CUBE_SANDBOX_CIDR__ placeholder in config.toml at install time
  • test-cidr-replace.sh: Tests for the placeholder replacement logic (default values, custom CIDR, all custom values)
  • down-local.sh: Cleans up kernel TAP devices matching z192.* pattern on teardown to prevent stale tap conflicts
  • Both config.toml and cubemaster.yaml placeholder configs are updated

This means CIDR configuration is now handled at install time (via env vars -> config template substitution), and the runtime code in network-agent/Cubelet provides defense-in-depth in case the CIDR still changes at runtime.

Remaining Observations

Minor indentation inconsistency in local_service.go recover()

The formatting of the recover() function around lines 506-544 has inconsistent indentation. The if state, ok := statesByTapName[restoredTap.Name]; ok { block body (lines 506-517) is not additionally indented, and the if inCubeVS && restoredTap.InUse { block at line 520 is less indented than the surrounding code. This won't affect correctness (Go doesn't care), but gofmt would fix it.

dequeueTapLocked doesn't release IP from allocator

In tap_lifecycle.go:67-72, when dequeueTapLocked() encounters an out-of-range tap and destroys it, it calls clearPortMappings, cubevsDelTAPDevice, and destroyTapFunc, but does NOT call s.allocator.Release(tap.IP). This is a minor issue since these taps were never assigned via the allocator (they entered the pool from the old CIDR), so the allocator bitmap doesn't track them. Still, for consistency with the recover() cleanup path, adding a release call would be harmless documentation of intent.

Cubelet-side ensureRouteToCubeDev -- good defensive addition

The ensureRouteToCubeDev function in plugin_tap.go:290-338 ensures a link-scoped route exists after IP reconciliation. This is important because removing and re-adding the IP on cube-dev can drop the kernel's auto-generated connected route. The function deduplicates correctly by checking Dst and LinkIndex.

Summary

The PR has matured well since the first review. The IP leak is fixed, the install-script CIDR configuration path addresses the reviewer's concern, and the defensive code in network-agent/Cubelet provides multi-layered protection. The minor consistency issues noted above don't block approval.

@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 23, 2026

Claude finished @woshihoujinxin's task in 2m 51s —— View job


Updated Review for PR #324

Branch: fix/cube-dev-cidr-validationmaster


Summary of Changes Since Previous Review

Six commits comprise this PR. The previous review identified an IP leak bug and chenhengqi requested that CIDR changes be handled at installation time. Both concerns have been addressed:

Commit Description
2da3a70 Core fix: Contains() check, CIDR guards in recover/reconcileState/dequeueTapLocked
e16cfe1 Fix: release IP in recover() failure paths (fixes the IP leak)
858e98a Tests verifying IP release on reconcileState failure
f1d6d41 Cleanup TAP devices in down-local.sh to prevent restart race
caa39e6 CIDR configuration via environment variables in one-click
9029b80 CIDR and tap_init_num configurable via env vars in network-agent/Cubelet

What Was Fixed

IP leak — The first cleanup block at local_service.go:521-529 and the second at local_service.go:537-543 both now call s.allocator.Release(restoredTap.IP) after reconcileState() fails. The test TestRecoverSkipsSandboxWhenReconcileFailsForCIDRChange (line 365) validates this by checking usedIPNum == 3 after recovery.

Installation-script CIDR — Added in deploy/one-click/:

  • env.example exposes CUBE_SANDBOX_CIDR and CUBE_SANDBOX_TAP_INIT_NUM as environment variables (defaults 192.168.0.0/18 and 500)
  • common.sh: render_config_placeholders() replaces __CUBE_SANDBOX_CIDR__ in config templates at startup
  • install.sh: generate_cubelet_config_cidr() replaces the placeholder at install time
  • down-local.sh: Cleans up kernel TAP devices matching z<IP> pattern to prevent stale device conflicts
  • test-cidr-replace.sh: 148 lines of tests for placeholder replacement logic

Code Quality Assessment

Strengths

1. Multi-layered defense — Three independent layers catch out-of-CIDR IPs:

  • recover() (local_service.go:492-501) — Destroys kernel taps with out-of-CIDR IPs at startup
  • reconcileState() (local_service.go:370-372) — Rejects sandboxes with IPs outside the current CIDR
  • dequeueTapLocked() (tap_lifecycle.go:65-74) — Defensively destroys out-of-range taps from the pool

2. Test quality — The test file cidr_change_test.go (648 lines) covers:

  • Out-of-CIDR sandbox rejection in reconcileState
  • In-CIDR acceptance and invalid IP rejection
  • Recovery destruction of out-of-range taps
  • Persisted state cleanup after CIDR change
  • IP release verification on reconcileState failure
  • Pool defense in dequeueTapLocked
  • End-to-end IP allocation from new CIDR

3. Cubelet-side cube-dev reconciliationgetOrNewCubeDev (plugin_tap.go:130-241) detects IP mismatches, removes stale addresses, and configures the correct one. Also ensures LinkSetUp and MTU are correct after reconciliation. ensureRouteToCubeDev (plugin_tap.go:257-285) restores the link-scoped route that can be dropped when addresses are removed and re-added.

4. Improved diagnostics — Probe timeout error messages now include elapsed time and probe description, and hint at checking cube-dev/CIDR configuration.

Minor Observations

1. Indentation inconsistency in local_service.go:recover()

The code around lines 507-551 has indentation that doesn't follow Go conventions. The if state, ok := ... block (line 510) should be at the same indentation level as the rest of the for loop body, but subsequent blocks (lines 521, 534, 535) appear at varying indentation levels. gofmt would flag this. However, since Go doesn't enforce indentation for correctness, this is cosmetic.

2. dequeueTapLocked doesn't release IP from allocator

In tap_lifecycle.go:70-72, when an out-of-range tap is destroyed from the pool, there is no s.allocator.Release(tap.IP) call. This is acceptable because these taps entered the pool under the old CIDR and the allocator's bitmap was initialized fresh for the new CIDR — the IP was never marked as used. However, adding an explicit s.allocator.Release() for consistency with the recover() cleanup path would make the intent clearer.

3. ensureRouteToCubeDev duplication

ensureRouteToCubeDev exists in both network-agent/internal/service/local_service.go and Cubelet/network/plugin_tap.go. These are separate processes with separate build targets, so the duplication is intentional, but a shared utility package could reduce maintenance burden over time.


Conclusion

The PR has matured well since its initial review. The IP leak is fixed, the installation-script CIDR path addresses chenhengqi's concern, and the defensive code provides robust multi-layered protection against stale CIDR IPs. The minor observations above do not block approval.

@woshihoujinxin
Copy link
Copy Markdown
Author

Local install

CUBE_SANDBOX_CIDR=192.168.100.0/24 CUBE_SANDBOX_TAP_INIT_NUM=100 CUBE_PVM_ENABLE=1 bash /root/hjx/cubesandbox-deploy/online-install-local.sh --local-file=/root/hjx/CubeSandbox/deploy/one-click/dist/cube-sandbox-one-click-9029b80.tar.gz

and if this pr is accept and we can set cidr and tap number by this command

curl -sL https://github.com/tencentcloud/CubeSandbox/raw/master/deploy/one-click/online-install.sh \
  | CUBE_SANDBOX_CIDR=192.168.100.0/24 CUBE_SANDBOX_TAP_INIT_NUM=100 CUBE_PVM_ENABLE=1 bash

Prestop: true
cube_box_req_template: >-
{"volumes":[{"name":"tmp","volume_source":{"empty_dir":{"medium":0}}}],"containers":[{"name":"cubebox-default","envs":[{"key":"TZ","value":"Asia/Shanghai"},{"key":"TERM","value":"xterm"}],"volume_mounts":[{"name":"tmp","container_path":"/"}],"security_context":{"privileged":true,"readonly_rootfs":false,"no_new_privs":false}}],"network_type":"tap","cubevs_context":{"allowInternetAccess":true,"denyOut":["10.0.0.0/8","100.64.0.0/10","172.16.0.0/12","192.168.0.0/18"]}}
{"volumes":[{"name":"tmp","volume_source":{"empty_dir":{"medium":0}}}],"containers":[{"name":"cubebox-default","envs":[{"key":"TZ","value":"Asia/Shanghai"},{"key":"TERM","value":"xterm"}],"volume_mounts":[{"name":"tmp","container_path":"/"}],"security_context":{"privileged":true,"readonly_rootfs":false,"no_new_privs":false}}],"network_type":"tap","cubevs_context":{"allowInternetAccess":true,"denyOut":["__CUBE_SANDBOX_DENY_OUT_PRIVATE_A__","__CUBE_SANDBOX_DENY_OUT_CGN__","__CUBE_SANDBOX_DENY_OUT_PRIVATE_B__","__CUBE_SANDBOX_DENY_OUT_SANDBOX_CIDR__"]}}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change here is unnecessary. These CIDRs are hard-coded in CubeVS:

var alwaysDeniedSandboxCIDRs = []string{
"10.0.0.0/8",
"127.0.0.0/8",
"169.254.0.0/16",
"172.16.0.0/12",
"192.168.0.0/16",
}

Comment thread Cubelet/network/plugin_tap.go Outdated
Comment on lines +139 to +140
// configured CIDR. When the operator changes the CIDR between restarts,
// the old IP on the dummy device no longer matches the new gateway IP.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this should not happen in production. Please drop it.

Copy link
Copy Markdown
Collaborator

@chenhengqi chenhengqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow our contribution guide and reorganize your commits properly.

@woshihoujinxin woshihoujinxin force-pushed the fix/cube-dev-cidr-validation branch from 9029b80 to 22fbb05 Compare May 27, 2026 06:12
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 27, 2026

Claude finished @woshihoujinxin's task in 2m 14s —— View job


Review: PR #324 — CIDR validation fix

Overview

This PR adds multi-layered defense against out-of-CIDR IP allocation when the Cubelet CIDR is changed and network-agent restarts. Six commits span 19 files with +1315/-29 changes across the network-agent, Cubelet, and deploy/one-click components.

Current Commits

Commit Description
a0fba84 Core fix: Contains() check, CIDR guards in recover/reconcileState/dequeueTapLocked
6b72fa7 Fix: release IP in recover() failure paths
72b2811 Tests verifying IP release on reconcileState failure
89bbc38 Cleanup TAP devices in down-local.sh
a0336d2 CIDR config via env vars in one-click scripts
22fbb05 CIDR/tap_init_num configurable via env vars

Previous Review Feedback

chenhengqi's 4 comments:

  1. CHANGES_REQUESTED — "make CIDR change on installation script instead" → Partially addressed. The one-click deploy scripts (install.sh, common.sh, env.example) now support CIDR configuration via CUBE_SANDBOX_CIDR and friends. However, the runtime code path in the network-agent that detects and handles CIDR changes is still present. The two approaches are complementary: install-time config prevents the problem, runtime defense catches unexpected divergence. This seems like a valid defense-in-depth approach.

  2. COMMENTED on cubemaster.yaml:43 — "These CIDRs are hard-coded in CubeVS" → Looking at CubeNet/cubevs/netpolicy.go, the alwaysDeniedSandboxCIDRs list includes 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16. However, the config was using 192.168.0.0/18 (not /16), and 100.64.0.0/10 (CGN) is not in the CubeVS hardcoded list. The placeholder approach adds flexibility. That said, chenhengqi's point has some merit — if the intent is that these values should always match CubeVS defaults, the placeholders add complexity without much benefit for most deployments.

  3. COMMENTED on plugin_tap.go:140 — "No, this should not happen in production. Please drop it." → Lines 138-185 in getOrNewCubeDev add runtime IP reconciliation for the cube-dev dummy interface: detecting stale addresses, removing them, and replacing with the correct one. chenhengqi's argument is that CIDR should be set at install time and never change at runtime, making this code dead weight. This comment has not been addressed — the reconciliation logic is still present. The author should either:

    • Drop the runtime IP reconciliation if install-time-only CIDR is the agreed approach, or
    • Explain the production scenario where runtime CIDR change can legitimately occur
  4. CHANGES_REQUESTED — "Please follow our contribution guide and reorganize your commits properly" → The contribution guide calls for:

    • One component per commit
    • Atomic commits
    • Logical ordering

    Commit 22fbb05 (feat: make CIDR and tap_init_num configurable via environment variables) touches 7 files across 3 components (network-agent, one-click deploy, and Cubelet config). The deploy changes could be in a separate commit. This request has not been addressed.

Code Quality Assessment

Strengths:

  • Robust multi-layer defense — CIDR validation at three independent layers (recover, reconcileState, dequeueTapLocked) ensures no stale-IP tap slips through.
  • Good test coveragecidr_change_test.go (648 lines) covers the core scenarios thoroughly, including IP release verification after reconcileState failure.
  • Defensive tap lifecycledequeueTapLocked now validates taps against the current CIDR before returning them, catching race conditions or maintenance-loop edge cases.
  • Improved diagnostics — Probe timeout messages now include elapsed time, probe description, and hints to check cube-dev/CIDR configuration (~line 124-131 in probe.go).

Remaining issues from previous review:

  • plugin_tap.go IP reconciliation (line 138-185) — chenhengqi explicitly asked for this to be dropped. This is an unresolved change request that should be discussed.
  • Commit organization — Commit 22fbb05 mixes network-agent, deploy, and Cubelet config changes. The contribution guide asks for separate commits per component.
  • cubemaster.yaml placeholders — Replacing hardcoded CIDRs with placeholders may be unnecessary since CubeVS already has alwaysDeniedSandboxCIDRs hardcoded.

Minor observations:

  • The local_service.go:recover() function has indentation inconsistencies (lines 521, 534, 535 appear at varying indentation levels). Running gofmt would clean this up.
  • dequeueTapLocked destroys the out-of-range tap but doesn't call s.allocator.Release(). This is correct because the tap's IP was never assigned in the allocator (the allocator is CIDR-scoped), but adding a release call for consistency with the recover() path would make the intent clearer.

HouJinxin added 5 commits May 27, 2026 17:20
…DR change

When the CIDR in Cubelet config changes, several code paths silently
accepted old-CIDR IPs via Assign(), causing sandboxes to receive IPs
from the wrong network segment.

- ipam: add Contains() to check if IP falls within current CIDR range
- recover(): destroy kernel taps with out-of-range IPs
- reconcileState(): reject sandbox with IP outside current CIDR
- dequeueTapLocked(): defensively destroy out-of-range taps from pool
- Wire CIDR and tap_init_num via environment variables
Add elapsed time and actionable hints to timeout error messages so
operators can quickly distinguish slow starts from configuration
issues.
…ariables

Allow operators to override CIDR and tap pool size through environment
variables in one-click deployment scripts.

- Add CUBE_CIDR and CUBE_TAP_INIT_NUM to env.example
- Propagate env vars through install.sh, common.sh, up.sh
- Cleanup TAP devices in down-local.sh to prevent restart race
- Add test-cidr-replace.sh for CIDR swap validation
- Update Cubelet config.toml and cubemaster.yaml defaults
Synchronize the OpenAPI spec with the actual CubeAPI endpoints,
request/response schemas, and error codes.
@woshihoujinxin woshihoujinxin force-pushed the fix/cube-dev-cidr-validation branch from 22fbb05 to 6de433c Compare May 27, 2026 09:37
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 27, 2026

Claude finished @woshihoujinxin's task in 2m 51s —— View job


Review: PR #324 — CIDR validation fix

Todo Status

  • Gather context: Read all changed files and understand the current diff
  • Check git log for commits since last review
  • Analyze changes and provide updated review feedback
  • Final summary

Current Commits (5 PR-specific)

Commit Component Description
183c125 network-agent Core fix: Contains() check, CIDR guards in recover/reconcileState/dequeueTapLocked, IP leak fix
50c0b39 cubelet Improved probe timeout error messages with elapsed time and hints
f5187de deploy CIDR and tap_init_num support via environment variables in one-click scripts
55fd99e openapi API documentation update
6de433c cubelet go.sum checksum update for containerd v2.2.2

Note on plugin_tap.go — Chenhengqi's Request

The current branch has no changes to Cubelet/network/plugin_tap.go compared to origin/master. The getOrNewCubeDev function is identical between branches. If the IP reconciliation code that chenhengqi asked to drop (No, this should not happen in production. Please drop it.) was present in an earlier iteration, it has since been removed from the current diff. This is a resolved concern.

Chenhengqi's Remaining Review Comments

1. CHANGES_REQUESTED: "Make CIDR change on installation script instead"
Partially addressed. The one-click deploy scripts (install.sh, env.example, common.sh) now support CIDR via CUBE_SANDBOX_CIDR and CUBE_SANDBOX_TAP_INIT_NUM environment variables. The runtime code in network-agent provides defense-in-depth for cases where config and runtime CIDR diverge (e.g., manual config edits, partial upgrades). Both approaches coexist, which is a valid strategy.

2. COMMENTED: "cubemaster.yaml CIDRs are hard-coded in CubeVS"
Looking at CubeNet/cubevs/netpolicy.go:15-21, the alwaysDeniedSandboxCIDRs list includes:

  • 10.0.0.0/8
  • 127.0.0.0/8
  • 169.254.0.0/16
  • 172.16.0.0/12
  • 192.168.0.0/16

Chenhengqi's point has merit — for the common RFC1918 sandbox CIDR, the CubeVS hardcoded deny list already covers it (e.g., 192.168.0.0/16 covers any /18 sandbox CIDR). However, the placeholders provide two benefits:

  • CGN space (100.64.0.0/10) is not in the CubeVS hardcoded list — this is genuinely new
  • The __CUBE_SANDBOX_DENY_OUT_SANDBOX_CIDR__ placeholder ensures sandbox-to-sandbox isolation is precisely scoped to the actual sandbox CIDR, even if it's a subset of what CubeVS always blocks

The template approach doesn't cause harm and adds flexibility for non-standard CIDR deployments. I'd recommend the author add a comment to cubemaster.yaml or install.sh noting that these are additional deny CIDRs on top of CubeVS's hardcoded always-denied networks.

3. CHANGES_REQUESTED: "Reorganize commits properly"
The 5 PR-specific commits are already well-organized by component:

  • network-agent: prefix (commit 1)
  • cubelet: prefix (commit 2)
  • deploy: prefix (commit 3)
  • openapi: prefix (commit 4)
  • cubelet: go.sum fix (commit 5)

Each commit touches a single component. This satisfies the contribution guide's requirement for one component per commit.

Code Quality Assessment

Strengths:

  • Robust multi-layer defense — CIDR validation at three independent layers: recover() (destroy out-of-range taps at startup), reconcileState() (reject out-of-CIDR sandbox IPs), and dequeueTapLocked() (defensive pool check). Even if one layer is bypassed, others catch the problem.
  • Thorough test coveragecidr_change_test.go (648 lines) covers out-of-CIDR rejections, in-CIDR acceptance, recovery destruction, IP release validation, pool defense, and end-to-end allocation. The ipam_test.go (29 lines) covers Contains() edge cases (nil, IPv6, boundary values).
  • IP leak fixed — Both cleanup paths in recover() now call s.allocator.Release() after reconcileState() fails (lines 527 and 542). Tests validate usedIPNum == 3 after recovery.
  • Improved diagnostics — Probe timeout messages now include elapsed time, probe description, and hints to check cube-dev/CIDR configuration.
  • TAP cleanup on shutdowndown-local.sh cleans up kernel TAP devices matching z<IP> pattern, preventing stale device conflicts on restart.
  • Configurable CIDR at deploy timeinstall.sh:generate_cubelet_config_cidr() and common.sh:render_config_placeholders() properly substitute __CUBE_SANDBOX_CIDR__ placeholders.

Remaining Observations

1. Indentation inconsistency in local_service.go:recover() (lines 521-560)
The recover() function has indentation that doesn't follow Go conventions. The if state, ok := ... block at line 510 starts at the for-loop body level, but then the if err := s.reconcileState(...) at line 521, the device, inCubeVS := ... at line 534, and the if inCubeVS && restoredTap.InUse at line 535 all appear at varying indentation levels. gofmt would flag this. While Go doesn't enforce indentation for correctness, running gofmt -w on this file would clean it up.

2. dequeueTapLocked doesn't call s.allocator.Release() (tap_lifecycle.go:65-74)
When dequeueTapLocked() destroys an out-of-range tap, it calls clearPortMappings, cubevsDelTAPDevice, and destroyTapFunc, but does not call s.allocator.Release(tap.IP). This is acceptable because the allocator is CIDR-scoped and was initialized fresh after the CIDR change — the old IP was never marked as used. For consistency with the recover() cleanup path, adding a release call would document the intent.

3. ensureRouteToCubeDev is duplicated (local_service.go vs plugin_tap.go)
ensureRouteToCubeDev exists in both network-agent/internal/service/local_service.go and Cubelet/network/plugin_tap.go. These are separate processes with separate build targets, so the duplication is intentional, but a shared utility package (e.g., under pkg/) could reduce maintenance burden over time.

Conclusion

The PR has matured well. The IP leak is fixed, deploy-time CIDR configuration is supported, and the runtime code provides robust multi-layered protection against stale CIDR IPs. The remaining observations are minor and do not block approval.

@woshihoujinxin
Copy link
Copy Markdown
Author

Closing this PR to resubmit as separate, component-focused PRs per review feedback and CONTRIBUTING.md guidelines. See:

  1. fix(network-agent): CIDR IP allocation fix
  2. feat(deploy): CIDR env var support
  3. fix(cubelet): probe timeout improvements
  4. docs: openapi update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants