Skip to content

fix(Cubelet): cleanup host-mount dirs after sandbox destroy#333

Merged
kinwin-ustc merged 2 commits into
TencentCloud:masterfrom
luzhixing12345:fix-hostmount-cleanup
May 27, 2026
Merged

fix(Cubelet): cleanup host-mount dirs after sandbox destroy#333
kinwin-ustc merged 2 commits into
TencentCloud:masterfrom
luzhixing12345:fix-hostmount-cleanup

Conversation

@luzhixing12345

Copy link
Copy Markdown
Contributor

Currently, sandbox destruction does not call cleanupHostDirVolumes, which causes host-mount related resources to remain on the host after the sandbox is removed.

As a result:

  • directories under /data/cubelet/hostdir/ are not cleaned up
  • bind mount points created for host-mount remain mounted
  • sandbox-specific directories continue to accumulate over time

The leaked mount points can be observed with:

sudo nsenter -t "$(pgrep -f '/Cubelet/bin/cubelet' | head -n1)" -m -- findmnt -n -o TARGET,SOURCE,FSTYPE,OPTIONS 

Example leaked mounts after sandbox destruction:

│ ├─/data/cubelet/hostdir/9847ba94c3b345c99222b30bf1ae0e9d/rw/hostdir-0
│ │     /dev/sda2[/tmp/rw]
│ │            ext4        rw,relatime,errors=remount-ro,stripe=16
│ └─/data/cubelet/hostdir/9847ba94c3b345c99222b30bf1ae0e9d/ro/hostdir-1
│       /dev/sda2[/tmp/ro]
│              ext4        ro,relatime,errors=remount-ro,stripe=16

This PR adds host-dir volume cleanup during sandbox destruction.

Unmount host-mount bind targets under the sandbox hostdir path during
sandbox cleanup

Signed-off-by: luzhixing12345 <luzhixing12345@163.com>
@luzhixing12345 luzhixing12345 requested a review from fslongjin as a code owner May 21, 2026 11:11
@fslongjin

Copy link
Copy Markdown
Member

/cubebot review

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown

Claude finished @fslongjin's task in 1m 8s —— View job


Code Review: fix-hostmount-cleanup

Todo

  • Gather context and understand the request
  • Review changed files and provide feedback
  • Post review results

Summary

This PR addresses a resource leak where host-mount bind mounts and directories were not cleaned up after sandbox destruction. The fix wires cleanupHostDirVolumes into the destroy path and improves its implementation significantly.

Changes Overview

Cubelet/storage/hostdir.gocleanupHostDirVolumes rewritten:

  • Signature change: func(ctx, sandboxID string)func(ctx, info *StorageInfo) error. This allows it to be called from destroy() which already has the StorageInfo available.
  • Defensive guards: nil/empty checks on info, info.SandboxID, and info.HostDirBackendInfos at the top.
  • Mount-point detection: Uses utils.IsMountPoint() before unmounting, so only actual mount points are targeted.
  • unix.MNT_DETACH: Uses lazy unmount, which is safer than a forceful unmount when there may be active file handles.
  • Post-unmount cleanup: Calls os.RemoveAll(sandboxDir) after all mounts are detached, removing the now-empty sandbox directories.
  • filepath.SkipDir after unmount: Prevents walking into an already-emptied subtree.
  • Error propagation: Returns errors upward instead of just logging them, so the caller can react.

Cubelet/storage/local.godestroy method:

  • Added l.cleanupHostDirVolumes(ctx, info) call with error joining alongside the existing destroyCubeBoxTemplateBase and destroyDefaultMediumVolumes calls.

Observations

1. Dead code is now alive

The old cleanupHostDirVolumes(ctx context.Context, sandboxID string) was defined but never called anywhere in the codebase. This PR correctly wires it into the destroy flow, making it actually execute during sandbox teardown.

2. WalkDir traversal is correct for the directory layout

The expected hostdir structure is:

/data/cubelet/hostdir/<sandboxID>/rw/hostdir-0  (bind mount)
/data/cubelet/hostdir/<sandboxID>/ro/hostdir-1  (bind mount)

The WalkDir correctly unmounts leaf directories and skips sub-trees with filepath.SkipDir, then os.RemoveAll cleans up the remaining empty parent directories. This is sound.

3. Error handling is well-considered

  • Errors from cleanupHostDirVolumes are joined with other destroy errors via errors.Join, so partial failures won't mask other cleanup issues.
  • If cleanup fails, Destroy()'s deferred deleteBackendFileInfo is skipped (since it checks err == nil), which means the sandbox metadata persists. This is actually a good retry mechanism — on next restart or explicit CleanUp call, the leftover resources will be cleaned up.
  • Potential concern: The Metadata persistence on failure means the sandbox will continue to appear in listings until the on-disk cleanup succeeds. This is probably acceptable behavior, but worth noting.

4. Edge case: file bind mounts

The WalkDir callback skips non-directory entries (!d.IsDir()). If hostdir ever contains bind mounts of files (rather than directories), those won't be explicitly unmounted before os.RemoveAll. os.RemoveAll can fail on mounted file bind mounts. However, the existing prepareHostDirVolume only creates directory bind mounts, so this is not a current issue — just something to be aware of if the volume type is extended later.

5. No test coverage

There are no new tests for cleanupHostDirVolumes. While mocking mount operations is nontrivial (requires unix syscall mocking or a test harness), the PR would benefit from at least a unit test covering the early-return paths (nil info, empty sandbox ID, no HostDirBackendInfos, non-existent directory) and perhaps an integration-level test.


Suggestions

S1 — Return early from WalkDir when all known mounts are handled (optional optimization):
Since info.HostDirBackendInfos contains the exact set of known bind paths, you could track which ones have been unmounted and return early once all are handled, rather than walking the entire tree. This is a minor optimization and not critical.

S2 — Consider using unix.UMOUNT_NOFOLLOW flag (optional safety):
For symlink-safety during unmount, unix.UMOUNT_NOFOLLOW could be used on newer kernels. Again, minor and not blocking.

Verdict

The PR is well-structured and correct. It fixes a real resource leak, properly propagates errors, and integrates cleanly into the existing destroy flow. No blocking issues found.

@luzhixing12345

Copy link
Copy Markdown
Contributor Author

Hi, may I ask when this PR might be merged?

@cubesandboxbot

cubesandboxbot Bot commented May 27, 2026

Copy link
Copy Markdown

Claude finished @kinwin-ustc's task in 38s —— View job


Status Update: PR #333

The branch was updated by merging master (most notably pulling in PR #337fix(Cubelet): use correct Cloud Hypervisor disk API endpoint). The core changes from this PR are unchanged from the previous review.

Changes (unchanged)

File Change
Cubelet/storage/hostdir.go cleanupHostDirVolumes rewritten: signature changed from (ctx, sandboxID string) to (ctx, info *StorageInfo) error; adds mount-point detection via utils.IsMountPoint, lazy unmount (unix.MNT_DETACH), post-unmount os.RemoveAll cleanup, and proper error propagation
Cubelet/storage/local.go destroy method now calls l.cleanupHostDirVolumes(ctx, info) with error joining

Current state

  • Branch: up to date with master
  • PR fix: correct — closes the host-mount resource leak described in the PR description
  • Previous review verdict: no blocking issues found

Awaiting maintainer review and merge. 👍

@kinwin-ustc kinwin-ustc merged commit 5d27a12 into TencentCloud:master May 27, 2026
7 of 8 checks passed
fslongjin pushed a commit that referenced this pull request May 27, 2026
fix cublet ci compile error introduced by PR #333

storage/local.go:854:31: cannot use result.SandboxID (variable of type string) as *StorageInfo value in argument to l.cleanupHostDirVolumes
make: *** [Makefile:80: cubelet] Error 1
make[1]: *** [Makefile:96: builder-run] Error 2
make[1]: Leaving directory '/home/runner/work/CubeSandbox/CubeSandbox'
make: *** [Makefile:140: cubelet] Error 2

Signed-off-by: Feng Jin <ronyjin@tencent.com>
Co-authored-by: Feng Jin <ronyjin@tencent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants