Skip to content

[internal] Add rawfile support#157

Draft
duckhawk wants to merge 7 commits intomainfrom
voleynikov-rawfile-test
Draft

[internal] Add rawfile support#157
duckhawk wants to merge 7 commits intomainfrom
voleynikov-rawfile-test

Conversation

@duckhawk
Copy link
Copy Markdown
Member

@duckhawk duckhawk commented Dec 23, 2025

Description

This PR adds support for RawFile volumes — local file-based volumes mounted as loop devices. They provide a simpler alternative to LVM for local storage when LVM is not available or not desired.

Key changes:

  1. New rawfile package (images/sds-local-volume-csi/pkg/rawfile/) — handles raw-file/loop-device management:

    • Create/delete raw files (regular or sparse)
    • Attach/detach loop devices (losetup)
    • Online resize (file + loop device + filesystem)
    • List/inspect local volumes; per-volume locking against concurrent operations
  2. CSI driver updates (images/sds-local-volume-csi/driver/):

    • CreateVolume: validates topology and pins each volume to a single node taken from accessibility_requirements (no implicit fallback to the controller's host)
    • NodeStageVolume: fetches the PV before file creation, persists .reclaim and .sparse on-disk markers, adds the storage.deckhouse.io/rawfile-pv-protection finalizer for Delete policy, then creates the file and attaches the loop device
    • NodeUnstageVolume / NodeExpandVolume: detach + clean unmount; online resize
    • DeleteVolume: defers actual deletion to the node-side cleanup goroutine (driven by on-disk markers + finalizer)
  3. LocalStorageClass CRD extension (api/v1alpha1/, crds/):

    • New rawFile spec with sparse (immutable) and nodes (mutable)
    • CEL XValidation for mutual exclusivity with lvm, immutability of backend switching and rawFile.sparse, mutability of rawFile.nodes
  4. Controller updates (images/controller/):

    • StorageClass parameters for the RawFile type (type=rawfile, rawfile-sparse, rawfile-nodes)
    • rawFile.nodes propagated to StorageClass.AllowedTopologies (key topology.sds-local-volume-csi/node) so both the scheduler (WFFC) and external-provisioner (Immediate) honor the constraint
    • Drift detection in hasSCDiff for rawFile.sparse and rawFile.nodes triggers SC recreation
  5. Webhook validation (images/webhooks/):

    • Separate validation for RawFile and LVM configurations
    • Rejects empty backend, non-empty/duplicate node names, and missing Node objects (fail-open on transient API errors)
  6. Cleanup mechanism (node-side):

    • On-disk .reclaim / .sparse markers stored next to each raw file
    • Background goroutine in the CSI node pod deletes orphaned RawFile volumes (respects grace period; honors ReclaimPolicy: Retain via the marker)
    • Finalizer-based PV protection (storage.deckhouse.io/rawfile-pv-protection)
  7. ModuleConfig (openapi/config-values.yaml):

    • New rawFileDefaultDataDir parameter (default: /var/lib/sds-local-volume/rawfile)
    • Wired through the CSI node DaemonSet as RAWFILE_DEFAULT_DATA_DIR and a Bidirectional-propagated hostPath mount

Why do we need it, and what problem does it solve?

Problem. Some environments cannot use LVM due to:

  • Lack of available block devices for LVM
  • Administrative restrictions on LVM usage
  • Need for simpler storage provisioning without LVM overhead
  • Testing and development environments where LVM setup is impractical

Solution. RawFile volumes provide:

  • File-based storage backed by loop devices on top of any local filesystem
  • No LVM dependency
  • Simple setup — just specify a directory path (per-node optional via rawFile.nodes)
  • Full CSI compatibility (mount, expand, delete)
  • Online resize without pod restart

What is the expected result?

After applying these changes:

  1. New LocalStorageClass with RawFile:
apiVersion: storage.deckhouse.io/v1alpha1
kind: LocalStorageClass
metadata:
  name: rawfile-storage
spec:
  rawFile:
    sparse: false
    # Optional. If set, restricts provisioning to these nodes via
    # StorageClass.AllowedTopologies.
    # nodes:
    #   - name: node-1
    #   - name: node-2
  reclaimPolicy: Delete
  volumeBindingMode: WaitForFirstConsumer
  1. PVC creation works:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-rawfile-pvc
spec:
  storageClassName: rawfile-storage
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 1Gi
  1. Pod scheduling works — no volume node affinity conflict errors (single-node pinning via WFFC + topology label).
  2. Online resize works — edit PVC size, the filesystem expands without pod restart.
  3. Delete reclaim — files are cleaned up by the node-side cleanup loop after PV finalizer is removed.
  4. Retain reclaim — files are preserved (.reclaim=Retain marker).

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually.

@duckhawk duckhawk self-assigned this Dec 23, 2025
@duckhawk duckhawk changed the title [internal] test rawfile support [internal] Add rawfile support Dec 23, 2025
@duckhawk duckhawk force-pushed the voleynikov-rawfile-test branch 6 times, most recently from 3af4977 to 6319e91 Compare March 10, 2026 14:21
@duckhawk duckhawk force-pushed the voleynikov-rawfile-test branch from 50f7865 to e2cbef1 Compare April 20, 2026 08:35
duckhawk added a commit that referenced this pull request Apr 20, 2026
This change implements the remaining review items from PR #157 and
aligns the driver with the synced ADR.

driver:

- NodeStageVolume now fetches the PV BEFORE creating disk.img and
  fails with codes.Unavailable on a missing PV for a brand-new
  volume so kubelet retries cleanly. The reclaim and sparse markers
  (.reclaim, .sparse) are persisted on disk before the backing file
  is created, ensuring the orphan-cleanup goroutine and
  NodeExpandVolume can rely on them even when apiserver is
  unreachable. Re-stage of an existing volume is idempotent and
  back-fills missing markers for older volumes.
- NodeStageVolume rolls back a freshly-created backing file when
  losetup, mkfs, mount or online resize fails, so retries do not
  reuse a half-formatted image. Pre-existing files are left intact.
- NodeExpandVolume resolves the sparse flag from PV.VolumeAttributes
  with the on-disk .sparse marker as the fallback and fails closed
  with codes.FailedPrecondition when neither source is available.
- CreateVolume no longer falls back to the controller's hostID when
  no topology is provided; HA controllers MUST receive topology via
  StorageClass.allowedTopologies (Immediate) or scheduler Preferred
  (WFFC). The rawFile.nodes filter is kept as defence-in-depth.
- Removed obsolete EnsureDataDir call from controller startup; the
  data directory is created lazily on the node side.

rawfile:

- New SetSparse / GetSparse helpers persist the sparse flag in a
  .sparse marker via atomic writeMarker / readMarker helpers (also
  used for .reclaim).
- createRawFile now Chmod()s the file after fallocate so umask does
  not weaken the intended 0600 mode (parity with the truncate
  branch).
- FindLoopDevice / FindAllLoopDevices guard against an empty
  losetup output with ErrLoopDeviceNotFound; parsing is extracted
  into parseLosetupOutput for unit-testability.
- Added unit tests covering: parseLosetupOutput edge cases,
  volumeLock mutual exclusion, distinct-volume parallelism and
  reference-counted entry release, .reclaim and .sparse marker
  round-trips, and DeleteVolume marker cleanup.

driver/cleanup_test.go (new):

- Covers listRawFilePVs (driver/type filtering), cleanupOrphanedVolume
  (Retain keeps file, Delete removes file, missing marker is
  conservative), processRawFilePVs (grace period, aged Delete vs
  aged Retain orphans, live Retain PV is left alone) and small
  helpers.

webhooks/lscValidator:

- validateRawFile now checks that every node in spec.rawFile.nodes
  exists in the cluster via a node Get loop. Local validation
  (empty / duplicate names) runs first to short-circuit. The
  existence check fails-open on transient apiserver errors so
  webhook outages do not block LSC edits, and is fully mocked from
  unit tests.

internal:

- Removed the unused RawFileReclaimPolicyKey constant.

pkg/utils/func.go:

- Re-grouped imports to satisfy gci ordering.

Signed-off-by: v.oleynikov <vasily.oleynikov@flant.com>
@duckhawk duckhawk force-pushed the voleynikov-rawfile-test branch from abf605b to 60079da Compare April 20, 2026 10:14
Introduce RawFile as an alternative to LVM — volumes backed by regular
files mounted as loop devices.

Components: API/CRD, CSI driver (controller + node), module controller,
scheduler extender, webhooks, Helm templates, documentation (EN/RU).

Key implementation details:
- Per-volume locking in rawfile.Manager (no global mutex contention)
- PV finalizer management via Patch with conflict retry
- Node filtering: LSC rawFile.nodes propagated through SC params and
  enforced in CreateVolume topology filtering
- Robust DeleteVolume: checks PV attrs, local file, LLV existence
  before choosing RawFile vs LVM path
- ValidateVolumeID for path traversal prevention
- Sparse and pre-allocated file creation (fallocate syscall → binary
  → goAllocate fallback chain)
- Cleanup goroutine for orphaned volumes and finalizer-protected PVs
Signed-off-by: v.oleynikov <vasily.oleynikov@flant.com>
- Pin RawFile PVs to a single node in CreateVolume. RawFile volumes are
  inherently node-local (the backing file lives on one filesystem), so
  AccessibleTopology MUST advertise exactly one node. Previously, multi-node
  topology could let the scheduler place a Pod on a different node, where
  NodeStageVolume would silently create a new empty file and orphan the
  original.

- Propagate rawFile.nodes to StorageClass.allowedTopologies so the scheduler
  (WaitForFirstConsumer) and external-provisioner (Immediate) honor the
  constraint instead of relying on a post-hoc filter in CreateVolume.

- Persist PV ReclaimPolicy on disk next to disk.img and refuse to delete
  orphaned files unless the persisted policy is "Delete". This restores
  the contract for ReclaimPolicy: Retain even after the PV API object
  disappears (e.g., manual cleanup, etcd loss).

- Loosen CRD immutability for spec.rawFile: only `sparse` stays immutable
  (changing it would require recreating existing volumes); `nodes` is now
  mutable so operators can extend or shrink the eligible-node list. A new
  top-level rule still forbids switching the backend (lvm <-> rawFile).

- Extend hasSCDiff to detect rawFile.sparse / rawFile.nodes drift between
  the LSC and its StorageClass; the SC is recreated when the eligible-node
  list (parameters + allowedTopologies) goes out of sync.

- Rename volume-context key "rawfileSize" -> "local.csi.storage.deckhouse.io/rawfile-size"
  for consistency with the other RawFile keys.

Signed-off-by: v.oleynikov <vasily.oleynikov@flant.com>
- Detach every loop device attached to a backing file. losetup permits
  multiple loop devices per file; the previous DetachLoopDevice only
  detached the first hit and silently left the others attached, which
  could keep the file busy and block volume deletion. New helper
  FindAllLoopDevices returns the full list; DetachLoopDevice now iterates
  and joins per-device errors. FindLoopDevice keeps its single-device
  signature for the (more common) "primary attachment" callers.

- Broaden retry classification in addFinalizer/removeFinalizer. Previously
  only 409 Conflict was retried, so a transient 429/503/500/timeout from
  the apiserver during NodeStageVolume would surface as a hard error and
  leave the PV without our cleanup finalizer. The new patchFinalizerWithRetry
  helper retries on Conflict + ServerTimeout + Timeout + TooManyRequests
  + InternalError + ServiceUnavailable, with exponential backoff capped
  at 2s and a context-aware sleep so cancellation propagates promptly.

- Replace the per-volume Get PV burst in processRawFilePVs with a single
  LIST per cycle, indexed by PV name and pre-filtered to RawFile PVs of
  this driver. Apiserver pressure drops from O(N_local_volumes) to O(1)
  per cycle; semantics for the orphan-cleanup grace period are preserved.

Signed-off-by: v.oleynikov <vasily.oleynikov@flant.com>
This change implements the remaining review items from PR #157 and
aligns the driver with the synced ADR.

driver:

- NodeStageVolume now fetches the PV BEFORE creating disk.img and
  fails with codes.Unavailable on a missing PV for a brand-new
  volume so kubelet retries cleanly. The reclaim and sparse markers
  (.reclaim, .sparse) are persisted on disk before the backing file
  is created, ensuring the orphan-cleanup goroutine and
  NodeExpandVolume can rely on them even when apiserver is
  unreachable. Re-stage of an existing volume is idempotent and
  back-fills missing markers for older volumes.
- NodeStageVolume rolls back a freshly-created backing file when
  losetup, mkfs, mount or online resize fails, so retries do not
  reuse a half-formatted image. Pre-existing files are left intact.
- NodeExpandVolume resolves the sparse flag from PV.VolumeAttributes
  with the on-disk .sparse marker as the fallback and fails closed
  with codes.FailedPrecondition when neither source is available.
- CreateVolume no longer falls back to the controller's hostID when
  no topology is provided; HA controllers MUST receive topology via
  StorageClass.allowedTopologies (Immediate) or scheduler Preferred
  (WFFC). The rawFile.nodes filter is kept as defence-in-depth.
- Removed obsolete EnsureDataDir call from controller startup; the
  data directory is created lazily on the node side.

rawfile:

- New SetSparse / GetSparse helpers persist the sparse flag in a
  .sparse marker via atomic writeMarker / readMarker helpers (also
  used for .reclaim).
- createRawFile now Chmod()s the file after fallocate so umask does
  not weaken the intended 0600 mode (parity with the truncate
  branch).
- FindLoopDevice / FindAllLoopDevices guard against an empty
  losetup output with ErrLoopDeviceNotFound; parsing is extracted
  into parseLosetupOutput for unit-testability.
- Added unit tests covering: parseLosetupOutput edge cases,
  volumeLock mutual exclusion, distinct-volume parallelism and
  reference-counted entry release, .reclaim and .sparse marker
  round-trips, and DeleteVolume marker cleanup.

driver/cleanup_test.go (new):

- Covers listRawFilePVs (driver/type filtering), cleanupOrphanedVolume
  (Retain keeps file, Delete removes file, missing marker is
  conservative), processRawFilePVs (grace period, aged Delete vs
  aged Retain orphans, live Retain PV is left alone) and small
  helpers.

webhooks/lscValidator:

- validateRawFile now checks that every node in spec.rawFile.nodes
  exists in the cluster via a node Get loop. Local validation
  (empty / duplicate names) runs first to short-circuit. The
  existence check fails-open on transient apiserver errors so
  webhook outages do not block LSC edits, and is fully mocked from
  unit tests.

internal:

- Removed the unused RawFileReclaimPolicyKey constant.

pkg/utils/func.go:

- Re-grouped imports to satisfy gci ordering.

Signed-off-by: v.oleynikov <vasily.oleynikov@flant.com>
The truncated header in cleanup_test.go and lscValidator_test.go
was rejected by the license linter. Restore the full Apache 2.0
boilerplate to match the rest of the repository.

Signed-off-by: v.oleynikov <vasily.oleynikov@flant.com>
- Pin TopologyKey constant on both sides of the cross-image boundary:
  add a literal-value test in controller/pkg/controller and a mirror in
  sds-local-volume-csi/internal, plus a comment on the CSI-side const so
  any rename forces both packages to be updated together.
- Add Ginkgo specs covering the controller's RawFile contract:
  AllowedTopologies propagation from rawFile.nodes (with TopologyKey),
  empty-topologies behavior when nodes is unset, and SC drift detection
  / re-creation on rawFile.nodes mutation.
- Unify Apache 2.0 copyright year on the four files added by this PR
  (2025 -> 2026) so all newly introduced files match the year of their
  initial commit.

Signed-off-by: v.oleynikov <vasily.oleynikov@flant.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant