Skip to content

bug: rootless containers inside of docker error with cgroup.subtree_controll #3597

@JosiahParry

Description

@JosiahParry

We use rootless containers via libcontainer in our binary. However, when running our binary inside of a standard docker container, we cannot get libcontainer to work.

When creating a container using the oci_spec::runtime::Linux::rootless() which is roughly

    let container = ContainerBuilder::new(container_id.clone(), SyscallType::Linux)
        .validate_id()?
        .with_root_path(&state_dir)?
        .with_executor(crostini::Crostini)
        .as_init(bundle_dir)
        // false in docker
        .with_systemd(false)
        .with_detach(false)
        .with_no_pivot(true)
        .build()?;

we get this error:

    Failed to create init container 'rico-01KV99C15V0RH8FCHJSB5Z6458':
      caused by: failed to create container: received unexpected message: OtherError("cgroup error: io error: failed to open /sys/fs/cgroup/cgroup.subtree_control: Read-only file system (os error 30)"), expected: WriteMapping

Note that we're not setting any cgroup limits or LinuxResources.

When using --privileged we were able to build the rootless runtime. But, ideally, that wouldn't be necessary.

After debugging this for a few hours with the help of claude i think, this is related to the v2 manager in libcgroups:

fn create_unified_cgroup(&self, pid: Pid) -> Result<(), V2ManagerError> {
let controllers: Vec<String> = util::get_available_controllers(&self.root_path)?
.iter()
.map(|c| format!("+{c}"))
.collect();
Self::write_controllers(&self.root_path, &controllers)?;
let mut current_path = self.root_path.clone();
let mut components = self
.cgroup_path
.components()
.filter(|c| c.ne(&RootDir))
.peekable();
while let Some(component) = components.next() {
current_path = current_path.join(component);
if !current_path.exists() {
fs::create_dir(&current_path).wrap_create_dir(&current_path)?;
fs::metadata(&current_path)
.wrap_other(&current_path)?
.permissions()
.set_mode(0o755);
}
// last component cannot have subtree_control enabled due to internal process constraint
// if this were set, writing to the cgroups.procs file will fail with Erno 16 (device or resource busy)
if components.peek().is_some() {
Self::write_controllers(&current_path, &controllers)?;
}
}
common::write_cgroup_file(self.full_path.join(CGROUP_PROCS), pid)?;
Ok(())
}

AI Summary

The failing write is the first line of create_unified_cgroup:

let controllers = util::get_available_controllers(&self.root_path)?…;
Self::write_controllers(&self.root_path, &controllers)?;   // writes ROOT subtree_control

Before walking down to the requested cgroup_path, it enables every available
controller on self.root_path (the cgroup v2 mount root) by writing
{root}/cgroup.subtree_control.

Inside an unprivileged Docker container, the container is delegated only its own
scope (…/docker-<id>.scope), not the mount root. So this root write hits a
read-only file → EROFS (os error 30). The subsequent path-walk loop would
write subtree_control on each ancestor it owns, which is fine — it's only the
unconditional root write that fails.

This happens regardless of requested resources (LinuxResources is unset) and
regardless of cgroups_path / --cgroupns (host or private), which is
consistent with the root write being unconditional rather than driven by config.

Possible fix direction: skip enabling controllers on the mount root when the
process doesn't own it / they're already enabled in the delegated parent, or
anchor controller-enabling at the delegated root rather than the fs mount root.

(libcgroups 0.6.0, cgroup v2, Docker default config, Debian trixie)

</details>

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions