Skip to content

Latest commit

 

History

History
432 lines (320 loc) · 17.3 KB

File metadata and controls

432 lines (320 loc) · 17.3 KB

Requirements for jbox command line tool

This file contains requirements for a command line tool called 'jbox' that launches a Java program running on an isolated, "sandboxed" process using linux kernel containerization features, in particular unprivileged user namespaces, pivot_root, and OverlayFS.

jbox is expected to be run as a non-root user and does not require CAP_SYS_ADMIN capabilities or any other privilege escalation features, eg, having its binary file setuid-bit to root; in this regard jbox is similar to podman in allowing non-privileged users to run isolated, sandboxed processes in a secure and safe way.

Detailed requirements below are grouped under level two headers. Every level two header starts with a number to identify the corresponding group of requirements. Requirements are specified in level three headers. Each level three header starts with the requirement group number, followed by a period ('.') character, followed by a number identifying the requirement inside the group. Sub-requirements are specified by level four headers. Each level four header starts with the identifier of its parent level three header, followed with a period ('.') character and followed with a number specifying the subrequirement.

1 Command line arguments

1.1 Image base directory.

A mandatory flag --image-basedir with a single directory argument must be provided to jbox. The argument specifies the path to a directory that we call the "image base" directory.

If this directory is provided as a relative path, it should be resolved to an absolute path immediately.

This directory contains the root of a minimal filesystem image with all the read-only parts necessary to run a Java Virtual Machine (JVM) that is confined to that root.

While we are calling the directory an "image", this directory contains the actual files and directories, ie, is not a tar file or a binary file with an ISO image on it, but the actual image contents in its "extracted" filesystem form.

This directory can be thought of as the result of running docker export on a fresh docker container ran from a particular docker image, and then extracting in the image base directory the contents of the tar file resulting from docker export.

1.1.1 Image base directory never modified

An execution of jbox should not be able to alter the contents of the image base direrctory in any way; no change should result to the image base directory by running any program through jbox.

1.1.2 Image base directory owner check

The user that is the owner of this directory should match the effective user running jbox, otherwise the tool should print an error message and exit with a unique exit value to help identify the failure.

1.2 Sandbox directory.

A mandatory flag --sandbox-dir with a single directory argument must be provided to jbox. The argument specifies the path to a directory that we will call the "sandbox" directory.

If this directory is provided as a relative path, it should be resolved to an absolute path immediately.

1.2.1 Sandbox directory is empty on startup.

If the sandbox directory exists on startup, it should be empty, and should be owned by the same effective user id running jbox. The directory should also have permissions for that user to read, write and execute on that directory.

If the sandbox directory does not exist on startup, its parent directory should allow permissions for the effective user running jbox to create the sandbox write directory, with permissions to read, write and execute on it; in this case jbox should create the directory as such.

If none of these conditions apply, jbox should print and error message and exit with a unique exit value to identify the failure.

1.3 Read-only volume directories

Any number of the optional flag --ro-volume with a single argument consisting of two directory paths separated by a colon character can be provided to jbox. The first path (before the ':' character) corresponds to a directory on the host. We will call this directory ro-vol-src for short.

If the ro-vol-src directory is provided as a relative path, it should be resolved to an absolute path immediately.

The second path (after the ':' character) corresponds to a directory inside the sandboxed process. We will call this directory ro-vol-dst for short.

The ro-vol-dst directory should always be provided as an absolute path in the sandboxed process resulting filesystem (should start with the '/' character)

The contents of ro-vol-src should be available on the the sandboxed JVM process in the sandboxed filesystem root under ro-vol-dst. Regardless of the permissions on the host for the ro-vol-src directory, inside the sandbox this should be available as read-only via a read-only type of mount.

It should not be possible for the sandboxed process to create any files or to change any existing files inside the ro-vol-dst directory.

It should be possible to provide values for ro-vol-src and ro-vol-dst that contain the ':' character in an unambiguous way. To do this, users can escape the ':' character with a backslash ('') character. To signify a backslash character in one of these directories, two backslashes can be provided ('\'). It is not legal to provide a backslash character preceeding any other character aside from ':' or ''; doing so should result in an error message being printed and jbox exiting with a unique error code to identify the failure.

1.3.1 Read-only volume source directory owner and permissions check

For any ro-vol-src directories passed to the --ro-volume flag, the user that is the owner of the directory should match the effective user running jbox.

The permissions on the directory itself should allow read and execute for the owner.

If any of these conditions is not met, the tool should print an error message and exit with a unique exit status to help identify the failure.

1.4 Read-write volume directories

Any number of the optional flag --rw-volume with a single argument consisting of two directory paths separated by a colon character can be provided to jbox. The first path (before the ':' character) corresponds to a directory on the host. We will call this directory rw-vol-src for short.

If this directory is provided as a relative path, it should be resolved to an absolute path immediately.

The second path (after the ':' character) corresponds to a directory inside the sandboxed process. We will call this directory rw-vol-dst for short.

The rw-vol-dst directory should always be provided as an absolute path in the sandboxed process resulting filesystem (should start with the '/' character)

The contents of rw-vol-src should be available on the the sandboxed JVM process in the sandboxed filesystem root under rw-vol-dst.

The mount should be such that any changes to this directory or its files should be reflected on the host actual rw-vol-src directory, and such changes should persist after the jbox process exits.

It should be possible to provide values for rw-vol-src and rw-vol-dst that contain the ':' character in an unambiguous way. To do this, users can escape the ':' character with a backslash ('') character. To signify a backslash character in one of these directories, two backslashes can be provided ('\'). It is not legal to provide a backslash character preceeding any other character aside from ':' or ''; doing so should result in an error message being printed and jbox exiting with a unique error code to identify the failure.

1.4.1 Read-write volume source directory owner and permissions check

For any rw-vol-src directories passed to the --rw-volume flag, the user that is the owner of the directory should match the effective user running jbox.

The permissions on the directory itself should allow read, write and execute for the owner.

If any of these conditions is not met, the tool should print an error message and exit with a unique exit status to help identify the failure.

1.5 Environment variables

Allow zero or more occurrences of the flag --env-var with a single argument of the form "NAME=VALUE". This argument should be parsed by doing a split at the first occurrence of the '=' character; everything before the first '=' character is variable name, everything after the first '=' character is the value.

1.6 Shared memory size

An optional --shm-size flag with a single argument can be provided for the string that will be passed to the mount command that will mount the /dev/shm filesystem in the sandbox. We call this value [shm-size]. The default value for this flag when not provided is "64m". The format for the value for this flag is an integral number followed by an optional 'k', 'm', or 'g' suffix.

1.6.1 Validate shared memory size

If provided, the value for the --shm-size flag should be validated against the expected format.

1.7 Debug

An optional --debug flag with no arguments can be provided. If provided, the tool should print to standard output the name and arguments for every system call executed before executing it in a syntax reminiscent of the C API of the system call.

1.8 No other flags

Aside from the flags mentioned earlier in this document, no other flags should be given to the tool. If there are any, the tool should print an error message and and exit with a unique exit status to help identify the failure.

1.9 Remaining arguments are the java command line to be run inside the sandbox

Any remaining arguments after the mandatory flags, which should precede, specify the command line (command and its arguments) to execute the java program inside the sandbox

2 Creation of an isolated sandbox process

The requirements in this group are given as specific actions and/or linux systemn calls that need to be performed in the given order. When linux system calls are specified, the argumenst are given as per the C API; if coding in a different language the appropriate conversions for that language and libraries (eg, rust nix).

If any operation or system call at any point below fails, the process should print an error message detailing the error and exit with a unique error code that allows to identify the failure.

2.1 Prepare overlay directories

jbox should create new subdirectories under the sandbox directory

  • "merged", which will be passed later as the mountpoint argument for an overlay mount
  • "upper", which will be passed later as the upper parameter for an overlay mount
  • "work", which will be passed later as the work parameter for an overlay mount

These subdirectories should be created with an owner and group matching the effective user and effective group id for the jbox process, and their permissions should match the spec rwxr-x--- (octal 750).

2.2 Calling clone system call

jbox should call the clone system call with the following arguments:

  • CLONE_NEWNS to create a new mount point namespace for the child
  • CLONE_NEWUSER to create a new user namespace for the child
  • CLONE_NEWPID to isolate the rest of host machine processes from the child

After clone is called, we have two processes, the parent and the child.

2.3 Client and child coordination

2.4 below will detail parent steps.

2.5 below will detail child steps.

Parent and child must coordinate with a pipe to ensure all steps under 2.4 in the parent are done before the child proceeds with steps in 2.5.

2.4 Parent configures user mappings for the child

2.4.1 Parent disables groups in the child

Parent writes deny to /proc/[child_pid]/setgroups where [child_pid] is the process id of the child.

2.4.2 Parent maps group id in the child

Parent writes 0 [egid] 1 to /proc/[child_pid]/gid_map where [egid] is the effective group id of the parent and [child_pid] is the process id of the child.

2.4.3 Parent maps user id in the child

Parent writes 0 [euid] 1 to /proc/[child_pid]/uid_map where

  • [euid] is the effective user id of the parent
  • [child_pid] is the process id of the child.

2.5 Child configures mounts and pivots to sandbox

2.5.1 Child establishes mount privacy

Child executes mount(NULL, "/", NULL, MS_REC|MS_PRIVATE, NULL) to prevent any subsequent mounts from propagating back to the host system.

2.5.2 Child assembles the overlay

Child executes

mount("overlay", "[merged]", "overlay", NULL, "lowerdir=[image-base],upperdir=[upper],workdir=[work]")

where

  • [image-base] is the image base directory argument that was provided to jbox in 1.1
  • [merged] is the "merged" directory created on 2.1
  • [upper] is the "upper" directory created on 2.1
  • [work] is the "work" directory created on 2.1

Since [image-base], [upper] and [work] directorues are arguments for one part of the string provided as the fifth ("data") argument for mount, and that string is interpreted by the kernel by using the ':' character and the ',' character as separators, any ':' or ',' or '' characters in the [image-base], [upper] or [work] directory strings will need to be escaped with a preceeding '' character.

2.5.3 Child bind mounts read-write volumes

For each pair of [rw-vol-src] and [rw-vol-dst] directories provided as read-only mounts in 1.4

  • Child ensures that a volume mount point directory under "[merged]/[rw-vol-dst]" exists and has permissions octal 750. Any directory elements in "[merged]/[ro-vol-dst]" that do not exist should be created using the mkdir system call. Permissions for created directories should be octal 750.
  • Child executes mount("[rw-vol-src]", "[merged]/[rw-vol-dst]", NULL, MS_BIND|MS_REC, NULL) where [merged] is the "merged" directory created on 2.1

2.5.4 Child bind mounts read-only volumes

For each pair of [ro-vol-src] and [ro-vol-dst] directories provided as read-only mounts in 1.3

  • Child ensures that a volume mount point directory under "[merged]/[ro-vol-dst]" exists and has permissions octal 550. Any directory elements in "[merged]/[ro-vol-dst]" that do not exist should be created using the mkdir system call. Permissions for created directories should be octal 550.
  • Child executes mount("[ro-vol-sr]", "[merged]/[ro-vol-dst]", NULL, MS_BIND|MS_REC, NULL) where [merged] is the "merged" directory created on 2.1
  • Child executes mount(NULL, "[merged]/[ro-vol-dst]", NULL, MS_BIND|MS_REMOUNT|MS_RDONLY|MS_REC, NULL) to remount as read-only. where [merged] is the "merged" directory created on 2.1

Note we don't provide MS_RDONLY in the initial bind mount call because MS_RDONLY is often ignored in bind mounts.

2.5.5 Child prepares /dev for the new root

Child executes:

  • mkdir("[merged]/dev", 0755) if [merged]/dev directory does not exist already where [merged] is the "merged" directory created on 2.1
  • mkdir("[merged]/dev/shm", 0755) if [merged]/dev/shm directory does not exist already where [merged] is the "merged" directory created on 2.1
  • mount("tmpfs", "[merged]/dev/shm", "tmpfs", MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REC, "mode=1755,size=[shm-size]") where [shm-size] is the shared memory size defined in 1.6 and [merged] is the "merged" directory created on 2.1
  • For each [device] in "/dev/null", "/dev/zero", "/dev/full", "/dev/random", "/dev/urandom", "/dev/tty"
    • creat("[merged]/[device]", 0666)
    • mount("[device]", "[merged]/[device]", NULL, MS_BIND, NULL) where [merged] is the "merged" directory created on 2.1

2.5.6 Child bind mounts virtual filesystems for /proc and /sys

Child executes

  • mkdir("/[merged]/proc", 0555) if [merged]/proc directory does not exist already where [merged] is the "merged" directory created on 2.1
  • mount("/proc", "[merged]/proc", NULL, MS_BIND|MS_REC, NULL)
  • mkdir("/[merged]/sys", 0555) if [merged]/sys directory does not exist already where [merged] is the "merged" directory created on 2.1
  • mount("/sys", "[merged]/sys", NULL, MS_BIND|MS_REC, NULL)

2.5.7 Child prepares the new root for pivot

Child executes a bind mount of the merged directory to itself, to satisfy the pivot_root requirement that the new root be a mount point: mount("[merged]", "[merged]", NULL, MS_BIND|MS_REC, NULL) where [merged] is the "merged" directory created on 2.1

2.5.8 Child prepares the old root for pivot

Child creates a new directory "[merged]/old_root" where [merged] is the "merged" directory created on 2.1

2.5.9 Child executes the pivot

Child executes pivot_root("[merged]", "[merged]/old_root") where [merged] is the "merged" directory created on 2.1 and then changes the current working directory to the new root executing chdir("/")

2.5.10 Child detaches the old root

Child executes umount2("/old_root", MNT_DETACH)

3 Execute java program

Child uses the exec system call to run the program using the command and arguments provided in 1.9 and the environment variables provided in 1.5. The environment for exec should not inherit any environment variables existing at the point of exec, and it should only have the ones provided in 1.5.

Before exec, the child should arrange for:

  • standard input (stdin) to be redirected from /dev/null.
  • standard output (stdout) to be redirected to the file "/rw-data/logs/stdout.log". If the directory "/rw-data/logs" does not exist it should be created. If the "stdout.log" file in that directory already exists, it should be overwritten.
  • standard error (stderr) to be redirected to the file "/rw-data/logs/stderr.log". If the "stderr.log" file in that directory already exists, it should be overwritten.