Hey,

When recommending folks to change their baggageclaim driver over to overlay (backed by overlayfs) when seeing that btrfs would just not work for their use cases, we’ve always been telling the story that for [privileged containers] (of any kind - tasks, gets, etc), an increase in container startup time would be expected, but we never really went deep into why that’s the case.

It turns out that there’s quite a bit of cool container tech in it, so here’s a deep detailed view into why that’s the case (at least for now).

In the first sections we go through the basics of how the permission system works with regards to a process interacting with a file, and how namespaces can make that seemingly different for userspace.

As we build up our knowledge, we get closer and closer to examples that simulate how containers work, up until the point that we get to see in practice with Concourse.

ps.: all of the exploration below is with regards to 5.0.0-32-generic on Ubuntu Disco (19.04).

Table of Contents

user and group identifiers

When executing a process, Linux associates a set of numeric user and group identifiers (UID and GIDs, respectively) that can be divided in few classes:

We can check what these are for a process by either using system calls (getuid(2), geteuid(2), or the more generic getresuid(2))

    NAME
           getresuid, getresgid - get real, effective and saved user/group IDs

    SYNOPSIS

           #define _GNU_SOURCE         /* See feature_test_macros(7) */
           #include <unistd.h>

           int getresuid(uid_t *ruid, uid_t *euid, uid_t *suid);
           int getresgid(gid_t *rgid, gid_t *egid, gid_t *sgid);

or, by asking /proc about process information:

    cat /proc/self/status | grep Uid
    Uid:    1001    1001    1001    1001
            |        |       |       |
            |        |       |   filesystem
            |        |     saved
            |     effective
          real

As this state (process credentials) is supposed to be used for performing permission checking when a process tries to access a file, it must exist “somewhere”.

Under the hood (at the kernel level), these can be seen in the data structure that holds the security context, struct cred, which is associated with the one that holds the process information (struct task_struct):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
struct cred
{
        kuid_t uid; /* real UID of the task */
        kgid_t gid; /* real GID of the task */

        kuid_t suid; /* saved UID of the task */
        kgid_t sgid; /* saved GID of the task */

        kuid_t euid; /* effective UID of the task */
        kgid_t egid; /* effective GID of the task */

        kuid_t fsuid; /* UID for VFS ops */
        kgid_t fsgid; /* GID for VFS ops */

        kernel_cap_t cap_inheritable; /* caps our children can inherit */
        kernel_cap_t cap_permitted;   /* caps we're permitted */
        kernel_cap_t cap_effective;   /* caps we can actually use */
        kernel_cap_t cap_bset;        /* capability bounding set */
        kernel_cap_t cap_ambient;     /* Ambient capability set */

        // ...
}

creating users and switching between them

It turns out that from a kernel perspective, users do no need to be created: to make a process have a different user, just call setuid(2) with a different number - as long as you have the capability to do so, that’s all you need.

Naturally, that might not be enough for a good user experience in some cases, but for access control at the kernel level, it’s all that’s necessary.

To see that in practice, consider my current setup where from the perspective of /etc/passwd 32 users exist:

    # read from `/etc/passwd` (the text file that describes user login
    # accounts in the system), looking at the third column (where UIDs are)
    #
    $ cat /etc/passwd | wc -l
            32
            
    $ cat /etc/passwd | awk -F ':' '{print $3}'
            0
            1
            2
            ...
            109
            1000
            999
            65534
            1001
            998

Despite those 32 users, we can craft a program that leverages way more users than just that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#define USERS_COUNT 64

void
show_who_am_i()
{
        printf("pid=%-8d uid=%-8d euid=%-8d\n", getpid(), getuid(), geteuid());
        return;
}

static int
child(void* arg)
{
        int err;
        int* uid = (int*)arg;

        // setuid() sets the effective user ID of the calling process.
        //
        // If the calling process is privileged (more precisely: if the process
        // has the CAP_SETUID capability in its user namespace), the real UID
        // and saved  set-user-ID  are  also set.
        //
        err = setuid(10001 + *uid);
        if (err == -1) {
                perror("setuid");
                exit(1);
        }

        show_who_am_i();
}

int
main(int argc, char** argv)
{
        int err;
        pid_t pids[USERS_COUNT];

        show_who_am_i();

        for (int i = 0; i < USERS_COUNT; i++) {
                void* stack = malloc(STACK_SIZE);
                if (stack == NULL) {
                        perror("malloc");
                        return 1;
                }

                pids[i] = clone(child, stack + STACK_SIZE, SIGCHLD, &i);
                if (pids[i] == -1) {
                        perror("clone: ");
                        return 1;
                }
        }

        for (int i = 0; i < USERS_COUNT; i++) {
                err = waitpid(pids[i], NULL, 0);
                if (err == -1) {
                        perror("waitpid: ");
                        return 1;
                }
        }

        return 0;
}

ps.: you must have CAP_SETUID to have this code working as a non-root user.

With 64 users, we get exactly that - 64 different UIDs being shown after uid 10001.

The big thing here is to recall that the mapping between user IDs and other information (name, home directory, etc.) is done entirely in user space, managed by programs like your shell, etc.

permission system

As mentioned before, when accessing a regular file (from a regular file system), a process needs to go through a set of credential checks before it can even get a handle to read the contents from.

The overall (simplified) operation looks like the following.

    userspace:      process (w/ a set of credentials)
                      requests opening of a file
                         | 
    -----------------(syscall)--------------
                         |
    kernel:         do_sys_open
                    do_filp_open
                    generic_permission
                      acl_permission_check?
                      capable_wrt_inode_uidgid?

First, an userspace process tries to open a certain file (issuing, for instance, openat(2)):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int
main(int argc, char** argv)
{
        int fd = openat(AT_FDCWD, "./trace.bt", O_RDONLY);
        if (fd == -1) {
                perror("openat: ");
                return 1;
        }

        close(fd);

        return 0;
}

At the kernel level, do_sys_open - the top-level common method used by both open(2) or openat(2) - gets called, having in its arguments pretty much the same that openat(2) was supplied with.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
long
do_sys_open(int dfd, const char __user* filename, int flags, umode_t mode)
{
        int fd;
        struct file* f = do_filp_open(dfd, tmp, &op);

        if (IS_ERR(f)) {
                put_unused_fd(fd);
                fd = PTR_ERR(f);
                return fd;
        }

        fsnotify_open(f);
        fd_install(fd, f);

        return fd;
}

While do_sys_open is not all that interesting by itself, it calls do_filp_open whose responsability is allocating the main data structure that represents a file open for someone to operate on - struct file.

1
2
3
4
5
6
7
8
9
// generate the file pointer
//
struct file*
do_filp_open(int dfd, struct filename* pathname, const struct open_flags* op)
{
        // ...
        filp = path_openat(&nd, op, flags | LOOKUP_RCU);
        return filp;
}

But, to get to the point where such struct can be allocated and then get s to the point when we can come back to the userspace process telling it that the file was successfully open, we need to first go through permission checks, which happens few other functions down the road:

    generic_permission
    do_inode_permission (inlined)
    inode_permission+58
    may_open.isra.63+94
    do_last (inlined)
    path_openat+670
    do_filp_open+147  -----^
    do_sys_open+375
    __x64_sys_openat+32
    do_syscall_64+90
    entry_SYSCALL_64_after_hwframe+68

Reaching generic_permission, we finally find what we’ve been searching for - the permission checking1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
/**
 * generic_permission -  check for access rights on a Posix-like filesystem
 * @inode:	inode to check access rights for
 * @mask:	right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC, ...)
 *
 * Used to check for read/write/execute permissions on a file.
 * We use "fsuid" for this, letting us set arbitrary permissions
 * for filesystem access without changing the "normal" uids which
 * are used for other things.
 */
int
generic_permission(struct inode* inode, int mask)
{
        int ret;

        /*
         * Do the basic permission checks.
         */
        ret = acl_permission_check(inode, mask);
        if (ret != -EACCES)
                return ret;

        if (S_ISDIR(inode->i_mode)) {
                /* DACs are overridable for directories */
                if (!(mask & MAY_WRITE))
                        if (capable_wrt_inode_uidgid(inode,
                                                     CAP_DAC_READ_SEARCH))
                                return 0;
                if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
                        return 0;
                return -EACCES;
        }

        /*
         * Searching includes executable on directories, else just read.
         */
        mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
        if (mask == MAY_READ)
                if (capable_wrt_inode_uidgid(inode, CAP_DAC_READ_SEARCH))
                        return 0;
        /*
         * Read/write DACs are always overridable.
         * Executable DACs are overridable when there is
         * at least one exec bit set.
         */
        if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO))
                if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
                        return 0;

        return -EACCES;
}

acl_permission_check is where the traditional discretionary access control (DAC) happens, verifying if the current process' filesystem user id matches the inodes uid, checking group membership, etc.

Taking a look at the method responsible for DAC, we can see that struct inode is where the information with regards to “who owns this file” lives:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
struct inode
{
        umode_t i_mode;
        unsigned short i_opflags;
        kuid_t i_uid;
        kgid_t i_gid;
        unsigned int i_flags;

        // ...
}

/*
 * This does the basic permission checking
 */
static int
acl_permission_check(struct inode* inode, int mask)
{
        unsigned int mode = inode->i_mode;

        if (likely(uid_eq(current_fsuid(), inode->i_uid)))
                mode >>= 6;
        else {
                if (IS_POSIXACL(inode) && (mode & S_IRWXG)) {
                        int error = check_acl(inode, mask);
                        if (error != -EAGAIN)
                                return error;
                }

                if (in_group_p(inode->i_gid))
                        mode >>= 3;
        }

        /*
         * If the DACs are ok we don't need any capability check.
         */
        if ((mask & ~mode & (MAY_READ | MAY_WRITE | MAY_EXEC)) == 0)
                return 0;
        return -EACCES;
}

This makes sense as the inode is meant to contain all the information needed by the filesystem to handle a file.

capabilities

Checking for the ability to open a file involves not only discretionary access control on a per-user and per-group basis, but also verifying the capabilities that a given process effectively has.

This is the mechanism that allows a root user to get access to any file in the system, and do whatever it wants to it.

As we saw in the previous section, if we relied solely on UIDs and GIDs for permission checks, a root user (whose UID is 0) would not match a non 0 UID file (e.g., root with UID 0 wouldn’t be able to open a file with UID 1000 as uid_eq(current_fsuid(), inode->i_uid) would be false).

The next check that’s performed after DAC, capable_wrt_inode_uidgid, verifies if the process (that has a given set of effective capabilities) is capable or not to perform the desired action - in this case, reading the file, which could be “bypassable” via CAP_DAC_READ_SEARCH and/or CAP_DAC_OVERRIDE).

CAP_DAC_OVERRIDE - Bypass file read, write, and execute permission checks.

CAP_DAC_READ_SEARCH - Bypass file read permission checks and directory read and execute permission checks;

from capabilities(7))

Through capabilities is essentially how a root user end up being able to read any file: although acl_permission_check fails, capable_wrt_inodee_uidgid works, as root has the full set of privileges set to it.

1: a given filesystem might implement its own permission checking. generic is just one (common) way of doing so.

user namespaces

When Concourse runs an unpriviled container, i.e., either a step that’s not marked with privileged: true or a step from a resource-type that’s not marked as privileged, the process that runs within that container may see that its UID is set to 0 (resembling a privileged process), but it turns out that when performing actual filesystem checks, that process do not have that UID.

This is possible due to the use of user namespaces.

User namespaces isolate security-related identifiers and attributes, in particular, user IDs and group IDs, the root directory, keys, and capabilities (see capabilities(7)).

A process’s user and group IDs can be different inside and outside a user namespace. In particular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace; in other words, the process has full privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace.

from user_namespaces(7)

While this might sound terrifying at first when it comes to permission checking for filesystem interactions, it’s really not at all when we see how under the hood how the kernel verifies if a user (regardless of the uid that it sees within the usernamespace) maps to.

Before we get to an illustration of that in practice, let’s first see how we can get a user namespace that lets us have UID 0 within it, mapping to a non-0 outside.

We can create a program that tries to open a file in different ways, and then see how that all plays out.

A call to clone(2) or unshare(2) with the CLONE_NEWUSER flag makes the new child process (for clone(2)) or the caller (for unshare(2)) a member of the new user namespace created by the call.

After the creation of a new user namespace, the uid_map file of one of the processes in the namespace may be written to once to define the mapping of user IDs in the new user namespace.

I.e., to fully utilize its uid mapping capabilities, we need to go through two steps:

  1. creat a process having the CLONE_NEWUSER bit set in the flags passed to clone(2)

  2. write to /proc/pid/uid_map the mapping of UIDs from outside the usernamespace, to inside it (e.g., “0 inside means 1000 outside”)

Once the write to /proc/pid/uid_map is finished, the process now has fully switched credentials.

Let’s start by creating the process with the CLONE_NEWUSER flag then.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
int
main(int argc, char** argv)
{
        int err;
        pid_t child_pid;

        int flags = CLONE_NEWUSER | SIGCHLD;
        void* stack = child_stack + STACK_SIZE;

        // create the new process in a new user namespace.
        //
        child_pid = clone(child, stack, flags, argv[1]);
        if (child_pid == -1) {
                perror("clone: ");
                return 1;
        }

        err = waitpid(child_pid, NULL, 0);
        if (err == -1) {
                perror("waitpid: ");
                return 1;
        }

        return 0;
}

Now, in the child process, write to the /proc/$pid/uid_map file corresponding to itself (self), specifying two pieces of information:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#define UIDMAP_FNAME "/proc/self/uid_map"

static const char* mapping = "0 1001 1";

void
show_proc_info()
{
        printf("%-8s %-8s %-8d\n", "process", "euid", geteuid());
}

static int
child(void* arg)
{
        int fd, n;

        fd = open(UIDMAP_FNAME, O_RDWR);
        if (fd == -1) {
                perror("open " UIDMAP_FNAME);
                exit(1);
        }

        n = write(fd, mapping, sizeof(mapping));
        if (n == -1) {
                perror("write");
                exit(1);
        }

        close(fd);

        show_proc_info();
}

Once the the new process finished specifying the uid mappign inside the namespace (by writing to /proc/self/uid_map, we can see how it shows (from the inside) that its UID is 0 there, but not from the outside:

    # create the process with CLONE_NEWUSER and then show the proc
    #information (pid and effective UID)
    #
    ./userns
            process  pid      16889
            process  euid     0


    # from the host user namespace, ask the kernel how the different types
    # of user IDs looks like
    #
    cat /proc/16889/status | grep Uid
            Uid:    1001    1001    1001    1001

Knowing that, let’s see how permission checking ends up happening inside the user namespace then.

ps.: you can see the full source code here: cirocosta/userns-sample

permission checking inside a user namespace

As we uncovered that despite the impression that the user has UID 0 from within the namespace, in reality it’s just 1001 under the hood, this implicates that permission check might not really be the way one would expect: having UID 0 inside will now mean that you can read a file owned by UID 0 outside that namespace.

When a process accesses a file, its user and group IDs are mapped into the initial user namespace for the purpose of permission checking and assigning IDs when creating a file.

When a process retrieves file user and group IDs via stat(2), the IDs are mapped in the opposite direction, to produce values relative to the process user and group ID mappings.

Let’s see that in practice.

Consider the following directory from the perspective of the host:

    ls -n ./samples/
    -r-------- 1    0 1002 0 Nov 24 16:10 0.txt
    -r-------- 1 1000 1002 0 Nov 24 16:10 1000.txt
    -r-------- 1 1001 1002 0 Nov 24 16:10 1001.txt
     |            |    |
    only owner   UID   |
    can read          GID

Having a user namespace where the mapping is 0 1001 1, that is:

We can see that the view changes quite a bit:

                        "unknown" 
                   .--> /proc/sys/kernel/overflowuid
                   |
    -r-------- 1 65534 0 0 Nov 24 16:10 0.txt
    -r-------- 1 65534 0 0 Nov 24 16:10 1000.txt
    -r-------- 1     0 0 0 Nov 24 16:10 1001.txt
                     |
                     *--> for us, the illustion that it's owned by 0
                          --> it's actuallt just 1001 outside.

As we expect, given that the actual underlying UIDs only match for 1001.txt, that’s the only file that we can read:

    root $ cat ./0.txt
            Permission denied
    root $ cat ./1000.txt
            Permission denied
    root $ cat ./1001.txt
            Hi!

In order to determine permissions when an unprivileged process accesses a file, the process credentials (UID, GID) and the file credentials are in effect mapped back to what they would be in the initial user namespace and then compared to determine the permissions that the process has on the file.

from user_namespaces(7)

user namespaces capabilities

Someone who went through capabilities(7) might realize that something is off here - a process having CAP_DAC_OVERRIDE (which, users inside the user namespace can have) are supposed to be able to read any file. Isn’t that a contradiction to the purpose of having user namespaces in the first place?

It turns out that it isn’t - when we look closely at the place where the check for capabilities (after the DAC check), we can see that it takes the current user namespace into consideration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
/**
 * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
 * @inode: The inode in question
 * @cap: The capability in question
 *
 * Return true if the current task has the given capability targeted at
 * its own user namespace and that the given inode's uid and gid are
 * mapped into the current user namespace.
 */
bool
capable_wrt_inode_uidgid(const struct inode* inode, int cap)
{
        struct user_namespace* ns = current_user_ns();

        return ns_capable(ns, cap) && privileged_wrt_inode_uidgid(ns, inode);
}

Doing so, it’s able to validate if the capability that we’re checking against can be used against a given inode (which, if it comes from, let’s say, a true ‘root’ in the host, it cannot).

concourse volumes and the uid dance

Having a pipeline with a job that contains a get step that’s supposed retrieve contents as implemented by a resource type (for instance, git-resource), the output of that action (in the case of git, a repository) gets materialized as something in the filesystem in the form of a “volume”, which can then be mounted inside a container for consumption (e.g., running the tests that a git repository has).

By default, those volumes that get produced are “unprivileged” by default - the files owned by that volume have their uids corresponding to a user whose UID is not 0.

For instance, if I go into the worker’s work directory and look at an unprivileged volume, we can see that the contents there all belong to a user that’s not UID 0 from our (host) perspective:

    root@cc-1:/concourse-state/volumes/live/d805cac3-9ad3...bd5/volume# ls -ln ./etc/
         mode       uid        gid
    drwxr-xr-x 1 4294967294 4294967294  X11
    -rw-r--r-- 1 4294967294 4294967294  adduser.conf
    drwxr-xr-x 1 4294967294 4294967294  alternatives
    drwxr-xr-x 1 4294967294 4294967294  apt
    -rw-r--r-- 1 4294967294 4294967294  bash.bashrc
    drwxr-xr-x 1 4294967294 4294967294  bash_completion.d
    -rw-r--r-- 1 4294967294 4294967294  bindresvport.blacklist
    drwxr-xr-x 1 4294967294 4294967294  ca-certificates
    -rw-r--r-- 1 4294967294 4294967294  ca-certificates.conf
    drwxr-xr-x 1 4294967294 4294967294  cron.daily
    -rw-r--r-- 1 4294967294 4294967294  debconf.conf

We can see how that works when we look at the definition of an OCI-complient container (like the ones that gdn creates when using runc under the hood):

    {
      "ociVersion": "1.0.0",
      "linux": {
        "uidMappings": [
          {                         // make 0 inside not 0 outside
            "hostID": 4294967294,
            "containerID": 0,
            "size": 1
          },
          {                         // let any others just work
            "hostID": 1,
            "containerID": 1,
            "size": 4294967293
          }
        ],

        "namespaces": [
          { "type": "network" },
          { "type": "pid" },
          { "type": "uts" },
          { "type": "ipc" },
          { "type": "mount" }
          { "type": "user" }
        ],
        "user": {                   // run this process (/tmp/gdn-init) as 0
          "uid": 0,                         (inside)
          "gid": 0
        },

        // ...

When we switch to a privileged container though, things change - the process now is not isolated with a user namespace anymore, and thus, no uidMappings are specified: a inside is a 0 outside.

However, because by default Concourse produces volumes that are expecting to be used by an unprivileged container, it shifts all of those files with UID 0 to that maximum ID (4294967294), so that when an unprivleged container needs to use it, it doesn’t need to do any extra work.

However, in the current implementation, if a privileged containerd needs to do so, the translation needs to occur before the container gets started. That means that every single file whose owner is the maximum ID ends up having to be translated to UID 0 (i.e., if you have 3000 files, we need to perform 3000 chown(2) to change the ownership of the files). Depending on how the filesystem handles that, it can be quite expensive.

closing thoughts

It was pretty nice to get a deeper understanding into how the whole permissions checking system works, and have some of the multiuser concepts ingrained - I never really understood very well how all of that works (user namespaces, even less).

It’s exciting to see that with the work that we started on getting containerd as the backend will allow us to have more control of all of these details, moving us to a place where things like vito/oci-build-task can run without extra privileges.

extra

using unshare to verify the behaviors

Although during the entire article I relied no my own tools (custom C code developed just for this), it turns out that you can test all of these things (with some limitations) using unshare(1).