Hey,

In the core of concourse, volumes provide ways for the various executions to have some form of state - be it in the form of root filesystems (acting as the storage underneath container images) or as a place for, let's say, a repository that was cloned via git-resource - see concourse/baggageclaim.

One of the implementations that we have of such abstraction users overlayfs, a filesystem in the upstream kernel that's been used quite a lot when it comes to containers recently.

In a very concourse-agnostic way, in this article I try to go very pragmatically through some of the concepts that overlayfs is based on, as well as a bit deep into why when we tie overlayfs with file ownership changes, things can get slow.

raison d'etre (at least in the “containers world”)

The big deal about overlayfs when it comes to containers (as I see, at least) is the ability of presenting a directory tree that consists of multiple other directory trees combined in a way that changes to it do not affect the originating trees that were combined.

That is, providing a copy-on-write “layer” on top of the union of multiple other directory trees.

Let's dig into those two.

the union of multiple directory trees

An overlay-filesystem tries to present a filesystem which is the result over overlaying one filesystem on top of the other. […] an ‘upper’ filesystem and a ‘lower’ filesystem.

from kernel.org's overlayfs entry

What that means is that if we have, let's say, two distinct directory trees like the following:

    dir1                    dir2
    /                       /
      a                       a
      b                       c

overlayfs is able to present a third directory (say, dir3) that represents putting, say, dir1 on top of dir2 (the order matters here):

    dir3
    /
      a
      b
      c

Naturally, this implies that overlayfs needs to “break ties” when a conflict occurs (like above, where we have a in both dir1 and dir2).

When a name exists in both filesystems, the object in the ‘upper’ filesystem is visible while the object in the ‘lower’ filesystem is either hidden or, in the case of directories, merged with the ‘upper’ object.

Using that terminology, here's how we can see that playing out in in practice (if you want to know more about mount in general, make sure you check out understanding mount namespaces)

    # set up the directory hierarchy necessary for getting a final merged
    # view (`./merged`) based of two trees: `./upper` and `./lower`.
    # 
    # ps.: `work` (passed as argument to `workdir`) is required for
    #      overlayfs to be able to perform atomic actions.
    #
    mkdir ./{merged,work,upper,lower}
    touch ./upper/{a,b}
    touch ./lower/{a,c}


    sudo mount \
            -t overlay \
            overlay \
            -o lowerdir=./lower,upperdir=./upper,workdir=./work \
            ./merged

    .
    ├── lower
    │   ├── a
    │   └── c
    ├── upper
    │   ├── a
    │   └── b
    ├── merged      < final view
    │   ├── a               (from upper)
    │   ├── b               (from upper)
    │   └── c               (from lower)
    └── work        < internal
        └── work

upper and lower

While both upper and lower contribute to the final merged view in pretty much the same way from a “read” perspective, from a “write”, they're quite different.

That's because lower layers are never written to - they only provide data -, while an upper layer can receive mutations (if not readonly).

For instance, considering the example above where c comes solely from lower, and b from upper, we can try writing to each in the merged tree and see what happens.

    echo "will-persist"  > ./merged/b
    echo "wont-persist"  > ./merged/c

Both writes work:

    cat ./merged/b
            will-persist
    cat ./merged/c
            wont-persist

But in the underlying directory trees:

    cat ./upper/b
            will-persist

    cat ./lower/c
            (empty)

what about container tech?

What ends up happening in that case is that multiple lower directory trees are utilized (the container image layers), and an empty upper directory is put on top of those - this ends up creating a final view of a directory tree that contains the entire container image mounted, but whose writes do not ever mutate the original contents: they end up always going to the upper directory (the “ephemeral storage”).

For instance, let's say we have a container image that's made up of two layers:

    layer1:                 layer2:
    /etc                    /bin
      myconf.ini              my-binary

With that, a container runtime would then take those two layers are lower directories, create an empty upper dir, and mount that somewhere:

    sudo mount \
            -t overlay \
            overlay \
            -o lowerdir=/layer1:/layer2,upperdir=/upper,workdir=/work \
            /merged

And then use that /merged as the rootfs of the container.

copy-on-write

When in the previous section we did a write to a file in the merged directory that came from the readonly lower directory tree, what happened there was copy-on-write taking place.

Because we can't modify the files on that layer, overlayfs takes care of reading all of the data (and metadata) from that file, copying it up to the upper directory, and only then, presenting to our application the file descriptor that we can use for writes.

When a file in the lower filesystem is accessed in a way the requires write-access, such as opening for write access, changing some metadata etc., the file is first copied from the lower filesystem to the upper filesystem (copy_up).

from kernel.org's overlayfs entry

To see this in practice, consider the following directory configuration that's a simplification of the previous case.

    upper:          lower:          merged:
    /               /               /
                      a               a (from upper)

Given that the filesystem employes copy-on-write semantics, despite us having the file visible under both ./lower and ./merged, even with two distinct device numbers, in the end there's no such duplication, until we decide to change it (“on write”).

More specifically, until we decide to open the file for writing:

    open("./merged/a", O_RDWR)
      |
      |
      *--> copy up to `merged`
            --> available for writes w/out change to the file under `lower`.

For instance, we can have a 1GB file in a lower dir, then see how we end up paying the price of a copy up whenever we try to open it as read-write.

First, let's setup the directories (just how we did before):

    # create the dirs
    #
    mkdir ./{merged,work,upper,lower}


    # write 1GB to `./lower/a`
    #
    dd if=/dev/zero of=./lower/a bs=$((1 << 12)) count=$((1 << 18))


    # mount
    #
    sudo mount \
            -t overlay \
            overlay \
            -o lowerdir=./lower,upperdir=./upper,workdir=./work \
            ./merged

Then, create a program that first opens the file as readonly (O_RDONLY), then “changes its mind”, opening it again, now as read-write (O_RDWR):

#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

static const char* file = "./merged/1GB";

int
main(int argc, char** argv)
{

    int fd;

    fd = openat(AT_FDCWD, file, O_RDONLY);
    if (!~fd) {
        perror("openat O_RDONLY");
        return 1;
    }

    if (!~close(fd)) {
        perror("close");
        return 1;
    }

    fd = openat(AT_FDCWD, file, O_RDWR);
    if (!~fd) {
        perror("openat O_RDWR");
        return 1;
    }

    return close(fd);
}

With the use of strace, we can see the difference in time between opening the file as readonly and read-write:

    strace -T -e openat ./a.out
    openat(AT_FDCWD, "./merged/a", O_RDONLY) = 3  <0.000110>
    openat(AT_FDCWD, "./merged/a", O_RDWR)  = 3   <0.297918>
    |      .-----------------------------.    |  .-----------.
    |              arguments passed           |    time taken
    syscall                                  return

“seeing” the price of specific copy ups

It turns out that it's quite easy to “accidentally” pay the price of a copy up without even knowing - open the file with the wrong flag, and you'll force a copy up.

For instance, it's been the case that concourse needs to shift the ownership of all files in a container image if it sees that it needs to translate that image from unprivileged to privilged, forcing every file to be copyed up.

While one can initially think that there's only the cost associated with N chown(2)s syscalls taking place (which is already something in its own), in the case of overlayfs, that also means adding the extra disk io for cloning each of those files.12

Using a tool like bpftrace we can see, e.g., how much time we've been spending on that very specific case: chown(2)ing.

Given that the copy up of data occurs at ovl_copy_up_data

    static int
    ovl_copy_up_data(struct path* old, struct path* new, loff_t len)
    {
            // ...
    }

we can trace that and discover the whole stack trace that leads to it:

    k:ovl_copy_up_data
    {
            printf("%s", kstack);
    }

telling us then:

    ovl_copy_up_data+1
    ovl_do_copy_up+1578
    ovl_copy_up_one+624
    ovl_copy_up_flags+167
    ovl_copy_up+16
    ovl_setattr+78
    notify_change+761
    chown_common+461
    do_fchownat+147
    __x64_sys_chown+34
    do_syscall_64+90
    entry_SYSCALL_64_after_hwframe+68

Now, given that we want to capture those copy ups that originated only from chown(2) syscalls, whose internal implementation delegates to ovl_setattr:

static const struct inode_operations ovl_file_inode_operations = {
    .setattr        = ovl_setattr,
    .permission     = ovl_permission,
    .getattr        = ovl_getattr,
    .listxattr      = ovl_listxattr,
    .get_acl        = ovl_get_acl,
    .update_time    = ovl_update_time,
    .fiemap         = ovl_fiemap,
};

we can set up one extra hook that records the thread id that's going through setattr so that we only capture the data being copied up through chown(2)s.

    #!/snap/bin/bpftrace

    BEGIN
    {
            printf("%-8s %-16s %-16s %-16s %-16s\n",
                "PID", "COMM", "NAME", "LEN(B)", "ELAPSED (ms)");
    }


    kprobe:do_fchownat
    {
            // keep track of the filename
            //
            @fname[tid] = str(arg1);
    }


    kprobe:ovl_copy_up_data
    / @fname[tid] != "" /
    {
            // keep track of the size of the file being copied up
            // as well as its start timestamp (in nanosecs)
            //
            @len[tid] = arg2;
            @start[tid] = nsecs;
    }


    kretprobe:ovl_copy_up_data
    / @len[tid] != 0 /
    {
            // once the coping finished, display the values
            //
            printf("%-8d %-16s %-16s %-16d %d\n",
                    pid, comm,
                    @fname[tid], @len[tid],
                    (nsecs - @start[tid]) / 1000000);


            // free those spaces
            //
            delete(@len[tid]);
    }

    kretprobe:do_fchownat
    {
            // free those spaces
            //
            delete(@fname[tid]);
            delete(@start[tid]);
    }

That running, we can see it collecting data in practice:

    PID      COMM             NAME             LEN(B)           ELAPSED (ms)
    12399    sample.out       ./merged/a       1073741824       787

1: it's possible to avoid the copying of the underlying data when chown(2)ing if using “metadata only copy up”, but we don't have that activated at the moment. Unfortunately, that only landed in 4.19, which is still not all that disseminated.

See “Overlayfs memory usage improvements” in 4.19.

2: we could in the future leverage shiftfs and avoid the whole chown(2)ing altogethere - Canonical has been shipping their kernels with it (mainly for LXD as far as I can tell), which is a good indication that it could come to upstream at some point soon - see “trying out shiftfs