Hey,
In the core of concourse
, volumes provide ways for the various executions to
have some form of state - be it in the form of root filesystems (acting as
the storage underneath container images) or as a place for, let’s say, a
repository that was cloned via git-resource
- see concourse/baggageclaim
.
One of the implementations that we have of such abstraction users overlayfs
,
a filesystem in the upstream kernel that’s been used quite a lot when it comes
to containers recently.
In a very concourse
-agnostic way, in this article I try to go very
pragmatically through some of the concepts that overlayfs
is based on, as well
as a bit deep into why when we tie overlayfs
with file ownership changes,
things can get slow.
raison d’etre (at least in the “containers world”)
The big deal about overlayfs
when it comes to containers (as I see, at least)
is the ability of presenting a directory tree that consists of multiple other
directory trees combined in a way that changes to it do not affect the
originating trees that were combined.
container ------------
.
. /etc
. foo --> from a directory tree (tree1)
. bar --> from another directory tree (tree2)
.
echo "lol" > /etc/foo
--> for the container, seems like a change to the `/etc/foo`
file
==> for the tree that brought `/etc/foo` (tree1), no
changes at all
That is, providing a copy-on-write “layer” on top of the union of multiple other directory trees.
Let’s dig into those two.
the union of multiple directory trees
An overlay-filesystem tries to present a filesystem which is the result over overlaying one filesystem on top of the other. […] an ‘upper’ filesystem and a ‘lower’ filesystem.
What that means is that if we have, let’s say, two distinct directory trees like the following:
dir1 dir2
/ /
a a
b c
overlayfs
is able to present a third directory (say, dir3
) that represents
putting dir1
on top of dir2
(the order matters here):
dir3
/
a
b
c
Naturally, this implies that overlayfs
needs to “break ties” when a conflict
occurs (like above, where we have a
in both dir1
and dir2
).
When a name exists in both filesystems, the object in the upper filesystem is visible while the object in the lower filesystem is either hidden or, in the case of directories, merged with the upper object.
Using that terminology, here’s how we can see that playing out in in practice1:
# set up the directory hierarchy necessary for getting a final merged
# view (`./merged`) based of two trees: `./upper` and `./lower`.
#
# ps.: `work` (passed as argument to `workdir`) is required for
# overlayfs to be able to perform atomic actions.
#
# as a result, we end up with two distinct directory trees having files:
#
# .
# ├── lower
# │ ├── a
# │ └── c
# └── upper
# ├── a
# └── b
#
mkdir ./{merged,work,upper,lower}
touch ./upper/{a,b}
touch ./lower/{a,c}
# create the overlayfs mount, creating a merged view of those two
#
sudo mount \
-t overlay \
overlay \
-o lowerdir=./lower,upperdir=./upper,workdir=./work \
./merged
.
├── lower
│ ├── a
│ └── c
├── upper
│ ├── a
│ └── b
├── merged < final view
│ ├── a (from upper)
│ ├── b (from upper)
│ └── c (from lower)
└── work < internal
└── work
upper and lower
While both upper
and lower
contribute to the final merged view in pretty
much the same way from a “read” perspective, from a “write” they’re quite
different.
That’s because lower layers are never written to - they only provide data -, while an upper layer can receive mutations (and will, if you change their corresponding files in the merged directory)2.
For instance, considering the example above where the file c
comes solely from
lower
, and b
from upper, we can try writing to each in the final merged
tree and see what happens.
echo "will-persist" > ./merged/b
echo "wont-persist" > ./merged/c
Looking from the perspective of merged
, both write
s work:
cat ./merged/b
will-persist
cat ./merged/c
wont-persist
But in the underlying directory trees, we can see that only upper (the writable layer) gets a mutation - lower does not:
cat ./upper/b
will-persist
cat ./lower/c
(empty)
This is also true for removals, and new file additions.
what about container tech?
What ends up happening in that case is that multiple lower directory trees are utilized (the container image layers), and an empty upper directory is put on top of those - this ends up creating a final view of a directory tree that contains the entire container image mounted, but whose writes do not ever mutate the original contents: they end up always going to the upper directory (the “ephemeral storage”).
For instance, let’s say we have a container image that’s made up of two layers:
layer1: layer2:
/etc /bin
myconf.ini my-binary
With that, a container runtime would then take those two layers are lower
directories, create an empty upper
dir, and mount that somewhere:
sudo mount \
-t overlay \
overlay \
-o lowerdir=/layer1:/layer2,upperdir=/upper,workdir=/work \
/merged
And then use that /merged
as the rootfs of the container.
copy-on-write
When in the previous section we did a write to a file in the merged
directory
that came from the readonly lower
directory tree, what happened there was
copy-on-write taking place.
Because we can’t modify the files on that layer, overlayfs
takes care of
reading all of the data (and metadata) from that file, copying it up to the
upper directory, and only then, presenting to our application the file
descriptor that we can use for writes.
When a file in the lower filesystem is accessed in a way the requires write-access, such as opening for write access, changing some metadata etc., the file is first copied from the lower filesystem to the upper filesystem (
copy_up
).
To see this in practice, consider the following directory configuration that’s a simplification of the previous case.
upper: lower: merged:
/ / /
a a (from upper)
Given that the filesystem employes copy-on-write semantics, despite us having
the file visible under both ./lower
and ./merged
, even with two distinct
device numbers, in the end there’s no such duplication, until we decide to
change it (“on write”).
More specifically, until we decide to open the file for writing:
open("./merged/a", O_RDWR)
|
|
*--> copy up to `merged`
--> available for writes w/out change to the file under `lower`.
For instance, we can have a 1GB file in a lower dir, then see how we end up paying the price of a copy up whenever we try to open it as read-write.
First, let’s setup the directories (just how we did before):
# create the dirs
#
mkdir ./{merged,work,upper,lower}
# write 1GB to `./lower/a`
#
dd if=/dev/zero of=./lower/a bs=$((1 << 12)) count=$((1 << 18))
# mount
#
sudo mount \
-t overlay \
overlay \
-o lowerdir=./lower,upperdir=./upper,workdir=./work \
./merged
Then, create a program that first opens the file as readonly (O_RDONLY
), then
“changes its mind”, opening it again, now as read-write (O_RDWR
):
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
static const char* file = "./merged/1GB";
int
main(int argc, char** argv)
{
int fd;
fd = openat(AT_FDCWD, file, O_RDONLY);
if (!~fd) {
perror("openat O_RDONLY");
return 1;
}
if (!~close(fd)) {
perror("close");
return 1;
}
fd = openat(AT_FDCWD, file, O_RDWR);
if (!~fd) {
perror("openat O_RDWR");
return 1;
}
return close(fd);
}
With the use of strace
, we can see the difference in time between opening
the file as readonly and read-write:
strace -T -e openat ./a.out
openat(AT_FDCWD, "./merged/a", O_RDONLY) = 3 <0.000110>
openat(AT_FDCWD, "./merged/a", O_RDWR) = 3 <0.297918>
| .-----------------------------. | .-----------.
| arguments passed | time taken
syscall return
“seeing” the price of specific copy ups
It turns out that it’s quite easy to “accidentally” pay the price of a copy up without even knowing - open the file with the wrong flag, and you’ll force a copy up.
For instance, it’s been the case that concourse
needs to shift the ownership
of all files in a container image if it sees that it needs to translate that
image from unprivileged
to privilged
, forcing every file to be copy
ed up.
While one can initially think that there’s only the cost associated with N
chown(2)
s syscalls taking place (which is already something in its own), in the
case of overlayfs
, that also means adding the extra disk io for cloning each
of those files.34
Using a tool like bpftrace
we can see, e.g., how much time we’ve been
spending on that very specific case: chown(2)
ing.
Given that the copy up of data occurs at ovl_copy_up_data
static int
ovl_copy_up_data(struct path* old, struct path* new, loff_t len)
{
// ...
}
we can trace that and discover the whole stack trace that leads to it:
k:ovl_copy_up_data
{
printf("%s", kstack);
}
telling us then:
ovl_copy_up_data+1
ovl_do_copy_up+1578
ovl_copy_up_one+624
ovl_copy_up_flags+167
ovl_copy_up+16
ovl_setattr+78
notify_change+761
chown_common+461
do_fchownat+147
__x64_sys_chown+34
do_syscall_64+90
entry_SYSCALL_64_after_hwframe+68
Now, given that we want to capture those copy ups that originated only from
chown(2)
syscalls, whose internal implementation delegates to ovl_setattr
:
static const struct inode_operations ovl_file_inode_operations = {
.setattr = ovl_setattr,
.permission = ovl_permission,
.getattr = ovl_getattr,
.listxattr = ovl_listxattr,
.get_acl = ovl_get_acl,
.update_time = ovl_update_time,
.fiemap = ovl_fiemap,
};
we can set up one extra hook that records the thread id that’s going through
setattr
so that we only capture the data being copied up through chown(2)
s.
#!/snap/bin/bpftrace
BEGIN
{
printf("%-8s %-16s %-16s %-16s %-16s\n",
"PID", "COMM", "NAME", "LEN(B)", "ELAPSED (ms)");
}
kprobe:do_fchownat
{
// keep track of the filename
//
@fname[tid] = str(arg1);
}
kprobe:ovl_copy_up_data
/ @fname[tid] != "" /
{
// keep track of the size of the file being copied up
// as well as its start timestamp (in nanosecs)
//
@len[tid] = arg2;
@start[tid] = nsecs;
}
kretprobe:ovl_copy_up_data
/ @len[tid] != 0 /
{
// once the coping finished, display the values
//
printf("%-8d %-16s %-16s %-16d %d\n",
pid, comm,
@fname[tid], @len[tid],
(nsecs - @start[tid]) / 1000000);
// free those spaces
//
delete(@len[tid]);
}
kretprobe:do_fchownat
{
// free those spaces
//
delete(@fname[tid]);
delete(@start[tid]);
}
That running, we can see it collecting data in practice:
PID COMM NAME LEN(B) ELAPSED (ms)
12399 sample.out ./merged/a 1073741824 787
overlaying overlay trees
What if … we wanted to either share layers that are already in use by other overlay trees, or, perhaps, use the final merged view of an overlay as the lowerdir of our fresh new overlay mount?
Let’s look at these three cases.
using an overlay mount as a lower directory tree
In this scenario, we got set of directory trees overlayed, forming a final merged view (e.g., we got a container root filesystem), and we want to have a copy-on-write of that live container root filesystem.
mkdir -p ./vol1/{merged,work,upper,lower}
touch ./vol1/upper/{a,b}
touch ./vol1/lower/{a,c}
sudo mount \
-t overlay \
overlay \
-o lowerdir=./vol1/lower,upperdir=./vol1/upper,workdir=./vol1/work \
./vol1/merged
mkdir -p ./vol2/{merged,work,upper}
sudo mount \
-t overlay \
overlay-overlay \
-o lowerdir=./vol1/merged,upperdir=./vol2/upper,workdir=./vol2/work \
./vol2/merged
What we end up with if the following:
.
├── vol1
│ ├── lower
│ │ ├── a
│ │ └── c
│ ├── merged
│ │ ├── a
│ │ ├── b
│ │ └── c
│ ├── upper
│ │ ├── a
│ │ └── b
│ └── work
│
└── vol2
├── merged
│ ├── a
│ ├── b
│ └── c
├── upper
└── work
We can see how changes in vol1/merged
reflect in vol2
merged:
touch vol1/merged/new
ls vol2/merged
a b c new
and that we indeed have a copy-on-write on top of that by trying to write
something to that file, and seeing how the underlying file (coming from
vol1/merged
) does not change (i.e., we wrote to a fresh copy):
echo "heyhey" > vol2/merged/new
cat vol1/merged/new
(empty)
thus, effectively having a copy-on-write of a copy-on-write.
using another overlay’s lower directory as the lower directory tree
This is a case that’s very common with containers - when a container image gets used by multiple containers, what they’re all doing is sharing the same set of lower directories.
using another overlay’s upper directory as the lower directory tree
This is a case where we’d either get undefined behavior, or an EBUSY
when
performing the mount.
I don’t fully understand why it wouldn’t be supported (as, being a lowerdir
,
that upper
would not get a write …), but, that’s the current implementation
(you can read more about it in kernel.org’s overlayfs entry in the ‘sharing
and copying layers’ section).
mkdir -p ./vol1/{merged,work,upper,lower}
touch ./vol1/upper/{a,b}
touch ./vol1/lower/{a,c}
sudo mount \
-t overlay \
overlay \
-o lowerdir=./vol1/lower,upperdir=./vol1/upper,workdir=./vol1/work \
./vol1/merged
mkdir -p ./vol2/{merged,work,upper}
sudo mount \
-t overlay \
overlay-overlay \
-o lowerdir=./vol1/upper,upperdir=./vol2/upper,workdir=./vol2/work \
./vol2/merged
Which works on my machine, but warns me with:
[143855.222564] overlayfs:
lowerdir is in-use as upperdir/workdir of another mount,
accessing files from both mounts will result in undefined
behavior.
-
if you want to know more about
mount
in general, make sure you check out understanding mount namespaces - there’s a section in there where I go deep into howmount
works. ↩︎ -
as long as it’s not coming from a readonly mount. ↩︎
-
it’s possible to avoid the copying of the underlying data when
chown(2)
ing if using “metadata only copy up”, but we don’t have that activated at the moment. Unfortunately, that only landed in 4.19, which is still not all that disseminated. See “Overlayfs memory usage improvements” in 4.19. ↩︎ -
we could in the future leverage
shiftfs
and avoid the wholechown(2)
ing altogethere - Canonical has been shipping their kernels with it (mainly for LXD as far as I can tell), which is a good indication that it could come to upstream at some point soon - see “trying outshiftfs
". ↩︎