Hey,
When recommending folks to change their baggageclaim
driver over to overlay
(backed by overlayfs
) when seeing that btrfs
would just not work for their
use cases, we’ve always been telling the story that for [privileged containers]
(of any kind - tasks, gets, etc), an increase in container startup time would be
expected, but we never really went deep into why that’s the case.
It turns out that there’s quite a bit of cool container tech in it, so here’s a deep detailed view into why that’s the case (at least for now).
In the first sections we go through the basics of how the permission system works with regards to a process interacting with a file, and how namespaces can make that seemingly different for userspace.
As we build up our knowledge, we get closer and closer to examples that simulate how containers work, up until the point that we get to see in practice with Concourse.
ps.: all of the exploration below is with regards to 5.0.0-32-generic on Ubuntu Disco (19.04).
Table of Contents
- user and group identifiers
- creating users and switching between them
- permission system
- capabilities
- user namespaces
- permission checking inside a user namespace
- user namespaces capabilities
- concourse volumes and the uid dance
- closing thoughts
- extra
user and group identifiers
When executing a process, Linux associates a set of numeric user and group identifiers (UID and GIDs, respectively) that can be divided in few classes:
-
real: identifies the user and group to which the process belongs. Inherited from the parent, unless changed otherwise.
-
effective: used to determine the permissions granted when trying to perform various operations. It can differ from real either through certain syscalls, or set-user-ID and set-group-ID programs.
-
saved: implicitly used to allow a program running with elevated privileged to temporarily run something with less privileges
-
filesystem: the identifiers used when performing file access checks (practically, the same as effective).
We can check what these are for a process by either using system calls
(getuid(2)
, geteuid(2)
, or the more generic getresuid(2)
)
NAME
getresuid, getresgid - get real, effective and saved user/group IDs
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <unistd.h>
int getresuid(uid_t *ruid, uid_t *euid, uid_t *suid);
int getresgid(gid_t *rgid, gid_t *egid, gid_t *sgid);
or, by asking /proc
about process information:
cat /proc/self/status | grep Uid
Uid: 1001 1001 1001 1001
| | | |
| | | filesystem
| | saved
| effective
real
As this state (process credentials) is supposed to be used for performing permission checking when a process tries to access a file, it must exist “somewhere”.
Under the hood (at the kernel level), these can be seen in the data structure
that holds the security context, struct cred
, which is associated with the
one that holds the process information (struct task_struct
):
|
|
creating users and switching between them
It turns out that from a kernel perspective, users do no need to be created: to
make a process have a different user, just call setuid(2)
with a different
number - as long as you have the capability to do so, that’s all you need.
Naturally, that might not be enough for a good user experience in some cases, but for access control at the kernel level, it’s all that’s necessary.
To see that in practice, consider my current setup where from the perspective of
/etc/passwd
32 users exist:
# read from `/etc/passwd` (the text file that describes user login
# accounts in the system), looking at the third column (where UIDs are)
#
$ cat /etc/passwd | wc -l
32
$ cat /etc/passwd | awk -F ':' '{print $3}'
0
1
2
...
109
1000
999
65534
1001
998
Despite those 32 users, we can craft a program that leverages way more users than just that.
|
|
ps.: you must have CAP_SETUID
to have this code working as a non-root user.
With 64 users, we get exactly that - 64 different UIDs being shown after uid 10001.
The big thing here is to recall that the mapping between user IDs and other information (name, home directory, etc.) is done entirely in user space, managed by programs like your shell, etc.
permission system
As mentioned before, when accessing a regular file (from a regular file system), a process needs to go through a set of credential checks before it can even get a handle to read the contents from.
The overall (simplified) operation looks like the following.
userspace: process (w/ a set of credentials)
requests opening of a file
|
-----------------(syscall)--------------
|
kernel: do_sys_open
do_filp_open
generic_permission
acl_permission_check?
capable_wrt_inode_uidgid?
First, an userspace process tries to open a certain file (issuing, for instance,
openat(2)
):
|
|
At the kernel level, do_sys_open
- the top-level common method used by both
open(2)
or openat(2)
- gets called, having in its arguments pretty much the
same that openat(2)
was supplied with.
|
|
While do_sys_open
is not all that interesting by itself, it calls
do_filp_open
whose responsability is allocating the main data structure that
represents a file open for someone to operate on - struct file
.
|
|
But, to get to the point where such struct can be allocated and then get s to the point when we can come back to the userspace process telling it that the file was successfully open, we need to first go through permission checks, which happens few other functions down the road:
generic_permission
do_inode_permission (inlined)
inode_permission+58
may_open.isra.63+94
do_last (inlined)
path_openat+670
do_filp_open+147 -----^
do_sys_open+375
__x64_sys_openat+32
do_syscall_64+90
entry_SYSCALL_64_after_hwframe+68
Reaching generic_permission
, we finally find what we’ve been searching for -
the permission checking1:
|
|
acl_permission_check
is where the traditional discretionary access control
(DAC) happens, verifying if the current process' filesystem user id matches the
inodes uid, checking group membership, etc.
Taking a look at the method responsible for DAC, we can see that struct inode
is where the information with regards to “who owns this file” lives:
|
|
This makes sense as the inode
is meant to contain all the information needed
by the filesystem to handle a file.
capabilities
Checking for the ability to open a file involves not only discretionary access control on a per-user and per-group basis, but also verifying the capabilities that a given process effectively has.
This is the mechanism that allows a root user to get access to any file in the system, and do whatever it wants to it.
As we saw in the previous section, if we relied solely on UIDs and GIDs for
permission checks, a root user (whose UID is 0) would not match a non 0 UID file
(e.g., root with UID 0 wouldn’t be able to open a file with UID 1000 as
uid_eq(current_fsuid(), inode->i_uid)
would be false).
The next check that’s performed after DAC, capable_wrt_inode_uidgid
,
verifies if the process (that has a given set of effective capabilities) is
capable or not to perform the desired action - in this case, reading the file,
which could be “bypassable” via CAP_DAC_READ_SEARCH
and/or
CAP_DAC_OVERRIDE
).
CAP_DAC_OVERRIDE
- Bypass file read, write, and execute permission checks.
CAP_DAC_READ_SEARCH
- Bypass file read permission checks and directory read and execute permission checks;
from
capabilities(7)
)
Through capabilities is essentially how a root user end up being able to read
any file: although acl_permission_check
fails, capable_wrt_inodee_uidgid
works, as root has the full set of privileges set to it.
1: a given filesystem might implement its own permission checking.
generic
is just one (common) way of doing so.
user namespaces
When Concourse runs an unpriviled container, i.e., either a step that’s not
marked with privileged: true
or a step from a resource-type that’s not marked
as privileged
, the process that runs within that container may see that its
UID is set to 0 (resembling a privileged process), but it turns out that when
performing actual filesystem checks, that process do not have that UID.
This is possible due to the use of user namespaces.
User namespaces isolate security-related identifiers and attributes, in particular, user IDs and group IDs, the root directory, keys, and capabilities (see
capabilities(7)
).
A process’s user and group IDs can be different inside and outside a user namespace. In particular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace; in other words, the process has full privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace.
from
user_namespaces(7)
While this might sound terrifying at first when it comes to permission checking
for filesystem interactions, it’s really not at all when we see how under the
hood how the kernel verifies if a user (regardless of the uid
that it sees
within the usernamespace) maps to.
Before we get to an illustration of that in practice, let’s first see how we can get a user namespace that lets us have UID 0 within it, mapping to a non-0 outside.
We can create a program that tries to open a file in different ways, and then see how that all plays out.
A call to
clone(2)
orunshare(2)
with theCLONE_NEWUSER
flag makes the new child process (forclone(2)
) or the caller (forunshare(2)
) a member of the new user namespace created by the call.
After the creation of a new user namespace, the
uid_map
file of one of the processes in the namespace may be written to once to define the mapping of user IDs in the new user namespace.
I.e., to fully utilize its uid mapping capabilities, we need to go through two steps:
-
creat a process having the
CLONE_NEWUSER
bit set in the flags passed toclone(2)
-
write to
/proc/pid/uid_map
the mapping of UIDs from outside the usernamespace, to inside it (e.g., “0 inside means 1000 outside”)
Once the write to /proc/pid/uid_map
is finished, the process now has fully
switched credentials.
Let’s start by creating the process with the CLONE_NEWUSER
flag then.
|
|
Now, in the child process, write to the /proc/$pid/uid_map
file
corresponding to itself (self
), specifying two pieces of information:
- that 0 inside that namespaces corresponds to 1001 outside
- that there’s only one user inside that namespace that maps to the outside
(thus, that no more than a
0
is allowed there).
|
|
Once the the new process finished specifying the uid mappign inside the
namespace (by writing to /proc/self/uid_map
, we can see how it shows (from the
inside) that its UID is 0 there, but not from the outside:
# create the process with CLONE_NEWUSER and then show the proc
#information (pid and effective UID)
#
./userns
process pid 16889
process euid 0
# from the host user namespace, ask the kernel how the different types
# of user IDs looks like
#
cat /proc/16889/status | grep Uid
Uid: 1001 1001 1001 1001
Knowing that, let’s see how permission checking ends up happening inside the user namespace then.
ps.: you can see the full source code here: cirocosta/userns-sample
permission checking inside a user namespace
As we uncovered that despite the impression that the user has UID 0 from within the namespace, in reality it’s just 1001 under the hood, this implicates that permission check might not really be the way one would expect: having UID 0 inside will now mean that you can read a file owned by UID 0 outside that namespace.
When a process accesses a file, its user and group IDs are mapped into the initial user namespace for the purpose of permission checking and assigning IDs when creating a file.
When a process retrieves file user and group IDs via stat(2), the IDs are mapped in the opposite direction, to produce values relative to the process user and group ID mappings.
Let’s see that in practice.
Consider the following directory from the perspective of the host:
ls -n ./samples/
-r-------- 1 0 1002 0 Nov 24 16:10 0.txt
-r-------- 1 1000 1002 0 Nov 24 16:10 1000.txt
-r-------- 1 1001 1002 0 Nov 24 16:10 1001.txt
| | |
only owner UID |
can read GID
Having a user namespace where the mapping is 0 1001 1
, that is:
- 0 inside mapping to 1001 outside
- allowing only a single user mapping
We can see that the view changes quite a bit:
"unknown"
.--> /proc/sys/kernel/overflowuid
|
-r-------- 1 65534 0 0 Nov 24 16:10 0.txt
-r-------- 1 65534 0 0 Nov 24 16:10 1000.txt
-r-------- 1 0 0 0 Nov 24 16:10 1001.txt
|
*--> for us, the illustion that it's owned by 0
--> it's actuallt just 1001 outside.
As we expect, given that the actual underlying UIDs only match for 1001.txt
,
that’s the only file that we can read:
root $ cat ./0.txt
Permission denied
root $ cat ./1000.txt
Permission denied
root $ cat ./1001.txt
Hi!
In order to determine permissions when an unprivileged process accesses a file, the process credentials (UID, GID) and the file credentials are in effect mapped back to what they would be in the initial user namespace and then compared to determine the permissions that the process has on the file.
from
user_namespaces(7)
user namespaces capabilities
Someone who went through capabilities(7)
might realize that something is off
here - a process having CAP_DAC_OVERRIDE
(which, users inside the user
namespace can have) are supposed to be able to read any file. Isn’t that a
contradiction to the purpose of having user namespaces in the first place?
It turns out that it isn’t - when we look closely at the place where the check for capabilities (after the DAC check), we can see that it takes the current user namespace into consideration:
|
|
Doing so, it’s able to validate if the capability that we’re checking against can be used against a given inode (which, if it comes from, let’s say, a true ‘root’ in the host, it cannot).
concourse volumes and the uid dance
Having a pipeline with a job that contains a get
step that’s supposed retrieve
contents as implemented by a resource type (for instance, git-resource
), the
output of that action (in the case of git
, a repository) gets materialized as
something in the filesystem in the form of a “volume”, which can then be mounted
inside a container for consumption (e.g., running the tests that a git
repository has).
By default, those volumes that get produced are “unprivileged” by default - the files owned by that volume have their uids corresponding to a user whose UID is not 0.
For instance, if I go into the worker’s work directory and look at an unprivileged volume, we can see that the contents there all belong to a user that’s not UID 0 from our (host) perspective:
root@cc-1:/concourse-state/volumes/live/d805cac3-9ad3...bd5/volume# ls -ln ./etc/
mode uid gid
drwxr-xr-x 1 4294967294 4294967294 X11
-rw-r--r-- 1 4294967294 4294967294 adduser.conf
drwxr-xr-x 1 4294967294 4294967294 alternatives
drwxr-xr-x 1 4294967294 4294967294 apt
-rw-r--r-- 1 4294967294 4294967294 bash.bashrc
drwxr-xr-x 1 4294967294 4294967294 bash_completion.d
-rw-r--r-- 1 4294967294 4294967294 bindresvport.blacklist
drwxr-xr-x 1 4294967294 4294967294 ca-certificates
-rw-r--r-- 1 4294967294 4294967294 ca-certificates.conf
drwxr-xr-x 1 4294967294 4294967294 cron.daily
-rw-r--r-- 1 4294967294 4294967294 debconf.conf
We can see how that works when we look at the definition of an OCI-complient
container (like the ones that gdn
creates when using runc
under the hood):
{
"ociVersion": "1.0.0",
"linux": {
"uidMappings": [
{ // make 0 inside not 0 outside
"hostID": 4294967294,
"containerID": 0,
"size": 1
},
{ // let any others just work
"hostID": 1,
"containerID": 1,
"size": 4294967293
}
],
"namespaces": [
{ "type": "network" },
{ "type": "pid" },
{ "type": "uts" },
{ "type": "ipc" },
{ "type": "mount" }
{ "type": "user" }
],
"user": { // run this process (/tmp/gdn-init) as 0
"uid": 0, (inside)
"gid": 0
},
// ...
When we switch to a privileged container though, things change - the process now
is not isolated with a user namespace anymore, and thus, no uidMappings
are
specified: a inside is a 0 outside.
However, because by default Concourse produces volumes that are expecting to be used by an unprivileged container, it shifts all of those files with UID 0 to that maximum ID (4294967294), so that when an unprivleged container needs to use it, it doesn’t need to do any extra work.
However, in the current implementation, if a privileged containerd needs to do
so, the translation needs to occur before the container gets started. That means
that every single file whose owner is the maximum ID ends up having to be
translated to UID 0 (i.e., if you have 3000 files, we need to perform 3000
chown(2)
to change the ownership of the files). Depending on how the
filesystem handles that, it can be quite expensive.
closing thoughts
It was pretty nice to get a deeper understanding into how the whole permissions checking system works, and have some of the multiuser concepts ingrained - I never really understood very well how all of that works (user namespaces, even less).
It’s exciting to see that with the work that we started on getting
containerd
as the backend will allow us to have more control of all of these
details, moving us to a place where things like vito/oci-build-task
can run
without extra privileges.
extra
using unshare to verify the behaviors
Although during the entire article I relied no my own tools (custom C code
developed just for this), it turns out that you can test all of these things
(with some limitations) using unshare(1)
.