Running Docker with a forked RunC

Hey,

While Docker takes some time to expose more options to service creation (for instance, limiting the maximum number of PIDs of a service), it’s important that they are still enforced (at least with a default number).

Many docker options that are not exposed to docker swarm mode yet - see add more options to service create / `service update.

It turns out though that there’s a way of adding such functionality without forking SwarmKit and Docker just to add that if you’re fine with setting a single default - adding a modified runtime to containerd (which is transparently managed by docker for you) and then making it the default runtime.

Example of interaction between docker-cli, the docker daemon, and runc

Can you run Docker with a forked runc?
The communication between Docker and ContainerD
The instantiation of containers by ContainerD
Forking runc
Adding a new runtime to the Docker daemon configuration
Modifying runc to place a default PID limit
Closing thoughts

Can you run Docker with a forked runc?

That’s possible because the Docker daemon is not the process that gets executed (in the sense of performing an execve) when a container is meant to be ran - it delegates the action to containerd which then controls a list of runtimes (runc by default) which is then responsible for creating a new process (calling the defined runtime as specified in the configuration parameters) with some isolation and only then execveing the entrypoint of that container.

Regarding interactions between docker, containerd, and runc, we can understand that without even looking at the source code - plain strace and pstree can do the job.

Without any container running, perform a ps aux | grep docker.

ps aux | grep docker
root      1002  /usr/bin/dockerd -H fd://
root      1038  docker-containerd --config /var/run/docker/containerd/containerd.toml

This shows us that we have two daemons running - the docker daemon and the docker-containerd daemon.

Given that dockerd interacts heavily with containerd all the time and the last is never exposed to the internet, it makes sense to bet that its interface is unix-socket based.

The communication between Docker and ContainerD

We can check that by looking at the docker-conatainerd configuration (/var/run/docker/containerd/containerd.toml):

root = "/var/lib/docker/containerd/daemon"
state = "/var/run/docker/containerd/daemon"
# ...

[grpc]
  address = "/var/run/docker/containerd/docker-containerd.sock"
  uid = 0
  gid = 0

Another options is to just grab docker-containerd pid and inspect its file descriptors:

# Retrieve the PID of `containerd`
CONTAINERD_PID="$(ps -C docker-containerd -o pid= | tr -d ' ')"

# Check what are the file descriptors associated with
# that process by looking at proc fs
sudo ls -lah /proc/$CONTAINERD_PID/fd
...
 6 -> socket:[17674]
 7 -> socket:[17675]
 8 -> socket:[17676]
 9 -> socket:[18597]

# That doesn't help much, aside from seeing that there are
# some sockets created. 
# 
# We can gather more information looking at the inspection
# performed by `lsof` though.
#
# Given that the inode of the sockets are known, we can filter
# the result to focus only on a single socket if we want
sudo lsof -p $CONTAINERD_PID | grep 17674
CMD        FD   TYPE  NODE  NAME
docker-co  6u   unix  17674 /var/run/docker/containerd/docker-containerd.sock type=STREAM

We can see how docker delegates all the work of setting up the container to containerd by looking at the write(2)s performed by docker to the containerd’s unix socket right before creating the container.

# Retrieve the PID of `dockerd`
DOCKERD_PID="$(ps -C dockerd -o pid= | tr -d ' ')"
BUFSIZE=1024

# Start strace following all forks and allowing it to
# print big lines (as big as 1024 bytes).
#
# By grepping `mycontainer` we can follow our trace of 
# `HOSTNAME=mycontainer` which should land at some point
# at containerd and in the end go through runc.
sudo strace -f \
        -e write \
        -s $BUFSIZE \
        -p $DOCKERD_PID 2>&1 | \
        grep containerd

[pid  1062] write(11, 
        "...io.containerd.runtime.v1.linux\22P\n!containerd.linux.runc.RuncOptions\22+"
        "\n\vdocker-runc\22\34/var/run/docker/runtime-runc*\340\235\1\n6types.containerd"
        ".io/opencontainers/runtime-spec/1/Spec\22\244\235\1{\"ociVersion\":\"1.0.1\",\"
>>>>>>  \"HOSTNAME=mycontainer\",\"NGINX_VERSION=1.13.9\"],\"cwd\":\"/\",\"capabilities\""
        ":{\"bounding\":[\"CAP_CHOWN\",\"CAP_DAC_OVERRIDE\",\"CA, 16393 <unfinished ...>

So we can see that docker is writing all that config to the file descriptor 11.

Looking at /proc/$DOCKERD_PID/fd doesn’t reveal much though - it doesn’t tell us what’s the end of that socket (just like with a plain TCP client-server communication, when you chat over unix sockets, clients also create a socket on their end):

sudo ls -lah /proc/$DOCKERD_PID/fd | grep '11 ->'
11 -> socket:[17757]

To determine what’s the end of that socket (the server-side unix socket that is in PASSIVE state), we can use ss:

# Check out what is the INODEs involved in the established
# connecition between a file descriptor 11 in the system
# (we could have more processes with a connection of fd=11 as
# that's per-process, but that's ok) and another peer.
#
# With both INODEs in hand we can use `lsof` again to inspect
# what are them
sudo ss -a --unix -p | grep fd=11
Netid  State      Recv-Q Send-Q  Local Address:Port   Peer Address:Port                
u_str  ESTAB      0      0       * 17757              * 17758           users:(("dockerd",pid=1002,fd=11))

# Check what those sockets are all about
sudo lsof -U | grep '17757\|17758'
CMD                PID   FD   TYPE ADDR                INODE  NAME
dockerd            1002  11u  unix 0xffff90edf6db6800  17757 type=STREAM
docker-containerd  1038  10u  unix 0xffff90edf6db6000  17758 /var/run/docker/containerd/docker-containerd.sock type=STREAM

Docker daemon issuing a container creation command to containerd

The instantiation of containers by ContainerD

In that {CONTAINER_CONFIG} there’s one line that stadands out for what we’re looking for here (changing the default runtime):

containerd.linux.runc.RuncOptions
        \22+\n\vdocker-runc
                \22\34/var/run/docker/runtime-runc*\340\235\1\n6
                types.containerd.io/opencontainers/runtime-spec/1/Spec\22\244\235\1
        {\"ociVersion\":\"1.0.1\"

We can’t see very well from the result of strace, but we can inspect that looking at the configuration used by containerd by using ctr, the command line utility that helps us interact with the containerd daemon:

# Check out what the containerd directory structure looks like
tree /var/lib/docker/containerd/daemon
.
├── io.containerd.content.v1.content
│   └── ingest
├── io.containerd.metadata.v1.bolt
│   └── meta.db
├── io.containerd.runtime.v1.linux
│   └── moby                            # <<< THE NAMESPACE
│       └── ac550b5a0083269e9866...     # <<< THE CONTAINER
├── io.containerd.snapshotter.v1.btrfs
└── io.containerd.snapshotter.v1.overlayfs
    └── snapshots


# Gather the containers that have been spawn in the
# moby namespace.
sudo docker-containerd-ctr \
        --namespace moby \
        --address /var/run/docker/containerd/docker-containerd.sock \
        containers ls
CONTAINER        ..   IMAGE    RUNTIME                           
ac550b5a0083269e9..   -        io.containerd.runtime.v1.linux 

# Gather the information related to the runtime of the container.
#
# This should reveal which binary it's used to create the actual
# containers.
sudo docker-containerd-ctr \
        --namespace moby  \
        --address /var/run/docker/containerd/docker-containerd.sock \
        containers info \
        ac550b5a0083269e9866e0d868e34e2fd35e1c5c6de31df00f481734d94a3ff7 | \
        jq '.Runtime'
{
  "Name": "io.containerd.runtime.v1.linux",
  "Options": {
    "type_url": "containerd.linux.runc.RuncOptions",
    "value": "Cgtkb2NrZXItcnVuYxIcL3Zhci9ydW4vZG9ja2VyL3J1bnRpbWUtcnVuYw=="
  }
}

# Decode the base64 value such that we can understand
# what's in there
echo "Cgtkb2NrZXItcnVuYxIcL3Zhci9ydW4vZG9ja2VyL3J1bnRpbWUtcnVuYw==" | base64 -d
docker-runc/var/run/docker/runtime-runc

We can now verify that containerd indeed makes use of runtime-runc when it initializes a container by, again, using strace, but this time, on containerd:

# Trace the execution of the running docker-containerd but
# filter the syscall tracing to only catch the `execve` calls
# such that we can see which process images are being used for
# the new processes.
sudo strace -f \
        -e execve \
        -p $CONTAINERD_PID


# A containerd-shim is created to execute `docker-runc` and keep
# track of its execution, allowing it to run the container entrypoint 
# process and not have to stay around after its execution.
#
# It's also responsible for keeping the IO and performing some extra
# cleanup roles if necessary.
#
# Note.: although this is spawn by containerd, it can be reparented
# by the machine pid1.
[pid  3749] execve("/usr/bin/docker-containerd-shim", [
        "docker-containerd-shim", 
        "-namespace", "moby", 
        "-workdir", "/var/lib/docker/containerd/daemo"..., 
        "-address", "/var/run/docker/containerd/docke"..., 
        "-containerd-binary", "/usr/bin/docker-containerd", 
        "-runtime-root", "/var/run/docker/runtime-runc"], [/* 7 vars */]) = 0


# docker-runc starts the process of creating the OCI container bundle
[pid  3755] execve("/usr/bin/docker-runc", [
        "docker-runc", 
        "--root", "/var/run/docker/runtime-runc/mob"..., 
        "--log", "/run/docker/containerd/daemon/io"..., 
        "--log-format", "json", 
        "create", 
        "--bundle", "/var/run/docker/containerd/daemo"..., 
        "--pid-file", "/run/docker/containerd/daemon/io"..., 
        "a919d61879fac203b1f2f78ddee3903c"...], [/* 7 vars */]) = 0


# docker-runc then starts the actual container
[pid  3807] execve("/usr/bin/docker-runc", [
        "docker-runc", 
        "--root", "/var/run/docker/runtime-runc/mob"..., 
        "--log", "/run/docker/containerd/daemon/io"..., 
        "--log-format", "json", 
        "start", 
        "a919d61879fac203b1f2f78ddee3903c"...], [/* 7 vars */]) = 0


# the container entrypoint gets executed
[pid  3771] execve("/usr/sbin/nginx", [
        "nginx", 
        "-g", "daemon off;"], [/* 4 vars */] <unfinished ...>



# We can now inspect the process tree of the container's entrypoint
# to see which processes are left (and check what is the first process
# when the namespaces got changed).
NGINX_PID=3771
sudo pstree \
        --show-pids \
        --ascii \
        --long \
        --ns-changes \
        --show-parents $NGINX_PID

systemd(1)
        ---dockerd(1002)
                ---docker-containerd(1038)
                        ---docker-containerd-shim(3749)
                                ---nginx(3771,ipc,mnt,net,pid,uts)

Regardless of what we use in that runc component, if it’s an OCI compliant runtime, then a container can be run.

Forking runc

Being a regular Golang project, make sure you have what’s needed to build a Go project (set your $GOPATH and stuff accordingly).

# Retrieve the `opencontainers/runc` and have it
# under `GOPATH/src`.
#
# This way we can very quickly perform the modifications
# to the source code, compile and see if they indeed 
# work.
go get -v github.com/opencontainers/runc

# Get into the source code destination
cd $GOPATH/src/github.com/opencontainers/runc

# Here you should see `runc` already built.
./runc --help
NAME:
   runc - Open Container Initiative runtime

runc is a command line client for running applications 
packaged according to the Open Container Initiative (OCI) 
format and is a compliant implementation of the Open Container 
Initiative specification.

Although with that we have runc already built, this version doesn’t have seccomp support (which is activated with an extra build tag).

If you don’t have it yet, gather the dependency (on Ubuntu it’s libseccomp-dev) and then build it with seccomp as a build tag:

# Verify that there's no libseccomp being open when
# we execute ./runc (that is, it's not compiled with
# support to seccomp) - nothing shows as it's not 
# there.
sudo strace -e openat ./runc --help 2>&1 | grep seccomp

# Remove the binary without seccomp
rm ./runc

# Update the package information from all the
# configured sources
sudo apt update -y

# Install the development package of libseccomp
sudo apt install -y \
        libseccomp-dev

# Build `runc` again, but now specifying that we
# want seccomp as well as apparmor.
#
# Note.: the default `BUILDTAGS` in the Makefile
# uses only `seccomp`. Docker requires both though.
make BUILDTAGS='seccomp apparmor'

# Verify that libseccomp is being loaded:
sudo strace -e openat ./runc --help 2>&1 | grep seccomp
openat(AT_FDCWD, 
        "/lib/x86_64-linux-gnu/libseccomp.so.2", 
        O_RDONLY|O_CLOEXEC) = 3

Now that we have a fresh build of runc, we can tell Docker to use our own version instead of docker-runc (the default runtime).

Adding a new runtime to the Docker daemon configuration

Head over to the daemon configuration file (/etc/docker/daemon.json) and add a new field:

{
    "runtimes": {
        "our-runtime": {
            "path": "/usr/local/bin/our-runc"
        }
    }
}

Link the runc generated in $GOPATH/src/github.com/opencontainers/runc to /usr/local/bin/our-runc and then tell dockerd to reload (send a SIGHUP to the dockerd process):

# Link the custom runc to `/usr/local/bin/our-runc`
# such that we can update the binary with a simple
# `make` and not have to copy to `/usr/local/bin`
# all the time.
sudo ln -sf $(realpath ./runc) /usr/local/bin/our-runc

# Tell dockerd to reload
sudo kill -s SIGHUP $(pgrep dockerd)

# Check that the daemon actually got the signal to
# reload and it's really doing it.
# 
# By default, dockerd will output the full configuration
# that it loaded and will be in use now.
#
# ps.: not every configuration will be reloaded with a
#      soft-reload via SIGHUP. Check the docs.
sudo journalctl -u docker.service -f

dockerd[1002]: time="2018-0..." level=info msg="Got signal to reload configuration, reloading from: /etc/docker/daemon.json"
dockerd[1002]: time="2018-0..." level=info msg="Reloaded configuration: {\"mtu\":1500, ...
        \"runtimes\":
                {\"our-runtime\":{\"path\":\"/usr/local/bin/our-runc\"},
                \"runc\":{\"path\":\"docker-runc\"}},
        \"default-runtime\":\"runc\" ...

Parsing that configuration that dockerd showed us (in JSON), we can highlight some things:

---
# If we don't specify a runtime, the runtime named `runc` 
# from the runtimes object is used.
#
# Naturally, this can be configured
default-runtime: 'runc'
runtimes:
  our-runtime:          # our runtime got loaded
    path: '/usr/local/bin/our-runc'
  runc:                 # default `runc` runtime that docker has
    path: 'docker-runc' # $PATH resolution can be performed

# Check in `docker info` if the configuration got
# properly loaded
docker info | \
        grep -i runtime

Runtimes: our-runtime runc
Default Runtime: runc

Cool, with that set, let’s modify our-runc and run a container.

Modifying runc to place a default PID limit

Given that Docker swarm mode doensn’t allow us to place a limit on the number of PIDs that a container can hold, we can go directly to runc and modify that:

diff --git a/libcontainer/cgroups/fs/pids.go b/libcontainer/cgroups/fs/pids.go
index f1e37205..418a5152 100644
--- a/libcontainer/cgroups/fs/pids.go
+++ b/libcontainer/cgroups/fs/pids.go
@@ -23,6 +23,11 @@ func (s *PidsGroup) Apply(d *cgroupData) error {
        if err != nil && !cgroups.IsNotFound(err) {
                return err
        }
+
+       if d.config.PidsLimit == 0 {
+               d.config.PidsLimit = 500
+       }
+
        return nil
 }

Now, compile it with make BUILDTAGS='seccomp apparmor' and run a container.

# Run the container specifying that we want `our-runtime` (`our-runc`)
# to be used as the container runtime.
docker run \
        --rm \
        --name c1 \
        --runtime our-runtime 
        nginx:alpine

# Gather the full ID of the container such that we can
# look at the cgroup filesystem and check if the pids
# limit has really been set.
CONTAINER_ID=$(docker inspect c1 | jq -r '.[] | .Id')

# Look at the `pids.max` key to verify if the 500
# is really set - the default value when we don't
# specify (per the changes we did).
cat /sys/fs/cgroup/pids/docker/$CONTAINER_ID/pids.max
500

That’s it! We’re running a Docker container with a modified runc.

Now, if you’re willing to always run containers with that runc, set the default-runtime property in /etc/docker/daemon.json and you’re good to go.

Closing thoughts

It’s great that Docker is modular enough to allow us to perform some quick modifications like this.

Being this building block for PaaSes, I guess this is more of a requirement than a feature.

Please let me know if you have any questions or want to point out a mistake I made.

I’m @cirowrc on Twitter,

Have a good one!

finis