Hey,
While Docker takes some time to expose more options to service creation (for instance, limiting the maximum number of PIDs of a service), it’s important that they are still enforced (at least with a default number).
Many docker options that are not exposed to docker swarm mode yet - see add more options to service create
/ `service update.
It turns out though that there’s a way of adding such functionality without forking SwarmKit and Docker just to add that if you’re fine with setting a single default - adding a modified runtime to containerd (which is transparently managed by docker for you) and then making it the default runtime.
- Can you run Docker with a forked runc?
- The communication between Docker and ContainerD
- The instantiation of containers by ContainerD
- Forking runc
- Adding a new runtime to the Docker daemon configuration
- Modifying runc to place a default PID limit
- Closing thoughts
Can you run Docker with a forked runc?
That’s possible because the Docker daemon is not the process that gets executed (in the sense of performing an execve
) when a container is meant to be ran - it delegates the action to containerd which then controls a list of runtimes (runc by default) which is then responsible for creating a new process (calling the defined runtime as specified in the configuration parameters) with some isolation and only then execve
ing the entrypoint of that container.
Regarding interactions between docker, containerd, and runc, we can understand that without even looking at the source code - plain strace
and pstree
can do the job.
Without any container running, perform a ps aux | grep docker
.
ps aux | grep docker
root 1002 /usr/bin/dockerd -H fd://
root 1038 docker-containerd --config /var/run/docker/containerd/containerd.toml
This shows us that we have two daemons running - the docker daemon and the docker-containerd daemon.
Given that dockerd
interacts heavily with containerd
all the time and the last is never exposed to the internet, it makes sense to bet that its interface is unix-socket based.
The communication between Docker and ContainerD
We can check that by looking at the docker-conatainerd configuration (/var/run/docker/containerd/containerd.toml
):
root = "/var/lib/docker/containerd/daemon"
state = "/var/run/docker/containerd/daemon"
# ...
[grpc]
address = "/var/run/docker/containerd/docker-containerd.sock"
uid = 0
gid = 0
Another options is to just grab docker-containerd
pid and inspect its file descriptors:
# Retrieve the PID of `containerd`
CONTAINERD_PID="$(ps -C docker-containerd -o pid= | tr -d ' ')"
# Check what are the file descriptors associated with
# that process by looking at proc fs
sudo ls -lah /proc/$CONTAINERD_PID/fd
...
6 -> socket:[17674]
7 -> socket:[17675]
8 -> socket:[17676]
9 -> socket:[18597]
# That doesn't help much, aside from seeing that there are
# some sockets created.
#
# We can gather more information looking at the inspection
# performed by `lsof` though.
#
# Given that the inode of the sockets are known, we can filter
# the result to focus only on a single socket if we want
sudo lsof -p $CONTAINERD_PID | grep 17674
CMD FD TYPE NODE NAME
docker-co 6u unix 17674 /var/run/docker/containerd/docker-containerd.sock type=STREAM
We can see how docker delegates all the work of setting up the container to containerd by looking at the write(2)
s performed by docker to the containerd’s unix socket right before creating the container.
# Retrieve the PID of `dockerd`
DOCKERD_PID="$(ps -C dockerd -o pid= | tr -d ' ')"
BUFSIZE=1024
# Start strace following all forks and allowing it to
# print big lines (as big as 1024 bytes).
#
# By grepping `mycontainer` we can follow our trace of
# `HOSTNAME=mycontainer` which should land at some point
# at containerd and in the end go through runc.
sudo strace -f \
-e write \
-s $BUFSIZE \
-p $DOCKERD_PID 2>&1 | \
grep containerd
[pid 1062] write(11,
"...io.containerd.runtime.v1.linux\22P\n!containerd.linux.runc.RuncOptions\22+"
"\n\vdocker-runc\22\34/var/run/docker/runtime-runc*\340\235\1\n6types.containerd"
".io/opencontainers/runtime-spec/1/Spec\22\244\235\1{\"ociVersion\":\"1.0.1\",\"
>>>>>> \"HOSTNAME=mycontainer\",\"NGINX_VERSION=1.13.9\"],\"cwd\":\"/\",\"capabilities\""
":{\"bounding\":[\"CAP_CHOWN\",\"CAP_DAC_OVERRIDE\",\"CA, 16393 <unfinished ...>
So we can see that docker
is writing all that config to the file descriptor 11
.
Looking at /proc/$DOCKERD_PID/fd
doesn’t reveal much though - it doesn’t tell us what’s the end of that socket (just like with a plain TCP client-server communication, when you chat over unix sockets, clients also create a socket on their end):
sudo ls -lah /proc/$DOCKERD_PID/fd | grep '11 ->'
11 -> socket:[17757]
To determine what’s the end of that socket (the server-side unix socket that is in PASSIVE
state), we can use ss
:
# Check out what is the INODEs involved in the established
# connecition between a file descriptor 11 in the system
# (we could have more processes with a connection of fd=11 as
# that's per-process, but that's ok) and another peer.
#
# With both INODEs in hand we can use `lsof` again to inspect
# what are them
sudo ss -a --unix -p | grep fd=11
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
u_str ESTAB 0 0 * 17757 * 17758 users:(("dockerd",pid=1002,fd=11))
# Check what those sockets are all about
sudo lsof -U | grep '17757\|17758'
CMD PID FD TYPE ADDR INODE NAME
dockerd 1002 11u unix 0xffff90edf6db6800 17757 type=STREAM
docker-containerd 1038 10u unix 0xffff90edf6db6000 17758 /var/run/docker/containerd/docker-containerd.sock type=STREAM
The instantiation of containers by ContainerD
In that {CONTAINER_CONFIG}
there’s one line that stadands out for what we’re looking for here (changing the default runtime):
containerd.linux.runc.RuncOptions
\22+\n\vdocker-runc
\22\34/var/run/docker/runtime-runc*\340\235\1\n6
types.containerd.io/opencontainers/runtime-spec/1/Spec\22\244\235\1
{\"ociVersion\":\"1.0.1\"
We can’t see very well from the result of strace, but we can inspect that looking at the configuration used by containerd by using ctr
, the command line utility that helps us interact with the containerd
daemon:
# Check out what the containerd directory structure looks like
tree /var/lib/docker/containerd/daemon
.
├── io.containerd.content.v1.content
│ └── ingest
├── io.containerd.metadata.v1.bolt
│ └── meta.db
├── io.containerd.runtime.v1.linux
│ └── moby # <<< THE NAMESPACE
│ └── ac550b5a0083269e9866... # <<< THE CONTAINER
├── io.containerd.snapshotter.v1.btrfs
└── io.containerd.snapshotter.v1.overlayfs
└── snapshots
# Gather the containers that have been spawn in the
# moby namespace.
sudo docker-containerd-ctr \
--namespace moby \
--address /var/run/docker/containerd/docker-containerd.sock \
containers ls
CONTAINER .. IMAGE RUNTIME
ac550b5a0083269e9.. - io.containerd.runtime.v1.linux
# Gather the information related to the runtime of the container.
#
# This should reveal which binary it's used to create the actual
# containers.
sudo docker-containerd-ctr \
--namespace moby \
--address /var/run/docker/containerd/docker-containerd.sock \
containers info \
ac550b5a0083269e9866e0d868e34e2fd35e1c5c6de31df00f481734d94a3ff7 | \
jq '.Runtime'
{
"Name": "io.containerd.runtime.v1.linux",
"Options": {
"type_url": "containerd.linux.runc.RuncOptions",
"value": "Cgtkb2NrZXItcnVuYxIcL3Zhci9ydW4vZG9ja2VyL3J1bnRpbWUtcnVuYw=="
}
}
# Decode the base64 value such that we can understand
# what's in there
echo "Cgtkb2NrZXItcnVuYxIcL3Zhci9ydW4vZG9ja2VyL3J1bnRpbWUtcnVuYw==" | base64 -d
docker-runc/var/run/docker/runtime-runc
We can now verify that containerd
indeed makes use of runtime-runc
when it initializes a container by, again, using strace
, but this time, on containerd
:
# Trace the execution of the running docker-containerd but
# filter the syscall tracing to only catch the `execve` calls
# such that we can see which process images are being used for
# the new processes.
sudo strace -f \
-e execve \
-p $CONTAINERD_PID
# A containerd-shim is created to execute `docker-runc` and keep
# track of its execution, allowing it to run the container entrypoint
# process and not have to stay around after its execution.
#
# It's also responsible for keeping the IO and performing some extra
# cleanup roles if necessary.
#
# Note.: although this is spawn by containerd, it can be reparented
# by the machine pid1.
[pid 3749] execve("/usr/bin/docker-containerd-shim", [
"docker-containerd-shim",
"-namespace", "moby",
"-workdir", "/var/lib/docker/containerd/daemo"...,
"-address", "/var/run/docker/containerd/docke"...,
"-containerd-binary", "/usr/bin/docker-containerd",
"-runtime-root", "/var/run/docker/runtime-runc"], [/* 7 vars */]) = 0
# docker-runc starts the process of creating the OCI container bundle
[pid 3755] execve("/usr/bin/docker-runc", [
"docker-runc",
"--root", "/var/run/docker/runtime-runc/mob"...,
"--log", "/run/docker/containerd/daemon/io"...,
"--log-format", "json",
"create",
"--bundle", "/var/run/docker/containerd/daemo"...,
"--pid-file", "/run/docker/containerd/daemon/io"...,
"a919d61879fac203b1f2f78ddee3903c"...], [/* 7 vars */]) = 0
# docker-runc then starts the actual container
[pid 3807] execve("/usr/bin/docker-runc", [
"docker-runc",
"--root", "/var/run/docker/runtime-runc/mob"...,
"--log", "/run/docker/containerd/daemon/io"...,
"--log-format", "json",
"start",
"a919d61879fac203b1f2f78ddee3903c"...], [/* 7 vars */]) = 0
# the container entrypoint gets executed
[pid 3771] execve("/usr/sbin/nginx", [
"nginx",
"-g", "daemon off;"], [/* 4 vars */] <unfinished ...>
# We can now inspect the process tree of the container's entrypoint
# to see which processes are left (and check what is the first process
# when the namespaces got changed).
NGINX_PID=3771
sudo pstree \
--show-pids \
--ascii \
--long \
--ns-changes \
--show-parents $NGINX_PID
systemd(1)
---dockerd(1002)
---docker-containerd(1038)
---docker-containerd-shim(3749)
---nginx(3771,ipc,mnt,net,pid,uts)
Regardless of what we use in that runc
component, if it’s an OCI compliant runtime, then a container can be run.
Forking runc
Being a regular Golang project, make sure you have what’s needed to build a Go project (set your $GOPATH
and stuff accordingly).
# Retrieve the `opencontainers/runc` and have it
# under `GOPATH/src`.
#
# This way we can very quickly perform the modifications
# to the source code, compile and see if they indeed
# work.
go get -v github.com/opencontainers/runc
# Get into the source code destination
cd $GOPATH/src/github.com/opencontainers/runc
# Here you should see `runc` already built.
./runc --help
NAME:
runc - Open Container Initiative runtime
runc is a command line client for running applications
packaged according to the Open Container Initiative (OCI)
format and is a compliant implementation of the Open Container
Initiative specification.
Although with that we have runc
already built, this version doesn’t have seccomp support (which is activated with an extra build tag).
If you don’t have it yet, gather the dependency (on Ubuntu it’s libseccomp-dev
) and then build it with seccomp
as a build tag:
# Verify that there's no libseccomp being open when
# we execute ./runc (that is, it's not compiled with
# support to seccomp) - nothing shows as it's not
# there.
sudo strace -e openat ./runc --help 2>&1 | grep seccomp
# Remove the binary without seccomp
rm ./runc
# Update the package information from all the
# configured sources
sudo apt update -y
# Install the development package of libseccomp
sudo apt install -y \
libseccomp-dev
# Build `runc` again, but now specifying that we
# want seccomp as well as apparmor.
#
# Note.: the default `BUILDTAGS` in the Makefile
# uses only `seccomp`. Docker requires both though.
make BUILDTAGS='seccomp apparmor'
# Verify that libseccomp is being loaded:
sudo strace -e openat ./runc --help 2>&1 | grep seccomp
openat(AT_FDCWD,
"/lib/x86_64-linux-gnu/libseccomp.so.2",
O_RDONLY|O_CLOEXEC) = 3
Now that we have a fresh build of runc
, we can tell Docker to use our own version instead of docker-runc
(the default runtime).
Adding a new runtime to the Docker daemon configuration
Head over to the daemon configuration file (/etc/docker/daemon.json
) and add a new field:
{
"runtimes": {
"our-runtime": {
"path": "/usr/local/bin/our-runc"
}
}
}
Link the runc
generated in $GOPATH/src/github.com/opencontainers/runc
to /usr/local/bin/our-runc
and then tell dockerd
to reload (send a SIGHUP
to the dockerd
process):
# Link the custom runc to `/usr/local/bin/our-runc`
# such that we can update the binary with a simple
# `make` and not have to copy to `/usr/local/bin`
# all the time.
sudo ln -sf $(realpath ./runc) /usr/local/bin/our-runc
# Tell dockerd to reload
sudo kill -s SIGHUP $(pgrep dockerd)
# Check that the daemon actually got the signal to
# reload and it's really doing it.
#
# By default, dockerd will output the full configuration
# that it loaded and will be in use now.
#
# ps.: not every configuration will be reloaded with a
# soft-reload via SIGHUP. Check the docs.
sudo journalctl -u docker.service -f
dockerd[1002]: time="2018-0..." level=info msg="Got signal to reload configuration, reloading from: /etc/docker/daemon.json"
dockerd[1002]: time="2018-0..." level=info msg="Reloaded configuration: {\"mtu\":1500, ...
\"runtimes\":
{\"our-runtime\":{\"path\":\"/usr/local/bin/our-runc\"},
\"runc\":{\"path\":\"docker-runc\"}},
\"default-runtime\":\"runc\" ...
Parsing that configuration that dockerd showed us (in JSON), we can highlight some things:
---
# If we don't specify a runtime, the runtime named `runc`
# from the runtimes object is used.
#
# Naturally, this can be configured
default-runtime: 'runc'
runtimes:
our-runtime: # our runtime got loaded
path: '/usr/local/bin/our-runc'
runc: # default `runc` runtime that docker has
path: 'docker-runc' # $PATH resolution can be performed
# Check in `docker info` if the configuration got
# properly loaded
docker info | \
grep -i runtime
Runtimes: our-runtime runc
Default Runtime: runc
Cool, with that set, let’s modify our-runc
and run a container.
Modifying runc to place a default PID limit
Given that Docker swarm mode doensn’t allow us to place a limit on the number of PIDs that a container can hold, we can go directly to runc
and modify that:
diff --git a/libcontainer/cgroups/fs/pids.go b/libcontainer/cgroups/fs/pids.go
index f1e37205..418a5152 100644
--- a/libcontainer/cgroups/fs/pids.go
+++ b/libcontainer/cgroups/fs/pids.go
@@ -23,6 +23,11 @@ func (s *PidsGroup) Apply(d *cgroupData) error {
if err != nil && !cgroups.IsNotFound(err) {
return err
}
+
+ if d.config.PidsLimit == 0 {
+ d.config.PidsLimit = 500
+ }
+
return nil
}
Now, compile it with make BUILDTAGS='seccomp apparmor'
and run a container.
# Run the container specifying that we want `our-runtime` (`our-runc`)
# to be used as the container runtime.
docker run \
--rm \
--name c1 \
--runtime our-runtime
nginx:alpine
# Gather the full ID of the container such that we can
# look at the cgroup filesystem and check if the pids
# limit has really been set.
CONTAINER_ID=$(docker inspect c1 | jq -r '.[] | .Id')
# Look at the `pids.max` key to verify if the 500
# is really set - the default value when we don't
# specify (per the changes we did).
cat /sys/fs/cgroup/pids/docker/$CONTAINER_ID/pids.max
500
That’s it! We’re running a Docker container with a modified runc.
Now, if you’re willing to always run containers with that runc
, set the default-runtime
property in /etc/docker/daemon.json
and you’re good to go.
Closing thoughts
It’s great that Docker is modular enough to allow us to perform some quick modifications like this.
Being this building block for PaaSes, I guess this is more of a requirement than a feature.
Please let me know if you have any questions or want to point out a mistake I made.
I’m @cirowrc on Twitter,
Have a good one!
finis