Hey,

I never really got to see under the hood how kubelet gets to turn kubernetes secrets into something that a process in a node could consume, so, that is all that this article is about.

First, we go through a quick review of the architecture, explore how the values are stored in Kubernetes' datastore, and then finally, look into how the kubelet deals with secrets in a node.

ps.: all of the exploration below was done in linux 5.3 (ubuntu eoan - 19.10), kubernetes v1.17 using microk8s.

creating a secret

From the documentation:

[…] create a Secret in a file first, in json or yaml format, and then create that object. It contains two maps: data and stringData.

The data field is used to store arbitrary data, encoded using base64.

The stringData field is provided for convenience, and allows you to provide secret data as unencoded strings.

kubernetes.io: creating a secret manually

Thus, we first define the obejct that represents what we want to store, then let apiserver deal with it.

In this case, I’m going with the stringData convenience to not have to fill the data field with base64-encoded text (for simplicity sake).

    apiVersion: v1
    kind: Secret
    metadata:
      name: mysecret
      namespace: test
    type: Opaque
    stringData:
      foo: bar
      caz: baz

the architecture

Having the secret accepted by apiserver, that got stored somewhere: in etcd.

    CONTROL PLANE


            etcd            (cluster datastore)
              |
            apiserver-+-.   (rest iface)
                      | |
                      | |
                      | |
    NODES             | |
                      | |
                      | |
            kubelet---' |   (node agent)
                        |
                        |
                        |
    USER                |
                        |
            kubectl-----'

That means that in the example above, we went from the definition of a secret that we (the user), had locally, which through the use of kubectl submitted to apiserver, which, having contact with etcd, asked it to persist in the etcd cluster.

Let’s see how that looks like from the perspective of etcd.

etcd - where it’s all stored

As I’m running microk8s, where all of the components end up running under the same host, we can know more about how to reach it by inspecting how the etcd process is currently running:

    cat /proc/$(pidof etcd)/cmdline | tr '\000' '\n'
    /snap/microk8s/1107/etcd
    --data-dir=/var/snap/microk8s/common/var/run/etcd
    --advertise-client-urls=https://10.158.0.3:12379
    --listen-client-urls=https://0.0.0.0:12379              << (!)
    --client-cert-auth                                      << (!)
    --trusted-ca-file=/var/snap/microk8s/1107/certs/ca.crt
    --cert-file=/var/snap/microk8s/1107/certs/server.crt
    --key-file=/var/snap/microk8s/1107/certs/server.key

From that output, we can already tell at least two things: etcd is configured with client-cert-auth (meaning that all incoming HTTPS requests will be checked against the trusted CA), and that we can connect to it through any interface on port 12379 (more on etcd's transport security model).

As in microk8s the etcd process lives in the host net namespace, we should be able to reach it out under 127.0.0.1 without any trouble

    sudo ss -atlnp | grep etcd
    LISTEN     127.0.0.1:2380     users:(("etcd",pid=18093,fd=5))
    LISTEN             *:12379    users:(("etcd",pid=18093,fd=6))

However, before we proceed, we need to check where those certificates live (as they’ll be checked upon when connecting).

Given that apiserver needs to connect itself to etcd, we can do the same that we did for etcd: inspect the cmdline of kube-apiserver and then figure out where those certficates are.

    cat /proc/$(pidof kube-apiserver)/cmdline | tr '\000' '\n'
    /snap/microk8s/1107/kube-apiserver
    --cert-dir=/var/snap/microk8s/1107/certs
    --client-ca-file=/var/snap/microk8s/1107/certs/ca.crt
    --etcd-cafile=/var/snap/microk8s/1107/certs/ca.crt
    --etcd-certfile=/var/snap/microk8s/1107/certs/server.crt
    --etcd-keyfile=/var/snap/microk8s/1107/certs/server.key
    --etcd-servers=https://127.0.0.1:12379
    ...

With those certificates, we can then use etcdctl (etcd’s helper command line tool) to communicate with etcd1.

Having the ETCDCTL_ environment variables set, we can, e.g., retrieve all of what’s stored under / (root)2:

    export ETCDCTL_API=3
    export ETCDCTL_ENDPOINTS=https://127.0.0.1:12379
    export ETCDCTL_CACERT=/var/snap/microk8s/1107/certs/ca.crt
    export ETCDCTL_CERT=/var/snap/microk8s/1107/certs/server.crt
    export ETCDCTL_KEY=/var/snap/microk8s/1107/certs/server.key

    etcdctl get '/' --prefix=true -w json > ./kv.json

The output is essentially:

    [
      "count",      -- number of entries under `kvs`
      "header",     -- cluster id, member id, raft term, and revision
      "kvs"         -- the entries themselves
    ]

Having each entry as:

    {
      "key"                         -- b64 key
      "create_revision"             -- number
      "mod_revision"                -- number
      "version"                     -- number
      "value"                       -- b64 value
    }

From which we can, for instance, consume all of the keys in a readable format:

    #!/bin/bash

    set -o errexit
    set -o pipefail

    readonly filepath=./etcd-kv.json

    for key in $(jq -r '.kvs[].key' ./kv.json); do
            printf "%s\n" $(base64 --decode <<< $key)
    done

Allowing us to discover that the path to our secret (set in the beginning of the article) is the following:

    /registry/secrets/test/mysecret

Which we can use to retrieve the values3:

    etcdctl get /registry/secrets/test/mysecret | tail -n 2 | jq '.'
    {
      "kind": "Secret",
      "apiVersion": "v1",
      "metadata": {
        "name": "mysecret",
        "namespace": "test",
        "uid": "b8386740-4872-45e7-89f0-0d50c8450b00",
        "creationTimestamp": "2019-12-18T13:45:27Z",
        "annotations": {
          "..."
        }
      },
      "data": {             << (!)
        "caz": "YmF6",
        "foo": "YmFy"
      },
      "type": "Opaque"
    }

While it might sound surprising that we can see that in plain-text, that’s because we didn’t enable encryption at rest, but yeah, it’s possible to do so - see encrypting data.

Now that we know where those bits live, let’s see how we can use them in an application.

consuming the secret

From a consumer’s perspective:

Secrets can be mounted as data volumes or be exposed as environment variables to be used by a container in a pod.

kubernetes.io: using secrets

Let’s do that with a pod definition that leverages both the environment variable form (i), as well as mounting the secret as a file through volumes (ii):

    apiVersion: v1
    kind: Pod
    metadata:
      name: mypod
      namespace: test
    spec:
      containers:
        - name: mypod
          image: busybox
          command: [ /bin/sleep, 33d ]
          env:
            - name: FOO
              valueFrom:
                secretKeyRef:               # << (i)
                  name: mysecret
                  key: foo
          volumeMounts:
            - name: foo                     # << (ii)
              mountPath: /mnt/foo
      volumes:
        - name: foo
          secret:
            secretName: mysecret

With that submitted to apiserver, lets dig deep into how kubelet deals with it once a pod gets scheduled to the node that it owns.

environment variables

As at some point kubelet will hand off to the implementor of the container runtime interface (CRI) to go from a definition to an actual container, there must be a place where the translation from secret reference to environment variable happens.

    k8s.io/kubernetes/pkg/kubelet/secret.(*secretManager).GetSecret
    k8s.io/kubernetes/pkg/kubelet.(*Kubelet).makeEnvironmentVariables << (!)
    k8s.io/kubernetes/pkg/kubelet.(*Kubelet).GenerateRunContainerOptions
    k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).generateContainerConfig
    k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).startContainer
    k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).SyncPod
    k8s.io/kubernetes/pkg/kubelet.(*Kubelet).syncPod
    k8s.io/kubernetes/pkg/kubelet.(*podWorkers).managePodLoop
    k8s.io/kubernetes/pkg/kubelet.(*podWorkers).UpdatePod

Exploring the kubernetes/kubernetes codebase (kubelet, more specifically), we can see that taking place in makeEnvironmentVariables:

    for _, envVar := range container.Env {
            if envVar.ValueFrom != nil {
                    switch {
                    case envVar.ValueFrom.SecretKeyRef != nil:
                            secret, ok := secrets[name]
                            if !ok {
                                    secret, err = kl.secretManager.GetSecret(pod.Namespace, name)
                                    secrets[name] = secret
                            }
                            runtimeValBytes, ok := secret.Data[key]
                    }
            }
    }
    // ...

There, we have the resolution of the secret variables as necessary. With all of that resolved, the full definition of a container is then submitted to the CRI during startContainer’s call to runtimeService.CreateContainer.

In the case of containerd, we can see that the variable gets properly replaced by using ctr to visualize the OCI bundle that got created for that container:

    "ociVersion": "1.0.1-dev",
    "process": {
        "user": {
            "uid": 0,
            "gid": 0,
            "additionalGids": [
                10
            ]
        },
        "args": [
            "/bin/sleep",
            "33d"
        ],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "HOSTNAME=mypod",
            "FOO=bar",      << (!)
            "KUBERNETES_PORT_443_TCP_PROTO=tcp",

Which pretty much means that at the process level, there’ll be no difference between a variable that’s coming from a secret and any other environment variable - they’ll all be treated the same:

    cat /proc/2882/environ | tr '\000' '\n'
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    HOSTNAME=mypod
    FOO=bar
    KUBERNETES_PORT_443_TCP_PROTO=tcp
    ...

One implication of this is that anyone who can read from that procfs entry (either someone from the host pid namespace, or a process sharing that namespace) can read those secrets (by default, the pid namespace is not shared between the containers in a pod - see share pid namespace between containers in a pod).

updates

When the value of a key that has been referenced in a pod definition changes, for environment variables, there’s no subsequent update in the processes' environment.

That’s mostly because there’s no such thing as a system call that would mutate the environment variable (just like there’s no syscall to even retrieve that either, despite being possible to see the initial set through /proc/pid/environ).

For instance, consider that we write the following program whose behavior is to start, then wait for a signal, and then when the signal comes, change its set of environment variables, then sleep again:

    #include <signal.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>

    void handle_sig(int sig) {}

    void wait_a_bit() {
            if (signal(SIGINT, handle_sig) == SIG_ERR) {
                    perror("signal");
                    exit(1);
            }

            pause();
    }

    /**
     * Start the binary w/ the initial set of env vars passed from
     * our shell, then sleep, then have an env var set, then sleep again.
     */
    int main(int argc, char** argv)
    {
            wait_a_bit();

            if (!~setenv("foo", "bar", 0)) {        // << (!)
                    perror("setenv");
                    return 1;
            }

            wait_a_bit();
            return 0;
    }

If we run this program, we can head to /proc/$pid/environ and see the env vars. Now, send a SIGTERM to it, and check /proc/$pid/environ again: no changes.

    TERM1                                   TERM2

    ./main.out
                                            cat /proc/$(pidof main.out)/environ
                                                    (no `foo` set)
    kill -s SIGINT $(pidof main.out)
       -> `setenv(foo,bar)` triggers

                                            cat /proc/$(pidof main.out)/environ
                                                    (no `foo` set).

That’s because that initial set of environment variables where passed to the program’s stack, and never changed anymore - what setenv (from libc) ends up doing under the hood is taking that initial set, and then modifying it in the program’s heap, where it can make it grow and shrink however it wants (while still always pointing __environ to the location where environment variables should be reachable from, according to LSB).

    int
    __add_to_environ(const char* name,
                     const char* value,
                     const char* combined,
                     int         replace)
    {

            // ...

            char** new_environ;

            /* We allocated this space; we can extend it.  */
            new_environ = (char**)realloc(last_environ, (size + 2) * sizeof(char*));
            if (new_environ == NULL) {
                    UNLOCK;
                    return -1;
            }

            if (__environ != last_environ)
                    memcpy(
                      (char*)new_environ, (char*)__environ, size * sizeof(char*));

            // ...
    }

(see glibc’s __add_to_environ method, called by setenv under the hood)

That’s all to say that it’d be very hard for something like Kubernetes to modify those variables in a way that doesn’t break the application.4

volumes

While in the case of an environment variable it’s all about replacing the reference by the value that should be set in the set of environment variables that the processes started in the container should inherit, in the case of volumes, contents of files in the filesystem must be populated (aside from the declarative ask for mounts).

from kubernetes' point of view

Pretty much all of that work happens in secretVolumeMounter.SetupAt, where the following sequence of steps occur:

  1. retrieve the secret
  1. mount an empty dir backed by memory
  1. ensure nested “mountpoints” exist (literally, os.MkdirAll)

  2. writes to files in the volume

  3. sets up the ownership accordingly

on the node

During step 2, we saw that there’s a mount that occurs.

Having access to the node, we can mimic a kubectl exec by joining the same namespaces that the container that we created uses:

  1. search for the process that we just created (/bin/sleep)

  2. get inside its pid and mount namespaces

     nsenter \
             --pid=/proc/$(pidof sleep)/ns/pid \
             --mount=/proc/$(pidof sleep)/ns/mnt \
             /bin/sh
    

From there, we can gather some more information about the mount:

    cat /proc/self/mountinfo | grep foo
    1248 1229 0:93 / /mnt/foo ro,relatime - tmpfs tmpfs rw
    |      |  |    | |        |           |  |       |   |
    |      |  |    | |        |           |  |       |  per superblock opts 
    |      |  |    | |        |           |  |      mount source
    |      |  |    | |        |           |  filesystem type
    |      |  |    | |        |           just a separator  
    |      |  |    | |        per-mount options
    |      |  |    | mount point 
    |      |  |    pathname of the dir in the fs forming the root of this mount 
    |      |  major:minor for files in this fs   
    |      id of the parent mount  
    unique mount id  

What this means is that from the perspective of the container, the mountpoint /mnt/foo where the secret lives is read-only.

From the perspective of the host, that’s not the case:

    ... /var/lib/kubelet/pods/6f7e1acc.../volumes/kubernetes.io~secret/foo rw,relatime
                                                                           |
                                                                           read-write
                                                                           for us!

Which makes it possible for kubelet to update the bytes of the files under that mountpoint freely.

Given that in this case there is an interface for performing updates (i.e., kubelet writing to a file and a process in a container detecting the change), for volumes, updates to secrets are visible to the containers that have those secrets mounted (inotify works for tmpfs).


  1. you can get etcdctl from the tarball that etcd releases: https://github.com/etcd-io/etcd/releases ↩︎

  2. we’re not using localhost there, but instead, 127.0.0.1:$port, because the certificate has a list of possible names and IPs that could be used, and localhost is not one of them. To see what you can use in your case, check that via openssl x509 -in $cert -text↩︎

  3. apiserver might store the contents of those values either as protobuf (which will contain some values that looks like gibberish), or json. You can change that by tweakinig apiserver’s storage-media-type variable to application/json if you want to achieve the same results. ↩︎

  4. while the act of writing can be hard, reading those should not be all that much - ptraceing it (e.g., with gdb) would give you access to __environ, but, again, only if it’s a compliant implementation. ↩︎