Hey,
I never really got to see under the hood how kubelet
gets to turn kubernetes
secrets into something that a process in a node could consume, so, that is all
that this article is about.
First, we go through a quick review of the architecture, explore how the values are stored in Kubernetes' datastore, and then finally, look into how the kubelet deals with secrets in a node.
ps.: all of the exploration below was done in linux 5.3 (ubuntu eoan - 19.10),
kubernetes v1.17 using microk8s
.
creating a secret
From the documentation:
[…] create a Secret in a file first, in json or yaml format, and then create that object. It contains two maps:
data
andstringData
.The
data
field is used to store arbitrary data, encoded usingbase64
.The
stringData
field is provided for convenience, and allows you to provide secret data as unencoded strings.kubernetes.io: creating a secret manually
Thus, we first define the obejct that represents what we want to store, then let
apiserver
deal with it.
In this case, I’m going with the stringData
convenience to not have to fill the
data
field with base64-encoded text (for simplicity sake).
apiVersion: v1
kind: Secret
metadata:
name: mysecret
namespace: test
type: Opaque
stringData:
foo: bar
caz: baz
the architecture
Having the secret accepted by apiserver
, that got stored somewhere: in etcd
.
CONTROL PLANE
etcd (cluster datastore)
|
apiserver-+-. (rest iface)
| |
| |
| |
NODES | |
| |
| |
kubelet---' | (node agent)
|
|
|
USER |
|
kubectl-----'
That means that in the example above, we went from the definition of a secret
that we (the user), had locally, which through the use of kubectl
submitted
to apiserver
, which, having contact with etcd
, asked it to persist in
the etcd
cluster.
Let’s see how that looks like from the perspective of etcd
.
etcd - where it’s all stored
As I’m running microk8s
, where all of the components end up running under the
same host, we can know more about how to reach it by inspecting how the etcd
process is currently running:
cat /proc/$(pidof etcd)/cmdline | tr '\000' '\n'
/snap/microk8s/1107/etcd
--data-dir=/var/snap/microk8s/common/var/run/etcd
--advertise-client-urls=https://10.158.0.3:12379
--listen-client-urls=https://0.0.0.0:12379 << (!)
--client-cert-auth << (!)
--trusted-ca-file=/var/snap/microk8s/1107/certs/ca.crt
--cert-file=/var/snap/microk8s/1107/certs/server.crt
--key-file=/var/snap/microk8s/1107/certs/server.key
From that output, we can already tell at least two things: etcd
is configured
with client-cert-auth
(meaning that all incoming HTTPS requests will be
checked against the trusted CA), and that we can connect to it through any
interface on port 12379
(more on etcd's transport security model
).
As in microk8s
the etcd
process lives in the host net namespace, we should
be able to reach it out under 127.0.0.1
without any trouble
sudo ss -atlnp | grep etcd
LISTEN 127.0.0.1:2380 users:(("etcd",pid=18093,fd=5))
LISTEN *:12379 users:(("etcd",pid=18093,fd=6))
However, before we proceed, we need to check where those certificates live (as they’ll be checked upon when connecting).
Given that apiserver
needs to connect itself to etcd
, we can do the same
that we did for etcd
: inspect the cmdline
of kube-apiserver
and then
figure out where those certficates are.
cat /proc/$(pidof kube-apiserver)/cmdline | tr '\000' '\n'
/snap/microk8s/1107/kube-apiserver
--cert-dir=/var/snap/microk8s/1107/certs
--client-ca-file=/var/snap/microk8s/1107/certs/ca.crt
--etcd-cafile=/var/snap/microk8s/1107/certs/ca.crt
--etcd-certfile=/var/snap/microk8s/1107/certs/server.crt
--etcd-keyfile=/var/snap/microk8s/1107/certs/server.key
--etcd-servers=https://127.0.0.1:12379
...
With those certificates, we can then use etcdctl
(etcd
’s helper command line
tool) to communicate with etcd
1.
Having the ETCDCTL_
environment variables set, we can, e.g., retrieve all of
what’s stored under /
(root)2:
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:12379
export ETCDCTL_CACERT=/var/snap/microk8s/1107/certs/ca.crt
export ETCDCTL_CERT=/var/snap/microk8s/1107/certs/server.crt
export ETCDCTL_KEY=/var/snap/microk8s/1107/certs/server.key
etcdctl get '/' --prefix=true -w json > ./kv.json
The output is essentially:
[
"count", -- number of entries under `kvs`
"header", -- cluster id, member id, raft term, and revision
"kvs" -- the entries themselves
]
Having each entry as:
{
"key" -- b64 key
"create_revision" -- number
"mod_revision" -- number
"version" -- number
"value" -- b64 value
}
From which we can, for instance, consume all of the keys in a readable format:
#!/bin/bash
set -o errexit
set -o pipefail
readonly filepath=./etcd-kv.json
for key in $(jq -r '.kvs[].key' ./kv.json); do
printf "%s\n" $(base64 --decode <<< $key)
done
Allowing us to discover that the path to our secret (set in the beginning of the article) is the following:
/registry/secrets/test/mysecret
Which we can use to retrieve the values3:
etcdctl get /registry/secrets/test/mysecret | tail -n 2 | jq '.'
{
"kind": "Secret",
"apiVersion": "v1",
"metadata": {
"name": "mysecret",
"namespace": "test",
"uid": "b8386740-4872-45e7-89f0-0d50c8450b00",
"creationTimestamp": "2019-12-18T13:45:27Z",
"annotations": {
"..."
}
},
"data": { << (!)
"caz": "YmF6",
"foo": "YmFy"
},
"type": "Opaque"
}
While it might sound surprising that we can see that in plain-text, that’s because we didn’t enable encryption at rest, but yeah, it’s possible to do so - see encrypting data.
Now that we know where those bits live, let’s see how we can use them in an application.
consuming the secret
From a consumer’s perspective:
Secrets can be mounted as data volumes or be exposed as environment variables to be used by a container in a pod.
kubernetes.io: using secrets
Let’s do that with a pod definition that leverages both the environment variable form (i), as well as mounting the secret as a file through volumes (ii):
apiVersion: v1
kind: Pod
metadata:
name: mypod
namespace: test
spec:
containers:
- name: mypod
image: busybox
command: [ /bin/sleep, 33d ]
env:
- name: FOO
valueFrom:
secretKeyRef: # << (i)
name: mysecret
key: foo
volumeMounts:
- name: foo # << (ii)
mountPath: /mnt/foo
volumes:
- name: foo
secret:
secretName: mysecret
With that submitted to apiserver
, lets dig deep into how kubelet
deals with
it once a pod gets scheduled to the node that it owns.
environment variables
As at some point kubelet
will hand off to the implementor of the container
runtime interface (CRI) to go from a definition to an actual container, there
must be a place where the translation from secret reference to environment
variable happens.
k8s.io/kubernetes/pkg/kubelet/secret.(*secretManager).GetSecret
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).makeEnvironmentVariables << (!)
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).GenerateRunContainerOptions
k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).generateContainerConfig
k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).startContainer
k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).SyncPod
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).syncPod
k8s.io/kubernetes/pkg/kubelet.(*podWorkers).managePodLoop
k8s.io/kubernetes/pkg/kubelet.(*podWorkers).UpdatePod
Exploring the kubernetes/kubernetes
codebase (kubelet
, more specifically),
we can see that taking place in makeEnvironmentVariables
:
for _, envVar := range container.Env {
if envVar.ValueFrom != nil {
switch {
case envVar.ValueFrom.SecretKeyRef != nil:
secret, ok := secrets[name]
if !ok {
secret, err = kl.secretManager.GetSecret(pod.Namespace, name)
secrets[name] = secret
}
runtimeValBytes, ok := secret.Data[key]
}
}
}
// ...
There, we have the resolution of the secret variables as necessary. With all of
that resolved, the full definition of a container is then submitted to the CRI
during startContainer
’s call to runtimeService.CreateContainer
.
In the case of containerd
, we can see that the variable gets properly replaced
by using ctr
to visualize the OCI bundle that got created for that container:
"ociVersion": "1.0.1-dev",
"process": {
"user": {
"uid": 0,
"gid": 0,
"additionalGids": [
10
]
},
"args": [
"/bin/sleep",
"33d"
],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"HOSTNAME=mypod",
"FOO=bar", << (!)
"KUBERNETES_PORT_443_TCP_PROTO=tcp",
Which pretty much means that at the process level, there’ll be no difference between a variable that’s coming from a secret and any other environment variable - they’ll all be treated the same:
cat /proc/2882/environ | tr '\000' '\n'
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=mypod
FOO=bar
KUBERNETES_PORT_443_TCP_PROTO=tcp
...
One implication of this is that anyone who can read from that procfs
entry
(either someone from the host pid namespace, or a process sharing that
namespace) can read those secrets (by default, the pid namespace is not shared
between the containers in a pod - see share pid namespace between containers in
a pod).
updates
When the value of a key that has been referenced in a pod definition changes, for environment variables, there’s no subsequent update in the processes' environment.
That’s mostly because there’s no such thing as a system call that would mutate
the environment variable (just like there’s no syscall to even retrieve that
either, despite being possible to see the initial set through
/proc/pid/environ
).
For instance, consider that we write the following program whose behavior is to start, then wait for a signal, and then when the signal comes, change its set of environment variables, then sleep again:
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
void handle_sig(int sig) {}
void wait_a_bit() {
if (signal(SIGINT, handle_sig) == SIG_ERR) {
perror("signal");
exit(1);
}
pause();
}
/**
* Start the binary w/ the initial set of env vars passed from
* our shell, then sleep, then have an env var set, then sleep again.
*/
int main(int argc, char** argv)
{
wait_a_bit();
if (!~setenv("foo", "bar", 0)) { // << (!)
perror("setenv");
return 1;
}
wait_a_bit();
return 0;
}
If we run this program, we can head to /proc/$pid/environ
and see the env
vars. Now, send a SIGTERM to it, and check /proc/$pid/environ
again: no
changes.
TERM1 TERM2
./main.out
cat /proc/$(pidof main.out)/environ
(no `foo` set)
kill -s SIGINT $(pidof main.out)
-> `setenv(foo,bar)` triggers
cat /proc/$(pidof main.out)/environ
(no `foo` set).
That’s because that initial set of environment variables where passed to the
program’s stack, and never changed anymore - what setenv
(from libc
) ends up
doing under the hood is taking that initial set, and then modifying it in the
program’s heap, where it can make it grow and shrink however it wants (while
still always pointing __environ
to the location where environment variables
should be reachable from, according to LSB
).
int
__add_to_environ(const char* name,
const char* value,
const char* combined,
int replace)
{
// ...
char** new_environ;
/* We allocated this space; we can extend it. */
new_environ = (char**)realloc(last_environ, (size + 2) * sizeof(char*));
if (new_environ == NULL) {
UNLOCK;
return -1;
}
if (__environ != last_environ)
memcpy(
(char*)new_environ, (char*)__environ, size * sizeof(char*));
// ...
}
(see glibc
’s
__add_to_environ
method, called by setenv
under the hood)
That’s all to say that it’d be very hard for something like Kubernetes to modify those variables in a way that doesn’t break the application.4
volumes
While in the case of an environment variable it’s all about replacing the reference by the value that should be set in the set of environment variables that the processes started in the container should inherit, in the case of volumes, contents of files in the filesystem must be populated (aside from the declarative ask for mounts).
from kubernetes' point of view
Pretty much all of that work happens in secretVolumeMounter.SetupAt
, where
the following sequence of steps occur:
- retrieve the secret
- this could’ve been already cached
- mount an empty dir backed by memory
-
does so by “wrapping” the emptydir volume source implementation
k8s.io/kubernetes/pkg/volume/emptydir.(*emptyDir).setupTmpfs+0 k8s.io/kubernetes/pkg/volume/secret.(*secretVolumeMounter).SetUpAt+1335 k8s.io/kubernetes/pkg/volume/secret.(*secretVolumeMounter).SetUp+155 mount("tmpfs", "/var/lib/kubelet/pods/7b8.../volumes/kubernetes.io~secret/foo", "tmpfs", ..., "") = 0
-
ensure nested “mountpoints” exist (literally,
os.MkdirAll
) -
writes to files in the volume
-
sets up the ownership accordingly
on the node
During step 2
, we saw that there’s a mount
that occurs.
Having access to the node, we can mimic a kubectl exec
by joining the same
namespaces that the container that we created uses:
-
search for the process that we just created (
/bin/sleep
) -
get inside its pid and mount namespaces
nsenter \ --pid=/proc/$(pidof sleep)/ns/pid \ --mount=/proc/$(pidof sleep)/ns/mnt \ /bin/sh
From there, we can gather some more information about the mount:
cat /proc/self/mountinfo | grep foo
1248 1229 0:93 / /mnt/foo ro,relatime - tmpfs tmpfs rw
| | | | | | | | | |
| | | | | | | | | per superblock opts
| | | | | | | | mount source
| | | | | | | filesystem type
| | | | | | just a separator
| | | | | per-mount options
| | | | mount point
| | | pathname of the dir in the fs forming the root of this mount
| | major:minor for files in this fs
| id of the parent mount
unique mount id
What this means is that from the perspective of the container, the mountpoint
/mnt/foo
where the secret lives is read-only.
From the perspective of the host, that’s not the case:
... /var/lib/kubelet/pods/6f7e1acc.../volumes/kubernetes.io~secret/foo rw,relatime
|
read-write
for us!
Which makes it possible for kubelet
to update the bytes of the files under
that mountpoint freely.
Given that in this case there is an interface for performing updates (i.e.,
kubelet
writing to a file and a process in a container detecting the change),
for volumes, updates to secrets are visible to the containers that have those
secrets mounted (inotify
works for tmpfs
).
-
you can get
etcdctl
from the tarball thatetcd
releases: https://github.com/etcd-io/etcd/releases ↩︎ -
we’re not using
localhost
there, but instead,127.0.0.1:$port
, because the certificate has a list of possible names and IPs that could be used, andlocalhost
is not one of them. To see what you can use in your case, check that viaopenssl x509 -in $cert -text
. ↩︎ -
apiserver might store the contents of those values either as
protobuf
(which will contain some values that looks like gibberish), orjson
. You can change that by tweakinig apiserver’sstorage-media-type
variable toapplication/json
if you want to achieve the same results. ↩︎ -
while the act of writing can be hard, reading those should not be all that much -
ptrace
ing it (e.g., withgdb
) would give you access to__environ
, but, again, only if it’s a compliant implementation. ↩︎