Hey,

Despite having been using Kubernetes resources for quite a long time, I never really took the time to dig deep into what that really meant at the node level - this article is all about adressing this.

First, we go through the different quality of services that a pod can fall into, and then analyze how going over container and node memory limits affect your containers.

the different uses of resources

Each pod once admitted gets assigned a quality of service based on the aggregate resource requirements made by the set of containers present in that pod defition.

As the three different quality of service classes show us (guaranteed, burstable, and best effort), there are multiple ways that one can go about requesting (or not) resources and limiting (or not) them.

For the process that ends up being run at some point, there’ll be quite a few differences once the container runtime materializes that definition of a certain reservation or set of limits. This makes possible the definition of prioritization when it comes to evicting pods when resources get scarce from the node perspective.

To get started, let’s compare guaranteed and burstable.

guaranteed vs burstable

From the definition of guaranteed, one gets assigned to such QoS as long as:

Every Container in the Pod must have a memory limit and a memory request, and they must be the same.

Every Container in the Pod must have a CPU limit and a CPU request, and they must be the same.

kubenertes.io: Guaranteed

So, here’s an example of that:

    apiVersion: v1
    kind: Pod
    metadata:
      name: limits-and-requests
      namespace: test
    spec:
      containers:
        - name: container
          image: busybox
          command: [ /bin/sleep, 33d ]
          resources:
            limits:
              cpu: 100m
              memory: 10Mi
            requests:
              cpu: 100m
              memory: 10Mi

Having 100% of our containers in the pod having both limits and requests set to equal values (for both memory and CPU), we get this pod to the guaranteed quality of service class, as we can observe

    kubectl describe pod limits-and-requests

    Name:         limits-and-requests
    Namespace:    test
    Priority:     0
    Status:       Running
    QoS Class:    Guaranteed

Now, for a burstable:

The Pod does not meet the criteria for QoS class Guaranteed.

At least one Container in the Pod has a memory or CPU request.

kubenertes.io: Burstable

i.e, if we take away that memory request that we had (set it to 0 - unlimited), we get our pod assigned to the burstable class:

    apiVersion: v1
    kind: Pod
    metadata:
      name: limits
      namespace: test
    spec:
      containers:
        - name: container
          image: busybox
          command: [ /bin/sleep, 33d ]
          resources:
            limits:
              cpu: 100m
              memory: 10Mi
            requests:
              cpu: 100m
              memory: 0

What this means is that at the time that Kubernetes goes about scheduling the pod to a node, it’ll not take a memory constraint in consideration when placing this pod as there’s simply no memory reservation.

We can verify that we got to the burstable class::

    kubectl describe pod limits

    Name:         limits
    Namespace:    test
    Priority:     0
    Status:       Running
    QoS Class:    Burstable

To see how those would differ at the container runtime level, we can take a look at the difference between the OCI specs that were generated for these containers.

As I’m using microk8s for this (which uses the containerd’s implementation of the container runtime interface (CRI)), we can use ctr (containerd’s CLI tool) to gather such spec:

    # use `kubectl` to get the id of the container
    #
    function container_id() {
            local name=$1

            kubectl get pod limits \
                    -o jsonpath={.status.containerStatuses[0].containerID} \
                    | cut -d '/' -f3
    }

    # use `ctr` to get the oci spec
    #
    function oci_spec () {
            local id=$1

            microk8s.ctr container info $id | jq '.Spec'
    }


    spec $(container_id "limits-and-requests") > /tmp/guaranteed
    spec $(container_id "limits") > /tmp/burstable

    git diff --no-index /tmp/guaranteed /tmp/burstable

Which reveals the biggest differences:

    diff --git a/guaranteed.json b/burstable.json
    index 046d16f..da7596a 100644
    --- a/guaranteed.json
    +++ b/burstable.json
    @@ -14,15 +14,15 @@
         ],
         "cwd": "/",
         "capabilities": {
    @@ -92,7 +92,7 @@
           ]
         },
    -    "oomScoreAdj": -998
    +    "oomScoreAdj": 999
       },

       "linux": {
         "resources": {
           "memory": {
             "limit": 10485760
           },
    @@ -247,25 +247,25 @@
             "period": 100000
           }
         },
    -    "cgroupsPath": "/kubepods/pod477062c0-1c.../05bef2...",
    +    "cgroupsPath": "/kubepods/burstable/podfbb122d5-ca/59...",

First, cgroupsPath gets completely different - /kubepods/burstable (rather than just kubepods).

Second, oomScoreAdj for the initial process gets configured according to a computation performed to give it less priority in the case of an out-of-memory (OOM) kill.

Let’s keep that in mind now and see the changes now going from guaranteed to besteffort, and then analyze what those details mean later.

guaranteed vs best effort

For a Pod to be given a QoS class of BestEffort, the Containers in the Pod must not have any memory or CPU limits or requests.

kubernetes.io: besteffort

i.e., we can tailor a pod definition with requests and limits all set to 0, and that should get us to besteffort:

    apiVersion: v1
    kind: Pod
    metadata:
      name: nothing
      namespace: test
    spec:
      containers:
        - name: container
          image: busybox
          command: [ /bin/sleep, 33d ]
          resources:
            limits:
              cpu: 0
              memory: 0
            requests:
              cpu: 0
              memory: 0

Which essentially means that no constraints should be placed based on memory and CPU when it comes to scheduling, and that at the moment of running the containers, no limits should be applied to it.

ps.: this DOES NOT mean that the pod will always be schedulable - checking for fitness of resources is just one of the predicates involved. See kube-scheduler#filtering for more on this.

We can verify that we got assigned besteffort:

    kubectl describe pod nothing

    Name:         nothing
    Namespace:    test
    Priority:     0
    Status:       Running
    QoS Class:    BestEffort

Now, comparing it with guaranteed:

    diff --git a/guaranteed.json b/besteffort.json
    index 046d16f..bd16d6b 100644
    --- a/guaranteed.json
    +++ b/besteffort.json
    @@ -92,7 +92,7 @@
           ]
         },
    -    "oomScoreAdj": -998
    +    "oomScoreAdj": 1000
       },
       "linux": {
         "resources": {
    @@ -239,33 +239,33 @@
             }
           ],
           "memory": {
    -        "limit": 10485760
    +        "limit": 0
           },
           "cpu": {
    -        "shares": 102,
    -        "quota": 10000,
    +        "shares": 2,
    +        "quota": 0,
             "period": 100000
           }
         },
    -    "cgroupsPath": "/kubepods/pod477062c0-1c/05bef2cca07a...",
    +    "cgroupsPath": "/kubepods/besteffort/pod31b936/1435...",

Naturally, in this case, CPU comes into place as in the other cases we were passing CPU resources and limits that were non-zero (if we kept the CPU configs, then we’d not get to besteffort - it’d be burstable).

I’ll not focus on the CPU part, but notice how with “unlimited CPU” (what we got from besteffort when not putting any CPU limits and reservations), we get pretty much no priority at all when consuming CPU (shares is set to 2). If you’d like to know more about CPU shares and quotas, make sure you check out Throttling: New Developments in Application Performance with CPU Limits.

But, aside from that, we see the same difference as before: the container is put in a different cgroup path, and the oomScoreAdj is also changed.

Let’s see those in detail, but first, a review of how OOM takes place in Linux.

oom score

When hitting high memory pressure, Linux kicks the process of trying to evict processes that could help the system recover from such pressure, while not impacting much (i.e., free as much as possible, killing the least important thing).

It does so by assigning a score (oom_score) for each and every process in the system (read from /proc/$pid/oom_score and you’ll see what’s the OOM score of a given $pid), where the higher the number, the bigger the likelihood of having the process killed when an OOM situation takes place.

This metric by itself ends up reflecting just the percentage of available memory of the system that a givne process uses. There’s no notion of “how important a process is” being captured in this.

example

Consider the case of a two-process system where one of the processes (PROC1) holds most of the memory in the system to itself (95%), while all the rest hold very little:

            MEM
    
    PROC1   95%
    PROC2   1%
    PROC3   1%

when checking the OOM score assigned to each of them, we can clearly see how big the score for PROC1 is when compared to the other ones:

            MEM     OOM_SCORE
    
    PROC1   95%     907
    PROC2   1%      9
    PROC3   1%      9

and indeed, if we now create a PROC4 that tries to allocate, say, 5%, that triggers the OOM killer, which kills PROC1, and not the new one.

    [951799.046579] Out of memory: 
            Killed process 18163 (mem-alloc) 
            total-vm:14850668kB, 
            anon-rss:14133628kB, 
            file-rss:4kB, 
            shmem-rss:0kB
    [951799.441402] oom_reaper: 
            reaped process 18163 (mem-alloc), 
            now anon-rss:0kB, 
            file-rss:0kB, 
            shmem-rss:0kB

    (that's the one we're calling PROC1 here)

However, as one can tell, it could be that PROC1 is really important, and that we never want it to be killed, even when memory pressure is super high.

That’s when adjustments to the OOM score comes into place.

oomScoreAdj

The file (oom_score_adj - under /proc/[pid]/oom_score_adj) can be used to adjust the “badness” heuristic used to select which process gets killed in out of memory conditions.

The value of /proc/<pid>/oom_score_adj is added to the “badness score” before it is used to determine which task to kill. It takes values range from -1000 (OOM_SCORE_ADJ_MIN) - “never kill” - to +1000 (OOM_SCORE_ADJ_MAX) - “sure kill”.

This way, if we wanted PROC1 (the one that allocates 95% of available memory) to not be killed even in the event of OOM, we could tweak that process' oom_score_adj (echo "-1000" > /proc/$(pidof PROC1)/oom_score_adj), which would lead to a much lower score:

            MEM     OOM_SCORE       OOM_SCORE_ADJ
    
    PROC1   95%     0               -1000
    PROC2   1%      9               0
    PROC3   1%      9               0

Which then, in case of further allocations, lead to other processes other than PROC1 being killed.

In the case of Kubernetes, that’s the primitive that it’s dealing with when it comes to prioritizing who should be killed in the event of a system-wide OOM:

different cgroup trees?

From those differences, something I didn’t really expect was those two different cgroup trees formed for the burstable and best effort classes.

    "cgroupsPath": "/kubepods/pod477062c0-1c.../05bef2...",
    "cgroupsPath": "/kubepods/besteffort/pod31b936/1435...",
    "cgroupsPath": "/kubepods/burstable/podfbb122d5-ca/59...",

It turns out that with those different trees, kubelet is able to appropriately provide to the guaranteed class what it needs by dynamically changing the values of CPU shares, and for an imcompressible resource like memory, allowing the operator to specify an amount to be reserved for higher classes, allowing best effort to be limitted to only what’s left after all has been taken by burstable and guaranteed.

This allows for fine-grained allocation of resources to a set of pods, while being able to “give the rest” to a whole class (represented by a child tree).

    /kubepods
            /pod123 (guaranteed)    cpu and mem limits well specified
                            cpu.shares              = sum(resources.requests.cpu)
                            cpu.cfs_quota_us        = sum(resources.limits.cpu)
                            memory.limit_in_bytes   = sum(resources.limits.memory)

            /burstable             all - (guaranteed + reserved)
                            cpu.shares              = max(sum(burstable pods cpu requests, 2))
                            memory.limit_in_bytes   = allocatable - sum(requests guaranteed)

                    /pod789
                            cpu.shares = sum(resources.requests.cpu)
                            if all containers set cpu:
                                    cpu.cfs_quota_us
                            if all containers set mem:
                                    memory.limit_in_bytes

            /besteffort              all - burstable
                            cpu.shares = 2
                            memory.limit_in_bytes = allocatable - (sum(requests guaranteed) + sum(requests burstable))
                    /pod999
                            cpu.shares = 2

ps.: I might’ve got some of the details above wrong, despite the overall idea holding. Make sure you check out the Kubelet pod level resource management design document.

Knowing that there are memory limits involved (when a limit is supplied of a certain amount gets reserved via qos reservation), what happens when you get past the threshold?

And what does reaching the threshold really mean?

per-cgroup memory accounting

In order to enforce memory limits according to what has been set on memory.limit_in_bytes, the kernel must keep track of the allocations that take place, so that it can decide whether that group of processes can go forward or not.

To see that accounting taking place, we can come up with a small example that shows that, and then we can observe which functions (inside the kernel) are involved with that.

    #include <signal.h>
    #include <stddef.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>

    static const ptrdiff_t len  = 1 << 25; // 32 MiB
    static const ptrdiff_t incr = 1 << 12; // 4KiB

    void handle_sig(int sig) { }

    void wait_a_bit(char* msg) {
            if (signal(SIGINT, handle_sig) == SIG_ERR) {
                    perror("signal");
                    exit(1);
            }

            printf("wait: %s\n", msg);
            pause();
    }

    int main(void) {
            char *start, *end;
            void* pb;

            pb = sbrk(0);
            if (pb == (void*)-1) {
                    perror("sbrk");
                    return 1;
            }

            start = (char*)pb;
            end   = start + len;

            wait_a_bit("next: brk");

            // "allocate" mem by increasing the program break.
            //
            if (!~brk(end)) {
                    perror("brk");
                    return 1;
            }

            wait_a_bit("next: memset");

            // "touch" the memory so that we get it really utilized - at this point,
            // we should see the faults taking place, and both RSS and active anon
            // going up.
            //
            while (start < end) {
                    memset(start, 123, incr);
                    start += incr;
            }

            wait_a_bit("next: exit");

            return 0;
    }

In the example above, we first expand the program break, giving us some extra place in the heap, then touch memory.

Having the process under a memory cgroup and then tracing the function involved with charging it (mem_cgroup_try_charge), we can see that the charging only takes place at the moment that we try to access the new area that has been just mapped (i.e., we only verify charges going on during memset).

    # leverage iovisor/bcc's `trace`
    #
    ./trace 'mem_cgroup_try_charge' -U -K -p $(pidof sample)

    PID     TID     COMM            FUNC
    18223   18223   sample          mem_cgroup_try_charge

            mem_cgroup_try_charge+0x1 [kernel]
            do_anonymous_page+0x139 [kernel]
            __handle_mm_fault+0x760 [kernel]
            handle_mm_fault+0xca [kernel]
            do_user_addr_fault+0x1f9 [kernel]
            __do_page_fault+0x58 [kernel]
            do_page_fault+0x2c [kernel]
            page_fault+0x34 [kernel]
            main+0x57 [sample]

Naturally, at some point, we get to the limit.

per-cgroup oom

to observe an OOM for a cgroup, we can put a limit on the cgroup that we created, setting memory.limit_in_bytes to something way smaller than the 32Mib that our sample application touches.

    echo "4M" > /sys/fs/cgroup/memory/test/memory.limit_in_bytes

Tracing the mem_cgroup_out_of_memory function (where the out-of-memory handling takes place at the kernel), we can figure out how all of that happens

    mem_cgroup_out_of_memory() {
      out_of_memory() {
        out_of_memory.part.0() {
          mem_cgroup_get_max();
          mem_cgroup_scan_tasks() {
            mem_cgroup_iter() { }
            css_task_iter_start() { }
            css_task_iter_next() { }
            oom_evaluate_task() {
              oom_badness.part.0() { }
            }
            css_task_iter_next() { }
            oom_evaluate_task() {
              oom_badness.part.0() { }
            }
            css_task_iter_next() { }
            css_task_iter_end() { }
          }
          oom_kill_process() { }
        }
      }
    }

kubelet’s soft evictions

Aside from those evictions that take place from the kernel perspective, there are also pod evictions that the kubelet can enforce.

These are based on thresholds configures at the kubelet level.

The kubelet can proactively monitor for and prevent total starvation of a compute resource. In those cases, the kubelet can reclaim the starved resource by proactively failing one or more Pods.

a worker pod is ungracefully restarted by Kubernetes when memory consumption exceeds an internally-configured threshold.

from soft eviction thresholds