Hey,
Despite having been using Kubernetes resources for quite a long time, I never really took the time to dig deep into what that really meant at the node level - this article is all about adressing this.
First, we go through the different quality of services that a pod can fall into, and then analyze how going over container and node memory limits affect your containers.
the different uses of resources
Each pod once admitted gets assigned a quality of service based on the aggregate resource requirements made by the set of containers present in that pod defition.
As the three different quality of service classes show us (guaranteed, burstable, and best effort), there are multiple ways that one can go about requesting (or not) resources and limiting (or not) them.
For the process that ends up being run at some point, there’ll be quite a few differences once the container runtime materializes that definition of a certain reservation or set of limits. This makes possible the definition of prioritization when it comes to evicting pods when resources get scarce from the node perspective.
To get started, let’s compare guaranteed and burstable.
guaranteed vs burstable
From the definition of guaranteed, one gets assigned to such QoS as long as:
Every Container in the Pod must have a memory limit and a memory request, and they must be the same.
Every Container in the Pod must have a CPU limit and a CPU request, and they must be the same.
kubenertes.io: Guaranteed
So, here’s an example of that:
apiVersion: v1
kind: Pod
metadata:
name: limits-and-requests
namespace: test
spec:
containers:
- name: container
image: busybox
command: [ /bin/sleep, 33d ]
resources:
limits:
cpu: 100m
memory: 10Mi
requests:
cpu: 100m
memory: 10Mi
Having 100% of our containers in the pod having both limits and requests set to equal values (for both memory and CPU), we get this pod to the guaranteed quality of service class, as we can observe
kubectl describe pod limits-and-requests
Name: limits-and-requests
Namespace: test
Priority: 0
Status: Running
QoS Class: Guaranteed
Now, for a burstable:
The Pod does not meet the criteria for QoS class Guaranteed.
At least one Container in the Pod has a memory or CPU request.
kubenertes.io: Burstable
i.e, if we take away that memory
request that we had (set it to 0 -
unlimited), we get our pod assigned to the burstable class:
apiVersion: v1
kind: Pod
metadata:
name: limits
namespace: test
spec:
containers:
- name: container
image: busybox
command: [ /bin/sleep, 33d ]
resources:
limits:
cpu: 100m
memory: 10Mi
requests:
cpu: 100m
memory: 0
What this means is that at the time that Kubernetes goes about scheduling the pod to a node, it’ll not take a memory constraint in consideration when placing this pod as there’s simply no memory reservation.
We can verify that we got to the burstable class::
kubectl describe pod limits
Name: limits
Namespace: test
Priority: 0
Status: Running
QoS Class: Burstable
To see how those would differ at the container runtime level, we can take a look at the difference between the OCI specs that were generated for these containers.
As I’m using microk8s
for this (which uses the containerd
’s
implementation of the container runtime interface (CRI)), we can use ctr
(containerd
’s CLI tool) to gather such spec:
# use `kubectl` to get the id of the container
#
function container_id() {
local name=$1
kubectl get pod limits \
-o jsonpath={.status.containerStatuses[0].containerID} \
| cut -d '/' -f3
}
# use `ctr` to get the oci spec
#
function oci_spec () {
local id=$1
microk8s.ctr container info $id | jq '.Spec'
}
spec $(container_id "limits-and-requests") > /tmp/guaranteed
spec $(container_id "limits") > /tmp/burstable
git diff --no-index /tmp/guaranteed /tmp/burstable
Which reveals the biggest differences:
diff --git a/guaranteed.json b/burstable.json
index 046d16f..da7596a 100644
--- a/guaranteed.json
+++ b/burstable.json
@@ -14,15 +14,15 @@
],
"cwd": "/",
"capabilities": {
@@ -92,7 +92,7 @@
]
},
- "oomScoreAdj": -998
+ "oomScoreAdj": 999
},
"linux": {
"resources": {
"memory": {
"limit": 10485760
},
@@ -247,25 +247,25 @@
"period": 100000
}
},
- "cgroupsPath": "/kubepods/pod477062c0-1c.../05bef2...",
+ "cgroupsPath": "/kubepods/burstable/podfbb122d5-ca/59...",
First, cgroupsPath
gets completely different - /kubepods/burstable
(rather
than just kubepods
).
Second, oomScoreAdj
for the initial process gets configured according to a
computation performed to give it less priority in the case of an out-of-memory
(OOM) kill.
Let’s keep that in mind now and see the changes now going from guaranteed to besteffort, and then analyze what those details mean later.
guaranteed vs best effort
For a Pod to be given a QoS class of BestEffort, the Containers in the Pod must not have any memory or CPU limits or requests.
kubernetes.io: besteffort
i.e., we can tailor a pod definition with requests and limits all set to 0, and that should get us to besteffort:
apiVersion: v1
kind: Pod
metadata:
name: nothing
namespace: test
spec:
containers:
- name: container
image: busybox
command: [ /bin/sleep, 33d ]
resources:
limits:
cpu: 0
memory: 0
requests:
cpu: 0
memory: 0
Which essentially means that no constraints should be placed based on memory and CPU when it comes to scheduling, and that at the moment of running the containers, no limits should be applied to it.
ps.: this DOES NOT mean that the pod will always be schedulable - checking for fitness of resources is just one of the predicates involved. See kube-scheduler#filtering for more on this.
We can verify that we got assigned besteffort:
kubectl describe pod nothing
Name: nothing
Namespace: test
Priority: 0
Status: Running
QoS Class: BestEffort
Now, comparing it with guaranteed:
diff --git a/guaranteed.json b/besteffort.json
index 046d16f..bd16d6b 100644
--- a/guaranteed.json
+++ b/besteffort.json
@@ -92,7 +92,7 @@
]
},
- "oomScoreAdj": -998
+ "oomScoreAdj": 1000
},
"linux": {
"resources": {
@@ -239,33 +239,33 @@
}
],
"memory": {
- "limit": 10485760
+ "limit": 0
},
"cpu": {
- "shares": 102,
- "quota": 10000,
+ "shares": 2,
+ "quota": 0,
"period": 100000
}
},
- "cgroupsPath": "/kubepods/pod477062c0-1c/05bef2cca07a...",
+ "cgroupsPath": "/kubepods/besteffort/pod31b936/1435...",
Naturally, in this case, CPU comes into place as in the other cases we were passing CPU resources and limits that were non-zero (if we kept the CPU configs, then we’d not get to besteffort - it’d be burstable).
I’ll not focus on the CPU part, but notice how with “unlimited CPU” (what we got from besteffort when not putting any CPU limits and reservations), we get pretty much no priority at all when consuming CPU (shares is set to 2). If you’d like to know more about CPU shares and quotas, make sure you check out Throttling: New Developments in Application Performance with CPU Limits.
But, aside from that, we see the same difference as before: the container is put in a
different cgroup path, and the oomScoreAdj
is also changed.
Let’s see those in detail, but first, a review of how OOM takes place in Linux.
oom score
When hitting high memory pressure, Linux kicks the process of trying to evict processes that could help the system recover from such pressure, while not impacting much (i.e., free as much as possible, killing the least important thing).
It does so by assigning a score (oom_score
) for each and every process in the
system (read from /proc/$pid/oom_score
and you’ll see what’s the OOM score of
a given $pid
), where the higher the number, the bigger the likelihood of
having the process killed when an OOM situation takes place.
This metric by itself ends up reflecting just the percentage of available memory of the system that a givne process uses. There’s no notion of “how important a process is” being captured in this.
example
Consider the case of a two-process system where one of the processes (PROC1) holds most of the memory in the system to itself (95%), while all the rest hold very little:
MEM
PROC1 95%
PROC2 1%
PROC3 1%
when checking the OOM score assigned to each of them, we can clearly see how big the score for PROC1 is when compared to the other ones:
MEM OOM_SCORE
PROC1 95% 907
PROC2 1% 9
PROC3 1% 9
and indeed, if we now create a PROC4 that tries to allocate, say, 5%, that triggers the OOM killer, which kills PROC1, and not the new one.
[951799.046579] Out of memory:
Killed process 18163 (mem-alloc)
total-vm:14850668kB,
anon-rss:14133628kB,
file-rss:4kB,
shmem-rss:0kB
[951799.441402] oom_reaper:
reaped process 18163 (mem-alloc),
now anon-rss:0kB,
file-rss:0kB,
shmem-rss:0kB
(that's the one we're calling PROC1 here)
However, as one can tell, it could be that PROC1 is really important, and that we never want it to be killed, even when memory pressure is super high.
That’s when adjustments to the OOM score comes into place.
oomScoreAdj
The file (oom_score_adj
- under /proc/[pid]/oom_score_adj
) can be used to
adjust the “badness” heuristic used to select which process gets killed in out
of memory conditions.
The value of /proc/<pid>/oom_score_adj
is added to the “badness score” before
it is used to determine which task to kill. It takes values range from -1000
(OOM_SCORE_ADJ_MIN
) - “never kill” - to +1000 (OOM_SCORE_ADJ_MAX
) - “sure
kill”.
This way, if we wanted PROC1 (the one that allocates 95% of available memory) to
not be killed even in the event of OOM, we could tweak that process'
oom_score_adj
(echo "-1000" > /proc/$(pidof PROC1)/oom_score_adj
), which
would lead to a much lower score:
MEM OOM_SCORE OOM_SCORE_ADJ
PROC1 95% 0 -1000
PROC2 1% 9 0
PROC3 1% 9 0
Which then, in case of further allocations, lead to other processes other than PROC1 being killed.
In the case of Kubernetes, that’s the primitive that it’s dealing with when it comes to prioritizing who should be killed in the event of a system-wide OOM:
- guaranteed get high priority
- besteffort get very low priority
different cgroup trees?
From those differences, something I didn’t really expect was those two different cgroup trees formed for the burstable and best effort classes.
"cgroupsPath": "/kubepods/pod477062c0-1c.../05bef2...",
"cgroupsPath": "/kubepods/besteffort/pod31b936/1435...",
"cgroupsPath": "/kubepods/burstable/podfbb122d5-ca/59...",
It turns out that with those different trees, kubelet
is able to appropriately
provide to the guaranteed class what it needs by dynamically changing the values
of CPU shares, and for an imcompressible resource like memory, allowing the
operator to specify an amount to be reserved for higher classes, allowing
best effort to be limitted to only what’s left after all has been taken by
burstable and guaranteed.
This allows for fine-grained allocation of resources to a set of pods, while being able to “give the rest” to a whole class (represented by a child tree).
/kubepods
/pod123 (guaranteed) cpu and mem limits well specified
cpu.shares = sum(resources.requests.cpu)
cpu.cfs_quota_us = sum(resources.limits.cpu)
memory.limit_in_bytes = sum(resources.limits.memory)
/burstable all - (guaranteed + reserved)
cpu.shares = max(sum(burstable pods cpu requests, 2))
memory.limit_in_bytes = allocatable - sum(requests guaranteed)
/pod789
cpu.shares = sum(resources.requests.cpu)
if all containers set cpu:
cpu.cfs_quota_us
if all containers set mem:
memory.limit_in_bytes
/besteffort all - burstable
cpu.shares = 2
memory.limit_in_bytes = allocatable - (sum(requests guaranteed) + sum(requests burstable))
/pod999
cpu.shares = 2
ps.: I might’ve got some of the details above wrong, despite the overall idea holding. Make sure you check out the Kubelet pod level resource management design document.
Knowing that there are memory limits involved (when a limit is supplied of a certain amount gets reserved via qos reservation), what happens when you get past the threshold?
And what does reaching the threshold really mean?
per-cgroup memory accounting
In order to enforce memory limits according to what has been set on
memory.limit_in_bytes
, the kernel must keep track of the allocations that take
place, so that it can decide whether that group of processes can go forward or
not.
To see that accounting taking place, we can come up with a small example that shows that, and then we can observe which functions (inside the kernel) are involved with that.
#include <signal.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
static const ptrdiff_t len = 1 << 25; // 32 MiB
static const ptrdiff_t incr = 1 << 12; // 4KiB
void handle_sig(int sig) { }
void wait_a_bit(char* msg) {
if (signal(SIGINT, handle_sig) == SIG_ERR) {
perror("signal");
exit(1);
}
printf("wait: %s\n", msg);
pause();
}
int main(void) {
char *start, *end;
void* pb;
pb = sbrk(0);
if (pb == (void*)-1) {
perror("sbrk");
return 1;
}
start = (char*)pb;
end = start + len;
wait_a_bit("next: brk");
// "allocate" mem by increasing the program break.
//
if (!~brk(end)) {
perror("brk");
return 1;
}
wait_a_bit("next: memset");
// "touch" the memory so that we get it really utilized - at this point,
// we should see the faults taking place, and both RSS and active anon
// going up.
//
while (start < end) {
memset(start, 123, incr);
start += incr;
}
wait_a_bit("next: exit");
return 0;
}
In the example above, we first expand the program break, giving us some extra place in the heap, then touch memory.
Having the process under a memory cgroup and then tracing the function involved
with charging it (mem_cgroup_try_charge
), we can see that the charging
only takes place at the moment that we try to access the new area that has been
just mapped (i.e.,
we only verify charges going on during memset
).
# leverage iovisor/bcc's `trace`
#
./trace 'mem_cgroup_try_charge' -U -K -p $(pidof sample)
PID TID COMM FUNC
18223 18223 sample mem_cgroup_try_charge
mem_cgroup_try_charge+0x1 [kernel]
do_anonymous_page+0x139 [kernel]
__handle_mm_fault+0x760 [kernel]
handle_mm_fault+0xca [kernel]
do_user_addr_fault+0x1f9 [kernel]
__do_page_fault+0x58 [kernel]
do_page_fault+0x2c [kernel]
page_fault+0x34 [kernel]
main+0x57 [sample]
Naturally, at some point, we get to the limit.
per-cgroup oom
to observe an OOM for a cgroup, we can put a limit on the cgroup that we
created, setting memory.limit_in_bytes
to something way smaller than the 32Mib
that our sample application touches.
echo "4M" > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
Tracing the mem_cgroup_out_of_memory
function (where the out-of-memory
handling takes place at the kernel), we can figure out how all of that happens
mem_cgroup_out_of_memory() {
out_of_memory() {
out_of_memory.part.0() {
mem_cgroup_get_max();
mem_cgroup_scan_tasks() {
mem_cgroup_iter() { }
css_task_iter_start() { }
css_task_iter_next() { }
oom_evaluate_task() {
oom_badness.part.0() { }
}
css_task_iter_next() { }
oom_evaluate_task() {
oom_badness.part.0() { }
}
css_task_iter_next() { }
css_task_iter_end() { }
}
oom_kill_process() { }
}
}
}
kubelet’s soft evictions
Aside from those evictions that take place from the kernel perspective, there
are also pod evictions that the kubelet
can enforce.
These are based on thresholds configures at the kubelet
level.
The kubelet can proactively monitor for and prevent total starvation of a compute resource. In those cases, the kubelet can reclaim the starved resource by proactively failing one or more Pods.
a worker pod is ungracefully restarted by Kubernetes when memory consumption exceeds an internally-configured threshold.