Hey,

Something that is very common to get wrong when starting with Linux containers is to think that free and other tools like top should report the memory limits.

Illustration of the access to procfs from a container

Here you’ll not only go through why that happens and how to get it right, but also take a look at where is the Kernel looking for information when you ask it for memory statistics.

Also, if you’re curious about how the code for keeping track of per-cgroup page counter looks, stick to the end!

This is the third article in a series of 30 articles around procfs: A Month of /proc.

If you’d like to keep up to date with it, make sure you join the mailing list!

Running top within a container

To get a testbed for the rest of the article, consider the case of running a single container with a memory limit of 10MB in a system that has 2GB of RAM available:

# Check the amount of memory available
# outside the container (i.e., in the host)
free -h
      total   used   free   available
Mem:   1.9G   312M   385M        1.5G

# Define the total number of bytes that
# will dictate the memory limit of the
# container.
MEM_MAX="$((1024 * 1024 * 10))"


# Run a container using the ubuntu image
# as its base image, with the memory limit
# set to 10MB, and a tty as well as interactive
# support.
docker run \
        --interactive \
        --tty \
        --memory $MEM_MAX \
        ubuntu

With the container running, we can now check what are the results from executing top over there:

top -bn1

Tasks:   2 total,   1 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.1 sy,  0.0 ni, 99.7 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
         .----------------.
         |                |
KiB Mem :| 2040940 total, | 117612 free,   651204 used,  1272124 buff/cache
KiB Swap:|       0 total, |      0 free,        0 used.  1196972 avail Mem
         *--+-------------*
  PID USER  |   PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root  |   20   0   18508   3432   3016 S   0.0  0.2   0:00.02 bash
   12 root  |   20   0   36484   3104   2748 R   0.0  0.2   0:00.00 top
            |
            *---> Not really what we 
                  expect, that is 2GB!!

As we outlined before, not what one would typically expect (it reports the total available memory as seen in the host - not showing the 10MB limit at all).

What about free? Same thing:

free -h
        total    used     free   available
Mem:     1.9G    612M     131M        1.2G
Swap:      0B      0B       0B

How the top and free tools gather memory statistics

If we go inspect what are the syscalls being used by both top and free, we can see that they’re making use of plain open(2) and read(2) calls:

# Check what are the syscalls being
# used by `free`
strace -f free
...
openat(AT_FDCWD, "/proc/meminfo", O_RDONLY) = 3

                        .-------.
                        |       v
read(3, "MemTotal:      | 2040940 kB\nMemF"..., 8191) = 1307
...                     |       
                     That is 2GB!



# Check what are the syscalls being used
# by `top`
strace -f top -p 19282  -bn1
...
openat(AT_FDCWD, "/proc/meminfo", O_RDONLY) = 5
lseek(5, 0, SEEK_SET)                   = 0
read(5, "MemTotal:        2040940 kB\nMemF"..., 8191) = 1307
...                             ^
                                |
             2GB again  --------*

Looking at those return values (what it’s read), we can spot that the “problem” is coming from /proc/meminfo, which free and top are just blindly trusting.

Before we go check what the Kernel is doing when reporting those values, let’s quickly remember how a container gets memory limits set.

Setting container limits

The way that Docker (ok, runc) ends up setting the container limits is via the use of cgroups.

As very well documented in the man page (see man 7 cgroups:

Control cgroups, usually referred to as cgroups, are a Linux kernel feature which allows processes to be organized into hierarchical groups whose usage of various types of resources can then be limited and monitored.

To see that in action, consider the following program that allocates memory in chunks of 1MB:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MEGABYTE (1 << 20)
#define ALLOCATIONS 20


/**
 * alloc - a "leaky" program that just allocated
 *         a predefined amount of memory and then
 *         exits.
 */
int
main(int argc, char** argv)
{
	printf("allocating: %dMB\n", ALLOCATIONS);


	void* p;
	int   i = ALLOCATIONS;


	while (i-- > 0) {
		// Allocate 1MB (not initializing it
		// though).
		p = malloc(MEGABYTE);
		if (p == NULL) {
			perror("malloc");
			return 1;
		}

		// Explicitly initialize the area that
		// has been allocated.
		memset(p, 65, MEGABYTE);

		printf("remaining\t%d\n", i);
	}
}

We can see that without any limits, we can keep allocating past 20MB without problems.

# Keep allocating memory until the 20MB
# mark gets reached.
./alloc.out
allocating: 20MB
remaining	19
remaining	18
...
remaining	1
remaining	0

That changes after we put our process under a cgroup with memory limits set:

# Create our custom cgroup
mkdir /sys/fs/cgroup/memory/custom-group

# Configure the maximum amount of memory
# that all of the processes in such cgroup
# will be able to allocate
echo "$((1024 * 1024 * 10))" > \
        /sys/fs/cgroup/memory/custom-group/memory.limit_in_bytes

# Put the current process tree under such
# cgroup
echo $$ > \
        /sys/fs/cgroup/memory/custom-group/tasks

# Try to allocate the 20MB
./alloc.out
allocating: 20MB
remaining	19
remaining	18
remaining	17
remaining	16
remaining	15
remaining	14
remaining	13
remaining	12
Killed

Looking at the results from dmesg, we can see what happened:

                        our thing getting killed!
                                  .------------.
[181346.109904] alloc.out invoked | oom-killer:|
                                  *------------*
[181346.109906] alloc.out cpuset=/ mems_allowed=0
[181346.109911] CPU: 0 PID: 22074 Comm: alloc.out Not tainted 4.15.0-36-generic #39-Ubuntu
[181346.109911] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[181346.109912] Call Trace:
[181346.109918]  dump_stack+0x63/0x8b
[181346.109920]  dump_header+0x71/0x285
[181346.109923]  oom_kill_process+0x220/0x440
[181346.109924]  out_of_memory+0x2d1/0x4f0
[181346.109926]  mem_cgroup_out_of_memory+0x4b/0x80
[181346.109928]  mem_cgroup_oom_synchronize+0x2e8/0x320
[181346.109930]  ? mem_cgroup_css_online+0x40/0x40
[181346.109931]  pagefault_out_of_memory+0x36/0x7b
[181346.109934]  mm_fault_error+0x90/0x180
[181346.109935]  __do_page_fault+0x4a5/0x4d0
[181346.109937]  do_page_fault+0x2e/0xe0
[181346.109940]  ? page_fault+0x2f/0x50
[181346.109941]  page_fault+0x45/0x50

                        Killed!
...               ____________________________
                 /                            \
[181346.109950] Task in /custom-group killed as 
                a result of limit of /custom-group
[181346.109954] memory: usage 10240kB, limit 10240kB, failcnt 56
[181346.109954] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[181346.109955] kmem: usage 940kB, limit 9007199254740988kB, failcnt 0
[181346.109955] Memory cgroup stats for /custom-group: cache:0KB rss:9300KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:9248KB inactive_file:0KB active_file:0KB unevictable:0KB
[181346.109965] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[181346.110005] [21530]     0 21530     5837     1381    90112        0             0 bash
[181346.110011] [22074]     0 22074     3440     2594    69632        0             0 alloc.out
[181346.110012] Memory cgroup out of memory: Kill process 22074 (alloc.out) score 989 or sacrifice child
[181346.318942] Killed process 22074 (alloc.out) total-vm:13760kB, anon-rss:8988kB, file-rss:1388kB, shmem-rss:0kB
[181346.322003] oom_reaper: reaped process 22074 (alloc.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

So we can see pretty well that limits are being enforced.

Again, why is /proc telling us that we have 2GB of memory?

Memory limits set by cgroups are not namespaced

The reason why is that the memory retrieved by /proc/meminfo is not namespaced.

Differently from other things like listing pids from /proc, when the file_operations that procfs implements reach the point of gathering memory information, it doesn’t acquire a namespaced view of it.

For instance, let’s compare the way that listing the differences in showing contents under /proc/ (listing the directory entries) and /proc/meminfo.

In the case of listing /proc (see How is /proc able to list process IDs?), we can see procfs taking the namespace reference and using it:

int proc_pid_readdir(struct file *file, struct dir_context *ctx)
{
        // Takes the namespace as seen by the file
        // provided.
	struct pid_namespace *ns = file_inode(file)->i_sb->s_fs_info;

        // ...
        
        // Iterates through the next available tasks
        // (processes) as seen by the namespace that
        // we are within.
	for (iter = next_tgid(ns, iter);
	     iter.task;
	     iter.tgid += 1, iter = next_tgid(ns, iter)) {
                // ...
        }

        // ...
}

Meanwhile, in the case of reading /proc/meminfo, that doesn’t happen at all (well, as expected, it’s not about namespaces! It’s about cgroups):

static int meminfo_proc_show(struct seq_file *m, void *v)
{
	struct sysinfo i;
	// ...

        // Populate the sysinfo struct with memory-related
        // stuff
	si_meminfo(&i);

        // Add swap information
	si_swapinfo(&i);
        
        // ... start displaying

	show_val_kb(m, "MemTotal:       ", i.totalram);
	show_val_kb(m, "MemFree:        ", i.freeram);

        // ...
}

As expected, no single reference to namespaces (or cgroups).

Also, si_meminfo, the method that fills the sysinfo interface takes some global values and bring it to /proc/meminfo, has no idea about cgroups either:

/**
 * The struct that holds part of the memory information
 * that ends up being displayed in the end.
 */
struct sysinfo {
	__kernel_long_t uptime;		/* Seconds since boot */
	__kernel_ulong_t loads[3];	/* 1, 5, and 15 minute load averages */
	__kernel_ulong_t totalram;	/* Total usable main memory size */
	__kernel_ulong_t freeram;	/* Available memory size */
	__kernel_ulong_t sharedram;	/* Amount of shared memory */
	__kernel_ulong_t bufferram;	/* Memory used by buffers */
	__kernel_ulong_t totalswap;	/* Total swap space size */
	__kernel_ulong_t freeswap;	/* swap space still available */
	__u16 procs;		   	/* Number of current processes */
	__u16 pad;		   	/* Explicit padding for m68k */
	__kernel_ulong_t totalhigh;	/* Total high memory size */
	__kernel_ulong_t freehigh;	/* Available high memory size */
	__u32 mem_unit;			/* Memory unit size in bytes */
	char _f[20-2*sizeof(__kernel_ulong_t)-sizeof(__u32)];	/* Padding: libc5 uses this.. */
};

/**
 * Fills the `sysinfo` struct passed as a pointer
 * with values collected from the system (globally
 * set).
 */
void si_meminfo(struct sysinfo *val)
{
	val->totalram = totalram_pages;
	val->sharedram = global_node_page_state(NR_SHMEM);
	val->freeram = global_zone_page_state(NR_FREE_PAGES);
	val->bufferram = nr_blockdev_pages();
	val->totalhigh = totalhigh_pages;
	val->freehigh = nr_free_highpages();
	val->mem_unit = PAGE_SIZE;
}

Interesting fact: totalram_pages (reported from MemTotal) can change - see this StackOverflow question: Why does MemTotal in /proc/meminfo change?.

Who’s controlling the allocation of memory?

If you’re now wondering where we end up reaching that limit that we set in the cgroup, we need to look at the path that a memory allocation takes.


   alloc.out	(our process)
      |
      |
      *--> task_struct (process descriptor)
            |
            |
   	    *--> mm_struct (memory descriptor)
   	           |
   	           |
   m_cgroup <------*
   |
   +------> page_counter memory
   |          |
   |          *--> { atomic_long_t count, unsigned long limit }
   |
   |
   *------> page_counter swap

Within the Kernel, each process created (in our case, alloc.out) is referenced internally via a process descriptor task_struct:

struct task_struct {
	struct thread_info		thread_info;

        // ... 
 
	unsigned int			cpu;
	struct mm_struct		*mm;       

Such process descriptor references a memory descriptor mm defined as mm_struct:

struct mm_struct {
	struct vm_area_struct *mmap;		/* list of VMAs */
	unsigned long mmap_base;		/* base of mmap area */
	unsigned long task_size;		/* size of task vm space */

	// ...

#ifdef CONFIG_MEMCG
	struct mem_cgroup *mem_cgroup;
#endif

}

Such memory descriptor references a mem_cgroup, a data structure that keeps track of the cgroup semantics for memory limiting and accounting:

struct mem_cgroup {
	struct cgroup_subsys_state css;

	/* Private memcg ID. Used to ID objects that outlive the cgroup */
	struct mem_cgroup_id id;

	/* Accounted resources */
	struct page_counter memory;
	struct page_counter swap;
        
        // ...
}

Such cgroup data structure then references some page counters (memory and swap, for instance) defined via the page_counter struct, which are responsible for keeping track of usage and providing the limiting functionality when someone tries to acquire a page:

struct page_counter {
	atomic_long_t count;
	unsigned long limit;

        // The parent CGROUP (remember, cgroups are
        // hierarchical!)
	struct page_counter *parent;
        
        // ...
};

Whenever a process needs some pages assigned to it, page_counter_try_charge goes through the cgroup memory hierarchy, trying to charge a given number of pages, which in case of success (new value would be smaller than the limit), it updates the counts, otherwise, it triggers OOM behavior.

Using bcc to trace page_counter_try_charge, we can see how the act of page_faulting leads to mem_cgroup_try_charge calling page_counter_try_charge:

25641   25641   alloc.out       page_counter_try_charge
        page_counter_try_charge+0x1 [kernel]
        mem_cgroup_try_charge+0x93 [kernel]
        handle_pte_fault+0x3e3 [kernel]
        __handle_mm_fault+0x478 [kernel]
        handle_mm_fault+0xb1 [kernel]
        __do_page_fault+0x250 [kernel]
        do_page_fault+0x2e [kernel]
        page_fault+0x45 [kernel]

Tracing a cgroup running out of memory

If we’re even more curious and decide to trace the page_counter_try_charge arguments, we can see the tries failing in the case when we’re within a container and try to grab more memory than we’re allowed to.

Using bpftrace, we’re able to tailor a small program that inspects the page_counter used in page_counter_try_charge and see how the limit changes over time (until the point that we reach the exhaustion - receiving an OOM then).

#include <linux/page_counter.h>


BEGIN
{
        printf("Tracing page_counter_try_charge... Hit Ctrl-C to end.\n");
        printf("%-8s %-6s %-16s %-10s %-10s %-10s\n",
                "TIME", "PID", "COMM", "REQUESTED", "CURRENT", "LIMIT");

	@epoch = nsecs;
}


kprobe:page_counter_try_charge
{
        $pcounter = (page_counter*)arg0;

        $limit = $pcounter->limit;
        $current = $pcounter->count.counter;
        $requested = arg1;

        printf("%-8d %-6d %-16s %-10ld %-10ld %-10ld\n",
                (nsecs - @epoch) / 1000000,
                pid,
                comm,
                $requested,
                $current,
                $limit
        );
}

Running the tracer with a shell session put into the cgroup that limits our memory, we can see it running out of pages:

sudo bpftrace ./try-charge-counter.d
Attaching 2 probes...
Tracing page_counter_try_charge... Hit Ctrl-C to end.
TIME     PID     REQUESTED  CURRENT    LIMIT
...
3301     25980   32         1288       2560
3302     25980   32         1320       2560
...
3307     25980   1          2553       2560
3307     25980   32         2554       2560
                        .--------------------.
3307     25980   1      |   2554       2560  |
3308     25980   32     |   2555       2560  |
3308     25980   1      |   2555       2560  |
3308     25980   32     |   2556       2560  |
3308     25980   1      |   2556       2560  |
3308     25980   32     |   2557       2560  |
3308     25980   1      |   2557       2560  |
3308     25980   32     |   2558       2560  |
                        *----------.---------*
                                   |
                                still possible
                                to increase the
                                number of pages 
                                   ...

3308     25980   1          2558       2560
3308     25980   32         2559       2560
3308     25980   1          2559       2560
3308     25980   32         2560       2560 * LIMIT REACHED
3308     25980   1          2560       2560 *
3308     25980   1          2560       2560 *
                             |          |
                             *-----.----*
                                   |
   Whoopsy, can't allocate  <------*
   anymore!

Closing thoughts

Although I’ve understood that meminfo wasn’t namespaced, it wasn’t clear for my why.

Going through the exercise of tailoring a quick program to inspect the arguments passed to page_counter_try_charge was very interesting (and easier than I thought!).

Shout out to bpftrace once again for allowing us to go deep into the Kernel with ease!

If you have any further questions, or just want to connect, let me know! I’m cirowrc on Twitter.

Have a good one!

Resources