Process resource limits under the hood

Hey,

When dealing with web servers, it’s not uncommon to face the problem of “too many files open”.

Usually, people solve that by modifying a number using ulimit -n, and then, sometimes that doesn’t work as such setting is per-process, and not system-wide.

Illustration of ulimits not working as expected

In this article, I go through how that ulimit setting works under the hood, and how you can make use of /proc to inspect other processes limits without the need for additional tools.

This is the fourth article in a series of 30 articles around procfs: A Month of /proc.

If you’d like to keep up to date with it, make sure you join the mailing list!

What are resource limits
How does the kernel limit the number of files open by a process?
Under the hood of the ulimit “command”
A sample ulimit implementation
Setting and getting resource limits at the Kernel level
Gathering process limits via /proc
Why expose limits through procfs if prlimit already exists
Closing thoughts
Resources

What are resource limits

Resource limits can be thought as ceilings that can be set for specific resources from the Kernel that a given process might end up consuming.

These limits are made up of two values:

soft: limits the number of resources that the process may consume; and
hard: a ceiling on the value to which the soft limit may be adjusted by an unprivileged process.

Throughout this article, I’ll be covering only one resource - the number of open files -, but there are many other resources that can have their consumption limited via resource limits.

So, to set a quick example, let’s consider the following program that opens n files and leaves them open:

package main

import (
	"flag"
	"fmt"
	"io/ioutil"
	"os"
)

/**
 * Take as user input the number of files
 * that we want to create and keep open.
 */
var files = flag.Uint("files", 10, "number of files to open")

/**
 * open-files - opens files and leaves them open.
 *
 * Usage: ./open-files -files=<number_of_files>
 *
 */
func main() {
	flag.Parse()

	if *files == 0 {
		fmt.Fprintf(os.Stderr, "number of files must be > 0")
		os.Exit(1)
	}

	/**
	 * Show the PID of the current process so
	 * that we can mess with it in another
	 * terminal.
	 */
	fmt.Println("pid: ", os.Getpid())

	/**
	 * Starts the process of creating N files
	 * and leaving them open.
	 */
	for i := 0; i < int(*files); i++ {
		fmt.Println(i, "files open")

		tmpfile, err := ioutil.TempFile("", "example")
		if err != nil {
			fmt.Fprintf(os.Stderr, "failed to create and " +
                                "open tmp file: %v\n", err)
			return
		}

		defer os.Remove(tmpfile.Name())
	}
}

Running them on a shell with a high limit of open files, we can see it properly working:

./open-files
pid:  726
0 files open
1 files open
2 files open
3 files open
4 files open
5 files open
6 files open
7 files open
8 files open
9 files open

Now, if we make that limit very small - 10 -, we can see it failing:

# Set the maximum number of open files to 10.
ulimit -n 10

# Try to open 10 files.
#
# Naturally, it should fail, given that 
# the execution of `open-files` will also open
# some shared libraries before opening the other
# 10 files.
./open-files
pid:  732
0 files open
1 files open
2 files open
3 files open
4 files open
5 files open
6 files open
failed to create and open tmp file: 
  open /tmp/example484763765: too many open files

How does the kernel limit the number of files open by a process?

In the case of open files, we can see in the Kernel code how that’s strictly enforced by following the methods invoked in the code path for an open(2) syscall:

          ^   __alloc_fd
          |   get_unused_fd_flags
(kernel)  |   do_sys_open
          |   sys_openat
          |   do_syscall_64
          |   entry_SYSCALL_64_after_hwframe
----------+--------------------------
(user)    |   open(2)

Having everything started at do_syscall and going towards do_sys_open, nothing particularly interesting happens when it comes to resource limits.

Things get interesting when it comes to get_unused_fd_flags, which then calls __alloc_fd with the resource limit boundaries:

/**
 * Sets up the ranges for __alloc_fd
 * so that when trying to allocate a
 * file descriptor, it performs some
 * bounds checking.
 */
int get_unused_fd_flags(unsigned flags)
{
	return __alloc_fd(
                current->files, 0, 
                rlimit(RLIMIT_NOFILE),
                flags);               
}

/*
 * Tries to allocate a file descriptor
 * for the process, marking it as busy.
 */
int __alloc_fd(struct files_struct *files,
	       unsigned start, unsigned end, unsigned flags)
{
	unsigned int fd;
	int error;
	struct fdtable *fdt;

        // ...

        // tries to gather the next available
        // file descriptor
	fd = start;
	if (fd < files->next_fd)
		fd = files->next_fd;
	if (fd < fdt->max_fds)
		fd = find_next_fd(fdt, fd);

        // Verifies if the file descriptor that
        // we have available falls within the
        // boundaries set by the open files
        // resource limit.
	error = -EMFILE;
	if (fd >= end)
		goto out;

	// ...
}

Knowing where all these checks take place, we can tailor a specific bpftrace program that targets __alloc_file and then lets us know what is the result from it.

BEGIN
{
        printf("Looking for EMFILE when opening\n");
}


// Place a return probe in the `__alloc_fd`
// so that we can verify whether an error is
// returned from it.
kretprobe:__alloc_fd /comm == "open-files"/
{
        printf("returned: %d\n", retval);
}

Leaving the tracing running in a terminal while we run open-files in another with a low limit:

# Run the trace program and then
# see the values returned.
# 
# What we expect: file descriptors being
# properly created up until `9`, then `-EMFILE`
# (24) at the 10th try.
#
# #define EMFILE 24 /* Too many open files */

sudo bpftrace ./limit-check.d
Attaching 2 probes...
Looking for EMFILE when opening
returned: 3
returned: 4
returned: 5
returned: 6
returned: 7
returned: 8
returned: 9
returned: -24

Under the hood of the ulimit “command”

Now that we know how the Kernel is able to check for the limits whenever our process tries to create a program, it’s time to discover what that ulimit -n command is all about.

The first thing one might notice is that ulimit is not really a binary that the shell is calling - it’s usually a builtin command that your shell provides.

For instance, looking at ash, the shell that busybox provides, we can see it in ash’s code:

/**
 * Implementation of all of the
 * ulimit builtin functionality.
 *
 * No `execve` calling a `/usr/bin/ulimit`
 * or somethign like that.
 *
 * Pure code implemented from ground up.
 */
int FAST_FUNC
shell_builtin_ulimit(char **argv)
{
        struct rlimit limit;

         // ...
        for (l = limits_tbl; 
                l != &limits_tbl[ARRAY_SIZE(limits_tbl)]; 
                l++) {

                // ...
                if (!opts)
                        opts = OPT_hard + OPT_soft;
                if (opts & OPT_hard)
                        limit.rlim_max = val;
                if (opts & OPT_soft)
                        limit.rlim_cur = val;
                if (setrlimit(l->cmd, &limit) < 0) {
                        bb_perror_msg("error setting limit");
                        return EXIT_FAILURE;
                }
        }
}

ps.: the same is valid for bash - see bash/builtins/ulimit.def.

Being a builtin or not, it doesn’t matter - at some point, it touches the Kernel with a syscall.

For doing so, three syscalls can be used (see man 2 prlimit):

setrlimit and getrlimit: respectively, sets and gets resource limits corresponding to the currently running process (and its future children); and
prlimit, which allows settings and gettings resource limits corresponding to an arbitrary process.

As we can see from the description, in the end, prlimit is what really matters.

If we look under the hood of setrlimit and getrlimit, we even discover that they’re just wrapping parts of the prlimit functionality (see kernel/sys.c):

SYSCALL_DEFINE2(getrlimit, 
        unsigned int, resource, 
        struct rlimit __user *, rlim)
{
	struct rlimit value;
	int ret;

	ret = do_prlimit(current, resource, NULL, &value);
	if (!ret)
		ret = copy_to_user(rlim, &value, sizeof(*rlim)) ? -EFAULT : 0;

	return ret;
}


SYSCALL_DEFINE2(setrlimit, 
        unsigned int, resource, 
        struct rlimit __user *, rlim)
{
	struct rlimit new_rlim;

	if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
		return -EFAULT;
	return do_prlimit(current, resource, &new_rlim, NULL);
}

So, how does prlimit work?

A sample ulimit implementation

We can see its functionality in place by creating a quick program in C that is able to change the resource limits regarding the number of open files associated with a given process id:

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <unistd.h>

// A hardcoded number that sets the maximum number
// of open files that the supplied PID can handle.
#define MAX_OPEN_FILES 10

// Hardcode the resource that we'll be changing.
//
// RLIMIT_NOFILE - number of files that can be
//                 opened by this process.
#define RESOURCE RLIMIT_NOFILE

/**
 * limit-open-files - limits the number of open files that
 *                    a given PID can hold to 10.
 *
 * Usage: ./limit-open-files.out <PID>
 *
 */
int
main(int argc, char** argv)
{
	int        err = 0;

        // parse the arguments supplied via
        // flags (code omitted for brevity).
	struct cli cli = { 0 };
	cli_parse(argc, argv, &cli);

        // Initialize the data structures that
        // carry information about the process
        // resource limits.
        //
        // `old` is supposed to have its values
        // filled by the kernel at the `get` 
        // phase of `prlimit`.
	struct rlimit old = { 0 };

        // `new` is supposed to contain the values
        // that we want to change.
	struct rlimit new = {
		.rlim_cur = cli.soft,
		.rlim_max = cli.hard,
	};

	// Perform a "get-and-set" operation.
	//
	// By passing non-NULL `new` and `old` values,
	// the values supplied at `new` will be used as
	// the new values for such resource.
	//
	// The `old` will get filled with the previous
	// values.

        // int prlimit(
        //
        //   pid_t pid,      --> process ID to affect
        // 
        //   int resource,   --> resource to tweak / gather info
        // 
        //   const struct rlimit *new_limit, --> if set; new limit 
        //                                       to place.
        // 
        //   struct rlimit *old_limit   --> if non-nil, gathers info 
        //                                  about the previous soft 
        //                                  and hard limits.
        // );
        // 
	err = prlimit(cli.pid, RLIMIT_NOFILE, &new, &old);
	if (err == -1) {
		perror("prlimit - get and set:");
		return 1;
	}

	printf("before: soft=%lld; hard=%lld\n",
	       (long long)old.rlim_cur,
	       (long long)old.rlim_max);

	// Perform a `get` operation to retrieve the
	// current values set for the resource.
	err = prlimit(cli.pid, RLIMIT_NOFILE, NULL, &old);
	if (err == -1) {
		perror("prlimit - get:");
		return 1;
	}

	printf("now:    soft=%lld; hard=%lld\n",
	       (long long)old.rlim_cur,
	       (long long)old.rlim_max);

	return 0;
}

Run it against a particular process and then see it changing its resources limits:

# Modify the limit of open files that process
# 29871 can hold to soft=12 and hard=12.
./limit-open-files.out -p 29871 -s 12 -h 12
before: soft=1024; hard=1048576
now: soft=12; hard=12


# Trying to modify it again, we can see that 
# the process was previously set to `12` 
# (given that we had just changed to 12).
./limit-open-files.out -p 29871 -s 12 -h 12
before: soft=12; hard=12
now: soft=12; hard=12


# Try to increase the hard limit from 12 to 13.
./limit-open-files.out -p 29871 -s 12 -h 13
prlimit - get and set:: Operation not permitted

Notice how in the end we were not able to complete our desired operation of increasing hard limit.

That’s because some of the operations that prlimit performs (like increasing the hard limit) require elevated privileges, which we didn’t have when running it as an unprivileged user.

Making use of a bpftrace tool called capable we’re able to see how prlimit checks for CAP_SYS_RESOURCE when trying to elevate the hard limit:

sudo capable
UID    PID    COMM             CAP  NAME                 AUDIT
1001   29882  limit-open-file  24   CAP_SYS_RESOURCE     1

Now that we’re aware of the prlimit syscall, can we see the Kernel code that sets and gets these limits?

Setting and getting resource limits at the Kernel level

Tracing down the methods invoked by prlimit, we can see that all ends up in the method do_prlimit at kernel/sys.c:

Here’s a break down of it, full of comments of my own (and mutual exclusion - through locks - stripped out):

/**
 * `do_prlimit` constitutes the underlying functionality
 * of `prlimit` (which is also partly used by `getrlimit`
 * and `setrlimit`).
 */
int do_prlimit(struct task_struct *tsk, unsigned int resource,
		struct rlimit *new_rlim, struct rlimit *old_rlim)
{
	struct rlimit *rlim;
	int retval = 0;

	// Checks if the resource is a valid one (given that
        // we can provide whatever we want from the syscall
        // interface).
	if (resource >= RLIM_NLIMITS)
		return -EINVAL;

        // If we specified a non-NULL `new_rlim`, it means
        // that we're planning to set some limits on a given
        // resource.
	if (new_rlim) {
		// Check if the values even make sense.
		if (new_rlim->rlim_cur > new_rlim->rlim_max)
			return -EINVAL;

		// Check if it'd be getting above the system-wide limit
		// already set (sysctl)
		if (resource == RLIMIT_NOFILE &&
                                new_rlim->rlim_max > sysctl_nr_open)
			return -EPERM;
	}

	// Grab the current resource limits for
	// the desired resource.
	//
	// - Maybe we'll be able to see that /proc/pid/limits
	//   looks at the very same thing (tsk->signal->rlim)?
	rlim = tsk->signal->rlim + resource;

	// If we're going to set new values,
	// that is, if we're going to change limits, then ...
	if (new_rlim) {

		// If we're increasing the hard limit, make
		// sure that the user has the proper capabilities.
		if (new_rlim->rlim_max > rlim->rlim_max &&
				!capable(CAP_SYS_RESOURCE))
			retval = -EPERM;

		// Go ahead and perform the update, but first,
                // apply a security check.
		if (!retval)
			retval = security_task_setrlimit(tsk, resource, new_rlim);

		// ...
	}

	// If everything went right so far,
	// update `old_rlim` with the values of
	// what has been captured as the current
	// limits as of before updating.
	if (!retval) {
		if (old_rlim)
			*old_rlim = *rlim;
		if (new_rlim)
			*rlim = *new_rlim;
	}
}

Cool! So now we know that the limits (per-task) are set at task->signal->rlim + <rlimit_offset>.

You might’ve noticed that we didn’t explicitly look at task->signal->rlim in the section “How does the kernel limits the number of files open by a process”.

To remember you, this is what we saw:

/**
 * Sets up the ranges for __alloc_fd
 * so that when trying to allocate a
 * file descriptor, it performs some
 * bounds checking.
 */
int get_unused_fd_flags(unsigned flags)
{
	return __alloc_fd(
                current->files, 0, 
                rlimit(RLIMIT_NOFILE),
                flags);               
}

Looking deeper into it, we can start inspecting what is that rlimit(RLIMIT_NOFILE) thing, which leads to the fact that under the hood, it looks at the current task and inspects ->signal->rlim in the next function that gets called (task_rlimit):

/**
 * Looks at the current task's resource
 * limit.
 *
 * ps.: notice how it doesn't look at the **hard**
 *      limit, but the **soft** limit.
 */
static inline unsigned long task_rlimit(const struct task_struct *tsk,
		unsigned int limit)
{
	return READ_ONCE(tsk->signal->rlim[limit].rlim_cur);
}

So, now that we know how limits can be updated, as well as read, we can start imagining how /proc/<pid>/limits works under the hood.

Gathering process limits via /proc

Not only we’re able to retrieve information about what are the limits set for a given process via the prlimit(2) syscall, we can also do that by inspecting /proc, more specifically, /proc/<pid>/limits (where <pid> corresponds to the process ID of the process we want to know the limits).

For instance, we can check the limits of the current process either via ulimit -n or /proc/self/limits (/proc/self is a link to /proc/<pid> where <pid> is the current process' pid).

# Check the limit of open files
# using `ulimit` (`prlimit` under the hood).
ulimit -n
1024


# Check the limit of open files
# using the `proc` filesystem under `/proc`.
cat /proc/self/limits  | grep 'open files'
Limit           Soft Limit  Hard Limit   Units
Max open files  1024        1048576      files

Knowing from the past articles (see What is /proc?) how procfs is able to register methods to respond to Virtual Filesystem (VFS) calls, we can start digging into how the method registered for handling calls to /proc/<pid>/limits work.

/**
 * Display limits for a process 
 */
static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
			   struct pid *pid, struct task_struct *task)
{
	unsigned int i;
	unsigned long flags;

	struct rlimit rlim[RLIM_NLIMITS];

        // Make sure we have exclusive access
        // to the task struct while reading
        // it.
	if (!lock_task_sighand(task, &flags))
		return 0;

        // Copy to our current execution all of
        // the limits that are set for the task
        // provided as argument.
	memcpy(rlim, task->signal->rlim, 
                sizeof(struct rlimit) * RLIM_NLIMITS);

        // Release the exclusive access
	unlock_task_sighand(task, &flags);

        // Print the header
        seq_printf(m, "%-25s %-20s %-20s %-10s\n",
		  "Limit", "Soft Limit", "Hard Limit", "Units");

        // Iterate over each limit and then display it.
	for (i = 0; i < RLIM_NLIMITS; i++) {
		if (rlim[i].rlim_cur == RLIM_INFINITY)
			seq_printf(m, "%-25s %-20s ",
				   lnames[i].name, "unlimited");
		else
			seq_printf(m, "%-25s %-20lu ",
				   lnames[i].name, rlim[i].rlim_cur);
                // ...
	}

	return 0;
}

That’s it!

Why expose limits through procfs if prlimit already exists

I don’t know!

While the kernel doesn’t provide convenient ways of accessing some of its internal resources, for limits it’s definitely not the case - prlimit can supply all of its functionality.

Do you know why? Please let me know!

Closing thoughts

It was interesting to see how a resource limit is actually applied by the kernel, and in the end, discover that /proc/<pid>/limits is very redundant when looking at the functionality that a syscall (prlimit) already provides.

If you have any further questions, or just want to connect, let me know! I’m cirowrc on Twitter.

Have a good one!

Resources

Here are some interesting books to learn more about some of the concepts highlighted here: