Hey,

Continuing with the Month of /proc, today’s blog post is about how reading /proc works (yep, the directory) .

Illustration of what happens after ls is called

Not only is this article’s content about /proc, but also about reading directories in general (expect syscalls and Kernel inspection).

If you’ve been curious about how listing directory entries works under the hood, this is for you!

This is the second article in a series of 30 articles around procfs: A Month of /proc.

If you’d like to keep up to date with it, make sure you join the mailing list!

Listing directory entries in Linux

Whenever you issue ls in a Linux system, three things happen:

  • /bin/ls (or the equivalent of it) is executed;
  • the given directory is opened, and
  • a syscall is issued to read the directory entries of such directory.

We can discover all these three things by making use of strace, a utility that allows us to trace the syscalls called by a given process:

# Create a directory somewhere
mkdir /tmp/ciro

# Run `strace` with the option of 
# tracing child processes as they are
# created.
strace -f ls /tmp/ciro
execve("/bin/ls", ["ls", "/tmp/ciro/"], 0x7ffd091bfe30 /* 20 vars */) = 0
...
        # Open the file, passing some flags to it.
        openat(AT_FDCWD,     
                "/tmp/ciro/",
                O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3

        # Read the directory entries.
        getdents(3,             
                /* 2 entries */, 32768)     = 48

While execve is interesting in it self, we’re not going to focus on it in this article - we can know just that it’s what executes the given program.

Opening a directory

The next syscall is openat.

This one is pretty much the same as open(2), which we covered in the last article (see What is /proc - section: Translating a file read to an internal kernel method), except that it has a special flag set: O_DIRECTORY.

According to the man open:

O_DIRECTORY: If pathname is not a directory, cause the open to fail.

That is pretty much an optimization so that we can avoid an extra stat(2) just for checking if the file is a directory before doing any following directory operations.

If you’re curious about how that works, check out the following trace:

do_last
path_openat
do_sys_open
sys_openat
do_syscall_64

Where do_last, which finishes the file opening, performs the following check:

//                     .--> flag set if 
error = -ENOTDIR; //   |    `O_DIRECTORY` is passed
//                     |    to open(2)
//                     |
if ((nd->flags & LOOKUP_DIRECTORY) && 
        !d_can_lookup(nd->path.dentry)) {

        goto out;       // ends up returning ENOTDIR
}

If you’d like to know more about open(2), make sure you also the last article: what is /proc.

There I cover how open works under the hood when the Virtual Filesystem interacts with procfs.

Another great resource is the book Understanding the Linux Kernel, 3rd Ed. It goes down to what the kernel does when opening a file too!

Reading directory entries from userspace

Now, once the directory has been opened, that is, in the Kernel we have a file description and in userspace we have the file descriptor, we can make use of the getdents syscall.

The system call getdents() reads several linux_dirent structures from the directory referred to by the open file descriptor fd into the buffer pointed to by dirp.

Something interesting about this one is that it’s not wrapped by glibc, meaning that we need to call it via the syscall(2) method ourselves.

I don’t really know why this syscall in specific is not wrapped! Do you? Please let me know! Reach at cirowrc on Twitter!

Not being a glibc-wrapped syscall, we need to call it directly with syscall, providing the arguments that it expects and memory areas for the Kernel to fill (so we can retrieve the response).

Here’s how we can do it (check the comments!):

#include <fcntl.h>
#include <linux/types.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <unistd.h>

// Total size that the buffer that we'll allocate
// in the stack can have.
#define BUF_SIZE 1024

// The directory we'll end up reading.
const char* proc_directory_path = "/proc";

/**
 * Provide the structure that fits our current kernel structure
 * for directory entries (see linux#include/linux/dirent.h).
 */
struct linux_dirent64 {
	long           d_ino;    /* 64-bit inode number */
	off_t          d_off;    /* 64-bit offset to next structure */
	unsigned short d_reclen; /* Size of this dirent */
	unsigned char  d_type;   /* File type */
	char           d_name[]; /* Filename (null-terminated) */
};

/**
 * Reads the directory entries from `/proc`.
 *
 * It does so using by using the non-wrapped syscall
 * getdents64 until it returns 0 entries.
 */
int
main(int argc, char** argv)
{
	// `buf` that holds a piece of allocated memory to
	// be given to the Kernel to retrieve data.
	//
	// Given that we're initializing it with `BUF_SIZE`,
	// this is allocating `BUFSIZE * 1` bytes in the stack.
	char buf[BUF_SIZE];

	// file descriptor to hold the open file (directory)
	int fd;

	// number of directory entries that were read and put
	// in the buffer that we allocated in the stack.
	int directory_entries_read;

	// error code to exit.
	int err = 0;


        // Open the directory so that we're able to let the
        // kernel deal with the underlying file description.
	fd = open(proc_directory_path, O_RDONLY | O_DIRECTORY);
	if (fd == -1) {
		perror("open");
		return 1;
	}

	for (;;) {
		// Call the `getdents64` syscall passing the buffer
		// to the kernel so that it can fill with directory
		// entries.
		directory_entries_read =
		  syscall(SYS_getdents64, fd, buf, BUF_SIZE);
		if (directory_entries_read == -1) {
			perror("SYS_getdents64");
			err = 1;
			break;
		}

		if (directory_entries_read == 0) {
			err = 0;
			break;
		}

		struct linux_dirent64* entry;

		// Given that the Kernel filled our array of memory with
		// `N` entries (directory_entries_read), iterate over those
		// structs using the right offset.
		for (int off = 0; off < directory_entries_read;) {
			entry = (struct linux_dirent64*)(buf + off);

			printf("entry: %s\n", entry->d_name);
			off += entry->d_reclen;
		}
	}

	// Given that we're done with reading
	// from the file, close it to free the
	// underlying structures allocated for
	// it (and make further reads fail).
	close(fd);
	return err;
}

IF you’re not very familiar with some of the Linux concepts presented above, a great book that covers that is The Linux Programming Interface.

I’ve based some of the explanations from it!

Having that, we’re able to read a directory with pure C.

Given that the interface is the same for any filesystem, we can swap /proc by /something.txt and it should work the same way - if /something.txt is a directory and we have the right permissions, done!

Under the hood of getdents

With the knowledge of what happens at the userspace (getdents), now it’s time to look at what happens under the hood - once getdents crosses to kernelspace.

Remembering that the filesystems need to implement the corresponding methods from file_operations interface, we can guess that:

  1. there is a method in such interface for listing directory entries, and
  2. such method gets called by sys_getdents at some point.

The first point can be confirmed by looking at the interface itself:

/**
 * file_operations describe an interface that
 * filesystems must implement in order to handle
 * calls from syscalls that interact with filesystems.
 */
struct file_operations {
	loff_t (*llseek) (struct file *, loff_t, int);
	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
	// ...

        // The method for iterating over the directory
        // entries under a given directory.
	int (*iterate) (struct file *, struct dir_context *);

        // A method that is similar to `iterate` but allows
        // multiple calls to be performed simultaneously.
	int (*iterate_shared) (struct file *, struct dir_context *);
	// ...
} __randomize_layout;

Now, regarding the second point (checking where in the sys_getdents path one of those two methods gets called), we can look at the syscall implementation at fs/readdir.c.

It first starts by verifying if the task can perform such call at sys_getdents, then it goes to iterate_dir, which is all about offloading the directory entries iteration to the implementor of the file_operations interface.

Looking at the at the actual Linux implementation, we can see how iterate_dir makes use of the underlying filesystem implementation of iterate_shared (f->f_op->iterate_shared):

int iterate_dir(struct file *file, struct dir_context *ctx)
{
	struct inode *inode = file_inode(file);
	bool shared = false;
	int res = -ENOTDIR;

        // making sure that either the iterate_shared or
        // iterate methods of the file_operations interface
        // are implemented
	if (file->f_op->iterate_shared)
		shared = true;
	else if (!file->f_op->iterate)
		goto out;

	// ...

        // is the directory really there?
        // if so, hand the file to the underlying
        // filesystem implementation and let it 
        // iterate over the directory entries.
	res = -ENOENT;
	if (!IS_DEADDIR(inode)) {
		ctx->pos = file->f_pos;

                // In either case, let the filesystem
                // implementation do it.
		if (shared)
			res = file->f_op->iterate_shared(file, ctx);
		else
			res = file->f_op->iterate(file, ctx);

                // update the reading offset
                // ("file pointer")
		file->f_pos = ctx->pos;

                // notify that we accessed the file
		fsnotify_access(file);
		file_accessed(file);
	}
	
        // ...
}

How procfs handles getdents calls

Now, to know where is the implementation of either iterate_shared or iterate from proc, we can go over to the procfs source code (at fs/proc) and search for the method signature (iterate or iterate_shared):

ubuntu@bionic:~/linux/fs/proc$ ag iterate_shared
fd.c
269:	.iterate_shared	= proc_readfd,
353:	.iterate_shared	= proc_readfdinfo,

generic.c
306:	.iterate_shared	= proc_readdir,

root.c
190:	.iterate_shared	= proc_root_readdir,

proc_net.c
184:	.iterate_shared	= proc_tgid_net_readdir,

base.c
2215:	.iterate_shared	= proc_map_files_readdir,
2587:	.iterate_shared	= proc_attr_dir_readdir,
3009:	.iterate_shared	= proc_tgid_base_readdir,
3402:	.iterate_shared	= proc_tid_base_readdir,
3612:	.iterate_shared	= proc_task_readdir,

namespaces.c
143:	.iterate_shared	= proc_ns_dir_readdir,

proc_sysctl.c
847:	.iterate_shared	= proc_sys_readdir,

To filter that list out, we can make use of funccount from iovisor/bcc to check which of those methods get called whenever we issue a call to getdents on /proc:

root@bionic:~# funccount 'proc_*readdir'
FUNC                                    COUNT
proc_readdir                                1
proc_root_readdir                           2
proc_pid_readdir                            2

Having narrowed our scope, now we can learn about those three functions.

From the naming, we can guess that proc_root_readdir is responsible for being the first to respond to a request to list all directory entries from /proc.

Such affirmation can be confirmed by looking at the proc_dir_entry set:

/*
 * This is the root "inode" in the /proc tree..
 */
struct proc_dir_entry proc_root = {
	.low_ino	= PROC_ROOT_INO, 
	.namelen	= 5, 
	.mode		= S_IFDIR | S_IRUGO | S_IXUGO, 
	.nlink		= 2, 
	.count		= ATOMIC_INIT(1),
	.proc_iops	= &proc_root_inode_operations, 

        // Sets the implementation of the `file_operations`
        // interface to use.
	.proc_fops	= &proc_root_operations,
	.parent		= &proc_root,
	.subdir		= RB_ROOT_CACHED,
	.name		= "/proc",
};


/*
 * The root /proc directory is special, as it has the
 * <pid> directories. Thus we don't use the generic
 * directory handling functions for that..
 */
static const struct file_operations proc_root_operations = {
	.read		 = generic_read_dir,

        // Sets the method for iterating over directory entries.
	.iterate_shared	 = proc_root_readdir,
	.llseek		= generic_file_llseek,
};

/**
 * Starts the process of listing the entries from
 * `/proc`.
 */
static int proc_root_readdir(struct file *file, struct dir_context *ctx)
{
        // Check if we're still in the context of
        // listing non-PID entries.
	if (ctx->pos < FIRST_PROCESS_ENTRY) {
		int error = proc_readdir(file, ctx);
		if (unlikely(error <= 0))
			return error;
		ctx->pos = FIRST_PROCESS_ENTRY;
	}

        // Having already read the non-pid directory
        // entries (like `/proc/meminfo`), now go 
        // list the PIDs.
	return proc_pid_readdir(file, ctx);
}

The first part (listing non-pid directories) doesn’t reveal a lot for us - it goes through a list of directories that have registered via their corresponding directory entries structs.

The second though (listing processes directories), is the thing.

Before we get there, we need to review the differences between how the Kernel looks at PIDs and threads compared to the userspace.

Linux and its pids

While in the userspace we’re accustomed to the term pid (process identifier), it’s not the same thing for tgid (thread group id).

Whenever we create a process in userspace, a PID is received.

Now, considering that this process creates a thread, we can see from userspace that this thread inherits such PID.

USERSPACE:

         (pid=123)
        my_root_proc      .--> my_root_proc (pid=123)
           |              |
           *----> fork ---+
                          |
                          *->  my_root_proc (pid=123)

When it comes to the kernel space though, a pid refers to a single execution, so that those things now differ:

KERNEL:

         (pid=123)
        my_root_proc      .--> my_root_proc (pid=123)
           |              |
           *----> fork ---+
                          |
                          *->  my_root_proc (pid=124) 
                                (new pid!)

What unites them is the notion of a tgid (thread group id). This is a property that gets inherited so that we can keep track of who initiated the whole three:

KERNEL:

         (pid=123,tgid=123)
        my_root_proc      .--> my_root_proc (pid=123,tgid=123)
           |              |
           *----> fork ---+
                          |
                          *->  my_root_proc (pid=124,tgid=123)
                                (new pid!)

With that in mind, let’s proceed.

How procfs lists process IDs

The whole implementation of process listing can be found at proc_pid_readdir:

/**
 * For the /proc/ directory itself, 
 * after non-process stuff has been done.
 */
int proc_pid_readdir(struct file *file, struct dir_context *ctx)
{
	// ...

        // Given that calls to `/proc`
        // are namespaced (check out `ls /proc`
        // from within a docker container),
        // we start by grabbing the PID namespace
        // of the current task executing the 
        // getdents call.
	struct pid_namespace *ns = file_inode(file)->i_sb->s_fs_info;

	// ... do some checks ...

        // Iterate over all thread group ids 
        // (`tgid`s), capturing the task struct
        // associated with them.
	for (iter = next_tgid(ns, iter);
	     iter.task;
	     iter.tgid += 1, iter = next_tgid(ns, iter)) {
		char name[PROC_NUMBUF];
		int len;

		cond_resched();
		if (!has_pid_permissions(ns, iter.task, HIDEPID_INVISIBLE))
			continue;

                // convert the tgid to a string that we can
                // return in the dir entries
		len = snprintf(name, sizeof(name), "%d", iter.tgid);
		ctx->pos = iter.tgid + TGID_OFFSET;
		if (!proc_fill_cache(file, ctx, name, len,
				     proc_pid_instantiate, iter.task, NULL)) {
			put_task_struct(iter.task);
			return 0;
		}
	}
	ctx->pos = PID_MAX_LIMIT + TGID_OFFSET;
	return 0;
}

If you’re curious about how the kernel is able to fill that struct task that next_tgid gets, then make sure you stick with the Month of proc!

In the next articles we go further into what information we can grab from such tasks.

Closing thoughts

Having never wrapped a syscall using C before, it was a great exercise to learn how that’s done.

I had never really given attention to how a process can list contents from a given directory - it was pretty clear for me that it involved opening a file and then issuing a certain syscall, but really learning what happens under the hood was amazing.

I’d like also to point out how helpful bcc and bpftrace are for learning about how Linux works internally. Kudos for everyone involved!

If you have any further questions or would like to drop a comment, let me know! I’m cirowrc on Twitter.

Have a good one!

Resources