Hey,

Although many people that are accustomed to Linux are aware of the existence of /proc and what some files over there can do, many lack the understanding of what goes behind the scenes to power such filesystem (myself included before writing this article).

If you’ve been wondering about how /proc works under the hood, stay tuned!

This is the first article in a series of 30 articles around procfs: A Month of /proc.

If you’d like to keep up to date with it, make sure you join the mailing list!

What is procfs

Procfs is a special virtual filesystem that can be mounted in your directory tree, allowing processes in userspace to read kernel information conveniently - using regular file I/O operations (like read(2) and write(2)).

           process 123: how many files do proc321
                  |          has open?
                  | 
(userspace)       *---> ls /proc/321/fd
                         \----+------/
                              |    ^
------------------------------|----|------------
                              |    |
                      .---<---*    |
                      |            |
                     kernel        *-----------<------. 
(kernelspace)         |                               |
                      *--> list number of open file   |
                           descriptors for proc `321` |
                           in the root namespace      |
                                 |                    |
                                 *------------>-------'
                                     there you go! 
                                        

The “virtual” comes from the fact that there’s not really a block device (like a solid-state drive - SSD) that serves the files that we can access under the place where you mount procfs (usually /proc).

Instead, there’s just some code implementing the filesystem interface that gets called whenever you issue reads and writes against those particular locations.

For instance, when a user asks for the limits that apply to a given process, the following path gets followed under the hood:

        cat /proc/13323/limits

                
(userspace)     fd = open("/proc/13323/limits")
                n = read(fd, buf, bufsize)
                     |
---------------------|--------------
                    vfs (common interface for interacting with
                     |   any filesystem)
                     |
                     *-> who's responsible for this `/proc`
                         mount?
                         procfs! let it handle the call.
                          |
                          |
(kernelspace)             *-> hey procfs, take this `read` call
                              for `/proc/13323/limits` please!
                                 |
                   sure! <-------*
                   I'll write the response
                   to the file.
                     |
                     *---> linux/fs/proc/base.c#proc_pid_limits
                           for limit := range limits {
                                fprintf(file, limit)
                           }

Using a tracer like bcc’s trace.py, we can see the kernel stack getting the proc_pid_limit command getting called:

PID     TID     COMM            FUNC
21450   21450   cat             proc_pid_limits
        proc_pid_limits+0x1 
        seq_read+0xe5 
        __vfs_read+0x1b 
        vfs_read+0x8e 
        sys_read+0x55 
        do_syscall_64+0x73 
        entry_SYSCALL_64_after_hwframe+0x3d 

Contrasting Procfs with a regular filesystem

A nice way of viewing the difference between the two is looking at how does the kernel path compare.

Let’s say we have a file /myfile.txt that lives on a disk that makes use of EXT4 as a filesystem.

If we were to read this file (making that that it’s not cached), this is how it’d look like:

        cat /myfile.txt

                
(userspace)     fd = open("/myfile.txt")
                n = read(fd, buf, bufsize)
                     |
---------------------|--------------
                    vfs (common interface for interacting with
                     |   any filesystem)
                     |
                     *-> who's responsible for this `/` mount?
                         ext4! let it handle the call.
                          |
                          |
(kernelspace)             *-> hey ext4, take this `read` call
                              for `/myfile.txt` please!
                                 |
                   sure! <-------*
                   Oh, I know that this file exists in the disk!
                   Let me request the underlying block device driver
                   for it.
                     |
                     *---> hey whoever is in charge of /dev/sda1, 
                        please hand me the contents of my file!
                                |
          Oh, this is not ------*
          in my cache; let me ask the disk for what
          is in the blocks where this file exists.

We can see that in the case of the regular file, the read(2) call ends up getting down to the block device driver that issues the read against a real device.

Using the same tracer that we used before, we can check that, differently from when reading from /proc, at this time, the path is much longer (goes deep down to the actual blk_* methods that handle block devices):

# Drop the caches so that the call ends up in
# a very low-level call to the block device 
# driver.
echo "3" > /proc/sys/vm/drop_caches

# Perform a read
cat ./myfile.txt

# See the trace results:
PID     TID     COMM            FUNC
28653   28653   cat             blk_start_request
        blk_start_request+0x1 
        scsi_request_fn+0xf5 
        __blk_run_queue+0x43 
        queue_unplugged+0x2a 
        blk_flush_plug_list+0x20a 
        blk_finish_plug+0x2c 
        __do_page_cache_readahead+0x1da 
        ondemand_readahead+0x11a 
        page_cache_sync_readahead+0x2e 
        generic_file_read_iter+0x7fb 
        ext4_file_read_iter+0x56 
        new_sync_read+0xe4 
        __vfs_read+0x29 
        vfs_read+0x8e 
        sys_read+0x55 
        do_syscall_64+0x73 
        entry_SYSCALL_64_after_hwframe+0x3d 

Although they look very different after vfs_read, everything feels the same for those consuming the vfs interface.

If you’d like to refresh some concepts around Linux in general (including filesystems), I recommend reading The Linux Programming Interface (chapter 12 covers /proc a bit, and chapter 14 is about filesystems!)

Reading from and writing to procfs

Not only being able to give you some introspection into what is the current state of a given process or the system as a while, /proc is also able to let you modify some of the behaviors of the system.

For instance, in the example above, we dropped the caches by performing a write(2) operation against /proc/sys/vm/drop_caches from the userspace.

To summarize:

  • when it comes to read(2) operations, it can be seen as an interface to introspect kernel data structures associated with either the whole system or a particular process; and

  • when it comes to write(2), it can be used to change some kernel parameters at runtime.

Translating a file read to an internal kernel method

Coming back to the interaction between user space and kernel space, you might’ve noticed in the previous sections that there was a common thing sitting between the EXT4 filesystem and procfs: the virtual filesystem (vfs).

illustration of read being directed to ext4 or procfs via vfs

By having this layer that sits between any syscall related to a filesystem, vfs is able to present a consistent API while letting different implementations to provide their functionality behind the scenes. No need for user programs to know about what’s the filesystem under the hood.

The way that the Kernel does this translation is pretty nifty.

Here’s an overview of how a read(2) from userspace gets down to a filesystem-specific implementation of a read:

        fd = open("file", flags);
        read(fd, buf, bufsize)
           |
 ----------+-------------------------- (userspace boundary)
           |
  .-------*
  |
  *-> ksys_read(fd, buf, bufsize)
         |                        .--> performs a file lookup,
         |                        |    gathering file information
         |                        |    from a given file descriptor (per-process)
         |                        |    to get a file description (system-wide)
         |                        |
         *-> struct fd f = fdget_pos(fd) 
             vfs_read(f.file, buf, bufsize) -> performs the actual
                       |                     read utilizing the info
                       |                   from the file gathered before.
             .---------*
             |                   
             *-> f.file contains a pointer to a `file_operations` struct,
                which can be thought as an interface that specifies file
                operations like `read`, `write`, etc
                |
                *--> depending on the mount, a specific implementation of
                     such interface is referenced there.
                     |
                     *--> f.file->f_op->read(...)
                                   ^     ^
                                   |     |
                                   |     *-- implementation
                                   |      
                                interface

Whenever a file is opened, the userspace program receives a reference to the open file - the “file descriptor”.

For instance, in the following program, a file descriptor is retrieved after openning the file , being printed to stdout and then closed:

#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

// Compile with `gcc -Wall ./main.c
//
// ps.: assumes `/tmp/file.txt` has been created before.
int main (int argc, char** argv) {
	int fd = open("/tmp/file.txt", 0);
	if (fd == -1) {
		perror("open");
		return 1;
	}

        // result: fd=3
        printf("fd=%d\n", fd)

        close(fd);

	return 0;
}

This file descriptor points to an open file description, a system-wide entry that is the thing in the kernel that contains the implementation of the file operations interface depending on the file system that such file resides (see struct file in include/linux/fs.h), as well as keeping track of other details.

/**
 * The file description that gets created when an
 * `open(2)` is called from userspace.
 */
struct file {
	// f_count keep track of the number of references
        // being hold for this file.
        atomic_long_t		f_count;

        // f_pos records the current file offset
        // (a.k.a. file pointer)
        loff_t	f_pos; 


        // f_op contains a pointer to an implementation
        // of the `file_operations` interface - a file
        // operation table.
	const struct file_operations *f_op;

	// ... many more
}


/**
 * Interface for vfs to interfact with.
 * 
 * This is meant to be implemented by the filesystems
 * so that VFS can transparentely interact with them.
 */
struct file_operations {
	struct module *owner;
	loff_t (*llseek) (struct file *, loff_t, int);
	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
	// ...
} __randomize_layout;

To see that a file description (struct file) gets created when an open (or openat) is issued, we can trace get_empty_filp, the function that gets called to allocate a new object.

32648   32648   a.out           get_empty_filp
        get_empty_filp+0x1      # finds an unused file structure 
                                # and returns a pointer to it
        do_filp_open+0x9b 
        do_sys_open+0x1bb 
        sys_openat+0x14 
        do_syscall_64+0x73 
        entry_SYSCALL_64_after_hwframe+0x3d 

After retrieving a file object, it then starts initializing the struct, filling the fields as they need.

Given that at this moment the kernel has already loaded the inode related to this file into memory, and knowing that the inode holds the pointer to the implementation of the file_operations interface for the underlying filesystem, vfs is then able to set the file operations accordingly, such that whenever a further file operations come, it just follows the pointers: f.file->f_op->read(...)

Representation of what goes behind the scenes when openning and reading a file in Linux

Now, if that file lived under an xfs filesystem, pretty much the same would happen, except that the inode would be loaded for xfs, which would have xfs file operations, which would then be called when reading through f.file->f_op->read (read would now be an xfs read).

In the case of procfs, f.file->f_op->read is a procfs read.

If you’d like to know more about this area, make sure you get a copy of Understanding the Linux Kernel, 3rd Ed. You can get more insights into VFS from the chapter 12 (The Virtual Filesystem).

Tip: To look around the Linux source code, check the Elixir Cross Referencer. It allows you to search references and find definitions around the code base across different Linux releases. Check out how linux#fs/open.c looks like.

Getting /proc in your tree

Although in most modern Linux distributions procfs is probably already mounted under /proc, it’s possible that it is not.

In such case, mounting it requires only the necessary privileges and executing mount with the right type (proc):

# Mount the `proc` device under `/proc`
# with the filesystem type specified as
# `proc`.
#
# note.: you could mount procfs pretty much
#        anywhere you want and specify any
#        device name.
#
#
#         .------------> fs type
#         |    
#         |     .------> device (dummy identifier
#         |    |         in this case - anything)
#         |    |
#         |    |     .-> location
#         |    |     |
#         |    |     |
mount -t proc proc /proc

Once the mount point is there, we can now access it:

# Search for the `meminfo` file in the
# procfs mountpoint
ls /proc | grep meminfo
meminfo

If you’d like to know more about mounting things in a directory tree, make sure you check out The Linux Programming Interface.

This is a great book to have - I’m always consulting it from time to time.

Procfs after VFS

Once we got our /proc mountpoint set up, we can start looking at what happens once we start interacting with it.

After understanding the functionality of vfs (and how it can trigger the specific read of a given filesystem by following the file->f_op->read pointers), it’s a matter of looking at how the file operations implementation of procfs looks like.

Differently from a regular filesystem (like ext4), procfs needs to set different handlers for different files, given that each file ends up in the execution of a different method in the kernel.

Illustration of different procfs methods being used depending on the path accessed

Taking the example of reading from /proc/<pid>/limits (from the beginning of the article) and a different file, like /proc/<pid>/wchan, we can see how they differ:

@@ -1,4 +1,4 @@
-        proc_pid_limits+0x1 [kernel]
+        proc_pid_wchan+0x1 [kernel]
        seq_read+0xe5 [kernel]
        __vfs_read+0x1b [kernel]
        vfs_read+0x8e [kernel]
        sys_read+0x55 [kernel]
        do_syscall_64+0x73 [kernel]
        entry_SYSCALL_64_after_hwframe+0x3d [kernel]

Now, what happens inside proc_pid_limits, or what /proc/<pid>/limits is all about … that’s something for another article!

Closing thoughts

It’s very interesting how flexible VFS ends up being.

The way that it presents a consistent interface for applications, letting different filesystem implementions deal with adapting themselves to such interface is pretty interesting.

Coming from a Golang background, I found pretty neat the way that the concept of an interface is applied in this Kernel code (which is all in C, as you might’ve noticed).

In the following articles I’ll go on with exploring some files under /proc, getting deep down into what are those methods doing, so, stay tuned!

If you have any questions or would like to drop some feedback for me, feel free to reach me on Twitter! I’m @cirowrc over there.

Have a good one!

Resources

Aside from regular man pages, two books were referenced in the article (and used during the research):