Hey,
Although many people that are accustomed to Linux are aware of the existence of /proc
and what some files over there can do, many lack the understanding of what goes behind the scenes to power such filesystem (myself included before writing this article).
If you’ve been wondering about how /proc
works under the hood, stay tuned!
This is the first article in a series of 30 articles around procfs: A Month of /proc.
If you’d like to keep up to date with it, make sure you join the mailing list!
- What is procfs
- Contrasting Procfs with a regular filesystem
- Reading from and writing to procfs
- Translating a file read to an internal kernel method
- Getting
/proc
in your tree - Procfs after VFS
- Closing thoughts
- Resources
What is procfs
Procfs
is a special virtual filesystem that can be mounted in your directory tree, allowing processes in userspace to read kernel information conveniently - using regular file I/O operations (like read(2)
and write(2)
).
process 123: how many files do proc321
| has open?
|
(userspace) *---> ls /proc/321/fd
\----+------/
| ^
------------------------------|----|------------
| |
.---<---* |
| |
kernel *-----------<------.
(kernelspace) | |
*--> list number of open file |
descriptors for proc `321` |
in the root namespace |
| |
*------------>-------'
there you go!
The “virtual” comes from the fact that there’s not really a block device (like a solid-state drive - SSD) that serves the files that we can access under the place where you mount procfs
(usually /proc
).
Instead, there’s just some code implementing the filesystem interface that gets called whenever you issue reads and writes against those particular locations.
For instance, when a user asks for the limits that apply to a given process, the following path gets followed under the hood:
cat /proc/13323/limits
(userspace) fd = open("/proc/13323/limits")
n = read(fd, buf, bufsize)
|
---------------------|--------------
vfs (common interface for interacting with
| any filesystem)
|
*-> who's responsible for this `/proc`
mount?
procfs! let it handle the call.
|
|
(kernelspace) *-> hey procfs, take this `read` call
for `/proc/13323/limits` please!
|
sure! <-------*
I'll write the response
to the file.
|
*---> linux/fs/proc/base.c#proc_pid_limits
for limit := range limits {
fprintf(file, limit)
}
Using a tracer like bcc’s trace.py, we can see the kernel stack getting the proc_pid_limit
command getting called:
PID TID COMM FUNC
21450 21450 cat proc_pid_limits
proc_pid_limits+0x1
seq_read+0xe5
__vfs_read+0x1b
vfs_read+0x8e
sys_read+0x55
do_syscall_64+0x73
entry_SYSCALL_64_after_hwframe+0x3d
Contrasting Procfs with a regular filesystem
A nice way of viewing the difference between the two is looking at how does the kernel path compare.
Let’s say we have a file /myfile.txt
that lives on a disk that makes use of EXT4 as a filesystem.
If we were to read this file (making that that it’s not cached), this is how it’d look like:
cat /myfile.txt
(userspace) fd = open("/myfile.txt")
n = read(fd, buf, bufsize)
|
---------------------|--------------
vfs (common interface for interacting with
| any filesystem)
|
*-> who's responsible for this `/` mount?
ext4! let it handle the call.
|
|
(kernelspace) *-> hey ext4, take this `read` call
for `/myfile.txt` please!
|
sure! <-------*
Oh, I know that this file exists in the disk!
Let me request the underlying block device driver
for it.
|
*---> hey whoever is in charge of /dev/sda1,
please hand me the contents of my file!
|
Oh, this is not ------*
in my cache; let me ask the disk for what
is in the blocks where this file exists.
We can see that in the case of the regular file, the read(2)
call ends up getting down to the block device driver that issues the read against a real device.
Using the same tracer that we used before, we can check that, differently from when read
ing from /proc
, at this time, the path is much longer (goes deep down to the actual blk_*
methods that handle block devices):
# Drop the caches so that the call ends up in
# a very low-level call to the block device
# driver.
echo "3" > /proc/sys/vm/drop_caches
# Perform a read
cat ./myfile.txt
# See the trace results:
PID TID COMM FUNC
28653 28653 cat blk_start_request
blk_start_request+0x1
scsi_request_fn+0xf5
__blk_run_queue+0x43
queue_unplugged+0x2a
blk_flush_plug_list+0x20a
blk_finish_plug+0x2c
__do_page_cache_readahead+0x1da
ondemand_readahead+0x11a
page_cache_sync_readahead+0x2e
generic_file_read_iter+0x7fb
ext4_file_read_iter+0x56
new_sync_read+0xe4
__vfs_read+0x29
vfs_read+0x8e
sys_read+0x55
do_syscall_64+0x73
entry_SYSCALL_64_after_hwframe+0x3d
Although they look very different after vfs_read
, everything feels the same for those consuming the vfs
interface.
If you’d like to refresh some concepts around Linux in general (including filesystems), I recommend reading The Linux Programming Interface (chapter 12 covers /proc
a bit, and chapter 14 is about filesystems!)
Reading from and writing to procfs
Not only being able to give you some introspection into what is the current state of a given process or the system as a while, /proc
is also able to let you modify some of the behaviors of the system.
For instance, in the example above, we dropped the caches by performing a write(2)
operation against /proc/sys/vm/drop_caches
from the userspace.
To summarize:
-
when it comes to
read(2)
operations, it can be seen as an interface to introspect kernel data structures associated with either the whole system or a particular process; and -
when it comes to
write(2)
, it can be used to change some kernel parameters at runtime.
Translating a file read to an internal kernel method
Coming back to the interaction between user space and kernel space, you might’ve noticed in the previous sections that there was a common thing sitting between the EXT4 filesystem and procfs: the virtual filesystem (vfs
).
By having this layer that sits between any syscall related to a filesystem, vfs
is able to present a consistent API while letting different implementations to provide their functionality behind the scenes. No need for user programs to know about what’s the filesystem under the hood.
The way that the Kernel does this translation is pretty nifty.
Here’s an overview of how a read(2)
from userspace gets down to a filesystem-specific implementation of a read:
fd = open("file", flags);
read(fd, buf, bufsize)
|
----------+-------------------------- (userspace boundary)
|
.-------*
|
*-> ksys_read(fd, buf, bufsize)
| .--> performs a file lookup,
| | gathering file information
| | from a given file descriptor (per-process)
| | to get a file description (system-wide)
| |
*-> struct fd f = fdget_pos(fd)
vfs_read(f.file, buf, bufsize) -> performs the actual
| read utilizing the info
| from the file gathered before.
.---------*
|
*-> f.file contains a pointer to a `file_operations` struct,
which can be thought as an interface that specifies file
operations like `read`, `write`, etc
|
*--> depending on the mount, a specific implementation of
such interface is referenced there.
|
*--> f.file->f_op->read(...)
^ ^
| |
| *-- implementation
|
interface
Whenever a file is opened, the userspace program receives a reference to the open file - the “file descriptor”.
For instance, in the following program, a file descriptor is retrieved after openning the file , being printed to stdout
and then closed:
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
// Compile with `gcc -Wall ./main.c
//
// ps.: assumes `/tmp/file.txt` has been created before.
int main (int argc, char** argv) {
int fd = open("/tmp/file.txt", 0);
if (fd == -1) {
perror("open");
return 1;
}
// result: fd=3
printf("fd=%d\n", fd)
close(fd);
return 0;
}
This file descriptor points to an open file description, a system-wide entry that is the thing in the kernel that contains the implementation of the file operations interface depending on the file system that such file resides (see struct file
in include/linux/fs.h
), as well as keeping track of other details.
/**
* The file description that gets created when an
* `open(2)` is called from userspace.
*/
struct file {
// f_count keep track of the number of references
// being hold for this file.
atomic_long_t f_count;
// f_pos records the current file offset
// (a.k.a. file pointer)
loff_t f_pos;
// f_op contains a pointer to an implementation
// of the `file_operations` interface - a file
// operation table.
const struct file_operations *f_op;
// ... many more
}
/**
* Interface for vfs to interfact with.
*
* This is meant to be implemented by the filesystems
* so that VFS can transparentely interact with them.
*/
struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
// ...
} __randomize_layout;
To see that a file description (struct file
) gets created when an open
(or openat
) is issued, we can trace get_empty_filp
, the function that gets called to allocate a new object.
32648 32648 a.out get_empty_filp
get_empty_filp+0x1 # finds an unused file structure
# and returns a pointer to it
do_filp_open+0x9b
do_sys_open+0x1bb
sys_openat+0x14
do_syscall_64+0x73
entry_SYSCALL_64_after_hwframe+0x3d
After retrieving a file object, it then starts initializing the struct, filling the fields as they need.
Given that at this moment the kernel has already loaded the inode
related to this file into memory, and knowing that the inode
holds the pointer to the implementation of the file_operations
interface for the underlying filesystem, vfs
is then able to set the file operations accordingly, such that whenever a further file operations come, it just follows the pointers: f.file->f_op->read(...)
Now, if that file lived under an xfs
filesystem, pretty much the same would happen, except that the inode would be loaded for xfs
, which would have xfs
file operations, which would then be called when reading through f.file->f_op->read
(read
would now be an xfs
read).
In the case of procfs
, f.file->f_op->read
is a procfs
read.
If you’d like to know more about this area, make sure you get a copy of Understanding the Linux Kernel, 3rd Ed. You can get more insights into VFS from the chapter 12 (The Virtual Filesystem).
Tip: To look around the Linux source code, check the Elixir Cross Referencer. It allows you to search references and find definitions around the code base across different Linux releases. Check out how linux#fs/open.c
looks like.
Getting /proc
in your tree
Although in most modern Linux distributions procfs
is probably already mounted under /proc
, it’s possible that it is not.
In such case, mounting it requires only the necessary privileges and executing mount
with the right type (proc
):
# Mount the `proc` device under `/proc`
# with the filesystem type specified as
# `proc`.
#
# note.: you could mount procfs pretty much
# anywhere you want and specify any
# device name.
#
#
# .------------> fs type
# |
# | .------> device (dummy identifier
# | | in this case - anything)
# | |
# | | .-> location
# | | |
# | | |
mount -t proc proc /proc
Once the mount point is there, we can now access it:
# Search for the `meminfo` file in the
# procfs mountpoint
ls /proc | grep meminfo
meminfo
If you’d like to know more about mounting things in a directory tree, make sure you check out The Linux Programming Interface.
This is a great book to have - I’m always consulting it from time to time.
Procfs after VFS
Once we got our /proc
mountpoint set up, we can start looking at what happens once we start interacting with it.
After understanding the functionality of vfs
(and how it can trigger the specific read
of a given filesystem by following the file->f_op->read
pointers), it’s a matter of looking at how the file operations implementation of procfs
looks like.
Differently from a regular filesystem (like ext4
), procfs
needs to set different handlers for different files, given that each file ends up in the execution of a different method in the kernel.
Taking the example of reading from /proc/<pid>/limits
(from the beginning of the article) and a different file, like /proc/<pid>/wchan
, we can see how they differ:
@@ -1,4 +1,4 @@
- proc_pid_limits+0x1 [kernel]
+ proc_pid_wchan+0x1 [kernel]
seq_read+0xe5 [kernel]
__vfs_read+0x1b [kernel]
vfs_read+0x8e [kernel]
sys_read+0x55 [kernel]
do_syscall_64+0x73 [kernel]
entry_SYSCALL_64_after_hwframe+0x3d [kernel]
Now, what happens inside proc_pid_limits
, or what /proc/<pid>/limits
is all about … that’s something for another article!
Closing thoughts
It’s very interesting how flexible VFS ends up being.
The way that it presents a consistent interface for applications, letting different filesystem implementions deal with adapting themselves to such interface is pretty interesting.
Coming from a Golang background, I found pretty neat the way that the concept of an interface is applied in this Kernel code (which is all in C, as you might’ve noticed).
In the following articles I’ll go on with exploring some files under /proc
, getting deep down into what are those methods doing, so, stay tuned!
If you have any questions or would like to drop some feedback for me, feel free to reach me on Twitter! I’m @cirowrc over there.
Have a good one!
Resources
Aside from regular man
pages, two books were referenced in the article (and used during the research):
- The Linux Programming Interface: Ch. 12 covering
/proc
, and Ch. 14 on Filesystems; and - Understanding the Linux Kernel, 3rd Ed. Ch 12 on VFS (The Virtual Filesystem).