Hey,

The file covered in this article, /proc/<pid>/stack, is the one that motivated me to learn more about /proc and get The Month of Proc.

It’s such a useful thing when you’re unaware of what is the state of a given process. Meanwhile, I’ve noticed that it’s not very well known by people getting started with Linux.

Illustration of how the process stack can be inspected

Here in this post, you’ll get to know more about how procfs can gather a process' stack trace, as well as get an idea of its usefulness.

This is the fifth article in a series of 30 articles around procfs: A Month of /proc.

If you’d like to keep up to date with it, make sure you join the mailing list!

A process that blocks

To kick things off, let’s start with the tailoring of a process that blocks - a TCP server.

In its most simplistic form, we can have a single-threaded TCP server that just receives traffic in a given thread and then processes its results.

Illustration of TCP server blocking the main thread

Naturally, there are two places where we can see the server blocking: at the accept(2) phase, of the read(2) phase - the first blocks until a client finishes the TCP handshake; the second, until data is available for read.

Here’s how a simplistic implementation in C would look like considering just the first blocking part (accepting):

/**
 * server_listen takes care of setting up the
 * passive socket that receives incoming TCP
 * connections.
 *
 * Returns the file descriptor corresponding to
 * the passive socket, of -1 in the case of
 * error.
 *
 * See https://ops.tips/blog/a-tcp-server-in-c/#creating-a-socket
 * for a reference implementation.
 */
int
server_listen();

/**
 * server_accept_and_close accepts connections that
 * finish their TCP handshake through the provided
 * @listen_fd argument.
 *
 * Once the connection is `accept`ed, it gets closed
 * immediately.
 */
int
server_accept_and_close(int listen_fd)
{
	int                conn_fd;
	int                err;
	socklen_t          client_len;
	struct sockaddr_in client_addr;

	client_len = sizeof(client_addr);

	printf("Accepting connections ...\n");

	// Accept connections from the completed
	// connection queue (pops the latest conn
	// that succesfully finished the 3-way
	// handshake).
	//
	// Given that the queue might be empty, it
	// waits (i.e., blocks) until there's a connection
	// there.
	err = (conn_fd = accept(
	         listen_fd, (struct sockaddr*)&client_addr, &client_len));
	if (err == -1) {
		perror("accept");
		return err;
	}

	printf("Client connected! Going to close now.\n");

	// Mark the connection as closed so that we
	// can get back to accepting new connections.
	err = close(conn_fd);
	if (err == -1) {
		perror("close");
		return err;
	}

	return 0;
}

/**
 * main execution - creates a passive socket
 * bound to a specific port that keeps receiving
 * incoming connections, accepting them and then
 * marking them as closed immediately.
 */
int
main(int argc, char** argv)
{
	int err;
	int listen_fd = server_listen();

	if (listen_fd == -1)
		return 1;

	for (;;) {
		err = server_accept_and_close(listen_fd);
		if (err == -1)
			return 1;
	}

	return 0;
}

If you’re not into how a TCP server can be implemented in C, make sure you check A TCP Server in C for a complete description of it.

Run the server, and then see it blocking!

Viewing the process kernel stack trace

Once the server is blocked, we can jump into /proc and check out what’s going on in the Kernel and figure out that it’s blocked on the accept syscall:

# Gather the in-kernel stack trace of
# the `accept.out` process (the simplistic
# TCP server that we're running).
cat /proc/$(pidof accept.out)/stack
[<0>] inet_csk_accept+0x246/0x380
[<0>] inet_accept+0x45/0x170
[<0>] SYSC_accept4+0xff/0x210
[<0>] SyS_accept+0x10/0x20
[<0>] do_syscall_64+0x73/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff

Although it might look like a weird stack trace at first, the structure is very straightforward to reason about.

Each line represents a function that was called (from looking at the stack calls), having the first part, that [<0>] thing, being the kernel address of the function, while the second, that do_syscall_64+... being the symbol name with the corresponding offset.

If [<0>] looks weird (like, not an actual address at all), it’s because it’s intended to be like that.

When fs/proc/base.c#proc_pid_stack (the method that gets called by the /proc implementation of the virtual filesystem) iterates over the stack frames, we can see that it hardcodes [<0>] as the address to be displayed:

/**
 * Iterates through the kernel stack frames of
 * the current task, displaying the kernel 
 * addresses of each function, as well as
 * their symbol name and offset.
 */
static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
			  struct pid *pid, struct task_struct *task)
{
	struct stack_trace trace;
	unsigned long *entries;
	int err;
	int i;


        // Allocate some memory so that we can have
        // in our execution the whole stack of the
        // process (up to a given depth).
        //
	// The first argument to the kmalloc
	// is the size of the block to be allocated.
        //
	// The second argument is allocation flag.
        //
	// GFP_KERNEL means that allocation is performed on
	// behalf of a process running in the kernel space.
	//
	// This means that the calling function is executing
	// a system call on behalf of a process.
	//
	// Using GFP_KERNEL means that kmalloc can put the
	// current process to sleep waiting for a page when
	// called in low-memory situations.
        // 
	// See https://elixir.bootlin.com/linux/v4.15/source/include/linux/gfp.h#L219
	entries = kmalloc(MAX_STACK_TRACE_DEPTH * sizeof(*entries), GFP_KERNEL);
	if (!entries)
		return -ENOMEM;


        // With the space properly allocated, we can now
        // prepare the `stack_trace` struct and pass
        // it down to the function that will get that
        // for us.
	trace.nr_entries	= 0;
	trace.max_entries	= MAX_STACK_TRACE_DEPTH;
	trace.entries		= entries;
	trace.skip		= 0;


        // Make sure that we have mutual exclusion in
        // place.
	err = lock_trace(task);
	if (!err) {
                // Capture the stack trace!
		// https://elixir.bootlin.com/linux/v4.15/source/arch/x86/kernel/stacktrace.c#L69
		save_stack_trace_tsk(task, &trace);

                // Iterate over each frame captured.
                //
                // *************************
                // HERE is where the `[<0>]` is hardcoded.
                // *************************
		for (i = 0; i < trace.nr_entries; i++) {
			seq_printf(m, "[<0>] %pB\n", (void *)entries[i]);
		}
		unlock_trace(task);
	}
	kfree(entries);

	return err;
}

Looking at the file where the function is defined, we can git blame that seq_printf and see who’s there to blame for putting that hardcoded [<0>].

Guess what - Torvalds did the change!

Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Mon Nov 27 16:45:56 2017 -0800

    proc: don't report kernel addresses in /proc/<pid>/stack

    This just changes the file to report them as zero, although maybe even
    that could be removed.  I checked, and at least procps doesn't actually
    seem to parse the 'stack' file at all.

    And since the file doesn't necessarily even exist (it requires
    CONFIG_STACKTRACE), possibly other tools don't really use it either.

    That said, in case somebody parses it with tools, just having that zero
    there should keep such tools happy.

    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 31934cb9dfc8..28fa85276eec 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -443,8 +443,7 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
                save_stack_trace_tsk(task, &trace);

                for (i = 0; i < trace.nr_entries; i++) {
-                       seq_printf(m, "[<%pK>] %pB\n",
-                                  (void *)entries[i], (void *)entries[i]);
+                       seq_printf(m, "[<0>] %pB\n", (void *)entries[i]);
                }
                unlock_trace(task);
        }

Not being a Kernel expert (at all!), I tried to understand what that %pB and %pK are all about - I’ve never used such kind of formatting with printf after all.

Looking at the docs for printk format specifiers, we can see what that very specialized formatting is all about:

The B specifier results in the symbol name with offsets and should be used when printing stack backtraces.

[The K specifier is used …] For printing kernel pointers which should be hidden from unprivileged users.

Meaning that yeah, previously you could retrieve the kernel addresses, but not anymore, for the reasons presented by Linus.

When the stack does not help much

While it’s very clear why knowing the in-kernel stack trace in the example above was useful, it’s not all that much when it comes to servers that make use of async io (like most of the modern web servers do).

Here’s how a code that is very similar to the TCP acceptor we wrote above in C looks like in Go:

package main

import (
	"net"
)

func main () {
        // Create the necessary underlying data
        // structures for listening on port 1337
        // on all interfaces.
	listener, err := net.Listen("tcp", ":1337")
	if err != nil {
		panic(err)
	}

        // Release all the resources when leaving
	defer listener.Close()

	for {
                // Accept a connection from the
                // backlog of connections that
                // finalized the 3-way handshake
		conn, err := listener.Accept()
		if err != nil {
			panic(err)
		}

                // Close them right away
		conn.Close()
	}
}

Although in the code above we spawn no goroutines other than the main one, under the hood, the Go runtime ends up setting a single event pool file that allows us to monitor multiple file descriptors and not block on them individually.

Illustration of the process of Epoll notifying events

By the way, Julia Evans has a great blog post on async IO - make sure you check it out! It’s called Async IO on Linux: select, poll, and epoll.

We can notice that by looking at which syscall the Kernel is blocked at when our process runs:

# Display the stack of every thread that
# pertains to the `go_accept` command that
# is running.
find /proc/$(pidof go_accept)/task -name "stack" | \
        xargs -I{} /bin/sh -c 'echo {} ; cat {}'

/proc/17019/task/17019/stack
[<0>] ep_poll+0x29c/0x3a0
[<0>] SyS_epoll_pwait+0x19e/0x220
[<0>] do_syscall_64+0x73/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff

/proc/17019/task/17020/stack
[<0>] futex_wait_queue_me+0xc4/0x120
[<0>] futex_wait+0x10a/0x250
[<0>] do_futex+0x325/0x500
[<0>] SyS_futex+0x13b/0x180
[<0>] do_syscall_64+0x73/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff

/proc/17019/task/17021/stack
[<0>] futex_wait_queue_me+0xc4/0x120
...

/proc/17019/task/17022/stack
[<0>] futex_wait_queue_me+0xc4/0x120
...

/proc/17019/task/17023/stack
[<0>] futex_wait_queue_me+0xc4/0x120
...

/proc/17019/task/17024/stack
[<0>] futex_wait_queue_me+0xc4/0x120
...

Notice that differently from the C code, here I’m looking at the stack of each task under the task group identified by the PID of the go_accept command.

Given that Go will run more than one thread when it starts (so that we can schedule goroutines to run across the poll of actual threads), we can take a look at the stack across all of the threads and see their stack (in the end, each thread is a task, so we can take the stack of each of them).

If we use dlv, we can see why it’s the case that we have those 5 threads just waiting with a futex_wait, while there’s another one blocked on ep_poll (the actual block on async IO):

# Attach to the currently running Go
# process with delve so that we can
# check from the userspace perspective
# what is going on with the Go runtime
dlv attach $(pidof go_accept)


# Check the thread pool
(dlv) threads
* Thread 17019 at .../sys_linux_amd64.s:671 runtime.epollwait
  Thread 17020 at .../sys_linux_amd64.s:532 runtime.futex
  Thread 17021 at .../sys_linux_amd64.s:532 runtime.futex
  Thread 17022 at .../sys_linux_amd64.s:532 runtime.futex
  Thread 17023 at .../sys_linux_amd64.s:532 runtime.futex
  Thread 17024 at .../sys_linux_amd64.s:532 runtime.futex


# Check the pool of goroutines
(dlv) goroutines
[4 goroutines]
  Goroutine 1 - ...netpoll.go:173 internal/poll.runtime_pollWait (0x427146)
  Goroutine 2 - ...proc.go:303 runtime.gopark (0x42c74b)
  Goroutine 3 - ...proc.go:303 runtime.gopark (0x42c74b)
  Goroutine 4 - ...proc.go:303 runtime.gopark (0x42c74b)


# Switch to goroutine 1
(dlv) goroutine 


# See the userspace stack that got us
# to being parked at `epoll wait`
(dlv) stack
 0  0x000000000042c74b in runtime.gopark
    at /usr/local/go/src/runtime/proc.go:303
 1  0x0000000000427a99 in runtime.netpollblock
    at /usr/local/go/src/runtime/netpoll.go:366
 2  0x0000000000427146 in internal/poll.runtime_pollWait
    at /usr/local/go/src/runtime/netpoll.go:173
 3  0x000000000048e81a in internal/poll.(*pollDesc).wait
    at /usr/local/go/src/internal/poll/fd_poll_runtime.go:85
 4  0x000000000048e92d in internal/poll.(*pollDesc).waitRead
    at /usr/local/go/src/internal/poll/fd_poll_runtime.go:90
 5  0x000000000048fc20 in internal/poll.(*FD).Accept
    at /usr/local/go/src/internal/poll/fd_unix.go:384
 6  0x00000000004b6572 in net.(*netFD).accept
    at /usr/local/go/src/net/fd_unix.go:238
 7  0x00000000004c972e in net.(*TCPListener).accept
    at /usr/local/go/src/net/tcpsock_posix.go:139
 8  0x00000000004c86c7 in net.(*TCPListener).Accept
    at /usr/local/go/src/net/tcpsock.go:260
 9  0x00000000004d55f4 in main.main
    at /tmp/tcp/main.go:16
10  0x000000000042c367 in runtime.main
    at /usr/local/go/src/runtime/proc.go:201
11  0x0000000000456391 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1333

Having now both the userspace and kernelspace stacks, we can properly identify what’s going on with the Go application.

Closing thoughts

The conclusion? IMO, /proc/<pid>/stack (or the equivalent /proc/<pid>/task/<task_id>/stack) is great, but it only takes us so far.

In the end, we need a mix of userspace and kernel space tools that can help us debug the state in which a system is present.

Luckily, from time to time new tools like dlv show up, pprof improves, and even more powerful tools to inspect the Kernel emerge.

I hope this article was useful for you! Please let me know if you have any questions.

I’m @cirowrc on Twitter, and I’d love to chat!

Have a good one!