Using C to inspect Linux syscalls

Hey,

I came through a great article this week (Write yourself a strace in 70 lines of code) and decided to give a try, going through every step with a bunch of care such that I’d understand each piece.

While writing the code, I came up with the following schematics of how everything works together:

Flow of execution of a Linux program that gets traced with PTRACE

Let’s dive into it.

Executing a new process with tracing enabled

By controlling the execution of a new process (being its parent) we can set this new process to have tracing enabled.

This is possible by how the creation of new processes work in Linux: fork from the parent (in which case there’s going to exist two very similar processes - a clone is made), then perform some actions and switch the process image to a new one.

Illustration of the Fork and exec flow for a process that wants to be traced

This means that in our main execution we end up with something in the following lines:

/**
 * Main execution - parses CLI arguments forming the runtime_t struct
 * and then starts the tracer by forking the execution and then
 * starting the main routine of the child (tracee) as well as the
 * tracing capabilities on the parent.
 */
int
main(int argc, char** argv)
{
	setbuf(stdout, NULL);

	if (argc < 2) {
		fprintf(stderr, "usage: %s prog args\n", argv[0]);
		return 1;
	}

        // Initialize the `runtime_t`, a custom structure
        // that takes care of parsing the CLI arguments.
        //
        // Check the repository to know more.
	runtime_t runtime = { 0 };
	runtime_init(&runtime, argc, argv);
	runtime_show(&runtime);

	// Create a new process such that we can spawn a new
	// execution (via `execvp`) via a process that has already
	// declared that it wants to be traced (PTRACE_TRACEME).
	//
	// In the other process (parent), we implement the tracing
	// capabilities that will inspect the child's registers and
	// interrupt its executions.
	pid_t child = fork();
	switch (child) {
		case 0:
			do_child(&runtime);
			break;
		case -1:
			perror("fork");
			fprintf(stderr, "failed to fork process");
			exit(1);
			break;
		default:
			do_trace(child);
			break;
	}

	runtime_destroy(&runtime);

	return 0;
}

Having do_child execute the child functionality and do_trace the tracer (parent), we can see how the child enables tracing and then signals itself a SIGSTOP to stop its execution:

/**
 * Main execution of the child process after
 * forking from the parent.
 *
 * Its `argc` and `argv` are a subset of the parent's
 * `argv` and `argv`, thus containing only the subprocess
 * arguments (e.g.: tracer myprog arg1 --> myprog arg1).
 */
int
do_child(runtime_t* runtime)
{
	int err = 0;

	// Explicitly tell that we want to get traced
	err = ptrace(PTRACE_TRACEME);
	if (err == -1) {
		return err;
	}

	// As a child that gets traced ends up with all the
	// signals send to it being sent to the parent, signalling
	// itself will signal the parent.
	//
	// To inform the parent then, we send a signal to
	// outselves which will also stop us.
	err = kill(getpid(), SIGSTOP);
	if (err == -1) {
		return err;
	}

	// ...
}

Meanwhile, the parent waits for the child to change its process state (to go from RUNNING to STOPPED or vice-versa) using waitpid.

Once the child process gets to the paused state (i.e., once the child raises SIGSTOP to itself), the parent can then resume execution of its main routine to set some ptrace options before it tells the child to wake up and continue its main routine.


/**
 * Main execution of the tracer.
 */
int
do_trace(pid_t child)
{
	int err = 0;
	int status;
	int syscall;
	int retval;

	// Wait until the child starts its main execution
	err = waitpid(child, &status, 0);
	if (err == -1) {
		return err;
	}

	// Set some options in regards to the child process.
	//
	// Here we also set ptrace option: PTRACE_O_TRACESYSGOOD.
	//
	// This option forces the kernel to set bit 7 in the signal
	// number such that we can know if that was a syscall trap
	// that we caught.
	err = ptrace(PTRACE_SETOPTIONS, child, 0, PTRACE_O_TRACESYSGOOD);
	if (err == -1) {
		return err;
	}

        // ...
}

At this moment the child is set to be traced and the parent can allow the child to resume its execution.

Catching a child’s syscalls

Once the parent finished setting the options needed to properly analyze the child’s behavior, it needs to tell it (the child) to start running again, and then wait on the child’s PID so that it can be notified of the syscall execution once it happens (in the child).

It does so by making use of ptrace(2) with the flag PTRACE_SYSCALL set and then using waitpid to wait for the syscall execution in the child (whenever the child executes a syscall it’ll get stopped - changing the process state - which gets noticed by the parent).

Parent process telling child process to resume using PTRACE_SYSCALL

Regarding code, we can wrap the logic in a function called wait_for_sycall which does what’s been described and also checks some information carried with the signal that tells the tracer what happened to the child:

/**
 * Keeps the child running until either entry to
 * or exit from a system call.
 */
int
wait_for_syscall(pid_t child)
{
	int status;
	int err = 0;

	while (1) {
		// Calling ptrace with PTRACE_SYSCALL makes
		// the tracee (child) continue its execution
		// and stop whenever there's a syscall being
		// executed (SIGTRAP is captured).
		err = ptrace(PTRACE_SYSCALL, child, 0, 0);
		if (err == -1) {
			return err;
		}

		// Wait until the next signal arrives
		// When the running tracee enters ptrace-stop, it
		// notifies its tracer using waitpid(2)
		// (or one of the other "wait" system calls).
		waitpid(child, &status, 0);

		// Ptrace-stopped tracees are reported as returns
		// with pid greater than 0 and WIFSTOPPED(status) true.
		//
		// -    WIFSTOPPED(status) returns true if the child
		//      process was stopped by delivery of a signal.
		//
		// -    WSTOPSIG(status) returns the number of the signal
		//      which caused the child to stop - should only be
		//      employed if WIFSTOPPED returned true.
		//
		//      by `and`ing with `0x80` we make sure that it was
		//      a stop due to the execution of a syscall (given
		//      that we set the PTRACE_O_TRACESYSGOOD option)
		if (WIFSTOPPED(status) && WSTOPSIG(status) & 0x80) {
			return 0;
		}

		// Check whether the child exited normally.
		if (WIFEXITED(status)) {
			return 1;
		}
	}

	return 0;
}

Then, to catch every syscall made in the child, just use it in a loop:

	while (1) {
		err = wait_for_syscall(child);
		if (err != 0) {
			break;
		}

		syscall =
		  ptrace(PTRACE_PEEKUSER, child, sizeof(long) * ORIG_RAX);
		fprintf(stderr, "syscall(%d) = ", syscall);

		err = wait_for_syscall(child);
		if (err != 0) {
			break;
		}

		retval = ptrace(PTRACE_PEEKUSER, child, sizeof(long) * RAX);
		fprintf(stderr, "%d\n", retval);
	}

ps.: we use wait_for_syscall twice as we’re being notified regarding the syscall enter (start of the syscall execution) and the syscall exit (termination of the syscall).

That’s it!

Looking up syscalls by number

To test if this is indeed working, I created a simple program that just writes to stdout:

#include <unistd.h>

int
main()
{
	return write(1, "hello world", 11);
}

Nothing fancy, just a write(2) syscall execution.

Let’s run it then:

# Compile the code we just created
# with `-static` to produce a static binary.
gcc main.c -O2 -static -o main.out

# Compile the test case (the program that writes
# to stdout
gcc case.c -O2 -static -o case.out

# Run the tracer with `case.out` as the tracee
./main.out ./case.out

syscall(59) = 0
syscall(12) = 31203328
syscall(12) = 31207872
syscall(158) = 0
syscall(63) = 0
syscall(89) = 47
syscall(12) = 31343040
syscall(12) = 31346688
syscall(21) = -2
syscall(1) = hello world11
syscall(231) =

As we wanted, it prints syscall followed by the number and then after the syscall exits, the exit number from it.

// Pick the syscall number by inspecting
// the values at the tracee's registers
// that were set at the moment that the
// user program was about to perform the 
// syscall.
syscall = ptrace(PTRACE_PEEKUSER, child, sizeof(long) * ORIG_RAX);
fprintf(stderr, "syscall(%d) = ", syscall);

wait_for_syscall(child);

// Pick the syscall number by inspecting
// the values at the tracee's registers
// once it returned from the syscall
retval = ptrace(PTRACE_PEEKUSER, child, sizeof(long) * RAX);
fprintf(stderr, "%d\n", retval);

Those numbers that we see represent the actual syscalls that were performed.

We can perform a reverse lookup to determine what each number is by looking at /usr/include/x86_64-linux-gnu/asm/unistd_64.h:

#define __NR_execve 59
#define __NR_brk 12
#define __NR_arch_prctl 158
#define __NR_uname 63
#define __NR_readlink 89
#define __NR_access 21
#define __NR_write 1
#define __NR_exit_group 231

meaning that our program started by executing execve (to get ./case.out program running), then it did some glibc stuff and then ended with write, printing our message to stdout.

Closing thoughts

It seems very useful for me to know that we can fully control the execution of the syscalls that our program performs, eventually injecting faults and tampering results. This seems very useful in the context of chaos engineering.

One downside of the way we did here is that we’re stoping for every single syscall, which might considerably slow down an application. To counterfeit that, I plan to explore the use of SECCOMP together with ptrace which would allow us to get in the middle of specific syscalls based on a filter that we provide.

I’m very grateful for Nelson who put Write yourself an strace in 70 lines of code up. I had a bunch of joy following that article and learning from it. Thanks!

Please let me know if you have any questions or spot something odd. I’d appreciate!

Have a good one,

finis