Hey,

In a Concourse installation that we have running, we've been recently seeing that a particular type of workload (Kubernetes in Docker in Concourse) was being quite annoying - when asked to die (e.g., when the task finished), it'd just never do!

It turns out that the underlying issue was that runc, the runtime responsible for running the containers that Concourse creates, can't kill those that have processes that are frozen (runc#2105).

Here's some of what I learned digging through how Linux deals with signals when it comes to killing a process, and how that differs from the case when a process is frozen.

delivering the sure kill signal

If you've ever wanted to forcibly kill a process, you definitely did that classic kill -s SIGKILL $pid to do so, right? Let's create an example where you'd be interested in doing that:

// main.c - a program that blocks all signals (that can be blocked) and sleeps.
//
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int
main (int argc, char **argv)
{
        int c = 0;

        // represent the set of signals that we want to affect.
        //
        sigset_t mask;


        // fill the set of signals with all the signals //
        c = sigfillset(&mask);
        if (c == -1) {
                fprintf(stderr,
                        "failed to fill the mask including all signals: %s\n",
                        strerror(errno));
                exit(1);
        }

        // set the set of blocked signals to the signal mask.
        //
        // as we set the mask to all signals, it essentially means that we're
        // blocking all signals from now on.
        //
        sigprocmask(SIG_SETMASK, &mask, NULL);
        if (c == -1) {
                fprintf(stderr,
                        "failed to change proces signal mask of the calling thread: %s\n",
                        strerror(errno));
                exit(1);
        }

        // sleep for a looong time.
        //
        sleep(10000);
}

In this case, because we're blocking the delivery of signals to the process, we can ensure that a simple SIGINT from ctrl+c won't kill it - such signal will remain pending, thus, the default action (termination) will never occur.

Unless, we send a SIGKILL.

kill -s SIGKILL $(pgrep main)

Looking at the generation of the signal (/usr/bin/kill), we can see that it's simply asking the kernel (via kill(2)):

$ strace -f kill -s SIGKILL $pid
kill(2293, SIGKILL)                     = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

But, what's going on underneath?

Before we get into killing, let's see how being killed looks like.

being killed, under the hood

Linux delivers signals to a process whenever it comes the time to switch from kernel mode to user mode for a given process (i.e., when it's about to get a time slice, or a given syscall is just about to return).

We can see how that's true by running the following example:

# sleep for 33 days in the background
# 
sleep 33d &


# send a sigkill to that proc
#
kill -s SIGKILL $!

If between that signal being sent, we place a kprobe on do_exit, we can get the call stack all the way back to what got sleep killed - the handling of a signal:

$ bpftrace -e 'kprobe:do_exit / comm == "sleep" / { @[kstack] = count() }'
@[ 
        do_exit+1659
        do_group_exit+67
        get_signal+302
        do_signal+52
        exit_to_usermode_loop+142
        do_syscall_64+240
        entry_SYSCALL_64_after_hwframe+68
]: 1

This already tells us some things:

  1. right after the kernel gets ready to get to execute user coode, it checks signals

  2. before it even gets to continue the execution of a process, it terminates (with a do_exit).

So, first, when the kernel is about to get some time for the process to run, it checks whether there are any signals pending that it should check.

static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
{
	while (true) {
		/* We have work to do. */

		// ...

		/* deal with pending signal delivery */
		if (cached_flags & _TIF_SIGPENDING)
			do_signal(regs);

		// ...

		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
			break;
	}
}

(from https://elixir.bootlin.com/linux/v5.0.1/source/arch/x86/entry/common.c#L138)

For each signal that it finds, it then deals with it.

void do_signal(struct pt_regs *regs)
{
	struct ksignal ksig;

	if (get_signal(&ksig)) {
		/* Whee! Actually deliver the signal.  */
		handle_signal(&ksig, regs);
		return;
	}

        // ...
}

(from https://elixir.bootlin.com/linux/v5.0.1/source/arch/x86/kernel/signal.c#L812)

While checking what to do with it, in the case of our SIGKILL, it finds out that it should die, thus, it does so.


bool get_signal(struct ksignal *ksig)
{
	struct sighand_struct *sighand = current->sighand;
	struct signal_struct *signal = current->signal;

        // ... 

        for (;;) {

		/*
		 * Death signals, no core dump.
		 */
		do_group_exit(ksig->info.si_signo);
        }

        // ...
}

(from https://elixir.bootlin.com/linux/v5.0.1/source/kernel/signal.c#L2381)

kill under the hood

Just like we did for the process being killed, we can also figure out how the kernel code path looks like for kill(1), the binary that's leveraging the kill(2) syscall to send the SIGKILL to sleep.

$ bpftrace -e 'kprobe:send_signal { printf("%s - %s\n", comm, kstack); }'

kill -
        send_signal+1
        group_send_sig_info+65
        kill_pid_info+57
        kill_something_info+278
        __x64_sys_kill+138
        do_syscall_64+90
        entry_SYSCALL_64_after_hwframe+68

Right after the invocation of the syscall, we end up getting to the place where we check not only if we are trying to send a valid signal, but also audit it and check if we have the right permissions to do so not only from a user / group perspective, but also from security modules installed.

/*
 * send signal info to all the members of a group
 */
int group_send_sig_info(int sig, struct kernel_siginfo *info,
			struct task_struct *p, enum pid_type type)
{
	int ret;

	rcu_read_lock();
	ret = check_kill_permission(sig, info, p);
	rcu_read_unlock();

	if (!ret && sig)
		ret = do_send_sig_info(sig, info, p, type);

	return ret;
}

(from https://elixir.bootlin.com/linux/v5.0.1/source/kernel/signal.c#L1360)

static int check_kill_permission(int sig, struct kernel_siginfo *info,
				 struct task_struct *t)
{
	if (!valid_signal(sig)) return -EINVAL;

	if (!si_fromuser(info)) return 0;

	if (error = (audit_signal_info(sig, t); /* Let audit system see the signal)) {
		return error;

	if (!same_thread_group(current, t) && !kill_ok_by_cred(t)) {
		// ...
        }

	return security_task_kill(t, info, sig, NULL);
}

(from https://elixir.bootlin.com/linux/v5.0.1/source/kernel/signal.c#L812)

static int send_signal(int sig, struct kernel_siginfo *info, struct task_struct *t,
			enum pid_type type)
{
	int from_ancestor_ns = 0;

#ifdef CONFIG_PID_NS
	from_ancestor_ns = si_fromuser(info) &&
			   !task_pid_nr_ns(current, task_active_pid_ns(t));
#endif

	return __send_signal(sig, info, t, type, from_ancestor_ns);
}

(from https://elixir.bootlin.com/linux/v5.0.1/source/kernel/signal.c#L1194)

Once all of the checks passed, we can then move on to the interesting part: trully sending it.

static int __send_signal(int sig, struct kernel_siginfo *info, struct task_struct *t,
			enum pid_type type, int from_ancestor_ns)
{
	// ...

	result = TRACE_SIGNAL_DELIVERED;

	if ((sig == SIGKILL) || (t->flags & PF_KTHREAD))
		goto out_set;

        // ...

out_set:
	signalfd_notify(t, sig);
	sigaddset(&pending->signal, sig);

	// ...

	complete_signal(sig, t, type);
ret:
	trace_signal_generate(sig, info, t, type != PIDTYPE_PID, result);
	return ret;
}

(from https://elixir.bootlin.com/linux/v5.0.1/source/kernel/signal.c#L1075)

Note how at this point, we've added the signal to the set of pending signals, part of the task struct - yeah, that exact one that gets iterated over right before the process is about to be run and whose desired behaviors are executed.

But, before we stop right there, we need to remember that we're not dealing with any signal - we're sending a SIGKILL here.

As such, being a fatal signal, it also forces the scheduler to get to the part of running the process as soon as it can to get that actioned upon.

static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
{
	struct signal_struct *signal = p->signal;
	struct task_struct *t;

	/*
	 * Now find a thread we can wake up to take the signal off the queue.
	 *
	 * If the main thread wants the signal, it gets first crack.
	 * Probably the least surprising to the average bear.
	 */
	if (wants_signal(sig, p))
		t = p;

        // ...

	/*
	 * The signal is already in the shared-pending queue.
	 * Tell the chosen thread to wake up and dequeue it.
	 */
	signal_wake_up(t, sig == SIGKILL);
	return;
}

(from https://elixir.bootlin.com/linux/v5.0.1/source/kernel/signal.c#L973)

And, naturally, as the kernel always wants SIGKILLs to be caught, it hardcodes that SIGKILL should return true for wants_signal:

/*
 * Test if P wants to take SIG. 
 */
static inline bool wants_signal(int sig, struct task_struct *p)
{
	if (sigismember(&p->blocked, sig))
		return false;

	if (p->flags & PF_EXITING)
		return false;

	if (sig == SIGKILL)
		return true;

	if (task_is_stopped_or_traced(p))
		return false;

	return task_curr(p) || !signal_pending(p);
}

(from https://elixir.bootlin.com/linux/v5.0.1/source/kernel/signal.c#L956)

killing a frozen process

What's interesting to note from the above is that there's a very explicit condition when it comes to having a process being killed:

However, when a process is frozen (e.g., because it got to the FROZEN state via the freezer cgroup, or a hibernation triggered by a device), the task switches to the UNINTERRUPTIBLE state, where it'll not get the opportunity to be ever scheduled to run until it gets switched to another state, meaning that it'll not be killable until so.

Note that this is different from the case when a SIGSTOP is sent to a process, where the process is just put into a state that is still interruptible.