Hey,

I’ve been using BPF for quite a while in many circumstances, especially when debugging or trying to understand what’s going on with a system that I maintain.

Having got some experience with it, I feel like now it is a good time to share some of the learnings, as well as try to solidify my thoughts and understandings around BPF, even more now that support for its latest features have been rolling out as distros update.

Here you’ll find some use cases that should bring some ideas of “why” and “what”, as well as bit of “how” that’s possible (what mechanisms exist to allow BPF to do what we talked about).

ps.: any examples below are all based on Ubuntu Eoan (19.10), running Linux 5.3.0.

intro

For pretty much all that we do here at Pivotal (as far as I can tell for most teams), we’re always writing code that’s supposed to run in userspace, where for the things we do, we rely on the kernel to provide that nice abstraction of the underlying hardware (which, in our case is almost always virtualized), as well as manage its resources.

    USERSPACE
            
            programs ..     (us !!)

    -----------------------

    KERNELSPACE

            kernel

    -----------------------

    HARDWARE (possibly virtualized)

While it’s really challenging to write the code that we’re used to write, we’re very privileged when it comes to the safety of running in such place - unless we do something terribly bad, we’re very far from the possibility of “panic"ing the whole machine: the operating system protects us from messing things up.

    USERSPACE PROGRAM                       KERNEL

    "give me 1PB of RAM!"
                                            "sure thing"

    starts trying to touch all of that
    space that it "received"

                                            ooh boy, I bet you're not going
                                            that far...


                    OOM KILLER GETS TRIGGERED
                                    end of story.

With that separation in place, it turns out that for certain things that we might want to do (e.g., observing what’s going on underneath, or making it do things in a way that we’d like to program the kernel to), it’s either completely impractical to do so from userspace, or simply impossible.

For a long time, for those use cases, you either had to use some quite complicated interfaces, write a kernel module, or perhaps even patch the kernel where a module couldn’t extend it.

Nowadays, however, in many cases you can rely on a much safer way of getting things done in the kernel - using BPF.

    USERSPACE

            program
                    
                    asks kernel to attach a program to certain events


    KERNELSPACE

            kernel
                    
                    attaches the user-defined code to certain events

                    ==> on event:

                            runs the user code

what it enables

To get a sense of what’s possible with it, I tried coming up with few examples of use cases that illustrate its applications in several domains.

The essential to grasp before we move on is that BPF enables us to run our own code, in kernel space, with native speeds, in a safe manner.

is my configuration file being read?

Let’s imagine that we have a machine which hosts an application, and we want to temporarily modify how its name resolution works.

To do so, let’s assume we want to achieve that in the simplest way possible - by tweaking /etc/hosts, the file-based static lookup table.

    BEFORE

            curl http://example.com

                    ==> "resolver", what's the address for `example.com`?
                                    
                                    <= it's 93.184.216.34!

                    ==> thx!
                    ...


    AFTER   

            echo '127.0.0.1 example.com' >> /etc/hosts

            curl http://example.com

                    ==> "resolver", what's the address for `example.com`?

                                    <= ooh, it's 127.0.0.1, have fun!

                    ==> thx!
                    ...

But, In addition to that, let’s also imagine that we’re not really aware if such application even looks at /etc/hosts in the first place.

    curl ...  does it even look at `/etc/hosts`?  

At this point, putting that systems engineer hat for a while, we can start formulating some ideas:

    - well, if `curl` cares about `/etc/hosts`, it gotta read the contents
      of it


            - if it's going to read the contents of a file, it needs to
              first open the file, in the first place


                    - if it's going to open a file, that might sit on a
                      particular type of hardware, it has to interact with
                      the kernel

                            - if it's interacting with the kernel, it's
                              making a syscall


    curl

            fd = openat(AT_FDCWD, "/etc/hosts", 0);
            ...                      |
                                     '---> what we're looking for!

            n = read(fd, buf, 4096);
            ...

Wouldn’t it be nice if you could … tap into that syscall and run a little piece of code there? That’s pretty much what BPF lets you do.

For instance, using opensnoop we could quickly figure out that indeed, /etc/hosts is opened by it.

# ./opensnoop
    PID    COMM               FD ERR PATH
    5874   curl               -1   2 /home/ubuntu/.curlrc
    5874   curl                3   0 /etc/nsswitch.conf
    5874   curl                3   0 /etc/host.conf
    5874   curl                3   0 /etc/resolv.conf
    5874   curl                3   0 /etc/ld.so.cache

Just to racap, the reason why thatis possible is essentially due to that basic property I talked about before:

“BPF enables us to react to events, running our own code in kernel space with native speeds in a safe manner.”

Depicting that sentence, with opensnoop we end up with a program that:

What’s interesting to note is that because we control the code that gets run there, we could even filter out the filepaths right there, from within the kernel, to only display those times when /etc/hosts is openned.

but strace can do that! what’s the point?

If you’re familiar with a tool like strace, this shouldn’t have impressed you, right? strace could do that!

And yeah, that’s totally true - a traditional approach like using strace would be totally fine in many cases where just checking which files are being opened, and this is something to be vary aware of: while BPF is great for many things, in multiple occasions, more traditional approaches are good enough. do what works.

    strace -e openat curl http://example.com

    [pid  1774] openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 3

But, to further motivate the case for BPF, what if you couldn’t target a specific process to trace?

What if you wanted to know about every process that tries to open /etc/hosts?

across the whole machine

Supposing that we’re now interested in every program that at some point tries to open our /etc/hosts file, strace becomes unsuitable - it needs to attach to each program that’s running in the system, capture its syscall, and run some code:

    USERSPACE
            tracer
                    - lists all processes in the system
                    - for each process:
                            - trace the process (`ptrace(2)`)


            process1
                    --> issues syscall
                            --> gets stopped, until tracer decides
                                that it's fine to run

            trace:
                    - let me see what syscall this proc just issued
                    - ok `process1`, you're good to continue now
                            - (`ptrace(2)` ...)



            ... for each proc

Attaching a userspace program that monitors each of these processes, effectively blocking the execution of the main program while strace does its thing is veeery consuming1.

For BPF, however, the story is very different - because we’re not a userspace program, there’s no need for the kernel to cross boundaries when it comes to starting the execution of our custom tracing code, or finishing it (signalizing that the other proc can continue).

In the case of BPF, we’re effectively adding native code to the kernel.

    KERNELSPACE

            KERNEL

                    sys_enter_openat
                            ==> bpf program
                                    - is this the file I care about?



    ./opensnoop
    PID    COMM      FD ERR PATH
    1576   curl       3   0 /etc/hosts
    1329   wget       5   0 /etc/hosts
    17358  httpie     3   0 /etc/hosts

Because we’re able to interecept every file open in a very cheap manner, going from “tracing 1 process” vs “tracing ALL processes” is fine.

However, you might still not be convinced.

but, … you know … perf can do that!

Well, that’s completely true, once again!

Through the use of perf trace, one could capture all of the syscalls that are related to opening files, and figure out what’s going on2:

    perf trace -e 'open*'

316.454 ( 0.029 ms): curl/6345 openat(dfd: CWD, filename: 0xca711aae, flags: RDONLY|CLOEXEC)         = 3
316.519 ( 0.012 ms): curl/6345 openat(dfd: CWD, filename: 0xca71bdf0, flags: RDONLY|CLOEXEC)         = 3
316.635 ( 0.012 ms): curl/6345 openat(dfd: CWD, filename: 0xca6d94e0, flags: RDONLY|CLOEXEC)         = 3
316.745 ( 0.011 ms): curl/6345 openat(dfd: CWD, filename: 0xca6d99d0, flags: RDONLY|CLOEXEC)         = 3

And, just like in bpf, this can be done system wide without any implications on performance (it uses mechanism in part used by BPF too).

The difference here is the capabilities that you have once you get to tailor you own code to solve your very specific case.

Let’s see an example where having that flexibility pays of, adding containers to the mix.

observing processes trying to open files in containers

Let’s imagine that in this machine that we’re at, we’re now not only wanting to figure out which processes are opening /etc/hosts, but we also want to tell what is the name of the container that’s doing so.

    container1 ----------------
    |
    |  curl http://example.com
    |

    container2 ----------------
    |
    |  curl http://example.com
    |

            ==>


    TS      CONTAINER       COMM
    t0      container1      curl
    t1      container2      curl
    ...

BPF can start becoming more interesting as an option when we start getting into situations where just capturing an event is not enough - whenever some state needs to be persisted for further evaluation, more about the current context of execution needs to be known, or some form of computation needs to be performed, then BPF starts really shining.

    - state?
    - context of execution?
    - computation?

The example of figuring out the container where that openat(2) is coming from is exactly the case of capturing the context of execution.

That’s because whenever a process performs a system call, the code that gets executed in the kernel to serve that request is done within the context of a particular task, representing the userspace thread that’s doing the call from the perspective of the kernel.

    USERSPACE

            program1        (thread1)
                |
                '----.
    ---------------openat(2)-----
                     |
    KERNELSPACE      |
                     '
            do_sys_open     (ctx => thread1)

Being a container a process within a certain group of namespaces (and other things), as long as we can reach the information of which namespace that process is associated with, that’s a type of thing that BPF can look at.

Putting our systems engineer hat once again, we can start wondering:

    - to open a file, a syscall needs to be issued

            - in the kernel, the serving of the syscall runs within the
              context of the thread that called it

                    - the thread is associated with a set of namespaces

                            - one of the namespaces is the UTS namespace,
                              which records the `hostname`
                              
                                    - containers usually have their names as
                                      the hostname

                                            - let's capture the hostname!

That formulated, the problem is solved: tap into openat(2), and then collect the hostname from the UTS namespace associated with the context.

    KERNELSPACE

            KERNEL

                    sys_enter_openat
                            ==> bpf program
                                    - is this the file I care about?
                                    - oh, btw, what's the hostname being
                                      used under this proc's UTS namespace?

By being able to execute our custom code, within the kernel, capturing the context of the thread that’s executing the syscall, we’re able to observe what’s going on with a lot of freedom.

packet processing

Despite us having discovered how BPF can be used to what’s going on with regards to interactions with the operating system, originally, BPF was essentially the same thing as packet filtering (you know, BPF … Berkeley PACKET FILTER …)

Being able to put BPF right in the network packets path, we’re able to not only filter them, but also modify their internal representation (in certain cases).

Let’s see some examples.

observing packets flowing through an interface

If you’ve ever used tcpdump, it turns out that you’ve used BPF already.

    # pos   show TCP-based ipv4 packets
    # -n    don't try converting addresses to names
    #
    tcpdump -n 'ip and tcp'

Under the hood, the packet filters that we specfiy there for tcpdump ends up being compiled to BPF programs (in the form of BPF bytecode), which the kernel gets to then “execute” for each packet travelling through the interface that the program attaches to, without paying the cost of copying those packets.

    USERSPACE

            tcpdump

                    -> generates BPF bytecode
                    -> "installs" it in the kernel


    KERNELSPACE

            kernel
                    for each packet
                            decide what to do based on the filter
                                    --> if of interest, "send to userspace".

For instance, let’s take a look at the byte code generated for the filtering rule above:

    tcpdump -d 'ip and tcp'
    (000) ldh      [12]
    (001) jeq      #0x800           jt 2    jf 5
    (002) ldb      [23]
    (003) jeq      #0x6             jt 4    jf 5
    (004) ret      #262144
    (005) ret      #0

Although that might seem a bit intimidating at first, it’s actually quite straightforward when we look deep into what it’s doing. By being so simple in what it does, needing so little, it’s capable of running at an incredible packet/s ratio.

            load half-word from the packet data at the offset 12 into the
            accumulator

    (000) ldh      [12]

            check if the what's there in the accumulator matches `0x800`,
            the identifier for ipv4 (ETH_P_IP) - if not, discard

    (001) jeq      #0x800           jt 2    jf 5

            
            load byte from the packet data at the offset 23 into the
            accumulator

    (002) ldb      [23]

            check if the value in the acc matches `0x6`, the identifier for
            `tcp` (IPPROTO_TCP) - if not, discard

    (003) jeq      #0x6             jt 4    jf 5

            discard

    (004) ret      #262144

            yayy match

    (005) ret      #0

Naturally, if you’d try to right that by hand, you’d need to know very well about details with regards to what the structure of the packet is. If you’re wondering where those offsets come from ([12] and [23]), as well as where those lengths (half-word and byte):

    #define ETH_ALEN        6               /* Octets in one ethernet addr   */

    struct ethhdr {
            unsigned char h_dest[ETH_ALEN];   /* destination eth addr */
            unsigned char h_source[ETH_ALEN]; /* source ether addr    */
            __be16        h_proto;            /* packet type ID field    */
    } __attribute__((packed));

    struct iphdr {
            __u8 ihl : 4, version : 4;
            __u8   tos;
            __be16 tot_len;
            __be16 id;
            __be16 frag_off;
            __u8   ttl;
            __u8   protocol;

            // ...
    }

filtering

Although in the previous case we were only concerned about observing packets that were coming through our interface of interest, in cases like firewall applications where certain packets should be dropped according to certain rules, more than just observing is needed - taking an action is also part of it.

                    "what should I do with
                       this packet?"

    ----packet--> iface  [classifier ] ----> go on! :)
                               |
                               '
                             drop!

For those cases, BPF can also get you covered.

One way that we can achieve something that resembles a firewall is by hooking into “traffic control (tc), the kernel packet scheduling subsystem.

Aside from all of the types of customizations that one can perform to tweak the way that network resources are distributed in a system (through the various queueing disciplines that it offers), it lets us go with the ultimate customization: using BPF, have our own code deciding what to do with each packet that either comes in (ingress), or out (egress).

For instance, either for ingress

       packet
         |
         '
    network iface
         |
         *----- skb ---> TC Ingress  --> ACTION --> continue to the
                            + cls_bpf       |        network stack
                                            |                                                
                                            *-> drop

or egress

    network stack --> skb --> TC Egress     --> ACTION --> continue to net
                               + cls_bpf           |       iface
                                                   |
                                                   *--> drop

we have the ability to load our BPF program to the cls_bpf and have our own code taking the decision of what to do with packets that are coming in / out.

Another possibility would be to use eXpress Data Path (XDP), speciallys suited for placing packet processing on ingress.

Differently

    packket --> net iface --> xdp context -->  ACTION --> continue to net
                                                 |        stack
                                                 |
                                                 *--> drop

As an example, let’s consider that we want to block any form of egress traffic on port 8080 over TCP on IPv4. Translating that to pseudo-code:

    fn classifier (Packet: pkt) {
            if pkt.network.protocol != IP {
                    return OK;
            }

            if pkt.transport.protocol != TCP {
                    return OK;
            }

            if pkt.port != 8080 {
                    return OK;
            }

            return DROP;
    }

Fortunately, we don’t need to write all of those in the form of byte code - instead, we can craft a piece of C code that gets compiled down to BPF bytecode.

In a very verbose way, it’d look like this:

    int
    prog(struct xdp_md* ctx)
    {
            __be16         dst_port;
            __u32          off;
            struct ethhdr* eth;
            struct iphdr*  ip;
            struct tcphdr* tcp;

            void* data     = (void*)(long)ctx->data;
            void* data_end = (void*)(long)ctx->data_end;

            /**
             * as the ethernet header is the outermost header that encapsulates the
             * package, we know that the first `sizeof(struct ethhdr)` corresponds
             * to the offset of the ethernet header.
             */
            eth = data;
            off = sizeof(struct ethhdr);

            /**
             * check whether the data storage has enough space filled to contain at
             * least the ip header in its full size (otherwise, we wouldn't be able
             * to safely read its contents).
             */
            if (data + off > data_end) {
                    bpf_trace_printk("info (pass): not enough data for ethhdr\n");
                    return XDP_PASS;
            }

            if (eth->h_proto != htons(ETH_P_IP)) {
                    bpf_trace_printk("info (pass): not ipv4 - %x\n",
                                     htons(eth->h_proto));
                    return XDP_PASS;
            }

            /**
             * being ip the very next thing after `eth`, we can just point to the
             * data at the offset that we've taken from `sizeof(struct ethhdr)`.
             */
            ip = data + off;
            off += sizeof(struct iphdr);
            if (data + off > data_end) {
                    bpf_trace_printk("info (pass): not enough data for iphdr\n");
                    return XDP_PASS;
            }

            if (ip->protocol != IPPROTO_TCP) {
                    bpf_trace_printk("info (pass): not tcp\n");
                    return XDP_PASS;
            }

            tcp = data + off;
            off += sizeof(struct tcphdr);

            if (data + off > data_end) {
                    bpf_trace_printk("info (pass): not enough data for tcphdr\n");
                    return XDP_PASS;
            }


            if (ntohs(tcp->dest) == 8000) {
                    bpf_trace_printk("info(drop): src=%d dst=%d\n", ntohs(tcp->source), ntohs(tcp->dest));
                    return XDP_DROP;
            }

            return XDP_PASS;
    }

what about that system call observability stuff? how’s it internally?

Just the same way that we’re able to write code that’s not just bytecode directly, and then have it compiled, we can do the same for observing syscalls.

In this space, bcc and bpftrace are the most prevalent.

The first, provides you with a framework for helping out with dealing the process of going from C code to BPF, attaching the program to the right placess, and helping out with interacting with the programs that sit in kernelspace, from userspace.

For instance, consider the following “hello world” in Python leveraging BCC:

#!/usr/bin/python

from bcc import BPF

code="""

    #include <uapi/linux/ptrace.h>

int
kprobe__sys_clone(struct pt_regs *ctx)
{
    bpf_trace_printk("new thread!\\n");
    return 0;
}

"""

BPF(text=code).trace_print()

On the other hand, bpftrace provides a much quicker way of going from a question to an implementation that answers our question.

For instance, the example above where we’d like to get an event for each time someone tries to create a thread:

    #!/snap/bin/bpftrace

    kprobe:__x64_sys_clone
    {
            printf("hello world!");
    }

Under the hood, bpftrace converts from its own DSL to a C-like syntax that can be leveraged by BCC to hook to various BPF hook points.

Let’s try to take the examples that we mentioned before, and replicate them using bpftrace.

who’s reading /etc/hosts?

As mentioned before, to know who’s trying to do that, we can place our code in right at the moment where the syscall is about to get served by the kernel.

Being there, we can capture parameters that are used in it, letting us use, for instance, the filename as a filter.

    #!/snap/bin/bpftrace

    BEGIN
    {
            printf("%-8s %s\n", "PID", "COMM");
    }

    tracepoint:syscalls:sys_enter_openat
    / str(args->filename) == "/etc/hosts" /
    {
            printf("%-8d %s\n", pid, comm);
    }

In the example above, str(args->filename) converts the const char* filename field into a string which we can use to perform string comparisons.

Then, on the case of matching /etc/hosts, we execute the code that prints out the pid of the process which issued the syscall, as well as comm, the name of the executable being run.

which container is reading /etc/hosts?

To get this information, we need to take a look at what is available for us in the context of execution of a system call.

    struct task_struct {
            /* Namespaces: */
            struct nsproxy* nsproxy;

            /* Executable name, excluding path */
            char comm[TASK_COMM_LEN];

            /* Effective (overridable) subjective task credentials (COW): */
            const struct cred __rcu* cred;


            // ... (a bunch of other fields)
    }

Having access to the contents of this struct, we’re able then to access task->nsproxy, which then contains the informations we need to know more about the namespaces that this thread is part of:

    struct nsproxy {
            atomic_t                 count;
            struct uts_namespace*    uts_ns;
            struct ipc_namespace*    ipc_ns;
            struct mnt_namespace*    mnt_ns;
            struct pid_namespace*    pid_ns_for_children;
            struct net*              net_ns;
            struct cgroup_namespace* cgroup_ns;
    };

As I mentioned, UTS is the one that effectively isolates hostnames, and, being hostname where container runtimes usually identify who that container is, we can further look into what uts_namespace has:

struct uts_namespace {
	struct kref kref;
	struct new_utsname name;
	struct user_namespace *user_ns;
	struct ucounts *ucounts;
	struct ns_common ns;
} __randomize_layout;

looking at new_utsname, we get exactly what we want:

    #define __NEW_UTS_LEN 64

    struct new_utsname {
            char sysname[__NEW_UTS_LEN + 1];
            char nodename[__NEW_UTS_LEN + 1];
            char release[__NEW_UTS_LEN + 1];
            char version[__NEW_UTS_LEN + 1];
            char machine[__NEW_UTS_LEN + 1];
            char domainname[__NEW_UTS_LEN + 1];
    };

All that to say that if we want to discover a container’s hostname, we can follow the path task->nsproxy->uts_ns->name.nodename.`

#!/snap/bin/bpftrace

    #include <linux/sched.h>
    #include <linux/utsname.h>

BEGIN
{
	printf("%-10s %-8s\n", "CONTAINER", "COMM");
}

tracepoint:syscalls:sys_enter_openat
/ str(args->filename) == "/etc/hosts" /
{
	$task = (struct task_struct*)curtask;
	$name = $task->nsproxy->uts_ns->name.nodename;

	printf("%s %s\n", $name, comm);
}

To test it out, lunch a container, install curl, make a request, then observe the results:

    Attaching 2 probes...
    CONTAINER  COMM
    9f4a0d06ce7a curl

  1. I wrote about how one can go about implementing in C an strace-a-like program that’s able to capture which syscalls are being executed by a process. You can see more about it here: https://ops.tips/gists/using-c-to-inspect-linux-syscalls/ ↩︎

  2. Yeah, it sucks that we can’t easily see the filename now. That’s because perf trace leverages the standard fmt definition for the openat tracepoint (you can find that under /sys/kernel/debug/tracing/events/syscalls/sys_enter_openat/format), which does not format filename as a string, but instead, just a hexadecimal value (filename: 0x%08lx). ↩︎