Hey,

This week I wanted to discover a bit more about how dmesg works under the hood. In the past, I wanted to have alerting based on error messages popping at dmesg, so, maybe by trying to create something that uses the same thing that dmesg uses under the hood, I could better understand it.

Also, I knew that in some systems, it was possible to gather the same information from kern.log and that dmesg was all about reading “the kernel ring buffer”, but, what did that mean?

More specifically, where was dmesg reading from?

In this blog post, I go through some exploration to better understand the whole picture.

You can jump to the tl;dr here.

Producing logs from kernel space

Whenever something in the Linux kernel (which could be a module) wants to do the equivalent of a printf, a similar utility (printk) can be used.

That allows the program to give information back to the developer (or any other user) in userspace to know more about what’s going on with the module - be a warning message, stack traces or some pure debugging information.

To test out producing logs from the Kernel, I came up with some ideas:

  1. iptables has a LOG target that essentially produces some logs from iptables itself whenever the target is, well, targetted. Being iptables something that runs in kernel space (not the cli, I mean - the actual thing), we could see logs coming out of the kernel;
  2. writing a minimal kernel module should not be something very hard - many people did that before. I could write a printk("hello-world") and that’s it!
  3. check if there’s something simpler that we could use to produce the logs (maybe the Kernel exposes a standard interface for doing that?)

Being this an exploration, why not try all of them?

Producing logs from iptables

Although this blog post is not particularly about iptables, we can use it as an example where interacting with something that is going on under the hood deep in the kernel is interesting for a user that sits on userspace.

From iptables man page:

Iptables is used to set up, maintain, and inspect the tables of IP packet filter rules in the Linux kernel. Several different tables may be defined.

These mentioned rules can be either very simple things (drop a packet if it comes from a given source), or some very complicated chain of rules that depend on a bunch of conditionals.

Acknowledging that it’s very useful to be able to visualize the packets that are hitting a specific chain of rules, the LOG target came to existence.

LOG: Turn on kernel logging of matching packets.

When this option is set for a rule, the Linux kernel will print some information on all matching packets (like most IP/IPv6 header fields) via the kernel log.

Here’s an example set up that allows us to see that working:

# List the current state of the `filter` table.
# As we can see below, nothing is set up:
# - any packet arriving w/ our machine as the
#   destination is accepted;
# - any packet arriving that should be forwarded
#   is dropped;
# - any packet that is sent from our machine is
#   accepted (can go through).
iptables --table filter --list
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy DROP)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination


# Add a rule to the OUTPUT chain to log every packet
# that is destined towards 8.8.8.8 (google's public
# dns) 
iptables \
        --table filter \
        --insert OUTPUT \
        --jump LOG \
        --destination 8.8.8.8 \
        --log-prefix="[google-dns-out]"

# Check that the rule has been placed in the OUTPUT 
# chain of the `filter` table
iptables --table filter --list OUTPUT --numeric
Chain OUTPUT (policy ACCEPT)
target   prot opt source       destination
LOG      all  --  0.0.0.0/0    8.8.8.8        LOG flags 0 level 4 prefix "[google-dns-out]"


# Perform a DNS query targetting 8.8.8.8
dig example.com @8.8.8.8
... example.com.		18272	IN	A	93.184.216.34

# Verify the request packet in our logs (dmesg)
dmesg | grep google-dns-out
[66458.608136] [google-dns-out]IN= OUT=enp0s3 SRC=10.0.2.15 DST=8.8.8.8 LEN=80 TOS=0x00 PREC=0x00 TTL=64 ID=30602 PROTO=UDP SPT=39573 DPT=53 LEN=60

Producing logs from a Linux loadable kernel module

Not being a driver developer myself (honestly, having never written a kernel module before), it was interesting to look for resources that would teach me how to accomplish this task.

Based on a guide from The Linux Kernel Labs, I came up with the following code for a very simple kernel module:

/**
 * `module.h` contains the most basic functionality needed for
 * us to create a loadable kernel module, including the `MODULE_*`
 * macros, `module_*` functions and including a bunch of other
 * relevant headers that provide useful functionality for us
 * (for instance, `printk`, which comes from `linux/printk.h`,
 * a header included by `linux/module.h`).
 */
#include <linux/module.h>

/**
 * Following, we make use of several macros to properly provide
 * information about the kernel module that we're creating.
 *
 * The information supplied here are visible through tools like
 * `modinfo`.
 *
 * Note.: the license you choose here **DOES AFFECT** other things -
 * by using a proprietary license your kernel will be "tainted".
 */
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Ciro S. Costa");
MODULE_DESCRIPTION("A hello-world printer");
MODULE_VERSION("0.1");

/** hello_init - initializes the module
 *
 * The `hello_init` method defines the procedures that performs the set up
 * of our module.
 */
static int
hello_init(void)
{
         // By making use of `printk` here (in the initialization),
         // we can look at `dmesg` and verify that what we log here
         // appears there at the moment that we load the module with
         // `insmod`.
	printk(KERN_INFO "HELLO-WORLD: Hello world!\n");
	return 0;
}

static void
hello_exit(void)
{
        // similar to `init`, but for the removal time.
	printk(KERN_INFO "HELLO-WORLD: Bye bye world!\n");
}

// registers the `hello_init` method as the method to run at module
// insertion time.
module_init(hello_init);

// similar, but for `removal`
module_exit(hello_exit);

ps.: the source code is available at github.com/cirocosta/hello-world-lkm

Load the module, and there you go:

dmesg | grep HELLO-WORLD
[62076.224353] HELLO-WORLD: Hello world!

Producing logs from kmsg

https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg

A standard interface that we can use to insert messages into the printk buffer is exactly the same that we can use to read the messages from it: /dev/kmsg.

From the kernel documentation:

Injecting messages: Every write() to the opened device node places a log entry in the kernel’s printk buffer.

That is, whatever we echo "haha" as a single write to /dev/kmsg goes into the buffer.

Let’s try it then:

# Generate the message from my current unprivileged
# user and then pipe to the privileged `tee` which is
# able to write into `/dev/kmsg` (yep, /dev/kmsg can be
# see and read, but to write you need more privileges)
echo "haha" | sudo tee "/dev/kmsg"


# Check that we can see our message is indeed there:
dmesg | grep haha
[67030.010334] haha

Seeing kernel logs from user space

Having already spoiled the blog post by mentioning kmsg in the section before, there’s not a bunch to talk now.

The kernel provides to us /dev/kmsg in userspace, a special device that allows us to read from the ring buffer (with multitenancy in mind), and that’s what dmesg uses.

# Trace the syscalls `openat` and `read` to verify
# that it reads from `/dev/kmsg`
strace -e openat,read -f dmesg > /dev/null
openat(AT_FDCWD, "/dev/kmsg", O_RDONLY|O_NONBLOCK) = 3
read(3, "5,0,0,-;Linux version 4.15.0-34-"..., 8191) = 192
read(3, "6,1,0,-;Command line: BOOT_IMAGE"..., 8191) = 142
read(3, "6,2,0,-;KERNEL supported cpus:\n", 8191) = 31
read(3, "6,3,0,-;  Intel GenuineIntel\n", 8191) = 29

Going even deeper, whenever a read(2) syscall is emitted targetting such file, the kernel triggers kernel/printk/devkmsg_read internally, taking a message from the circular queue, formatting it and then sending back to the user.

# We can use `iovisor/bcc#examples/trace/stacknoop.py` to
# gather the stack trace from the execution of `devkmsg_read`
# to verify that the `read` gets right there when `vfs_read`
# is called.
./stacksnoop.py -v devkmsg_read
TIME(s)            COMM         PID    CPU FUNCTION
1.875072956        tail         1565   0   devkmsg_read
	devkmsg_read
	vfs_read
	sys_read
	do_syscall_64
	entry_SYSCALL_64_after_hwframe

A similar path is followed when writing to the circular queue as well: whenever a write(2) is issued, at some point devkmsg_write is called.

devkmsg translates such call to the equivalent of a printk, which then takes the path of reaching log_store, the method that ends up finally taking a free space from the queue and adding the log message:

# In one terminal
echo "haha" > /dev/kmsg

# In another terminal
./stacksnoop.py -v log_store
TIME(s)            COMM         PID    CPU FUNCTION
1.786375046        bash         1450   2   log_store
	log_store
	printk_emit
	devkmsg_write
	new_sync_write
	__vfs_write
	vfs_write
	sys_write
	do_syscall_64
	entry_SYSCALL_64_after_hwframe

By knowing that the kernel provides this interface, we can implement a simple program that constantly performs read(2)s against such device and then parses the contents (messages) that arrive.

Interpreting the kmsg messages

Knowing that every read(2) syscall performed against /dev/kmsg returns us a single message, we’re left to:

  • implementing a reader that continuously looks at /dev/kmsg and then extracts the raw messages from there; as well as
  • implementing the parsing of those messages.

The later is the most interesting, imo.

Each message comes packed in the following format: a list of comma-separated info fields and a message (these, separated by a semicolon).

//
//                  INFO		              MSG
//     .------------------------------------------. .------.
//    |                                            |        |
//    |	int	int      int      char, <ignore>   | string |
//    prefix,   seq, timestamp_us,flag[,..........];<message>
//

There are four standard fields in the info section, leaving room for other fields in the future.

From the priority field, we can extract two pieces of information: priority and facility.

// DecodePrefix extracts both priority and facility from a given
// syslog(2) encoded prefix.
//
//	   facility    priority
//      .-----------.  .-----.
//      |           |  |     |
//	7  6  5  4  3  2  1  0    bits
//

With the rest of the fields, we’re able to know when the message was produced and what ID does such message carry in the sequence of messages that have been put in the buffer.

If you’re interested in seeing this parsing in the form of actual code, I wrote a Golang implementation that you can check here: github.com/cirocosta/dmesg_exporter/kmsg/kmsg.go.

Closing thoughts

It was great to have the opportunity of going deep into the stack and verifying how components from the kernel can communicate with userspace applications.

That led me to create dmesg_exporter, a tool for exporting “dmesg” logs metrics so that people can eventually alert messages marked as “critical” arriving from a specific facility.

During the writing of this blog post, I made use of the following book: The Linux Programming Interface. Please make sure you check it out! It’s definitely one of the best books on Linux out there.

If you’ve had any questions, or have ideas for improvements, please let me know! I’m cirowrc on Twitter, and I’d love to hear from you!

Have a good one!