Hey,
This week I wanted to discover a bit more about how dmesg
works under the hood. In the past, I wanted to have alerting based on error messages popping at dmesg
, so, maybe by trying to create something that uses the same thing that dmesg
uses under the hood, I could better understand it.
Also, I knew that in some systems, it was possible to gather the same information from kern.log
and that dmesg
was all about reading “the kernel ring buffer”, but, what did that mean?
More specifically, where was dmesg
reading from?
In this blog post, I go through some exploration to better understand the whole picture.
- Producing logs from kernel space
- Producing logs from iptables
- Producing logs from a Linux loadable kernel module
- Producing logs from kmsg
- Seeing kernel logs from user space
- Interpreting the kmsg messages
- Closing thoughts
You can jump to the tl;dr here.
Producing logs from kernel space
Whenever something in the Linux kernel (which could be a module) wants to do the equivalent of a printf
, a similar utility (printk
) can be used.
That allows the program to give information back to the developer (or any other user) in userspace to know more about what’s going on with the module - be a warning message, stack traces or some pure debugging information.
To test out producing logs from the Kernel, I came up with some ideas:
iptables
has aLOG
target that essentially produces some logs fromiptables
itself whenever the target is, well, targetted. Beingiptables
something that runs in kernel space (not thecli
, I mean - the actual thing), we could see logs coming out of the kernel;- writing a minimal kernel module should not be something very hard - many people did that before. I could write a
printk("hello-world")
and that’s it! - check if there’s something simpler that we could use to produce the logs (maybe the Kernel exposes a standard interface for doing that?)
Being this an exploration, why not try all of them?
Producing logs from iptables
Although this blog post is not particularly about iptables, we can use it as an example where interacting with something that is going on under the hood deep in the kernel is interesting for a user that sits on userspace.
From iptables
man page:
Iptables is used to set up, maintain, and inspect the tables of IP packet filter rules in the Linux kernel. Several different tables may be defined.
These mentioned rules can be either very simple things (drop a packet if it comes from a given source), or some very complicated chain of rules that depend on a bunch of conditionals.
Acknowledging that it’s very useful to be able to visualize the packets that are hitting a specific chain of rules, the LOG
target came to existence.
LOG: Turn on kernel logging of matching packets.
When this option is set for a rule, the Linux kernel will print some information on all matching packets (like most IP/IPv6 header fields) via the kernel log.
Here’s an example set up that allows us to see that working:
# List the current state of the `filter` table.
# As we can see below, nothing is set up:
# - any packet arriving w/ our machine as the
# destination is accepted;
# - any packet arriving that should be forwarded
# is dropped;
# - any packet that is sent from our machine is
# accepted (can go through).
iptables --table filter --list
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy DROP)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
# Add a rule to the OUTPUT chain to log every packet
# that is destined towards 8.8.8.8 (google's public
# dns)
iptables \
--table filter \
--insert OUTPUT \
--jump LOG \
--destination 8.8.8.8 \
--log-prefix="[google-dns-out]"
# Check that the rule has been placed in the OUTPUT
# chain of the `filter` table
iptables --table filter --list OUTPUT --numeric
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
LOG all -- 0.0.0.0/0 8.8.8.8 LOG flags 0 level 4 prefix "[google-dns-out]"
# Perform a DNS query targetting 8.8.8.8
dig example.com @8.8.8.8
... example.com. 18272 IN A 93.184.216.34
# Verify the request packet in our logs (dmesg)
dmesg | grep google-dns-out
[66458.608136] [google-dns-out]IN= OUT=enp0s3 SRC=10.0.2.15 DST=8.8.8.8 LEN=80 TOS=0x00 PREC=0x00 TTL=64 ID=30602 PROTO=UDP SPT=39573 DPT=53 LEN=60
Producing logs from a Linux loadable kernel module
Not being a driver developer myself (honestly, having never written a kernel module before), it was interesting to look for resources that would teach me how to accomplish this task.
Based on a guide from The Linux Kernel Labs, I came up with the following code for a very simple kernel module:
/**
* `module.h` contains the most basic functionality needed for
* us to create a loadable kernel module, including the `MODULE_*`
* macros, `module_*` functions and including a bunch of other
* relevant headers that provide useful functionality for us
* (for instance, `printk`, which comes from `linux/printk.h`,
* a header included by `linux/module.h`).
*/
#include <linux/module.h>
/**
* Following, we make use of several macros to properly provide
* information about the kernel module that we're creating.
*
* The information supplied here are visible through tools like
* `modinfo`.
*
* Note.: the license you choose here **DOES AFFECT** other things -
* by using a proprietary license your kernel will be "tainted".
*/
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Ciro S. Costa");
MODULE_DESCRIPTION("A hello-world printer");
MODULE_VERSION("0.1");
/** hello_init - initializes the module
*
* The `hello_init` method defines the procedures that performs the set up
* of our module.
*/
static int
hello_init(void)
{
// By making use of `printk` here (in the initialization),
// we can look at `dmesg` and verify that what we log here
// appears there at the moment that we load the module with
// `insmod`.
printk(KERN_INFO "HELLO-WORLD: Hello world!\n");
return 0;
}
static void
hello_exit(void)
{
// similar to `init`, but for the removal time.
printk(KERN_INFO "HELLO-WORLD: Bye bye world!\n");
}
// registers the `hello_init` method as the method to run at module
// insertion time.
module_init(hello_init);
// similar, but for `removal`
module_exit(hello_exit);
ps.: the source code is available at github.com/cirocosta/hello-world-lkm
Load the module, and there you go:
dmesg | grep HELLO-WORLD
[62076.224353] HELLO-WORLD: Hello world!
Producing logs from kmsg
https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg
A standard interface that we can use to insert messages into the printk
buffer is exactly the same that we can use to read the messages from it: /dev/kmsg
.
From the kernel documentation:
Injecting messages: Every write() to the opened device node places a log entry in the kernel’s printk buffer.
That is, whatever we echo "haha"
as a single write
to /dev/kmsg
goes into the buffer.
Let’s try it then:
# Generate the message from my current unprivileged
# user and then pipe to the privileged `tee` which is
# able to write into `/dev/kmsg` (yep, /dev/kmsg can be
# see and read, but to write you need more privileges)
echo "haha" | sudo tee "/dev/kmsg"
# Check that we can see our message is indeed there:
dmesg | grep haha
[67030.010334] haha
Seeing kernel logs from user space
Having already spoiled the blog post by mentioning kmsg
in the section before, there’s not a bunch to talk now.
The kernel provides to us /dev/kmsg
in userspace, a special device that allows us to read from the ring buffer (with multitenancy in mind), and that’s what dmesg uses.
# Trace the syscalls `openat` and `read` to verify
# that it reads from `/dev/kmsg`
strace -e openat,read -f dmesg > /dev/null
openat(AT_FDCWD, "/dev/kmsg", O_RDONLY|O_NONBLOCK) = 3
read(3, "5,0,0,-;Linux version 4.15.0-34-"..., 8191) = 192
read(3, "6,1,0,-;Command line: BOOT_IMAGE"..., 8191) = 142
read(3, "6,2,0,-;KERNEL supported cpus:\n", 8191) = 31
read(3, "6,3,0,-; Intel GenuineIntel\n", 8191) = 29
Going even deeper, whenever a read(2)
syscall is emitted targetting such file, the kernel triggers kernel/printk/devkmsg_read
internally, taking a message from the circular queue, formatting it and then sending back to the user.
# We can use `iovisor/bcc#examples/trace/stacknoop.py` to
# gather the stack trace from the execution of `devkmsg_read`
# to verify that the `read` gets right there when `vfs_read`
# is called.
./stacksnoop.py -v devkmsg_read
TIME(s) COMM PID CPU FUNCTION
1.875072956 tail 1565 0 devkmsg_read
devkmsg_read
vfs_read
sys_read
do_syscall_64
entry_SYSCALL_64_after_hwframe
A similar path is followed when writing to the circular queue as well: whenever a write(2)
is issued, at some point devkmsg_write
is called.
devkmsg
translates such call to the equivalent of a printk
, which then takes the path of reaching log_store
, the method that ends up finally taking a free space from the queue and adding the log message:
# In one terminal
echo "haha" > /dev/kmsg
# In another terminal
./stacksnoop.py -v log_store
TIME(s) COMM PID CPU FUNCTION
1.786375046 bash 1450 2 log_store
log_store
printk_emit
devkmsg_write
new_sync_write
__vfs_write
vfs_write
sys_write
do_syscall_64
entry_SYSCALL_64_after_hwframe
By knowing that the kernel provides this interface, we can implement a simple program that constantly performs read(2)
s against such device and then parses the contents (messages) that arrive.
Interpreting the kmsg messages
Knowing that every read(2)
syscall performed against /dev/kmsg
returns us a single message, we’re left to:
- implementing a reader that continuously looks at
/dev/kmsg
and then extracts the raw messages from there; as well as - implementing the parsing of those messages.
The later is the most interesting, imo.
Each message comes packed in the following format: a list of comma-separated info fields and a message (these, separated by a semicolon).
//
// INFO MSG
// .------------------------------------------. .------.
// | | |
// | int int int char, <ignore> | string |
// prefix, seq, timestamp_us,flag[,..........];<message>
//
There are four standard fields in the info section, leaving room for other fields in the future.
From the priority
field, we can extract two pieces of information: priority and facility.
// DecodePrefix extracts both priority and facility from a given
// syslog(2) encoded prefix.
//
// facility priority
// .-----------. .-----.
// | | | |
// 7 6 5 4 3 2 1 0 bits
//
With the rest of the fields, we’re able to know when the message was produced and what ID does such message carry in the sequence of messages that have been put in the buffer.
If you’re interested in seeing this parsing in the form of actual code, I wrote a Golang implementation that you can check here: github.com/cirocosta/dmesg_exporter/kmsg/kmsg.go.
Closing thoughts
It was great to have the opportunity of going deep into the stack and verifying how components from the kernel can communicate with userspace applications.
That led me to create dmesg_exporter
, a tool for exporting “dmesg” logs metrics so that people can eventually alert messages marked as “critical” arriving from a specific facility.
During the writing of this blog post, I made use of the following book: The Linux Programming Interface. Please make sure you check it out! It’s definitely one of the best books on Linux out there.
If you’ve had any questions, or have ideas for improvements, please let me know! I’m cirowrc on Twitter, and I’d love to hear from you!
Have a good one!