Hey,

some weeks ago I wrote a Docker logging plugin that acts as an oklog forwarder: oklog-docker-plugin.

oklog makes use of a fairly straightforward format for ingesting logs: send it a line with some predefined fields and ta-da, it’ll be properly ingested.

While coding the plugin, I remember considering whether I should buffer these stuff or not.

My rationale was that I should probably look at the tradeoff of between immediate availability of the logs and some kind of network optimization by reducing the number of writes that I’d perform, thus reduce the number of packets inflight in a given time.

That got me to wonder how much of perceived differences we could find via buffering writes and how does that change as we switch from a given size to another.

What I expected: in the beginning, the more we cache, the better; we’ll achieve some peak, and then after that, we’ll start getting worse than not buffering or buffering at a given limit given that the whole write will be broken in parts. Does this hold true for real scenarios? How does it perform?

Testing it

To test that, I managed to have the following:

  1. set a given reasonable amount of data to transfer
  2. prepare a test executable that would transfer this with the different buffer sizes by passing a different parameter
  3. run each test execution multiple times; average the results for each.

The easiest way I could find to have buffering set was to make use of setvbuf(3):

SETBUF(3)                 Linux Programmer's Manual                SETBUF(3)

NAME
       setbuf, setbuffer, setlinebuf, setvbuf - stream buffering operations

       ...

       int setvbuf(FILE *stream, char *buf, int mode, size_t size);

       The setvbuf() function may be used on any open stream to  change  its
       buffer.

The coolness of this method is that we can either provide a buffer of our own (and then, say, inspect it at any time we want) or let glibc take care of that. We can skill all the buffering if desired by setting a flag or make it perform buffering based on newlines. Something to take notice is that setvbuf is applied to a FILE, which is not our primary target when making use of sockets.

This means that to benefit from those methods we have to associate a socket with a FILE first and just after that set the buffering mode for methods that make use of that stream.

Establishing connections and associating sockets with FILE streams

To associate a socket with a file stream we have to, right after getting a socket, make use of fdopen(3):

The fopen() function opens the file whose name is the string pointed to by path and associates a stream with it.

Given that when we establish a TCP connection, we have both ways (read and write), we can associate the socket with to separate file streams.

To make that easy to deal with I started by creating a struct (see when-to-buf/conn.h) that would hold these two streams and then deal with establishing the connections.

typedef struct conn {
	FILE* rx;
	FILE* tx;
	int fd;
} t_conn;

The struct gets initialized differently depending on whether we’re dealing with a client (active) or server (passive) connection.

For instance, on the server side we accept(2) on the listen_fd and then prepare the t_conn struct:

int
init_server_conn(t_conn* connection, int listen_fd)
{
	int conn_fd;
	struct sockaddr_in client_addr;
	socklen_t client_len = sizeof(client_addr);

        // `accept` extracts the first conn request from the queue of 
        // pending connections for the listening socket `listen_fd`
        // and then created a new connected socket.
	conn_fd =
	  accept(listen_fd, (struct sockaddr*)&client_addr, &client_len);
	if (conn_fd == -1) {
		perror("failed to accept connection - accept");
		return 1;
	}

	connection->fd = conn_fd;

        // with `fdopen` we create a FILE stream for the `read`
        // side of the socket
	connection->rx = fdopen(conn_fd, "r");
	if (connection->rx == NULL) {
		perror("failed to associate socket with rx stream - fdopen");
		return 1;
	}

        // to create the second FILE stream we first duplicate the
        // socket file descriptor and then open it with write only
        // capabilities
	connection->tx = fdopen(dup(conn_fd), "w");
	if (connection->tx == NULL) {
		perror("failed to associate socket with tx stream - fdopen");
		return 1;
	}

	return 0;
}

Once we have t_conn set up, writing to it became as simple as writing to a file. Any method that takes a FILE (like fprintf or fwrite…) could now be an entry point to writing to that connection.

Writing to the connection

To perform the buffered writes, setvbuf end up looking like the following (see when-to-buf/client.c#work_on_connection):

int
work_on_connection(t_conn* connection, int bufsize)
{
        int to_write = bufsize;
	char* write_buf;
	// ... 

        // if we want to buffer our writes, then prepare a 
        // buffer that will hold the data prior to pushing it
        // to the real underlying connection via `write(2)`s
	if (bufsize > 0) {
		write_buf = malloc(bufsize * sizeof(char));
		if (write_buf == NULL) {
			perror("failed to allocate write_buf");
			return 1;
		}

                // make `glibc` make use of our buffer and always buff
                // up to `bufsize`.
		err = setvbuf(connection->tx, write_buf, _IOFBF, bufsize);
		if (err != 0) {
			perror("failed to set block buffer - setvbuf");
			return 1;
		}
	        // ...
        }

	for (;;) {
                // ...
                // send data to `fwrite` such that it can perform
                // the buffering as performed above and then
                // end up writing to the connection at some point
		n = fwrite(SOURCE_BUFFER + total_written,
		           sizeof(char),
		           to_write,
		           connection->tx);
                // ...
	}
        // ...
}

With that set, it was now time to make sure that our underlying write(2) calls were being buffered as we intended.

To test that, the easiest way was to set SOURCE_BUFFER to something small, like 1024 bytes

- #define SRC_BUFSIZE (1 << 25)
+ #define SRC_BUFSIZE (1 << 10)
  diff client.c

and then verify for different bufsizes how many write(2) calls are made.

First we start with a big buffer (bufsize=1024):

# Run `strace` watching the execution of the 
# `./client` executable taking the arguments 
# `127.0.0.1` (address) and `1024` (bufsize).
#
#  ps.: -w : summarise time different between executions
#       -c : count the executions

# In this case, we expect to have a *very*  low number
# of `write(2)`s as we should be able to fill the whole
# thing in a single `write(2)` call, with no buffering
# at all.

strace -c -f -w ./client 127.0.0.1 1024

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 33.22    0.000288          72         4           close
 18.92    0.000164          82         2           write        <<<<<<<<<<
 13.03    0.000113         113         1           execve
  ...
  0.35    0.000003           3         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000867                    36         3 total
                                      /\ 
                                      ||
                                  low amount (Y)

After that, with a very small buffer (bufsize=1):

# Now, with a `bufsize` that's as low as 1 byte, we should
# expect to have the number of `write(2)`s to be pretty big
# (close to 1024):
strace -c -f -w ./client 127.0.0.1 1


% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 94.58    0.011862          12      1025           write        <<<<<<<<<
  2.72    0.000341          85         4           close
  ...
  0.02    0.000003           3         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.012542                  1059         3 total
                                    /\ 
                                    ||
                            a bunch of syscalls

cool, so we have what we need setup:

  1. control bufsize via CLI arguments
  2. have buffered writes with different buffer sizes

now it’s time to make it run!

Running some scenarios

The first scenario I wanted to test was the one that I could predict the outcome pretty well.

  • communication between client and server via loopback

As the loopback interface isn’t more than a virtual NIC and doesn’t even makes use of Ethernet, we should be able to transfer these packets as fast as possible regardless of the size (ps.: we’d not have everything being sent in a single packet as we still make use of IP).

That said, the improvements should be seen as we get bigger and bigger buffer sizes.

Graph of how perfect loopback performs

It really is - overall, increasing the size of the buffers help.

Is this too far from reality? Well, in some cases. For those deploying Kubeneretes services, it’s very common to have sidecars, i.e, collocation of containers. In this case, you’re running another container in the same shared namespaces, and you’re entirely able to communicate via loopback (e.g, for relaying logs!). It’s debatable if you should be using the whole network stack in such case (as opposed to unix sockets), but it’s a case.

Next, I wanted to check how it performs when there are some failures and delays in the communication. In theory, this is where buffers should shine after all.

Introducing problems in the network

To introduce both consistent and eventual delays, as well as the ability to drop packages at a given rate, I made use of tc(1). Change the queueing discipline of the interface, and you’ll have a messed up scenario*.

Keeping the total transfer size constant (1GB), I tried five scenarios. First I started with package loss:

  • Random package loss applied to 0.1% of all packages
Graph of 0.1% of package loss affects time
  • Random package loss applied to 1% of all packages
Graph of how a consistent 1% package loss affects time
  • Random package loss applied to 5% of all packages
Graph of how 5% package loss affects time

Nothing too interesting here. The shape of the data looks like the same as in the perfect loopback scenario. It might vary a little bit more but nothing impressive.

The game changes when it comes to delays though.

  • Consistent delay of 5ms to all packages
Graph of how a consistent 5ms delay affects time
  • Consistent 10ms to all packages
Graph of how 10ms of delay on all packages affects time
  • Bursting delay - added delay of 100ms (+/- 10ms) with the next random element depending on 25% on the last one
Graph of how time varies in a bursting delay scenario

If you’re curious about how to perform the same kind of traffic shaping in your own loopback, these were the rules used:

tc qdisc add dev lo root netem loss 0.1%
tc qdisc change dev lo root netem loss 1%
tc qdisc change dev lo root netem loss 5%
tc qdisc change dev lo root netem delay 100ms 10ms 25%
tc qdisc change dev lo root netem delay 5ms

Closing thoughts

I found pretty surprising how delays introduced a much more diverse result compared to package loss.

In the future, I might explore this a bit more, but I think this is enough and worth sharing.

If you spot any mistakes or think that I might have gotten everything wrong, please let me know! I’d appreciate a lot perfecting this and understanding better the results.

You can get the Excel worksheet that I created here: when-to-buf.xslx.

I’m cirowrc on Twitter, feel free to drop me a message there!

Have a good one,

finis