Hey,

Continuing from where we left at the latest article (How Linux creates Sockets), in this blog post we go through the internals of what happens when we make our socket ready to accept connections (and what “getting ready to accept connections” really means under the hood).

It goes through the internals of bind(2), listen(2) and accept(2), building the whole foundation around the preparation of the socket data structures.

How to make sockets communicate

If you’ve wondered what that netstat command looks at, make sure you stick until the end!

This is the seventh article in a series of 30 articles around procfs: A Month of /proc.

If you’d like to keep up to date with it, make sure you join the mailing list!

Creating the TCP socket

Given that to have a TCP connection we must have TCP sockets, the first thing we need to do is create a socket in the server side, and another in the client side.

This step is crucial as it is what creates the endpoint for communication on both sides.

In both cases, the call is the same: socket(2).

int 
main (int argc, char** argv)
{
        // Create a socket in the AF_INET (ipv4) 
        // communication domain, of type `SOCK_STREAM`
        // (sequenced, reliable, two-way, connection-based
        // stream - yeah, tcp), using the most adequate
        // protocol (that last argument - 0).
        int fd = socket(AF_INET, SOCK_STREAM, 0); 

        if (fd == -1) {
                perror("socket");
                return 1;
        }

        return 0;
}

The whole “under-the-hood” of socket(2) has been already covered!

Make sure you check How Linux creates Sockets.

Once the socket has been created with the appropriate family and protocol selected, we can start looking at how the next protocol-specific calls work in order to have a working connection.

Binding the socket to an address

Before binding, that socket we created exists in a given namespace, but it has no address assigned to it - the underlying data structures are there (allocated), but the semantics are not entirely defined.

What bind(2) does is assign an address specified at userspace to the socket referred to by the file descriptor that we received from socket(2).

/*
 * bind - bind a name to a socket.
 */
int bind(
        // The socket that we created before and which we 
        // want to associate an address with.
        int sockfd, 

        // Address varies depending on the address family.
        //
        // `struct sockaddr` defines a generic socket address,
        // but given that each family carries its own address
        // definition, we need to specialize this struct with
        // a `struct` that suits our protocol.
        const struct sockaddr *addr,

        // Size of the address structure pointed to by addr.
        socklen_t addrlen);

This function is particularly interesting due to the fact that it tries to be as generic as possible when it comes to taking an address - it expects a pointer to a chunk of memory and then takes the size that you told it that the struct has.

USERSPACE:

        bind( socket, [ ..... piece of memory ...... ], size of the piece of memory)

                 "hey kernel, here's some chunk of memory that corresponds
                  to something that the socket's family understand; and
                  btw, its size if `N`".


KERNELSPACE:

        oh, thx! I'll let `af_inet` know about it!

                >> grabs the memory by copying it to kernelspace;
                >> forwards it to the af_inet implementation of `bind`.

For instance, for our case (IPv4), we can make it listen on all interfaces (0.0.0.0), and port 1337, by specifying a struct sockaddr_in which we fill with the family (communication domain), port, and address information:

/*
 * server_bind - binds a given socket to a specific
 *               port and address.
 * @listen_fd: the socket to bind to an address.
 */
int
server_bind(int listen_fd)
{
        // Structure describing an internet socket
        // address (ipv4).
	struct sockaddr_in server_addr = { 0 };
        int err = 0;

        // The family that the address belongs to.
        // AF_INET: ipv4 "internet" addresses.
	server_addr.sin_family      = AF_INET;

        // The IPV4 address in network byte order (big endian)
	server_addr.sin_addr.s_addr = htonl(INADDR_ANY);

        // The service port that we're willing to
        // bind the socket to (in network byte order).
	server_addr.sin_port        = htons(PORT);

        err = bind(listen_fd, 
                (struct sockaddr*)&server_addr, 
                sizeof(server_addr));
	if (err == -1) {
		perror("bind");
		fprintf(stderr, "Failed to bind socket to address\n");
		return err;
	}
}

Even though we’re not allocating struct sockaddr_in in the heap (via malloc or similar), we still have a chunk of memory to reference, with a corresponding size (thus, we can pass it to bind).

Also, although bind(2) specifies that it expects a struct sockaddr, we can pass to it pretty much any chunk of memory that has a given size as adverted in addrlen, not necessarily fitting the same size of sockaddr.

For instance, consider the differences between IPv4 and IPv6.

In both cases, bind(2) is called to assign an address to a socket, and naturally, IPv6 has a much bigger address space compared to IPv4, thus, needing a much bigger structure.

// Generic `sockaddr`. A pointer to
// a struct like this is expected by the
// `bind` syscall.
//
// size: 16B
struct sockaddr {
	sa_family_t sa_family;   // 2B
	char        sa_data[14]; // 14B
}


// The socket address representation of an
// IPv4 address.
//
// size: 16B
struct sockaddr_in {
	sa_family_t    sin_family; // 2B
	in_port_t      sin_port;   // 2B
	struct in_addr sin_addr;   // 4B

	// Just add the rest that is left
	// (padding it for reasons I don't know).
	unsigned char sin_zero[8]; // 8B
}


// The socket address representation of an
// IPv6 address.
//
// size: 28B
struct sockaddr_in6 {
	sa_family_t     sin6_family;   // 2B
	in_port_t       sin6_port;     // 2B
	uint32_t        sin6_flowinfo; // 4B
	struct in6_addr sin6_addr;     // 16B
	uint32_t        sin6_scope_id; // 4B
};

What matters, in the end, is how the underlying family implementation of the bind operation deals with such chunk of memory that is meant to be the address (be address whatever the family thinks an address is).

What the Kernel does with the address passed to bind

Tracing down the bind operation of our program above, we can see the stack trace of the bind(2) for an AF_INET socket:

# Trace the `inet_bind` method, the one that
# gets called whenever a `bind` is called
# on a socket that has been created for
# the `af_inet` family (regardless of the
# type - SOCK_STREAM or SOCK_DATAGRAM).
trace -K inet_bind
PID     TID     COMM            FUNC
28700   28700   bind.out        inet_bind
        inet_bind+0x1 [kernel]
        sys_bind+0xe [kernel]
        do_syscall_64+0x73 [kernel]
        entry_SYSCALL_64_after_hwframe+0x3d [kernel]

Looking at each of those, we can then understand how that whole thing works.

Starting with sys_bind, the method that provides the bind(2) syscall functionality, we can see that it:

  1. looks up the underlying socket saved for such process file descriptor;
  2. copies the memory from userspace to kernelspace; and then
  3. lets the underlying socket family handle the bind operation.
Illustration of what the bind syscall does under the hood

In code, you can look at the system call definition in net/socket.c:

/*
 * Bind a name to a socket. Nothing much to do here since it's
 * the protocol's responsibility to handle the local address.
 * 
 * We move the socket address to kernel space before we call
 * the protocol layer (having also checked the address is ok).
 */
SYSCALL_DEFINE3(bind, 
        int, fd, 
        struct sockaddr __user*, umyaddr, 
        int, addrlen)
{
        // Reference to the underlying `struct socket`
        // associated with the file descriptor `fd` passed
        // from userspace.
	struct socket*          sock;
	struct sockaddr_storage address;
	int                     err, fput_needed;

	// Retrieve the underlying socket from the
	// file descriptor.
	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (sock) {
		// Copy the address to kernel space.
		err = move_addr_to_kernel(umyaddr, addrlen, &address);
		if (err >= 0) {
			// Call any security hooks registered for
			// the `bind` operation
			err = security_socket_bind(
			  sock, (struct sockaddr*)&address, addrlen);
			if (!err) {
				// Perform the underlying family's
				// bind operation.
				err = sock->ops->bind(
				  sock, (struct sockaddr*)&address, addrlen);
			}
		}
		fput_light(sock->file, fput_needed);
	}
	return err;
}

A little digression … something that caught my attention is the fact that all of these calls have either an audit_ or security_ method.

These hooks seem to be what brings Linux Security Modules to life (see Linux Security Modules: General Security Hooks for Linux):

The Linux Security Module (LSM) framework provides a mechanism for various security checks to be hooked by new kernel extensions.

Very interesting!


With the socket address in the kernel, and the socket structure associated with the file descriptor found, now it’s time for sock->ops->bind to be called (in our case - ipv4 -, inet_bind).

Considering that bind is pretty much a “mutator”, in the sense that its whole goal is to change some internal fields of the socket structure, we can start looking at what fields are these (before checking how such mutation is performed).

/**
 * Higher-level interface for any type of sockets
 * that we end up creating through `sockfs`.
 */
struct socket {
        // The state of the socket (not to confuse with
        // the transport state).
        // 
        // Note.: this is an enumeration of five possible
        // states: 
        // - SS_FREE = 0        not allocated
        // - SS_UNCONNECTED,	unconnected to any socket
        // - SS_CONNECTING,	in process of connecting
        // - SS_CONNECTED,	connected to socket
        // - SS_DISCONNECTING	in process of disconnecting
        //
        // Given that we have already called `socket(2)`, at
        // this point, we're clearly not in the `SS_FREE` state.
	socket_state		state;

        // File description (kernelspace) associated with 
        // the file descriptor (userspace).
	struct file		*file;

         // Family-specific implementation of a
         // network socket.
	struct sock		*sk;
        // ...
};


/** 
 * AF_INET specialized representation of network sockets.sockets.
 *
 * struct inet_sock - representation of INET sockets
 *
 * @sk - ancestor class
 * @inet_daddr          - Foreign IPv4 addr
 * @inet_rcv_saddr      - Bound local IPv4 addr
 * @inet_dport          - Destination port
 * @inet_num            - Local port
 * @inet_saddr          - Sending source
 * @inet_sport          - Source port
 * @saddr               - Sending source
 */
 struct inet_sock {
	struct sock		sk;
#define inet_daddr		sk.__sk_common.skc_daddr
#define inet_rcv_saddr		sk.__sk_common.skc_rcv_saddr
#define inet_dport		sk.__sk_common.skc_dport
#define inet_num		sk.__sk_common.skc_num
	__be32			inet_saddr;
	// ...
};

From such definitions, we can imagine which fields are about to change: the ones related to source addresses and source ports.

Once a series of checks and security hooks are called, bind proceeds with changing the fields in sock.

Checks performed by the ipv4 implementation of bind

Looking at the code (buckle up, a bunch of C code below), we can see these changes happening (go check net/ipv4/af_inet.c!):

// `AF_INET` specific implementation of the
// `bind` operation (called by `sys_bind` after
// retrieving the underlying `struct socket`
// associated with the file descriptor supplied
// by the user from `userspace`).
int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
{
	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;

        // Retrieves the `struct sock` associated with the non-family
        // specific representation of a socket (`struct socket`).
	struct sock *sk = sock->sk;


        // Cast the socket to the `inet-specific` definition 
        // of a socket.
	struct inet_sock *inet = inet_sk(sk);
	struct net *net = sock_net(sk);
	unsigned short snum;


        // ...
	// Make sure that the address supplied is
	// indeed of the size of a `sockaddr_in`
	err = -EINVAL;
	if (addr_len < sizeof(struct sockaddr_in))
		goto out;


	// Make sure that the address contains the right family
	// specified in its struct.
	if (addr->sin_family != AF_INET) {
		err = -EAFNOSUPPORT;
		if (addr->sin_family != AF_UNSPEC || addr->sin_addr.s_addr != htonl(INADDR_ANY))
			goto out;
	}


        // ...
        // Grab the service port as set in the address 
        // struct supplied from userspace.
	snum = ntohs(addr->sin_port);
	err = -EACCES;

	// Here is where we perform the check to make sure 
        // that the user has the necessary privileges to
	// bind to a privileged port.
	if (snum && snum < inet_prot_sock(net) &&
	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
		goto out;


	// Can't bind after the socket is already active,
	// of if it's already bound.
	err = -EINVAL;
	if (sk->sk_state != TCP_CLOSE || inet->inet_num)
		goto out_release_sock;

	// Set the source address of the socket to the
	// one that we've supplied.
	inet->inet_rcv_saddr = inet->inet_saddr = addr->sin_addr.s_addr;
	if (chk_addr_ret == RTN_MULTICAST || chk_addr_ret == RTN_BROADCAST)
		inet->inet_saddr = 0;

	/* Make sure we are allowed to bind here. */
	// CC: this is where you can retrieve a "random" port
	//     if you don't specify one.
	if (
                (snum || !inet->bind_address_no_port) && // has a port set?
	        sk->sk_prot->get_port(sk, snum)          // was able to grab the port
        ) {
		inet->inet_saddr = inet->inet_rcv_saddr = 0;
		err = -EADDRINUSE;
		goto out_release_sock;
	}

        // ...
        // Set the source port to the one that we've
        // specified in the address supplied.
	inet->inet_sport = htons(inet->inet_num);
	inet->inet_daddr = 0;
	inet->inet_dport = 0;
        // ...

	return err;
}

And that’s all of it!

At this moment, bind(2) finished all of its duties, providing an address (specified at user space) to the underlying socket structure in the kernel.

ps.: note that we still don’t see any sockets in /proc/net/tcp at this point. More on that later!

Making a socket passive

Once we have set an address for our socket, we need to let it either take the server or client role.

That is, it needs to either put itself for listening for incoming connections or start a connection to someone else that is listening.

With listen(2), we take the server role - setting ourselves for listening to clients connection.

// "listen for connections on a socket."
int listen(
        // File descriptor refering to a
        // socket of type SOCK_STREAM or
        // SOCK_SEQPACKET.
        int sockfd, 

        // Maximum length to which the queue
        // of pending connections for `sockfd`
        // can grow.
        int backlog);

Thus, this is how we can do it from userspace:

int
server_listen(int listen_fd)
{
        int err = 0;

	err = listen(listen_fd, BACKLOG);
	if (err == -1) {
		perror("listen");
		return err;
	}
}

From the manual page:

listen() marks the socket referred to by sockfd as a passive socket, that is, as a socket that will be used to accept incoming connection requests using accept(2).

So, what does “marking” really means under the hood?

Also, what is happenning with that “backlog” number being passed?

Under the hood of the listen syscall

Very much like bind(2), the listen(2) syscall implementation deals with figuring out what is the socket associated with the userspace file descriptor (associated with the process calling the syscall), performing some checks, and then letting the family implementation deal with the semantics.

Illustration of listen system call under the hood

The implementation can be found under net/socket.c:

SYSCALL_DEFINE2(listen, 
        int, fd, 
        int, backlog)
{
	struct socket *sock;
	int err, fput_needed, somaxconn;

        // Retrieve the underlying socket from
        // the userspace file descriptor associated
        // with the process.
	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (sock) {
                // Gather the `somaxconn` paremeter globally set
                // (/proc/sys/net/ipv4/somaxconn) and make use of it
                // so limit the size of the backlog that can be
                // specified.
                //
                // See https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
		somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn;
		if ((unsigned int)backlog > somaxconn)
			backlog = somaxconn;

                // Run the security hook associated with `listen`.
		err = security_socket_listen(sock, backlog);
		if (!err) {
                        // Call the ipv4 implementation of 
                        // `listen` that has been registered
                        // before at `socket(2)` time.
			err = sock->ops->listen(sock, backlog);
                }

		fput_light(sock->file, fput_needed);
	}
	return err;
}

Something that I found very interesting is that the backlog gets limited by the SOMAXCONN parameter set for the network namespace, and not the whole system.

Could we verify that? Sure!

Checking if the listen backlog limit is really set per namespace

If the limiting is really taking place on a per network namespace fashion, we should be able to join a namespace, set a limit and then, from the outside, see a different limit:

# Check the somaxconn limit as set in the
# default network namespace
cat /proc/sys/net/core/somaxconn
128


# Create a new network namespace
ip netns add mynamespace


# Join the network namespace and then
# check the value set for `somaxconn`
# within it
ip netns mynamespace exec \
        cat /proc/sys/net/core/somaxconn
128


# Modify the limit set from within the 
# network namespace
ip netns mynamespace exec \
        /bin/sh -c "echo 1024 > /proc/sys/net/core/somaxconn"


# Check whether the limit is in place there
ip netns mynamespace exec \
        cat /proc/sys/net/core/somaxconn
1024


# Check that the host's limit is still the
# same as before (128), meaning that the change
# took effect only within the namespace
cat /proc/sys/net/core/somaxconn
128

So, /proc is telling us the sysctl parameters, but, are they really in place?

To figure that out, we need first to see how we can gather the backlog limits set for a given socket.

Gathering TCP socket information from procfs

Looking at /proc/net/tcp, we can have a great view of all the sockets that we have in our current namespace.

Usually, that’s the file that reveals most of the information that you’d need.

It carries some very useful information like:

  • connection state;
  • remote address and port;
  • local address and port;
  • the size of the receive queue; and
  • the size of the transmit queue.

For instance, running that after we’ve let our process’ socket listen, we can see the socket in the listen state:

# Retrieve a list of all of the TCP sockets that
# are either listening of that have had or has a
# established connection.    

        hexadecimal representation <-.
        of the conn state.           |
                                     |
cat /proc/net/tcp                    |
   .---------------.               .----.
sl | local_address | rem_address   | st | tx_queue rx_queue 
0: | 00000000:0016 | 00000000:0000 | 0A | 00000000:00000000 
   *---------------*               *----*
    |                                  |
    *-> Local address in the format    *-.
        <ip>:<port>, where numbers are   |
        represented in the hexadecimal   |
        format.                          |
                    .--------------------*
                    |
        The states here correspond to the
        ones in include/net/tcp_states.h:

enum {
	TCP_ESTABLISHED = 1,
	TCP_SYN_SENT,
	TCP_SYN_RECV,
	TCP_FIN_WAIT1,
	TCP_FIN_WAIT2,
	TCP_TIME_WAIT,     .-> 0A = 10 --> LISTEN
	TCP_CLOSE,         |
	TCP_CLOSE_WAIT,    |
	TCP_LAST_ACK,      |
	TCP_LISTEN,  ------*
	TCP_CLOSING,
        TCP_NEW_SYN_RECV,
	TCP_MAX_STATES,
};

Something that we don’t have here is the configured backlog for the listening socket. I suppose that the reason why we don’t have that there is the fact that it’s an information very specific to sockets in the LISTEN state, but this is just a guess.

So, how can we check that?

Checking the backlog size of a listening socket

One quick way of figuring that out is making use of ss from iproute2.

Consider the following userspace code:

int main (int argc, char** argv) {
        // Create a socket for the AF_INET
        // communication domain, of type SOCK_STREAM
        // without a protocol specified.
        int sock_fd = socket(AF_INET, SOCK_STREAM, 0);
        if (sock_fd == -1) {
                perror("socket");
                return 1;
        }

        // Mark the socket as passive with a backlog
        // size of 128.
        int err = listen(sockfd, 128);
        if (err == -1) {
                perror("listen");
                return 1;
        }

        // Sleep
        sleep(3000);
}

After running the code above, run ss:

# Display a list of passive tcp sockets, showing
# as much info as possible.
ss \
  --info \     .------> Number of connections waiting
  --tcp \      |        to be accepted.
  --listen \   |    .-> Maximum size of the backlog.
  --extended   |    |
        .--------..--------.
State   | Recv-Q || Send-Q | ...
LISTEN  | 0      || 128    | ...
        *--------**--------*

The reason why we use ss instead of /proc/net/tcp for this case, is that the latest does not give us information about the backlog size of the socket, while ss does.

What allows ss to do it is the fact that it uses a different API for retrieving information from the kernel: instead of reading from procfs, it makes use of netlink:

Netlink is a datagram-oriented service. […] used to transfer information between the kernel and user-space processes.

Given that netlink supports communication with many different kernel subsystems, ss needs to specify which one it intends to talk with - in the case of sockets, it choses sock_diag:

The sock_diag netlink subsystem provides a mechanism for obtaining information about sockets of various address families from the kernel.

This subsystem can be used to obtain information about individual sockets or request a list of sockets.

More specifically, what allows us to gather the backlog information is UDIAG_SHOW_RQLEN flag:

UDIAG_SHOW_RQLEN
       ...
       udiag_rqueue
              For listening sockets: the number of pending
              connections. [ ... ] 

       udiag_wqueue
              For listening sockets: the backlog length which
              equals to the value passed as the second argu‐
              ment to listen(2). [...]

Now, running the same code as in the previous section, we can see how the limit is indeed applied per-namespace (go ahead and do it! If you have any questions on how to do that, let me know on Twitter - @cirowrc).

We’ve talked about the size of this backlog queue, but, how is it initialized?

The internals of the ipv4 version of listen

Right after the backlog size gets capped by the sysctl value (SOMAXCONN), the next step is offloading the execution of listen to the family operation (inet_listen)).

So, here is where things actually happen.

Illustration of what inet_listen does under the hood

Having some of the code for TCP Fast Open redacted for better readability, here’s how inet_listen is implemented:

int
inet_listen(struct socket* sock, int backlog)
{
	struct sock*  sk = sock->sk;
	unsigned char old_state;
	int           err, tcp_fastopen;

	// Ensure that we have a fresh socket that has
	// not been put into `LISTEN` state before, and
	// is not connected.
	//
	// Also, ensure that it's of the TCP type (otherwise
	// the idea of a connection wouldn't make sense).
	err = -EINVAL;
	if (sock->state != SS_UNCONNECTED || sock->type != SOCK_STREAM)
		goto out;

	if (_some_tcp_fast_open_stuff_) {
		// ... do some TCP fast open stuff ...

		// Initialize the necessary data structures
		// for turning this socket into a listening socket
		// that is going to be able to receive connections.
		err = inet_csk_listen_start(sk, backlog);
		if (err)
			goto out;
	}

	// Annotate the protocol-specific socket structure
	// with the backlog configured by `sys_listen` (the
	// value from userspace after being capped by the
	// kernel).
	sk->sk_max_ack_backlog = backlog;
	err                    = 0;
	return err;
}

Once some checks get performed, inet_csk_listen_start takes care of performing the mutations on the socket, as well as allocating the connection queue:

int inet_csk_listen_start(struct sock *sk, int backlog)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	struct inet_sock *inet = inet_sk(sk);
	int err = -EADDRINUSE;

        // Initializes the internet connection accept
        // queue.
	reqsk_queue_alloc(&icsk->icsk_accept_queue);

        // Sets the maximum ACK backlog to the one that
        // was capped by the kernel.
	sk->sk_max_ack_backlog = backlog;

        // Sets the current size of the backlog to 0 (given
        // that it's not started yet.
	sk->sk_ack_backlog = 0;
	inet_csk_delack_init(sk);

        // Marks the socket as in the TCP_LISTEN state.
	sk_state_store(sk, TCP_LISTEN);

        // Tries to either reserve the port already
        // bound to the socket or pick a "random" one.
	if (!sk->sk_prot->get_port(sk, inet->inet_num)) {
		inet->inet_sport = htons(inet->inet_num);

		sk_dst_reset(sk);
		err = sk->sk_prot->hash(sk);

		if (likely(!err))
			return 0;
	}

        // If things went south, then return the error
        // but first set the state of the socket to
        // TCP_CLOSE.
	sk->sk_state = TCP_CLOSE;
	return err;
}

Now that we have an address assigned to the socket, as well as the right state set and a queue for having incoming connections lined up, we have everything to start accepting connections.

But before that, let’s look at some cases we might fall into.

What happens if you do not bind before listening

If not binding “at all”, listen(2) ends up choosing a random port for you.

That’s because if we take a closer look at the method used by inet_csk_listen_start to reserve the port (get_port), we can see that it gather a random ephemeral port if the underlying socket has no port chosen.

/* Obtain a reference to a local port for the given sock,
 * if snum is zero it means select any available local port.
 * We try to allocate an odd port (and leave even ports for connect())
 */
int inet_csk_get_port(struct sock *sk, unsigned short snum)
{
	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
	int ret = 1, port = snum;
	struct inet_bind_hashbucket *head;
	struct inet_bind_bucket *tb = NULL;

        // If we didn't specify a port (port == 0)
	if (!port) {
		head = inet_csk_find_open_port(sk, &tb, &port);
		if (!head)
			return ret;
		if (!tb)
			goto tb_not_found;
		goto success;
	}

        // ...
}

So, yeah, if you don’t want to bother with picking a port when listening, there you go!

What metrics spike when you do not accept connections as fast as you should

Given that there are always two queues in place when we act as a passive socket (one queue for those connections that have not completed the three-way handshake, and another for those that completed but were not accepted yet), we can imagine that the second one will start filling up.

Illustration of the problems with having a slow accept

The first metric we can look at is the one that we’ve already covered before: the idiag_rqueue and idiag_wqueue values reported from sock_diag for a specific socket.

idiag_rqueue
      For listening sockets: the number of pending connections.

      For other sockets: the amount of data in the incoming queue.

idiag_wqueue
      For listening sockets: the backlog length.

      For other sockets: the amount of memory available for sending.

While those are great for per-socket analysis, we can look at higher level information to know if, in overall, the machine is seeing the accept queue getting overflowed.

Given that whenever the Kernel tries to transition an incoming request from the syn queue to the accept queue and fails, it records an error at ListenOverflows, we can keep track of that number (which you can get from /proc/net/netstat):

# Retrieve the number of listen overflows
# (accept queue full, making transitioning a
# connection from `syn queue` to `accept queue`
# not possible at the moment).
cat /proc/net/netstat
cat /proc/net/netstat
TcpExt: SyncookiesSent SyncookiesRecv ...  ListenOverflows
TcpExt: 0 0 ... 105 ...

Naturally, we can see that /proc/net/netstat doesn’t provide the most human readable format possible.

That’s where netstat (the tool) comes in:

netstat --statistics | \
        grep 'times the listen queue of a socket overflowed'
105 times the listen queue of a socket overflowed

Curious about where in the Kernel code that happens? Check out tcp_v4_syn_recv_sock.

/*
 * The three way handshake has completed - we got a valid synack -
 * now create the new socket.
 */
struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb,
				  struct request_sock *req,
				  struct dst_entry *dst,
				  struct request_sock *req_unhash,
				  bool *own_req)
{
        // ...
	if (sk_acceptq_is_full(sk))
		goto exit_overflow;
        // ...
exit_overflow:
	NET_INC_STATS(
                sock_net(sk), 
                LINUX_MIB_LISTENOVERFLOWS); // (ListenOverflows)
}

Now, what if the syn queue is getting all the load and not seeing three-way handshakes being completed, thus not transitioning connections to the accept queue until the point that it starts overflowing?

That’s where another metric comes handy: TCPReqQFullDrop or TCPReqQFullDoCookies (depending on whether SYN cookies are enabled or not) - follow tcp_conn_request for in-depth info about it.

If we’d like to know at any given amount in time what is the number of connections that within the first queue (syn queue), we can list all sockets that are still in the syn-recv state:

# List all sockets that are in
# the `SYN-RECV` state  towards
# the port 1337.
ss \
  --numeric \
  state syn-recv sport = :1337

The folks from CloudFlare have a great post on this topic: SYN packet handling in the wild.

Go check it out!

Closing thoughts

It was great to be able to understand some of the edge cases involved with setting a server TCP socket for accepting new connections.

I plan to do some bigger exploration on some more relevant metrics involved in the process, as well as properly understanding some of the more modern TCP quirks, but that’s for another post.

Please let me know if you find anything weird in this blog post, or have any question! I’m cirowrc on Twitter, and I’d love to chat :)

Have a good one!

Resources