How Linux creates sockets and counts them

Hey,

If you’ve been working with web servers for a little while, you certainly have already hit the classic “address already in use” (EADDRINUSE).

Here in this article, we go through not only how to see whether such condition as conditioned to happen (by looking at the list of open sockets), but also verify in the actual Kernel code paths where that check happens.

Illustration of someone wondering about how does the socket syscall work

In case you’ve been wondering about how the socket(2) syscall works where are these sockets stored, make sure you stick to the end!

This is the sixth article in a series of 30 articles around procfs: A Month of /proc.

If you’d like to keep up to date with it, make sure you join the mailing list!

What are these sockets about?
Where to look for the list of sockets in my system?
What happens under the hood when the socket syscall gets called?
Sockets and resource limits
Counting the number of sockets in the system
What about namespaces?
Closing thoughts
Resources

What are these sockets about?

Sockets are the constructs that allow processes on different machines to communicate through an underlying network, being also possibly used as a way of communicating with other processes in the same host (through Unix sockets).

The analogy that really stuck with me is the one presented in the book Computer Networking: A top-down approach.

At a very high-level, we can think of the server machine as this “house” with a set of doors.

A house that represents a server with a door that represents the socket

With each door corresponding to a socket, the client can arrive at the door of the house and “knock” at it.

Right after knocking (sending the SYN packet), the house then automatically responds back with a response (SYN+ACK), which is then acknowledged by the house (yep, smart house with a “smart door”).

The interaction between the client and the house when the client is still being greeted by the house

Meanwhile, while the process just sits there within the house, the clients get organized by the “smart house”, which creates two lines: one for those that the house is still greeting, and another one for those that it finished greeting.

Whenever new clients land in the second line, the process can then let it come in.

The server process accepting incoming connections from the two queues formed

Once this connection gets accepted (the client is told to come in), the server is then able to communicate with it, transmitting and receiving data at wish.

One detail to note is that the client doesn’t really “get in” - the server creates a “private door” in the house (a client socket) and then communicates with the client from there.

If you’d like to follow the step by step of implementing a TCP server in C, make sure you check this article! Implementing a TCP server.

Where to look for the list of sockets in my system?

Having the mental model of how the TCP connection establishment looks like, we can now “get into the house” and explore how the machine is creating these “doors” (the sockets), how many doors our house has and in which state they are (are they closed? are they opened?).

For doing so, let’s consider the example of a server that just creates a socket (the door!) and does nothing with it.

// socket.c - creates a socket and then sleeps.
#include <stdio.h>
#include <sys/socket.h>


/**
 * Creates a TCP IPv4 socket and then just
 * waits.
 */
int
main(int argc, char** argv)
{
	// The `socket(2)` syscall creates an endpoint for communication
	// and returns a file descriptor that refers to that endpoint.
	//
	// It takes three arguments (the last being just to provide greater
	// specificity):
	// -    domain (communication domain)
	//      AF_INET              IPv4 Internet protocols
	//
	// -    type (communication semantics)
	//      SOCK_STREAM          Provides sequenced, reliable,
	//                           two-way, connection-based byte
	//                           streams.
	int listen_fd = socket(AF_INET, SOCK_STREAM, 0);
	if (err == -1) {
		perror("socket");
		return err;
	}


        // Just wait ...
        sleep(3600);

	return 0;
}

Under the hood, such simple syscall ends up triggering a whole bunch of internal methods (more on that in the next session) that at some point allows us to seek for information about active sockets under three different files: /proc/<pid>/net/tcp, /proc/<pid>/fd, and /proc/<pid>/net/sockstat.

While the fd directory presents us a list of files opened by the process, /proc/<pid>/net/tcp file gives us information regarding currently active TCP connections (in their various states) under the process network namespace. sockstat, on the other hand, acts more like a summary.

Starting with the fd directory, we can see that after the socket(2) call we can see the socket file descriptor in the list of file descriptors:

# Run socket.out (gcc -Wall -o socket.out socket.c)
# and leave it running in the background
./socket.out &
[2] 21113
 
# Check out that are the open files that the process has.
ls -lah /proc/21113/fd
dr-x------ 2 ubuntu ubuntu  0 Oct 16 12:27 .
dr-xr-xr-x 9 ubuntu ubuntu  0 Oct 16 12:27 ..
lrwx------ 1 ubuntu ubuntu 64 Oct 16 12:27 0 -> /dev/pts/0
lrwx------ 1 ubuntu ubuntu 64 Oct 16 12:27 1 -> /dev/pts/0
lrwx------ 1 ubuntu ubuntu 64 Oct 16 12:27 2 -> /dev/pts/0
lrwx------ 1 ubuntu ubuntu 64 Oct 16 12:27 3 -> 'socket:[301666]'

Given that from a simple call to socket(2) we don’t have a TCP connection, there’s no relevant information to be gathered from /proc/<pid>/net/tcp.

From the summary (sockstat), we can guess that we’re increasing the number of allocated TCP sockets:

# Check the summary regarding socket.
cat /proc/21424/net/sockstat
sockets: used 296
TCP: inuse 3 orphan 0 tw 4 alloc 106 mem 1
UDP: inuse 1 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

To make sure that we’re really increasing the alloc number, we can modify the source code above and allocate 100 sockets instead:

+ for (int i = 0; i < 100; i++) {
      int listen_fd = socket(AF_INET, SOCK_STREAM, 0);
      if (err == -1) {
          perror("socket");
          return err;
      }
+ }

Now, checking that again, we can see the alloc at a higher number:

cat /proc/21456/net/sockstat

                   bigger than before!
                                |
sockets: used 296          .----------.
TCP: inuse 3 orphan 0 tw 4 | alloc 207| mem 1
UDP: inuse 1 mem 0         *----------*
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

Now, the question is - how does the socket gets created under the hood?

What happens under the hood when the socket syscall gets called?

socket(2) is pretty much a factory that produces the underlying structures for handling operations on such socket.

Making use of iovisor/bcc, we can trace the deepest invocation that happens in the sys_socket call stack and from there understand each step.

|  socket()
|--------------- (kernel boundary)
|  sys_socket    
|       (socket, type, protocol)
|  sock_create   
|       (family, type, protocol, res)
|  __sock_create 
|       (net, family, type, protocol, res, kern)
|  sock_alloc    
|       ()
˘

Starting from sys_socket itself, this syscall wrapper is the first thing to be touched at kernelspace, being responsible for performing various checks and preparing some flags to pass down to subsequent invocations.

Once preliminary validations have been performed, it allocates in its stack a pointer to a struct socket, the struct that will end up holding the non-protocol specific information about the socket:

/**
 * Defined `socket` as a syscall with the
 * following arguments:
 * - int family;        - the communication domain
 * - int type; and      - the communication semantics
 * - int protocol.      - a specific protocol within a
 *                        certain domain and semantics.
 *                       
 */
SYSCALL_DEFINE3(socket, 
        int, family, 
        int, type, 
        int, protocol)
{
        // A pointer that is meant to be pointed to
        // a `struct sock` that contains the whole
        // socket definition after it gets properly
        // allocated by the socket family.
	struct socket *sock;
	int retval, flags;


	// ... Checks some stuff and prepare some flags ...
        // Create the underlying socket structures.
	retval = sock_create(family, type, protocol, &sock);
	if (retval < 0)
		return retval;


        // Allocate the file descriptor for the process so
        // that it can consume the underlying socket from
        // userspace.
	return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
}


/**
 * High level wrapper of the socket structures.
 */
struct socket {
	socket_state            state;
	short                   type;
	unsigned long           flags;
	struct sock*            sk;
	const struct proto_ops* ops;
	struct file*            file;
        // ...
};

Given that at the moment that we create the socket we can choose between different types and protocol families (like, UDP, UNIX, and TCP), the struct socket contains an interface (struct proto_ops*) that defines the basic constructs that sockets implement (regardless of their families/type), which gets initiated on the next method called: sock_create.

/**
 * Initializes `struct socket`, allocating the
 * necessary memory for it, as well as filling
 * the necessary information associated with
 * the socket.
 * 
 * It:
 * - Performs some argument checking;
 * - Runs a security check hook for `socket_create`
 * - Initializes the actual allocation of the `struct socket`
 *   (letting the `family` do it according to its own rules)
 */
int __sock_create(struct net *net, 
        int family, int type, int protocol, 
        struct socket **res, int kern)
{
	int err;
	struct socket *sock;
	const struct net_proto_family *pf;

        // Checks if the protocol range.
	if (family < 0 || family >= NPROTO)
		return -EAFNOSUPPORT;
	if (type < 0 || type >= SOCK_MAX)
		return -EINVAL;


	// Triggers custom security hooks for socket_create.
	err = security_socket_create(family, type, protocol, kern);
	if (err)
		return err;


	 // Allocates a `struct socket` object and ties it to
         // a file under the `sockfs` filesystem.
        sock = sock_alloc();
	if (!sock) {
		net_warn_ratelimited("socket: no more sockets\n");
		return -ENFILE;	/* Not exactly a match, but its the
				   closest posix thing */
	}

	sock->type = type;

        // Tries to retrieve the protocol family methods
        // for performing the family-specific socket creation.
        pf = rcu_dereference(net_families[family]);
	err = -EAFNOSUPPORT;
	if (!pf)
		goto out_release;


        // Executes the protocol family specific 
        // socket creation method.
        //
        // For instance, if our family is AF_INET (ipv4)
        // and we're creating a TCP socket (SOCK_STREAM),
        // a specific method for handling such type of socket
        // is called.
        //
        // If we were specifying a local socket (UNIX),
        // then another method would be called (given that
        // such method would implement the `proto_ops` interface
        // and have been loaded).
	err = pf->create(net, sock, protocol, kern);
	if (err < 0)
		goto out_module_put;
        // ...
}

Continuing with our deep dive, we can now look closely at how the actual struct socket gets allocated by sock_alloc().

Illustration of how the Linux kernel creates sockets

What that method does is allocate two things: a new inode, and a socket object.

These two are bound together via the sockfs filesystem, which is then responsible for not only keeping track of socket information in the system, but also providing the translation layer between regular filesystem calls (like write(2)) and the network stack (regardless of the underlying communication domain).

By tracing sock_alloc_inode, the method responsible for allocating the inode in sockfs, we’re able to see how that gets set up:

trace -K sock_alloc_inode
22384   22384   socket-create.out      sock_alloc_inode
        sock_alloc_inode+0x1 [kernel]
        new_inode_pseudo+0x11 [kernel]
        sock_alloc+0x1c [kernel]
        __sock_create+0x80 [kernel]
        sys_socket+0x55 [kernel]
        do_syscall_64+0x73 [kernel]
        entry_SYSCALL_64_after_hwframe+0x3d [kernel]


/**
 *	sock_alloc	-	allocate a socket
 *
 *	Allocate a new inode and socket object. The two are bound together
 *	and initialized. The socket is then returned. If we are out of inodes
 *	NULL is returned.
 */
struct socket *sock_alloc(void)
{
	struct inode *inode;
	struct socket *sock;

        // Given that the filesystem is in-memory,
        // perform the allocation using the kernel
        // memory.
	inode = new_inode_pseudo(sock_mnt->mnt_sb);
	if (!inode)
		return NULL;


        // Retrieves the `socket` struct from
        // the `inode` that lives in `sockfs`
	sock = SOCKET_I(inode);


        // Sets some filesystem aspects so that
	inode->i_ino = get_next_ino();
	inode->i_mode = S_IFSOCK | S_IRWXUGO;
	inode->i_uid = current_fsuid();
	inode->i_gid = current_fsgid();
	inode->i_op = &sockfs_inode_ops;


        // Update the per-cpu counter (which can then be
        // used by `sockstat` to and other systems
        // to quickly know the socket count).
	this_cpu_add(sockets_in_use, 1);
	return sock;
}


static struct inode *sock_alloc_inode(
        struct super_block *sb)
{
	struct socket_alloc *ei;
	struct socket_wq *wq;

        // Create an entry in the kernel cache 
        // taking the necessary memory for it.
	ei = kmem_cache_alloc(sock_inode_cachep, GFP_KERNEL);
	if (!ei)
		return NULL;

	wq = kmalloc(sizeof(*wq), GFP_KERNEL);
	if (!wq) {
		kmem_cache_free(sock_inode_cachep, ei);
		return NULL;
	}

        
        // Performs the most basic initialization
        // possible
	ei->socket.state = SS_UNCONNECTED;
	ei->socket.flags = 0;
	ei->socket.ops = NULL;
	ei->socket.sk = NULL;
	ei->socket.file = NULL;

        // Returns the underlying vfs inode.
	return &ei->vfs_inode;
}

Sockets and resource limits

Given that a filesystem inode can be referred from the userspace from a file descriptor, after we have the underlying Kernel structs all set up, sys_socket is then responsible for generating a file descriptor for the user (going through the resource limits validations as presented in Process resource limits under the hood.

If you’ve wondered why it is the case that you might receive a “too many open files” error for socket(2), that’s the reason - it goes through the same resource limits checks:

static int
sock_map_fd(struct socket* sock, int flags)
{
	struct file* newfile;

        // Do you recall this one? This is the method
        // the kernel ends up performing a check against
        // resource limits and making sure that we don't
        // get past the limits!
	int          fd = get_unused_fd_flags(flags);
	if (unlikely(fd < 0)) {
		sock_release(sock);
		return fd;
	}

	newfile = sock_alloc_file(sock, flags, NULL);
	if (likely(!IS_ERR(newfile))) {
		fd_install(fd, newfile);
		return fd;
	}

	put_unused_fd(fd);
	return PTR_ERR(newfile);
}

Counting the number of sockets in the system

If you’ve paid attention to the sock_alloc call, there was a part of it that took care of increasing the number of sockets that are “in-use”.

struct socket *sock_alloc(void)
{
	struct inode *inode;
	struct socket *sock;

        // ....

        // Update the per-cpu counter (which can then be
        // used by `sockstat` to and other systems
        // to quickly know the socket count).
	this_cpu_add(sockets_in_use, 1);
	return sock;
}

Being this_cpu_add a macro, we can look at its definition and understand a bit more about it:

/*
 * this_cpu operations (C) 2008-2013 Christoph Lameter <cl@linux.com>
 *
 * Optimized manipulation for memory allocated through the per cpu
 * allocator or for addresses of per cpu variables.
 *
 * These operation guarantee exclusivity of access for other operations
 * on the *same* processor. The assumption is that per cpu data is only
 * accessed by a single processor instance (the current one).
 * 
 * [...]
 */

Now, given that we’re always adding to sockets_in_use, we can at least guess that if we go through the method that is registered for /proc/net/sockstat is going to use that value, which it really does (with also performing an addition over the values registered for each CPU):

/*
 *	Report socket allocation statistics [mea@utu.fi]
 */
static int sockstat_seq_show(struct seq_file *seq, void *v)
{
	struct net *net = seq->private;
	unsigned int frag_mem;
	int orphans, sockets;

        // Retrieve the counters related to TCP sockets.
	orphans = percpu_counter_sum_positive(&tcp_orphan_count);
	sockets = proto_sockets_allocated_sum_positive(&tcp_prot);

        // Show the stats!
        // As we saw in the beginning of the article,
        // `alloc` show all of those that were allocated
        // and might not be in an "in-use" state yet.
	socket_seq_show(seq);
	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
		   sock_prot_inuse_get(net, &tcp_prot), orphans,
		   atomic_read(&net->ipv4.tcp_death_row.tw_count), sockets,
		   proto_memory_allocated(&tcp_prot));
	// ...
	seq_printf(seq,  "FRAG: inuse %u memory %u\n", !!frag_mem, frag_mem);
	return 0;
}

What about namespaces?

As you might’ve noticed, there’s no logic in the code related to namespaces when it comes to counting how many sockets where allocated.

That’s something that at first really surprised me given that I thought the networking stack was something that was the most namespaced, but it turns out that there are still some points that are not.

interesting - `/proc/<pid>/net/tcp` is namespaced, but `/proc/<pid>/net/sockstat` is not (it still isn't, the patch wasn't accept) pic.twitter.com/BcaVCAOczY
— Ciro S. Costa (@cirowrc) October 16, 2018

If you’d like to see that by yourself, make sure you follow the article Using network namespaces and a virtual switch to isolate servers.

The gist of it is that you can create a bunch of sockets, see sockstat, then create a network namespace, get into it, and then see that although you can’t see the TCP sockets from the whole system over there (namespaces in action!), you can see the total number of allocated sockets in the system (not namespaced).

# Create a bunch of sockets using our
# example in C
./sockets.out


# Check that we have a bunch of sockets
cat /proc/net/sockstat
sockets: used 296
TCP: inuse 5 orphan 0 tw 2 alloc 108 mem 3
UDP: inuse 1 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0


# Create a network namespace
ip netns add namespace1


# Get into it
ip netns exec namespace1 /bin/bash


# Check how `/proc/net/sockstat` shows the same
# number of allocated sockets.
TCP: inuse 0 orphan 0 tw 0 alloc 108 mem 3
UDP: inuse 0 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

Closing thoughts

It’s interesting to see how after exploring the inner workings of the Kernel by just being curious about /proc is leading to answers to why some specific behaviors that I’ve seen in daily operations work in such way.

Given that this is just the first article that is about /proc/net and I’ve already learned a lot, I can’t wait to start digging deeper into the rest of it!

If you’d like to follow along with me, make sure you subscribe to the mailing list.

In case you have any questions or thoughts you’d like to share, please let me know!

I’m cirowrc on Twitter, and I’d love to chat with you!

Have a good one, Ciro