Hey,

I was reviewing a PR where it seemed like somethnig odd could go on with a context with a timeout that could be reached earlier than expected, and that got me thinking about how could it be that Go implement context cancellation for HTTP requests under the hood.

Here in this post I go through the exploration, getting to the point where we end up with a code in C that looks similar to what Go does, understanding what mechanisms are involved.

a regular tcp client

With a typical implementation of a TCP client that we usually learn when getting started with sockets, one would end up with something like this:

    int
    main(int argc, char** argv)
    {
            struct sockaddr_in addr = { 0 };
            int                fd;

            if (!~(fd = socket(AF_INET, SOCK_STREAM, IPPROTO_IP))) {
                    perror("socket");
                    return 1;
            }

            addr.sin_family = AF_INET;
            addr.sin_port   = htons(PORT);
            inet_pton(AF_INET, HOST, &addr.sin_addr);

            if (!~connect(fd, (struct sockaddr*)&addr, sizeof(addr))) {
                    perror("fd: connect");
                    return 1;
            }

            do_read(fd);
            do_write(fd);

            // ..
    }

with the subsequent reads and writes looking like the following

    int
    do_read(int sock_fd)
    {
            char buf[BUFSIZE] = { 0 };

            if (!~read(sock_fd, buf, BUFSIZE)) {
                    perror("read");
                    return -1;
            }

            printf("read: '%s'\n", buf);
            return 0;
    }

    int
    do_write(int sock_fd)
    {
            const char* out_msg = "GET / HTTP/1.1\r\nHost: 127.0.0.1\r\n\r\n";

            if (!~write(sock_fd, out_msg, strlen(out_msg))) {
                    perror("write");
                    return -1;
            }

            return 0
    }

there's the problem by being blocking calls, they'll essentially block for as long as other internal (very long) timeouts hold.

in the case of a language like Go, where one is supposed to perform IO operations (specially networking ones) without having to worry too much about how those will perform, going with blocking calls wouldn't really work - that'd be too costly as that'd require at least one thread for each of those calls if doing them concurrently.

making it non-blocking

making those non-blocking takes quite a bit more of code, but the idea is not that complicated. others have talkd way more in depth about this than I plan to do here, so I'll keep it short.

the idea is that rather than asking the kernel for something and waiting until what we want is available, we could instead let the kernel know that at some point we'd like to have it, but not necesarilly right now - when ready, just let us know.

on Linux, epoll is a common mechanism to do that pretty much that: add the socket to an epoll, and when ready to read data from it, just do it without blocking.

    create the socket in non-blocking mode

            socket(AF_INET, SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_IP) = 3


    partially start connection -  as this is blocking, it'll happen in the
    background, and we'll be notified later

            connect(3, {sa_family=AF_INET, sin_port=htons(1337), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)


    create the epoll device

            epoll_create1(0)                        = 4


    add the socket to the epoll device, expecting only notifications for the
    ability to `write` on that socket fd

            epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLOUT, {u32=3, u64=3}}) = 0


    wait for that

            epoll_wait(4, [{EPOLLOUT, {u32=3, u64=3}}], 32, -1) = 1


    now that we're ready for a write (`connect` finished), check if there
    were any errors (remember, this is non-blocking, so, if there were any
    errors, we'd see them stored in the socket)

            getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0


    write to the socket (here we send our HTTP request)

            write(3, "GET / HTTP/1.1\r\nHost: 127.0.0.1\r"..., 35) = 35


    with the request sent, now only wait for EPOLLIN (as we want to `read`)
    (ps.: I could've used `EPOLL_CTL_MOD` rather then trying an ADD followed by a
    DEL and another ADD)

            epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=3, u64=3}}) = -1 EEXIST (File exists)
            epoll_ctl(4, EPOLL_CTL_DEL, 3, NULL)    = 0
            epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=3, u64=3}}) = 0


    wait for the ability to read

            epoll_wait(4, [{EPOLLIN, {u32=3, u64=3}}], 32, -1) = 1


    read from the socket (without blocking)

            read(3, "HTTP/1.1 200 OK\r\nDate: Sat, 28 D"..., 4096) = 120

If you're curious about how this looks codewise, I've implemented an example client that sends a GET / HTTP/1.1 to an HTTP server and then reads is response using that mechanism here: `cirocosta/http-ctx-cancellation#http-client.c.

cancelling a long read

As you can tell from those traced syscalls above, there's plently of opportunity between a blocking action and another to stop the whole thing.

For instance, consider that we're communicting with a server that's very far away, with a terrible internet connection.

Trying to read(2) from it would result in a blocking action - read(2) would return EAGAIN (or EWOULDBLOCK) right away.

At this point, we could either decide to wait for a little longer (either through epoll or nanosleep), or perhaps decide not to wait, because we've already waited to long. In the last case, we'd be doing pretty much what Go does in terms of request cancellation - give up on waiting, destroy that connection1 (close(2) it).

    ret = read(sock, buf, 4096)
    if (!~ret) {
            if (errno == EAGAIN && time_elapsed > threshold) {
                    close(sock);
            }

            return = ERR_CONTEXT_DEADLINE_EXCEEDED;
    }

    printf("%s\n", buf);
    return OK;

observing Go's behavior on context cancellation

We can see that close(2) being issued by Go is indeed the one that we expect by looking at the syscalls that it's performing (filtering some stuff out):

    strace -f -e 'trace=!futex,nanosleep' ./http


    1. non-blocking socket gets created
            socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
            setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
            connect(3, {sa_family=AF_INET, sin_port=htons(1337), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 
                    EINPROGRESS (Operation now in progress)

    2. socket gets added to epoll facility

            epoll_ctl(4, EPOLL_CTL_ADD, 3, {
                    EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=812239112, u64=139655968835848}}) = 0

    3. wait on the fds that were added - once `connect` finishes, we
       should get an event for that fd

            epoll_pwait(4, [{
                    EPOLLOUT, {u32=812239112, u64=139655968835848}}], 128, 0, NULL, 824634156712) = 1

    4. check if there were any errors in the conn

            getsockopt(3, SOL_SOCKET, SO_ERROR,  <unfinished ...>

    5. write to the socket

            write(3, "GET / HTTP/1.1\r\nHost: localhost:"..., 95

    6. try to read from it

            read(3,  <unfinished ...>
            <... read resumed> 0xc00011e000, 4096) 
                    -1 EAGAIN (Resource temporarily unavailable)
                    // ...

    7. would block, so lets continue waiting ..

    8. deadline reached, all we gotta do is remove from the set and
       close it

            epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc000132984) = 0
            close(3)                    = 0

And to be absolutely sure about that, we can even trace with bpftrace how we get from userspace down to the sys_enter_close tracepoint (where we can observe the close(2) syscall from the kernel perspective), and verify that it was indeed after a read that took too long:

    bpftrace -e 'tracepoint:syscalls:sys_enter_close / comm == "http" / { printf("%s", ustack); }'

    syscall.Syscall+48
    internal/poll.(*FD).destroy+67
    internal/poll.(*FD).readUnlock+81
    internal/poll.(*FD).Read+519
    net.(*netFD).Read+79
    net.(*conn).Read+104
    net/http.(*persistConn).Read+117
    bufio.(*Reader).fill+259
    bufio.(*Reader).Peek+79
    net/http.(*persistConn).readLoop+470

  1. in the case of HTTP2 where there's a single connection for multiple requests, that's not really a connection destroy, but more of a “stream cancellation” AFAIK (please let me know if I'm wrong) ↩︎