containerd

Hey,

This is a “quick” intro to get everyone on the concourse team up to speed on the beginning of our work on containerd - which you can check out the progress in #4783 -, with some of what I learned so far.

Keep in mind that I’m no containerd expert, so I might be stating things that are not 100% true or accurate.

overview

As you probably know, at some point, concourse gets to run definitions of “work to be done” in the form of processes in a machine.

    WEB
            "I gotta run this thing somewhere"

            - grabs a worker

                    --> hey, create this container following this spec
                    --> btw, run this process in it

Given that the garden interface is high-level enough, as long as the backend that implements it is able to do what’s supposed to, web doesn’t need to worry about who’s implementing it.

    WEB     -----container action-----> Garden implementor (backend)
                                                              |
                                       (how it's called)   <--'

For instance, if you look at the workers that are currently registered against https://ci.concourse-ci.org, you’ll see that we have two different garden implementors: houdini, for windows and darwin, and guardian for the linux ones.

    $ fly -t ci workers
    name                                              containers  platform  tags        team                   
    9561ac54-vm-f742202a-e8d9-45b6-5bbc-7fada8c2c958  4           windows   none        none                   
    c184f9d5-a073-4de7-8ded-e2b78055cc82              4           linux     bosh        main                   
    ci-monitoring-worker-0                            16          linux     none        monitoring-hush-house  
    ci-pr-worker-0                                    3           linux     pr          none                   
    ci-topgun-worker-0                                13          linux     k8s-topgun  none                   
    ci-worker-0                                       53          linux     none        none                   
    darwin-worker                                     4           darwin    none        main

As an example of how this looks like in practice, the container creation looks like this (from worker.createGardenContainer):

    func (w workerHelper) createGardenContainer(
            containerSpec ContainerSpec,
            fetchedImage FetchedImage,
            handleToCreate string,
            bindMounts []garden.BindMount,
    ) (gclient.Container, error) {

            // do some setup ...
            env := append(fetchedImage.Metadata.Env, containerSpec.Env...)

            return w.gardenClient.Create(
                    garden.ContainerSpec{
                            Handle:     handleToCreate,
                            RootFSPath: fetchedImage.URL,
                            Privileged: fetchedImage.Privileged,
                            BindMounts: bindMounts,
                            Limits:     containerSpec.Limits.ToGardenLimits(),
                            Env:        env,
                            Properties: gardenProperties,
                    })
    }

And running a process in a container, like this (from worker.gardenWorkerContainer.RunScript, with some code deleted / modified for readability):

    func (container *gardenWorkerContainer) RunScript(ctx context.Context,
            path string, args []string,
            input []byte, output interface{},
            logDest io.Writer, recoverable bool,
    ) error {

            stdout := new(bytes.Buffer)
            stderr := new(bytes.Buffer)

            processIO := garden.ProcessIO{
                    Stdin:  bytes.NewBuffer(input),
                    Stdout: stdout,
                    Stderr: stdout,
            }

            process, err := container.Run(ctx, garden.ProcessSpec{
                    Path: path,
                    Args: args,
            }, processIO)
            if err != nil {
                    return err
            }

            processExited := make(chan struct{})

            go func() {
                    processStatus, processErr = process.Wait()
                    close(processExited)
            }()

            select {
            case <-processExited:   // execution finished
                    if processStatus != 0 {
                            return runtime.ErrResourceScriptFailed{
                                    Path:       path,
                                    Args:       args,
                                    ExitStatus: processStatus,
                                    Stderr: stderr.String(),
                            }
                    }

                    return err

            case <-ctx.Done():       // cancelled
                    container.Stop(false)
                    <-processExited
                    return ctx.Err()
    }

That’s all to say that as long as we implement the Garden interface, we can swap the container runtimes as we wish, and that’s exactly the first step that we’re taking with containerd: writing a replacement for guardian in the Linux stack.

    web ...........................................
    .
    .
    .       concourse web ----.
    .                         |
                              |
                              |
    worker ...................+....................
    .                         |
    .                         |
    .       concourse worker  |
    .               garden backend ---.     :7777
    .                                 |
    .                                 |
    .       containerd                |
    .               /run/containerd/containerd.sock
    .

Again, for web, it’s still communicating with a garden server.

containerd is a container runtime that’s able to manage the complete container lifecycle - from fetching images from a registry, to setting up storage, running, and destroying it. In terms of usage, it’s currently the engine under the hood of moby, buildkit, kubernetes (when using containerd-cri), pouch, and recently, openfaas.

The way that we’re aiming at running it is as a separate process that gets spawn by an ifrit runner that brings it up from the binaries located under the usual Concourse assets path (/usr/local/concourse/bin).

    worker
            containerd runner
                    --> /usr/local/concourse/bin/containerd (separate process)

Once up, the interaction with it takes place through a client that has all that we need to touch a running containerd instance in our machine (the worker):

    client, err := containerd.New("/run/containerd/containerd.sock")
    if err != nil {
            err = fmt.Errorf("containerd client conn: %w", err)
            return
    }

    defer client.Close()

Under the hood, this takes cares of instantiating the grpc client and “dialing” to the unix socket, so that further interactions with containerd can take place throguh remote procedure calls that we perform against it.

    // New returns a new containerd client that is connected to the containerd
    // instance provided by address
    //
    func New(address string, opts ...ClientOpt) (*Client, error) {
            gopts := []grpc.DialOption{
                    grpc.WithBlock(),
                    grpc.WithInsecure(),
                    grpc.FailOnNonTempDialError(true),
                    grpc.WithBackoffMaxDelay(3 * time.Second),
                    grpc.WithContextDialer(dialer.ContextDialer),
                    grpc.WithDefaultCallOptions(grpc.MaxCallRecvMsgSize(defaults.DefaultMaxRecvMsgSize)),
                    grpc.WithDefaultCallOptions(grpc.MaxCallSendMsgSize(defaults.DefaultMaxSendMsgSize)),
            }

            connector := func() (*grpc.ClientConn, error) {
                    ctx, cancel := context.WithTimeout(context.Background(), copts.timeout)
                    defer cancel()
                    conn, err := grpc.DialContext(ctx, dialer.DialAddress(address), gopts...)
                    if err != nil {
                            return nil, errors.Wrapf(err, "failed to dial %q", address)
                    }
                    return conn, nil
            }

            conn, err := connector()
            if err != nil {
                    return nil, err
            }

            // ...
    }

That’s because differently from garden where the interface is a REST-a-like HTTP-based API, containerd expose its functionality through grpc services whose specs are defined in the form of protocol buffers in its interface definition language.

    worker
            backend
                    containerd client

                            ------grpc rpc ----> containerd

For instance, looking at the “containers” protobuf spec, we can see the definition of how one one interact with the provider of the “containers service”.

    syntax = "proto3";

    service Containers {
            rpc Get(GetContainerRequest) returns (GetContainerResponse);
            rpc List(ListContainersRequest) returns (ListContainersResponse);
            rpc ListStream(ListContainersRequest) returns (stream ListContainerMessage);
            rpc Create(CreateContainerRequest) returns (CreateContainerResponse);
            rpc Update(UpdateContainerRequest) returns (UpdateContainerResponse);
            rpc Delete(DeleteContainerRequest) returns (google.protobuf.Empty);
    }

    message Container {
            string id = 1;
            map<string, string> labels  = 2;

            // ...

            google.protobuf.Timestamp created_at = 8 [(gogoproto.stdtime) = true, (gogoproto.nullable) = false];
    }

    message GetContainerRequest {
            string id = 1;
    }

    message GetContainerResponse {
            Container container = 1 [(gogoproto.nullable) = false];
    }

(from containers.proto)

ps.: for a simple “hello world” example of grpc-go, check out cirocosta/hello-grpc.

Using the gRPC toolchain, the containerd maintainers take that definition and turn it into both client and server-side code that implements it, which we can then consume (the client) from the github.com/containerd/containerd package.

    containers, err := client.ContainerService().List(context.Background()))
    if err != nil {
            err = fmt.Errorf("list containers: %w", err)
            return
    }

    fmt.Printf("%-16s %s", "ID", "CREATED-AT")
    for _, container := range containers {
            fmt.Printf("%-16s %s",
                    container.ID,
                    container.CreatedAt.String(),
            )
    }

That said, as long as we’re matching compatible versions of the server (containerd) and the client (the github.com/containerd/containerd Go package), that interface is “guaranteed” to be honored.

go.mod:

    module github.com/concourse/concourse

    require (
            github.com/containerd/containerd v1.3.2
            // ...
    )

    go 1.13

dockerfile:

    ARG CONTAINERD_VERSION=1.3.2
    RUN curl -sSL $URL/releases/download/v$CONTAINERD_VERSION.linux-amd64.tar.gz \
            | tar -zvxf - -C /usr/local/concourse/bin --strip-components=1 && \

namespaces

If we try to run that code above though, it wouldn’t work - we’re missing something: namespaces.

Whenever interacting with the containerd API, we must specify against which “tenant” we’re interacting with, allowing containerd to be targetted by multiple consumers without conflicts between the objects maintained for each of them.

    client
      | 
      | `` what are the containers
      |           in NS1?         ,,
      | 
      '------>      containerd
              .---------------.--------------------.
              |  NS1          |   NS2              |
              |  containers   |  other containers  |
              |  some imgs    |  some other imgs   |
              |   ...         |   ..               |

To fix that example above then, we could either get the namespace information into the context used for that call, or configure the client with a default namespace for all calls to containerd made by that client.

    ctx := namespaces.WithNamespace(context.Background(), "ns1")

    // or

    client, err := containerd.New(
            "/run/containerd/containerd.sock",
            containerd.WithDefaultNamespace("ns1"),
    )

services

Internally, containerd is composed of multiple components, whose interfaces are exposed as gRPC services, which are presented in a nice high-level form through the containerd API that the client interacts with (through gRPC).

    concourse containerd backend   |
    ----------------------------   | client
          containerd client        |
                 |
                 | grpc
                 |
          containerd API           |
    ----------------------------   | server
       low-level grpc services     |

Despite these being very loosely coupled, they’re meant to work together from the desire of the higher level containerd API.

For instance, in the process of going from having an image in a registry, to actually running a container, quite few of those need to interact:

    pull
            --> fetch
                    --> content
                    --> images

            --> unpack
                    <-- content
                    <-- images
                    --> snapshots

first, once layers are pulled, their content is put in the content store, which provides access to content addressable storage
those layers are then referenced through the metadata store, which is all about keeping track of references & relationships, as well as namespacing things - “images” are then just pointers to those content-addressable blobs.
consuming that content store, layers from the image are then unpacked into the snapshotter component, who’s then capable of mounting those layers in the right way.
using the image manifest and configuration, the execution configuration can be prepared, so that the executor, who’s job is to implement the container runtime that effectively runs the containers

ps.: having this decoupled nature, some of this components are actually swappable by custom plugins - e.g., you can bring your Snapshotter (storage), or Task (runtime).

In this first iteration of our Garden backend though, to have the least amount of divergences from what we currently run on top of (guardian), what I proposed was that we don’t use containerd fetcher, and instead, continue leveraging baggageclaim for now, which already gives us a root filesystem that we can use in our containers.

    task to run
            --> baggaclaim prepares a volume for it
            --> garden references that volume
                    --> containerd uses that rootfs volume

By doing this, we can learn from all of the pieces that we’ll definitely get wrong, and have a more “oranges to oranges” comparison when it comes to its day-to-day operations.

container

For running a container, few services are usually involved, just like for images.

Assuming that we’re using containerd’s image fetcher, the flow would look like the following:

    run
            initialize
                    <-- images
                    --> snapshot
            setup
                    <-- snapshot
                    --> containers  (metadata)
            start
                    <-- containers
                    --> tasks       *actual container

First, it starts by reading the image’s configuration, creating an OCI spec that describes the container that we want to run, then create a cow layer to serve as the rootfs.

Once that’s done, it then moves on to setting up the linux namespaces, mounts, etc, and then actually starting the process itself.

In our case though, if we assume that baggageclaim is there to give us the volumes, and that we don’t run containers “to completion”, but rather, we use them as “places to execute stuff”, it looks more like this:

    1. garden.Create(containerSpec)

            setup
                    --> containers  (creates metadata that specifies
                                     that we want a container)

            start
                    <-- containers
                    --> task        * the container

    2. garden.Run(processSpec, processIO)

            exec
                    <-- task
                    --> process     * our process in the task sandbox

Despite containerd being the thing that provides the API for dealing with all of the lifecycle of a container, the containers themselves are not tied to the lifecycle of containerd - there’s a strict separation between those.

When it comes the time to run a container, it forks off a shim which is responsible for calling out to runc, these two being re-parented by the system’s init (they decouple themselves from containerd).

    systemd───containerd-shim───executable───5*[{executable}]

As the shim expects is tailored towards a specific version of runc, that’s another dependency that we must also get the right version.