Hey,

Those who already have deployed Docker to an EC2 instance might have noticed (or not) that, within the containers, you’re able to perform requests to the EC2 metadata service, discovering information about the host from within these containers.

While in some cases that’s a desired thing (i.e., not needing to explicitly pass credentials to containers and make use of instance profiles to authenticate against AWS services), sometimes it ends up leaking information that the containers (which maybe you don’t control) shouldn’t have access to.

As a way of showing both the usefulness of the metadata service and a setup where containers can’t access the service, one could tailor an instance like this:

Sample EC2 instance with containers being blocked of accessing the EC2 metadata service while a regular process can

A regular EC2 instance containing two components:

  • cirocosta/awsmon service: a daemon that gathers host metrics that AWS doesn’t automatically retrieve for you (like memory and load) and forwards it to AWS CloudWatch - without any hardcoded credentials, it makes use of the aws metadata service to retrieve the credentials necessary for putting metrics in cloudwatch;
  • docker running containers that I want not to be able to access the EC2 metadata service.

It turns out that there’s not much documentation out there on how to accomplish such kind of setup.

In this blog post, I go through the steps of making regular processes be able to connect to a destination while at the same time blocking containers from doing so (while still not removing all their external connectivity).

What’s EC2 metadata service about

The instance metadata service is a service that you can access by making requests to a well-defined address from within an EC2 instance: http://169.254.169.254.

# Check what are the types of metadata values that
# we can gather.
curl 169.254.169.254/latest/meta-data/
ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
hostname
iam/                    # << this is interesting
instance-action
instance-id
...


# Retrieve the public IPV4 address of the current
# instance.
curl 169.254.169.254/latest/meta-data/public-ipv4
18.231.176.21


# Retrieve the instance profile associated with the
# instance that we're running.
curl http://169.254.169.254/latest/meta-data/iam/info
{
  "Code" : "Success",
  "LastUpdated" : "2018-04-29T13:29:01Z",
  "InstanceProfileArn" : "arn:aws:iam::111111111:instance-profile/put-metrics-profile",
  "InstanceProfileId" : "AAAABBBBBBBCCCCCDDD"
}


# Retrieve the credentials generated for the role (default) 
# assigned to the instance profile associated with the instance.
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/default
{
  "Code" : "Success",
  "LastUpdated" : "2018-04-29T13:28:16Z",
  "Type" : "AWS-HMAC",
  "AccessKeyId" : "AAAEEEEEIIIIOOO",
  "SecretAccessKey" : "aaaaaaaa2222222233333330000aaa",
  "Token" : "a-very-long-token-that-looks-to-be-base64-encoded=",
  "Expiration" : "2018-04-29T19:50:14Z"
}

Not only information regarding the machine can be retrieved as you can see - security credentials are also first-class citizens in the metadata service.

When you create an instance with an instance profile attached, this instance can retrieve the role credentials by making requests to this service.

Given that the service (like almost any other in AWS) is throttled, making too many requests to the service will make it unavailable for the machine.

If you’re using the instance metadata service to retrieve AWS security credentials, avoid querying for credentials during every transaction or concurrently from a high number of threads or processes, as this may lead to throttling. Instead, we recommend that you cache the credentials until they start approaching their expiry time.

A user container could create a denial of service just by making too many requests to it, making the service unavailable to services in the machine that need to use it.

Blocking any connectivity to the EC2 metadata service

The most straightforward way to block any requests to the EC2 metadata service is by modifying the routing table of the instance.

# Show the kernel's IP routing table
route 
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.31.0.1      0.0.0.0         UG    100    0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.18.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker_gwbridge
172.31.0.0      0.0.0.0         255.255.240.0   U     0      0        0 eth0
172.31.0.1      0.0.0.0         255.255.255.255 UH    100    0        0 eth0


# Place a rejection for the fixed IP of the
# metadata service
route add -host 169.254.169.254 reject

# Check the updated IP table
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.31.0.1      0.0.0.0         UG    100    0        0 eth0
169.254.169.254 -               255.255.255.255 !H    0      -        0 -
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.18.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker_gwbridge
172.31.0.0      0.0.0.0         255.255.240.0   U     0      0        0 eth0
172.31.0.1      0.0.0.0         255.255.255.255 UH    100    0        0 eth0

Given that we’re rejecting traffic at the routing decision moment, no traffic (either from a container of from the host) is able to get to that address.

# Try to make a request to the EC2 metadata service
# right from the host.
curl http://169.254.169.254/ -v
*   Trying 169.254.169.254...
* TCP_NODELAY set
* Immediate connect fail for 169.254.169.254: No route to host
* Closing connection 0
curl: (7) Couldnt connect to server

In a container, we can’t connect either:

# Run a container and then try the same
# command.
#
# Given that at some point the process inside
# the container will end up issuing a `connect`
# that goes through the routing table, it'll
# fail.
docker run -it alpine /bin/sh
apk add --update curl
curl http://169.254.169.254/ -v
*   Trying 169.254.169.254...
* TCP_NODELAY set
* connect to 169.254.169.254 port 80 failed: Host is unreachable
* Failed to connect to 169.254.169.254 port 80: Host is unreachable
* Closing connection 0
curl: (7) Failed to connect to 169.254.169.254 port 80: Host is unreachable

If all you need is make sure that happens after you extracted initial metadata that is static (e.g., the instance’s region, the instance’s id etc), this is enough and no concerns regarding Docker changing iptables would be raised.

In the case of needing to preserve host connectivity to the service and disallowing docker containers to access it, then a different approach is needed.

ps.: you can remove the “reject route” by issuing route del 169.254.169.254 reject.

Using internal networks to block external requests

When making use of overlay networks, one easy way of prohibiting requests to external services is creating internal-only networks.

These are networks without connectivity to the docker_gwbridge that allows containers to make external requests.

For instance:

# Create two networks: one that is internal-only and another
# that is a regular network with external connectivity.
docker network create \
        --driver overlay \
        --attachable \
        --internal mynet-internal
docker network create \
        --driver overlay \
        --attachable mynet

# Run a container in the network with external connectivity
docker run \
        --network mynet \
        --tty \
        --detach \
        --name with-conn \
        alpine /bin/sh

docker exec with-conn ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=51 time=2.603 ms
64 bytes from 8.8.8.8: seq=1 ttl=51 time=2.629 ms
^C


# Run a container in the network without external connectivity
docker run \
        --network mynet-internal \
        --tty \
        --detach \
        --name without-conn \
        alpine /bin/sh

docker exec without-conn ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
ping: sendto: Network unreachable

Naturally, this covers a very particular use case - total isolation from outside the network -, not very suitable for most people.

If you’re curious regarding the difference between these two networks, check out how the interfaces and routing are set up inside the namespaces of these containers we just created.

In the one with a network that enables external connectivity, there’s an extra interface that ends up connected to the docker_gwbridge interface in the host via an extra interface set up within the container’s network namespace:

Sample container network configuration together with its routing table

Any requests that are not directed towards the internal network are then routed to docker_gwbridge in the host.

Given that, in the host, docker configures the necessary iptables rules to NAT outgoing packets that originate from the docker_gwbridge interface, packets can flow to the outside world without problems.

In the other container though, there’s only the overlay network interface and no extra route to a default gateway like in the first.

Sample container connected to an internal network without external connectivity

Putting everything together and assuming that there’s an extra container without connectivity, the overview is like the following:

Overview of the network configuration of containers in a network with and without external connectivity

Using classid marking

Given that every container runs in a cgroup, and that it’s possible to make Docker use a parent cgroup that encapsulates all the containers, we could eventually use net_cls.classid to put a little mark in every packet and then make use of that mark in iptables to block certain traffic.

Although it’s indeed possible to have net_cls.classid setup, we can’t make use of such mark from outside the network namespace of the container (as the classid is restricted by the namespace).

Sample container with net_cls.classid configured and making SCMP requests

Such limitation means that to go with such strategy we’d need to put specific iptables rules in every container namespace created.

Not good.

A solution: using the DOCKER-USER chain

Recent Docker daemons include an extra chain that is never changed by the daemon - the DOCKER-USER chain -, meaning that we can modify it without having to worry about it being flushed or messed up by Docker.

As the table sits right in the forward path that packets from containers to external networks, we can place our filtering rules right there and then block any container access to the EC2 metadata server.

iptables --list
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy DROP)
target            prot opt source        destination         
DOCKER-USER       all  --  anywhere      anywhere     # <<<<<<<<<
DOCKER-ISOLATION  all  --  anywhere      anywhere            
ACCEPT            all  --  anywhere      anywhere     ctstate RELATED,ESTABLISHED
DOCKER            all  --  anywhere      anywhere            
ACCEPT            all  --  anywhere      anywhere            
ACCEPT            all  --  anywhere      anywhere            

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

Chain DOCKER (1 references)
target     prot opt source               destination         

Chain DOCKER-ISOLATION (1 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             anywhere            

Chain DOCKER-USER (1 references)                      # <<<<<<<<<
target     prot opt source               destination         
RETURN     all  --  anywhere             anywhere     

One easy way to making sure that this is the right chain to place rules is by adding a rule that will simply lead to the LOG target, giving us palpable results regarding matching when running some scenarios.

# Insert a rule to the DOCKER-USER chain to
# jump to the LOG target such that we can
# visualize the logs in the kernel logs
# whenever a container reaches 1.1.1.1.
#
# Because `DOCKER-USER` comes referenced
# in FORWARD, it'll only be matched in the
# case of the host forwarding the connections
# from the containers, and not those being
# initiated directly from the host (not 
# namespaced).
iptables \
        --insert DOCKER-USER \
        --jump LOG \
        --destination 1.1.1.1 \
        --log-prefix="[container]"

# Insert rule to the OUTPUT chain to
# jump to the LOG target such that we can
# visualize the packets sent to 1.1.1.1 
# from the host.
iptables \
        --insert OUTPUT \
        --jump LOG \
        --destination 1.1.1.1 \
        --log-prefix="[host]"

# Check how our two rules have been placed in
# the two chains that matter.
iptables \
        --list \
        --numeric
(...)
Chain OUTPUT (policy ACCEPT)
target     prot opt source         destination
LOG        all  --  0.0.0.0/0      1.1.1.1       LOG flags 0 level 4 prefix "[host]"

(...)
Chain DOCKER-USER (1 references)
target     prot opt source         destination
LOG        all  --  0.0.0.0/0      1.1.1.1       LOG flags 0 level 4 prefix "[container]"
RETURN     all  --  0.0.0.0/0      0.0.0.0/0


# Clear the kernel ring buffer so we
# don't check old logs
dmesg --clear

# Send some packets from the host
ping 1.1.1.1 -c 5

# Check dmesg for kernel logs.
#
# ps.: you could initiate a second terminal and
# follow the logs with the `--follow` opt.
dmesg
[ 3632.384997] [host]IN= OUT=eth0 SRC=172.31.5.64 DST=1.1.1.1 LEN=84 ... SEQ=1 
[ 3633.386725] [host]IN= OUT=eth0 SRC=172.31.5.64 DST=1.1.1.1 LEN=84 ... SEQ=2 
[ 3634.388343] [host]IN= OUT=eth0 SRC=172.31.5.64 DST=1.1.1.1 LEN=84 ... SEQ=3 
[ 3635.389922] [host]IN= OUT=eth0 SRC=172.31.5.64 DST=1.1.1.1 LEN=84 ... SEQ=4 
[ 3636.391557] [host]IN= OUT=eth0 SRC=172.31.5.64 DST=1.1.1.1 LEN=84 ... SEQ=5 

# Send some packets from a container 
docker run \
        --rm \
        alpine \
        ping 1.1.1.1 -c 5

# From the host, check `dmesg` and see that
# there are no other new `[host]` entries, but
# only new `[container]` entries as we wanted.
#
# Note that there's an included `IN` interface and
# and an `OUT` as we expected given that we're in
# the middle of a forwarding path.
dmesg
[ 3724.759112] [container]IN=docker0 OUT=eth0 (...) 0 SRC=172.17.0.3 DST=1.1.1.1 LEN=84 ... SEQ=0 
[ 3725.759295] [container]IN=docker0 OUT=eth0 (...) 0 SRC=172.17.0.3 DST=1.1.1.1 LEN=84 ... SEQ=1 
[ 3726.759484] [container]IN=docker0 OUT=eth0 (...) 0 SRC=172.17.0.3 DST=1.1.1.1 LEN=84 ... SEQ=2 
[ 3727.759673] [container]IN=docker0 OUT=eth0 (...) 0 SRC=172.17.0.3 DST=1.1.1.1 LEN=84 ... SEQ=3 
[ 3728.759849] [container]IN=docker0 OUT=eth0 (...) 0 SRC=172.17.0.3 DST=1.1.1.1 LEN=84 ... SEQ=4 

Knowing which chain to use to place a filter rule, now it’s matter of dropping the connections to the EC2 metadata service.

# Delete (optionally) the log rule.
#
# ps.: The `1` argument is the number of the
# rule in the chain.
iptables \
        --delete DOCKER-USER 1

# Add the rule to reject traffic to the ec2 metadata
# service.
iptables \
        --insert DOCKER-USER \
        --destination 169.254.169.254 \
        --jump REJECT

# Try to access the metadata service from the host.
curl 169.254.169.254/latest/meta-data/instance-id
i-0648792307a6e6610

# Try to access the metadata service from the container.
docker run -it alpine /bin/sh
apk add --update curl 
curl -v 169.254.169.254/
*   Trying 169.254.169.254...
* TCP_NODELAY set
* connect to 169.254.169.254 port 80 failed: Connection refused
* Failed to connect to 169.254.169.254 port 80: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 169.254.169.254 port 80: Connection refused

That’s it!

Make sure that you have the iptables configuration properly set up at boot (given that the changes are not permanent) and you’re done.

If you have any questions or saw anything wrong in the blog post, please let me know! I’m cirowrc on Twitter.

Have a good one!

Ciro