Hey,

a month ago I was looking at some issues with a container running a MySQL instance against an NFS-mounted directory on the host (should you do this? maybe not).

MySQL interacting with AWS EFS showing an error

The issues seemed pretty weird as I’m not a MySQL guy and there were all sorts of errors popping up related to disk quotas.

Sure, we did have a bunch of space:

df -h
fs-<...>.amazonaws.com:/  8.0E  1.6G  8.0E   1% /mnt/nfs

The problem, as the logs revealed, was that InnoDB wasn’t able to grab a lock that it wanted.

That seemed even weirder to me as I thought we’d never hit a limit in this front.

Making the case

My idea to get through this was to try to replicate what the user was facing - he couldn’t take his MySQL instance up.

Let’s bring something up and look at some MySQL details:

# Create a container named `inst1` which has
# two environment variables set which alters
# the behavior of the mysql instance startup.
#
# The code that runs at startup time can be found
# at the repository that holds this image:
# https://hub.docker.com/mysql
docker run \
        --detach \
        --name inst1 \
        --env MYSQL_DATABASE=testdb \
        --env MYSQL_ALLOW_EMPTY_PASSWORD=true \
        mysql

These two environment variables allow us to have a running MySQL instance with an empty password and a database testdb created.

If you’re curious about it, this comes from mysql’s image entrypoint:

# If the string `MYSQL_DATABASE` has been
# specified, then add the database creation
# string to the list of mysql commands to 
# run after the process has started.
# (line 159 of docker-entrypoint.sh)
if [ "$MYSQL_DATABASE" ]; then
        echo "CREATE DATABASE IF NOT EXISTS \`$MYSQL_DATABASE\` ;" | \
                "${mysql[@]}"
        mysql+=( "$MYSQL_DATABASE" )
fi

With this, we can start looking at the structure of the filesystem that MySQL creates.

# get into the container
docker exec \
        --interactive \
        --tty \
        inst1 \
        /bin/bash


# From within the container, get into the directory 
# where mysql puts our database files and everything 
# that we should persist
cd /var/lib/mysql


# Find all the directories that are out there
find . -type d

        .
        ./sys
        ./mysql
        ./testdb
        ./performance_schema


# Check that indeed, these are just databases
echo "SHOW DATABASES;" | mysql 

        Database
        information_schema
        mysql                   <<<<<<<
        performance_schema      <<<<<<<
        sys
        testdb                  <<<<<<<

Cool, but what about the locks?

We can count how many are there by either making use of lslocks or looking at /proc/locks. The benefit of using lslocks here is that it’ll resolve the paths for us so we can know which files of the MySQL instance that are holding locks:

lslocks
COMMAND PID  TYPE SIZE MODE  M START END PATH
mysqld    1 POSIX  48M WRITE 0     0   0 ./ib_logfile0
mysqld    1 POSIX  48M WRITE 0     0   0 ./ib_logfile1
mysqld    1 POSIX  76M WRITE 0     0   0 ./ibdata1
mysqld    1 POSIX  12M WRITE 0     0   0 ./ibtmp1
mysqld    1 POSIX  96K WRITE 0     0   0 ./mysql/plugin.ibd
mysqld    1 POSIX  96K WRITE 0     0   0 ./mysql/innodb_table_stats.ibd
mysqld    1 POSIX  96K WRITE 0     0   0 ./mysql/innodb_index_stats.ibd
mysqld    1 POSIX  96K WRITE 0     0   0 ./mysql/gtid_executed.ibd
mysqld    1 POSIX  96K WRITE 0     0   0 ./mysql/server_cost.ibd
mysqld    1 POSIX  96K WRITE 0     0   0 ./mysql/engine_cost.ibd
mysqld    1 POSIX  96K WRITE 0     0   0 ./mysql/time_zone_leap_second.ibd
mysqld    1 POSIX  96K WRITE 0     0   0 ./mysql/servers.ibd
mysqld    1 POSIX 320K WRITE 0     0   0 ./mysql/time_zone_name.ibd
mysqld    1 POSIX 144K WRITE 0     0   0 ./mysql/time_zone.ibd
mysqld    1 POSIX  12M WRITE 0     0   0 ./mysql/time_zone_transition.ibd
mysqld    1 POSIX 464K WRITE 0     0   0 ./mysql/time_zone_transition_type.ibd

Well, no testdb there.

So we know that it’s not locking at the database level.

Maybe it does lock but differently as there is a way of performing a mutually exclusive flush per-database - FLUSh tables WITH read lock.

We can confirm that it locks at the table-level:

# Create the query that creates a table called 
# `example` with just two dummy columns.
table_creation_query="
CREATE TABLE example ( 
        id INT, 
        data VARCHAR(100) 
);"


# Execute the query against our database
mysql --database testdb <<< "$table_creation_query"


# Verify that a lock has been created
lslocks | grep testdb
mysqld    1 POSIX  96K WRITE 0     0   0 ./testdb/example.ibd

Cool, we now know that MySQL will hold locks per-table.

How does MySQL behave on lock starvation?

For someone with close to zero knowledge about MySQL, I thought about forcing the system to fail MySQL when it tried to get a lock by setting limits on the number of locks that the process could take.

The idea is to make use of the setrlimit syscall via the ulimit utility:

GETRLIMIT(2)    Linux Programmer's Manual         GETRLIMIT(2)

NAME
       getrlimit, setrlimit, prlimit - get/set resource limits

SYNOPSIS
       int setrlimit(int resource, const struct rlimit *rlim);

...

       The  getrlimit()  and  setrlimit()  system  calls  get  
       and set resource limits respectively.  Each resource has 
       an associated soft and hard limit ...


           struct rlimit {
               /* Soft limit */
               rlim_t rlim_cur;  

               /* Hard limit (ceiling for rlim_cur) */
               rlim_t rlim_max;  
           };

It essentially means that we can put some limits on some resources by specifying some numbers (hard and soft limit). The resource that we want to limit in this case is the number of locks a process can take:

       RLIMIT_LOCKS
              A limit on the combined number of flock(2) locks 
              and fcntl(2) leases that this process may establish.

That sounds pretty close but we don’t know if mysql is indeed making use of these calls.

We know that lslocks does see a new lock when we create a new table but … does it use one of those syscalls?

We can check that using strace:

# Get the PID of the MySQL daemon instance as seen by
# the host.
docker top inst1
PID          ..   CMD
861379       ..   mysqld
861589       ..   /bin/bash
864033       ..   /bin/bash


# Cool, PID of `mysqld` is 861379.
# From the host we can specify that pid to strace
# and tell it to attach to all the processes raised 
# by it. 
#
# This way we can look at every syscall
# made by them while it runs.
#
# Here I'm making use of the following arguments:
# -f:   Trace  child  processes (and "lightweight processes")
# -e:   Filter the calls so that we only get notified when
#       there's a syscall that matches the term
# -p:   Pid to attach to (and its children)

sudo strace -f -e fcntl -p 861379 
strace: Process 861379 attached with 28 threads


# Now that we got attached, in a separate terminal, create a new
# table by sending the table creation command to the
# container's mysql binary (client) that will get the command
# sent to the mysql daemon.
echo "CREATE TABLE example2 ( id INT, data VARCHAR(100) );" | 
        docker exec inst1 \
                mysql --database testdb


# Now look at our strace results!
# We spotted some `fcntl` syscalls with
# the lock-related parameters
[pid 862376] fcntl(44, F_SETLK, {       <<<<<
        l_type=F_WRLCK,                 <<<<<
        l_whence=SEEK_SET, 
        l_start=0, 
        l_len=0}) = 0

[pid 862376] fcntl(44, F_SETLK, {       <<<<<<
        l_type=F_WRLCK,                 <<<<<<
        l_whence=SEEK_SET, 
        l_start=0, 
        l_len=0}) = 0

By the declaration of fcntl we can check what those arguments are about:

fcntl - manipulate file descriptor

int fcntl(int fd, int cmd, ... /* arg */ );

FD:             file descriptor to act upon
CMD:            action to perform
ARG:            well, arguments

fd=44           (acting uppon filedescriptor 44)
cmd=F_SETLK     (acquiring a write - exclusive - lock)

But, how do we know what is that 44? What that file descriptor points to?

To know that we can look at the table of file descriptors open by a given process.

From within the container (we could do this from the outside too) we can check that:

# Get inside the container
docker exec -it inst1 /bin/bash

# By looking at the proc virtual filesystem we
# can check what files are open.
#
# Because we're running a container that runs MYSQL
# as the PID 1, we look at the properties for the 
# process who's PID is 1.
#
readlink /proc/1/fd/44 
/var/lib/mysql/testdb/example2.ibd

Exactly! Our table example2 has a lock associated with it (as we wanted) which gets set using fcntl, a syscall that can be blocked by setrlimit calls.

So now we know that we can make use of setrlmit with RLIMIT_LOCKS and expect that the kernel will limit the process' resources.

With Docker, we make use of it by specifying an extra argument to the run command: --ulimit.

# Run the same command as before that spawns
# a mysql container but this time with the limit
# of 25 locks. 
docker run \
        --detach \
        --name inst1 \
        --ulimit locks=25 \
        --env MYSQL_DATABASE=testdb \
        --env MYSQL_ALLOW_EMPTY_PASSWORD=true \
        mysql

To see it in action, let’s try to create 30 tables.

#!/bin/bash

# exit on any failed execution
set -o errexit

# main routine that calls the
# table creation method 30 times
main () {
        for i in $(seq 1 30); do
                create_table mytable$i
        done
}

# execute a command within the mysql
# container to create a table.
# args:
#       1) name of the table
create_table () {
        local name=$1

        echo "CREATE TABLE $name ( 
        id INT, 
        data VARCHAR(100) 
);" | docker exec -i inst1 \
        mysql --database=testdb
}

main

Guess what? It doesn’t work! The execution goes past 30 tables without a problem. We’re not limiting the locks as we wanted.

The reason why it’s not working is that since Linux 2.4 setrlimit simply stopped limiting locks, meaning that we were relying on a feature that is not even there anymore.

How can we get get through that?

Messing up our syscalls

There’s a Linux syscall (ptrace) that can help us.

Quoting the man pages, it provides means by which a trace program may observe and control the execution of another process, even changing the tracee’s memory and registers.

So, given that we know exactly what syscall brings us the trouble (puts locks), we can get in the way of the syscall invocations for a given process, change it and then mess with it such that the tracee (mysqld in our case) thinks that it doesn’t have enough disk quotas (free locks).

Given that strace (the tool) makes use of ptrace and that it allows us to temper syscalls, we can make use of it to emulate a disk quota issue (after n fctnl calls trying to lock, fail with ENOLCK (“Too many segment locks open, lock table is full, or a remote locking protocol failed”).

# Run `strace` targetting the process at pid $PID
# as well as it's children.
#
# Inject an FCNTL error in case the syscall has been
# executes more than 25 times.
strace \
        -p $PID \
        -f \
        -e fault=fcntl:error=ENOLCK:when=25+

Running the script above (the one that creates 30 tables) runs, we can see the InnoDB logs:

mysqld: ready for connections.
Version: '5.7.21'  socket: '/var/run/mysqld/mysqld.sock'  
                   port: 3306  MySQL Community Server (GPL)
[ERROR] InnoDB: Unable to lock ./testdb/mytable22.ibd error: 37
[ERROR] InnoDB: Operating system error number 37 in a file operation.

                                        \/ \/ \/ \/
[ERROR] InnoDB: Error number 37 means 'No locks available'
                                        /\ /\ /\ /\

[ERROR] InnoDB: Cannot create file './testdb/mytable22.ibd'

Cool, we failed the table creation process!

Now it’s time to do it on AWS.

Terraforming a test scenario in AWS

The scenario is simple: single machine and an EFS mount in the same AZ (Availability Zone).

To provision that I made use of Terraform such that I can make it very declarative and reproducible.

I structured it like the following:

.
├── main.tf             # data retrieval and output
├── efs.tf              # provisiong the filesystem and a mount target
│                       # in the specific AZ that our instance is in
├── instance.tf         # creates an instance in the region and AZ we want
├── keys.tf             # provisions the key in the region
├── keys                # dumb keys for this one-off experiment
│   ├── key.rsa
│   └── key.rsa.pub
└── terraform.tfstate   # state

1 directory, 7 files

If you’re curious about the whole Terraform files, check them out at cirocosta/efs-locks-sample, but I wanted to highlight efs.tf.

Whenever you create an EFS filesystem you created it by region (note that not all regions support it - I think it’s something like only 5) and then within that region, you create mount points that the EFS attach to within subnets you already provisioned.

For instance, say you’re in us-east-1 and have subnets on us-east-1a and us-east-1b. You create two mount points, and then your instances in those two availability zones can connect to it.

In Terraform terms that would mean that you’d have a list of subnets and availability zones that you want your EFS mounts in and then you’de iterate over them (using count) to create the mount points.

Here I’m sticking with a single AZ to make things easier.

# Creates a new empty file system in EFS.
resource "aws_efs_file_system" "main" {
  tags {
    Name = "locks-test-fs"
  }
}

# Creates a mount target of EFS in a specified subnet
# such that our instances can connect to it.
resource "aws_efs_mount_target" "main" {
  file_system_id  = "${aws_efs_file_system.main.id}"
  subnet_id       = "${var.subnet}"
  security_groups = ["${aws_security_group.efs.id}"]
}

# Allows both ingress and egress for port 2049 (nfs)
# such that our instances can get to the mount
# target in the AZ.
resource "aws_security_group" "efs" {
  name        = "efs-mnt"
  description = "Allows NFS traffic from instances within the VPC."
  vpc_id      = "${data.aws_vpc.default.id}"

  ingress {
    from_port = 2049
    to_port   = 2049
    protocol  = "tcp"

    cidr_blocks = [
      "${data.aws_vpc.default.cidr_block}",
    ]
  }

  egress {
    from_port = 2049
    to_port   = 2049
    protocol  = "tcp"

    cidr_blocks = [
      "${data.aws_vpc.default.cidr_block}",
    ]
  }

  tags {
    Name = "allow_nfs-ec2"
  }
}

The output of that is then an address that we use to mount an NFS partition - which we then use to mount in a directory that we wish the have the file system.

#!/bin/bash

# Performs the mounting of an EFS mount target into
# predefined MOUNT_LOCATION in the filesystem.


set -o errexit

readonly MOUNT_LOCATION="${MOUNT_LOCATION:-/mnt/efs}"
readonly MOUNT_TARGET="${MOUNT_TARGET}"


# Main routine - checks for the MOUNT_TARGET variable
# that does not have a default (can be empty) making 
# sure that it gets specified.
main () {
        if [[ -z "$MOUNT_TARGET" ]]; then
                echo 'ERROR: 
        MOUNT_TARGET environment variable must be specified.
                '
                exit 1
        fi

        echo "INFO:
        Mounting shared filesystem.

        MOUNT_LOCATION  $MOUNT_LOCATION
        MOUNT_TARGET    $MOUNT_TARGET
        "

        mount_nfs
}


# Performs the actual mounting of the EFS target
# in the AZ we specified into a directory we defined
# via MOUNT_LOCATION.
mount_nfs() {
        sudo mkdir -p $MOUNT_LOCATION

        sudo mount \
                -t nfs4 \
                -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 \
                $MOUNT_TARGET:/ $MOUNT_LOCATION
}

main

Having the script above executed, we’re ready to go (make sure you have nfs-common installed, by the way).

Exhausting the EFS mount target locks

Per the AWS documentation we can check how the quotas look like:

Each unique mount on the instance can acquire up to a total of 8,192 locks across a maximum of 256 unique file/process pairs. For example, a single process can acquire one or more locks on 256 separate files, or 8 processes can each acquire one or more locks on 32 files.

See Limits for Client EC2 Instances

That means that just to make sure that we’ll indeed exhaust it, we can create a single MySQL instance and then make it create some three hundred tables.

Adopting that script I showed above, this means:

local number_of_tables="$1"

if [[ -z "$number_of_tables" ]]; then
        echo "ERROR:
        An argument (number of tables) must be supplied.
        "
        exit 1
fi

for i in $(seq 1 $number_of_tables); do
        create_table mytable$i
done

then, calling it with 300:

./create-many-tables.sh 300

we can see that after a little while, it breaks:

ERROR 1030 (HY000) at line 1: Got error 168 from storage engine

Looking at the number of locks that the system is holding:

# Count the number of locks being held from the efs mount
lslocks | grep 'efs' | wc -l
256

We can clearly see the quota in place.

Naturally, InnoDB complained and failed:

# docker logs ...
[Note] mysqld: ready for connections.
[ERROR] InnoDB: Unable to lock ./testdb/mytable241.ibd error: 122
[ERROR] InnoDB: Operating system error number 122 in a file operation.
[ERROR] InnoDB: Error number 122 means 'Disk quota exceeded'
[ERROR] InnoDB: Cannot create file './testdb/mytable241.ibd'

That’s it, 256 locks in a machine and you’re done.

Is this a problem with NFS in general?

I thought this limit that AWS imposes could be a hard limit imposed by the protocol itself (NFSv4.1), but it turns out that this isn’t the case.

If you create yourself an NFS server, mount it and try running a stress test that grabs as many locks as possible, you can go up to the millions (even though that takes some load on the server).

To make sure that this is indeed the case, I set up a quick c program that stresses only this piece (checkout at cirocosta/efs-locks-sample/stress).

/**
 * Struct that keeps track of the runtime arguments
 * that we supply via the CLI.
 */ 
typedef struct runtime_t {
	int number_of_locks;
	char* base_directory;
} runtime_t;

/**
 * Prints to STODUT the runtime configuration passed.
 */
void
runtime_show(runtime_t* runtime)
{
	// some printf to show the runtime
}

/**
 * Creates a file under a directory (base) with a number (i)
 * and then grabs a write lock to it.
 */
void
create_file(const char* base, const int i)
{
	FILE* fp;
	char* filename;
	struct flock lock = { 0 };
	int err = 0;

        // make the lock exclusive (write lock)
	lock.l_type = F_WRLCK;

	// .. do some work to have the filename properly set
        // try to acquire the lock
	err = fcntl(fileno(fp), F_SETLKW, &lock);
	if (err == -1) {
		perror("fcntl");
		fprintf(stderr, "Couldn't acquire lock on %s", filename);
		exit(1);
	}

        // just write something stupid to the file
	fprintf(fp, "dummy");
	free(filename);
}

/**
 * Creates `N` files with write a write lock grabbed for each
 * under a base directory `base`.
 */
void
runtime_create_files(runtime_t* runtime)
{
	// ...
}

/**
 * Main execution.
 *
 * Parses the cli arguments (number of locks
 * and base directory) and then starts the creation of the
 * files and locks.
 *
 * Once lock creation is done, pauses its execution until
 * a signal arrives.
 */
int
main(int argc, char** argv)
{
	// ...

	runtime_t runtime = {
		.number_of_locks = atoi(argv[1]), 
                .base_directory = argv[2],
	};

	runtime_show(&runtime);
	runtime_create_files(&runtime);
        
        // ... wait for signals

	return 0;
}

As expected, we hit the limit pretty fast in AWS while with a regular NFS server we can go to the sky.

Closing thoughts

What can you do if you need MySQL on EFS? Well, this is the funny part: not much!

That’s a limit that you can’t tell AWS to raise.

I’m not very sure you’d want to run MySQL on EFS (or any other NFS mount) as it’s clearly not endorsed by MySQL, but I guess that people using other software might hit problems in scenarios that make sense using EFS (NFS in general).

I guess the best is to deal with proper MySQL replication on normal EBS disks, SSD on some other machines or go straight with the new NVMe-based instances.

Have you faced problems like this? Did you spot any error?

Please let me know!

I’m cirowrc on Twitter and would appreciate knowing what your thoughts are!

Have a good one,

finis