Hey,
a month ago I was looking at some issues with a container running a MySQL instance against an NFS-mounted directory on the host (should you do this? maybe not).
The issues seemed pretty weird as I’m not a MySQL guy and there were all sorts of errors popping up related to disk quotas.
Sure, we did have a bunch of space:
df -h
fs-<...>.amazonaws.com:/ 8.0E 1.6G 8.0E 1% /mnt/nfs
The problem, as the logs revealed, was that InnoDB
wasn’t able to grab a lock that it wanted.
That seemed even weirder to me as I thought we’d never hit a limit in this front.
- Making the case
- How does MySQL behave on lock starvation?
- Messing up our syscalls
- Terraforming a test scenario in AWS
- Exhausting the EFS mount target locks
- Is this a problem with NFS in general?
- Closing thoughts
Making the case
My idea to get through this was to try to replicate what the user was facing - he couldn’t take his MySQL instance up.
Let’s bring something up and look at some MySQL details:
# Create a container named `inst1` which has
# two environment variables set which alters
# the behavior of the mysql instance startup.
#
# The code that runs at startup time can be found
# at the repository that holds this image:
# https://hub.docker.com/mysql
docker run \
--detach \
--name inst1 \
--env MYSQL_DATABASE=testdb \
--env MYSQL_ALLOW_EMPTY_PASSWORD=true \
mysql
These two environment variables allow us to have a running MySQL instance with an empty password and a database testdb
created.
If you’re curious about it, this comes from mysql’s image entrypoint:
# If the string `MYSQL_DATABASE` has been
# specified, then add the database creation
# string to the list of mysql commands to
# run after the process has started.
# (line 159 of docker-entrypoint.sh)
if [ "$MYSQL_DATABASE" ]; then
echo "CREATE DATABASE IF NOT EXISTS \`$MYSQL_DATABASE\` ;" | \
"${mysql[@]}"
mysql+=( "$MYSQL_DATABASE" )
fi
With this, we can start looking at the structure of the filesystem that MySQL creates.
# get into the container
docker exec \
--interactive \
--tty \
inst1 \
/bin/bash
# From within the container, get into the directory
# where mysql puts our database files and everything
# that we should persist
cd /var/lib/mysql
# Find all the directories that are out there
find . -type d
.
./sys
./mysql
./testdb
./performance_schema
# Check that indeed, these are just databases
echo "SHOW DATABASES;" | mysql
Database
information_schema
mysql <<<<<<<
performance_schema <<<<<<<
sys
testdb <<<<<<<
Cool, but what about the locks?
We can count how many are there by either making use of lslocks
or looking at /proc/locks
. The benefit of using lslocks
here is that it’ll resolve the paths for us so we can know which files of the MySQL instance that are holding locks:
lslocks
COMMAND PID TYPE SIZE MODE M START END PATH
mysqld 1 POSIX 48M WRITE 0 0 0 ./ib_logfile0
mysqld 1 POSIX 48M WRITE 0 0 0 ./ib_logfile1
mysqld 1 POSIX 76M WRITE 0 0 0 ./ibdata1
mysqld 1 POSIX 12M WRITE 0 0 0 ./ibtmp1
mysqld 1 POSIX 96K WRITE 0 0 0 ./mysql/plugin.ibd
mysqld 1 POSIX 96K WRITE 0 0 0 ./mysql/innodb_table_stats.ibd
mysqld 1 POSIX 96K WRITE 0 0 0 ./mysql/innodb_index_stats.ibd
mysqld 1 POSIX 96K WRITE 0 0 0 ./mysql/gtid_executed.ibd
mysqld 1 POSIX 96K WRITE 0 0 0 ./mysql/server_cost.ibd
mysqld 1 POSIX 96K WRITE 0 0 0 ./mysql/engine_cost.ibd
mysqld 1 POSIX 96K WRITE 0 0 0 ./mysql/time_zone_leap_second.ibd
mysqld 1 POSIX 96K WRITE 0 0 0 ./mysql/servers.ibd
mysqld 1 POSIX 320K WRITE 0 0 0 ./mysql/time_zone_name.ibd
mysqld 1 POSIX 144K WRITE 0 0 0 ./mysql/time_zone.ibd
mysqld 1 POSIX 12M WRITE 0 0 0 ./mysql/time_zone_transition.ibd
mysqld 1 POSIX 464K WRITE 0 0 0 ./mysql/time_zone_transition_type.ibd
Well, no testdb
there.
So we know that it’s not locking at the database level.
Maybe it does lock but differently as there is a way of performing a mutually exclusive flush per-database - FLUSh tables WITH read lock
.
We can confirm that it locks at the table-level:
# Create the query that creates a table called
# `example` with just two dummy columns.
table_creation_query="
CREATE TABLE example (
id INT,
data VARCHAR(100)
);"
# Execute the query against our database
mysql --database testdb <<< "$table_creation_query"
# Verify that a lock has been created
lslocks | grep testdb
mysqld 1 POSIX 96K WRITE 0 0 0 ./testdb/example.ibd
Cool, we now know that MySQL will hold locks per-table.
How does MySQL behave on lock starvation?
For someone with close to zero knowledge about MySQL, I thought about forcing the system to fail MySQL when it tried to get a lock by setting limits on the number of locks that the process could take.
The idea is to make use of the setrlimit
syscall via the ulimit
utility:
GETRLIMIT(2) Linux Programmer's Manual GETRLIMIT(2)
NAME
getrlimit, setrlimit, prlimit - get/set resource limits
SYNOPSIS
int setrlimit(int resource, const struct rlimit *rlim);
...
The getrlimit() and setrlimit() system calls get
and set resource limits respectively. Each resource has
an associated soft and hard limit ...
struct rlimit {
/* Soft limit */
rlim_t rlim_cur;
/* Hard limit (ceiling for rlim_cur) */
rlim_t rlim_max;
};
It essentially means that we can put some limits on some resources by specifying some numbers (hard and soft limit). The resource that we want to limit in this case is the number of locks a process can take:
RLIMIT_LOCKS
A limit on the combined number of flock(2) locks
and fcntl(2) leases that this process may establish.
That sounds pretty close but we don’t know if mysql
is indeed making use of these calls.
We know that lslocks
does see a new lock when we create a new table but … does it use one of those syscalls?
We can check that using strace
:
# Get the PID of the MySQL daemon instance as seen by
# the host.
docker top inst1
PID .. CMD
861379 .. mysqld
861589 .. /bin/bash
864033 .. /bin/bash
# Cool, PID of `mysqld` is 861379.
# From the host we can specify that pid to strace
# and tell it to attach to all the processes raised
# by it.
#
# This way we can look at every syscall
# made by them while it runs.
#
# Here I'm making use of the following arguments:
# -f: Trace child processes (and "lightweight processes")
# -e: Filter the calls so that we only get notified when
# there's a syscall that matches the term
# -p: Pid to attach to (and its children)
sudo strace -f -e fcntl -p 861379
strace: Process 861379 attached with 28 threads
# Now that we got attached, in a separate terminal, create a new
# table by sending the table creation command to the
# container's mysql binary (client) that will get the command
# sent to the mysql daemon.
echo "CREATE TABLE example2 ( id INT, data VARCHAR(100) );" |
docker exec inst1 \
mysql --database testdb
# Now look at our strace results!
# We spotted some `fcntl` syscalls with
# the lock-related parameters
[pid 862376] fcntl(44, F_SETLK, { <<<<<
l_type=F_WRLCK, <<<<<
l_whence=SEEK_SET,
l_start=0,
l_len=0}) = 0
[pid 862376] fcntl(44, F_SETLK, { <<<<<<
l_type=F_WRLCK, <<<<<<
l_whence=SEEK_SET,
l_start=0,
l_len=0}) = 0
By the declaration of fcntl
we can check what those arguments are about:
fcntl - manipulate file descriptor
int fcntl(int fd, int cmd, ... /* arg */ );
FD: file descriptor to act upon
CMD: action to perform
ARG: well, arguments
fd=44 (acting uppon filedescriptor 44)
cmd=F_SETLK (acquiring a write - exclusive - lock)
But, how do we know what is that 44
? What that file descriptor points to?
To know that we can look at the table of file descriptors open by a given process.
From within the container (we could do this from the outside too) we can check that:
# Get inside the container
docker exec -it inst1 /bin/bash
# By looking at the proc virtual filesystem we
# can check what files are open.
#
# Because we're running a container that runs MYSQL
# as the PID 1, we look at the properties for the
# process who's PID is 1.
#
readlink /proc/1/fd/44
/var/lib/mysql/testdb/example2.ibd
Exactly! Our table example2
has a lock associated with it (as we wanted) which gets set using fcntl
, a syscall that can be blocked by setrlimit
calls.
So now we know that we can make use of setrlmit
with RLIMIT_LOCKS
and expect that the kernel will limit the process' resources.
With Docker, we make use of it by specifying an extra argument to the run command: --ulimit
.
# Run the same command as before that spawns
# a mysql container but this time with the limit
# of 25 locks.
docker run \
--detach \
--name inst1 \
--ulimit locks=25 \
--env MYSQL_DATABASE=testdb \
--env MYSQL_ALLOW_EMPTY_PASSWORD=true \
mysql
To see it in action, let’s try to create 30 tables.
#!/bin/bash
# exit on any failed execution
set -o errexit
# main routine that calls the
# table creation method 30 times
main () {
for i in $(seq 1 30); do
create_table mytable$i
done
}
# execute a command within the mysql
# container to create a table.
# args:
# 1) name of the table
create_table () {
local name=$1
echo "CREATE TABLE $name (
id INT,
data VARCHAR(100)
);" | docker exec -i inst1 \
mysql --database=testdb
}
main
Guess what? It doesn’t work! The execution goes past 30 tables without a problem. We’re not limiting the locks as we wanted.
The reason why it’s not working is that since Linux 2.4 setrlimit
simply stopped limiting lock
s, meaning that we were relying on a feature that is not even there anymore.
How can we get get through that?
Messing up our syscalls
There’s a Linux syscall
(ptrace
) that can help us.
Quoting the man
pages, it provides means by which a trace program may observe and control the execution of another process, even changing the tracee’s memory and registers.
So, given that we know exactly what syscall brings us the trouble (puts locks), we can get in the way of the syscall invocations for a given process, change it and then mess with it such that the tracee (mysqld
in our case) thinks that it doesn’t have enough disk quotas (free locks).
Given that strace
(the tool) makes use of ptrace
and that it allows us to temper syscalls, we can make use of it to emulate a disk quota issue (after n
fctnl
calls trying to lock, fail with ENOLCK
(“Too many segment locks open, lock table is full, or a remote locking protocol failed”).
# Run `strace` targetting the process at pid $PID
# as well as it's children.
#
# Inject an FCNTL error in case the syscall has been
# executes more than 25 times.
strace \
-p $PID \
-f \
-e fault=fcntl:error=ENOLCK:when=25+
Running the script above (the one that creates 30 tables) runs, we can see the InnoDB
logs:
mysqld: ready for connections.
Version: '5.7.21' socket: '/var/run/mysqld/mysqld.sock'
port: 3306 MySQL Community Server (GPL)
[ERROR] InnoDB: Unable to lock ./testdb/mytable22.ibd error: 37
[ERROR] InnoDB: Operating system error number 37 in a file operation.
\/ \/ \/ \/
[ERROR] InnoDB: Error number 37 means 'No locks available'
/\ /\ /\ /\
[ERROR] InnoDB: Cannot create file './testdb/mytable22.ibd'
Cool, we failed the table creation process!
Now it’s time to do it on AWS.
Terraforming a test scenario in AWS
The scenario is simple: single machine and an EFS mount in the same AZ (Availability Zone).
To provision that I made use of Terraform such that I can make it very declarative and reproducible.
I structured it like the following:
.
├── main.tf # data retrieval and output
├── efs.tf # provisiong the filesystem and a mount target
│ # in the specific AZ that our instance is in
├── instance.tf # creates an instance in the region and AZ we want
├── keys.tf # provisions the key in the region
├── keys # dumb keys for this one-off experiment
│ ├── key.rsa
│ └── key.rsa.pub
└── terraform.tfstate # state
1 directory, 7 files
If you’re curious about the whole Terraform files, check them out at cirocosta/efs-locks-sample, but I wanted to highlight efs.tf
.
Whenever you create an EFS filesystem you created it by region (note that not all regions support it - I think it’s something like only 5) and then within that region, you create mount points that the EFS attach to within subnets you already provisioned.
For instance, say you’re in us-east-1
and have subnets on us-east-1a
and us-east-1b
. You create two mount points, and then your instances in those two availability zones can connect to it.
In Terraform terms that would mean that you’d have a list of subnets and availability zones that you want your EFS mounts in and then you’de iterate over them (using count
) to create the mount points.
Here I’m sticking with a single AZ to make things easier.
# Creates a new empty file system in EFS.
resource "aws_efs_file_system" "main" {
tags {
Name = "locks-test-fs"
}
}
# Creates a mount target of EFS in a specified subnet
# such that our instances can connect to it.
resource "aws_efs_mount_target" "main" {
file_system_id = "${aws_efs_file_system.main.id}"
subnet_id = "${var.subnet}"
security_groups = ["${aws_security_group.efs.id}"]
}
# Allows both ingress and egress for port 2049 (nfs)
# such that our instances can get to the mount
# target in the AZ.
resource "aws_security_group" "efs" {
name = "efs-mnt"
description = "Allows NFS traffic from instances within the VPC."
vpc_id = "${data.aws_vpc.default.id}"
ingress {
from_port = 2049
to_port = 2049
protocol = "tcp"
cidr_blocks = [
"${data.aws_vpc.default.cidr_block}",
]
}
egress {
from_port = 2049
to_port = 2049
protocol = "tcp"
cidr_blocks = [
"${data.aws_vpc.default.cidr_block}",
]
}
tags {
Name = "allow_nfs-ec2"
}
}
The output of that is then an address that we use to mount an NFS partition - which we then use to mount in a directory that we wish the have the file system.
#!/bin/bash
# Performs the mounting of an EFS mount target into
# predefined MOUNT_LOCATION in the filesystem.
set -o errexit
readonly MOUNT_LOCATION="${MOUNT_LOCATION:-/mnt/efs}"
readonly MOUNT_TARGET="${MOUNT_TARGET}"
# Main routine - checks for the MOUNT_TARGET variable
# that does not have a default (can be empty) making
# sure that it gets specified.
main () {
if [[ -z "$MOUNT_TARGET" ]]; then
echo 'ERROR:
MOUNT_TARGET environment variable must be specified.
'
exit 1
fi
echo "INFO:
Mounting shared filesystem.
MOUNT_LOCATION $MOUNT_LOCATION
MOUNT_TARGET $MOUNT_TARGET
"
mount_nfs
}
# Performs the actual mounting of the EFS target
# in the AZ we specified into a directory we defined
# via MOUNT_LOCATION.
mount_nfs() {
sudo mkdir -p $MOUNT_LOCATION
sudo mount \
-t nfs4 \
-o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 \
$MOUNT_TARGET:/ $MOUNT_LOCATION
}
main
Having the script above executed, we’re ready to go (make sure you have nfs-common
installed, by the way).
Exhausting the EFS mount target locks
Per the AWS documentation we can check how the quotas look like:
Each unique mount on the instance can acquire up to a total of 8,192 locks across a maximum of 256 unique file/process pairs. For example, a single process can acquire one or more locks on 256 separate files, or 8 processes can each acquire one or more locks on 32 files.
That means that just to make sure that we’ll indeed exhaust it, we can create a single MySQL instance and then make it create some three hundred tables.
Adopting that script I showed above, this means:
local number_of_tables="$1"
if [[ -z "$number_of_tables" ]]; then
echo "ERROR:
An argument (number of tables) must be supplied.
"
exit 1
fi
for i in $(seq 1 $number_of_tables); do
create_table mytable$i
done
then, calling it with 300
:
./create-many-tables.sh 300
we can see that after a little while, it breaks:
ERROR 1030 (HY000) at line 1: Got error 168 from storage engine
Looking at the number of locks that the system is holding:
# Count the number of locks being held from the efs mount
lslocks | grep 'efs' | wc -l
256
We can clearly see the quota in place.
Naturally, InnoDB complained and failed:
# docker logs ...
[Note] mysqld: ready for connections.
[ERROR] InnoDB: Unable to lock ./testdb/mytable241.ibd error: 122
[ERROR] InnoDB: Operating system error number 122 in a file operation.
[ERROR] InnoDB: Error number 122 means 'Disk quota exceeded'
[ERROR] InnoDB: Cannot create file './testdb/mytable241.ibd'
That’s it, 256 locks in a machine and you’re done.
Is this a problem with NFS in general?
I thought this limit that AWS imposes could be a hard limit imposed by the protocol itself (NFSv4.1), but it turns out that this isn’t the case.
If you create yourself an NFS server, mount it and try running a stress test that grabs as many locks as possible, you can go up to the millions (even though that takes some load on the server).
To make sure that this is indeed the case, I set up a quick c
program that stresses only this piece (checkout at cirocosta/efs-locks-sample/stress).
/**
* Struct that keeps track of the runtime arguments
* that we supply via the CLI.
*/
typedef struct runtime_t {
int number_of_locks;
char* base_directory;
} runtime_t;
/**
* Prints to STODUT the runtime configuration passed.
*/
void
runtime_show(runtime_t* runtime)
{
// some printf to show the runtime
}
/**
* Creates a file under a directory (base) with a number (i)
* and then grabs a write lock to it.
*/
void
create_file(const char* base, const int i)
{
FILE* fp;
char* filename;
struct flock lock = { 0 };
int err = 0;
// make the lock exclusive (write lock)
lock.l_type = F_WRLCK;
// .. do some work to have the filename properly set
// try to acquire the lock
err = fcntl(fileno(fp), F_SETLKW, &lock);
if (err == -1) {
perror("fcntl");
fprintf(stderr, "Couldn't acquire lock on %s", filename);
exit(1);
}
// just write something stupid to the file
fprintf(fp, "dummy");
free(filename);
}
/**
* Creates `N` files with write a write lock grabbed for each
* under a base directory `base`.
*/
void
runtime_create_files(runtime_t* runtime)
{
// ...
}
/**
* Main execution.
*
* Parses the cli arguments (number of locks
* and base directory) and then starts the creation of the
* files and locks.
*
* Once lock creation is done, pauses its execution until
* a signal arrives.
*/
int
main(int argc, char** argv)
{
// ...
runtime_t runtime = {
.number_of_locks = atoi(argv[1]),
.base_directory = argv[2],
};
runtime_show(&runtime);
runtime_create_files(&runtime);
// ... wait for signals
return 0;
}
As expected, we hit the limit pretty fast in AWS while with a regular NFS server we can go to the sky.
Closing thoughts
What can you do if you need MySQL on EFS? Well, this is the funny part: not much!
That’s a limit that you can’t tell AWS to raise.
I’m not very sure you’d want to run MySQL on EFS (or any other NFS mount) as it’s clearly not endorsed by MySQL, but I guess that people using other software might hit problems in scenarios that make sense using EFS (NFS in general).
I guess the best is to deal with proper MySQL replication on normal EBS disks, SSD on some other machines or go straight with the new NVMe-based instances.
Have you faced problems like this? Did you spot any error?
Please let me know!
I’m cirowrc on Twitter and would appreciate knowing what your thoughts are!
Have a good one,
finis