Incremental backups using GNU Tar and S3

Hey,

a friend of mine recently told me about a way of performing incremental backups making use of the ubiquitous tar tool.

I was really impressed that such a useful thing could be hidden there without people talking about it (or is it just me who didn’t know? anyway …)

Here are my two cents on how you can tailor a script that will backup a directory incrementally sending the snapshots to S3 and then recovering later.

In the end, I also include restic as an alternative.

Goal

The ultimate goal is pretty simple to describe:

perform backups that we can go back to if we need and that we can store in a remote location;

With an extra feature that is very nice to have: restore the backups up to a specific date.

The idea of restoring to a point-in-time snapshot is that if we mess up and backup our mess we want to go back to a previous backup that wasn’t messed up.

Consider the following example: starting at t0 we perform our initial snapshots that corresponds to an addition of all the present files and directories. Next, in t1, we perform a second snapshot which accounts for only the addition of a new file. In a third time, t2, we perform another snapshot which differs from the last one by only adding a new file. Now assume that at t3 we perform a new snapshot but we shouldn’t have snapshotted as we added a wrong file (e.g, the database was already broken and our snapshotting routine took the snapshot).

If we can go back in time to any snapshot that we want (instead of relying on a single backup), then we can recover even from a bad snapshot taken at the wrong moment. In our example, that would mean not recovering to t3, but instead, recover to t2 where we took a good snapshot.

With that in hands, we can make sure that even if we wrongly backup a directory where things got removed we can still go to a snapshot taken previously and get to that state.

Install the dependencies - `gnu-tar` (macOS)

If you’re using MacOS you can retrieve gnu-tar using brew as it already has a formula (see gnu-tar formula):

brew install gnu-tar
gtar --help

Usage: gtar [OPTION...] [FILE]...
GNU 'tar' saves many files together into a single tape or disk archive, and can
restore individual files from the archive.

Examples:
  tar -cf archive.tar foo bar  # Create archive.tar from files foo and bar.

If you don’t want to have to prefix with g we can pass an additional argument, --with-default-names as we can see in the formula (line 16 from gnu-tar.rb):

  option "with-default-names", "Do not prepend 'g' to the binary"

  # ...

  test do
    tar = build.with?("default-names") ? 
        bin/"tar" : 
        bin/"gtar"
    # ...
  end

so that we install it with the following command:

brew install gnu-tar --with-default-names

I’ll stick with gtar (installation with --with-default-names). You can know whether you’re using the BSD or GNU version by looking at help:

tar --help

       ++==== BSD
       ||
       \/
tar(bsdtar): manipulate archive files
First option must be a mode specifier:
  -c Create  -r Add/Replace  -t List  -u Update  -x Extract
...

Cool, let’s jump to the backups.

Performing incremental backups

The manuals of tar contain a very interesting section Using tar to perform incremental dumps that states:

The option `--listed-incremental' instructs tar to operate 
on an incremental archive with additional metadata stored in 
a standalone file, called a snapshot file. The purpose of this 
file is to help determine which files have been changed, added 
or deleted since the last backup, so that the next incremental 
backup will contain only modified files. The name of the snapshot 
file is given as an argument to the option:

`--listed-incremental=file'

That essentially means that we can create some various tar files and keep track of this sequence of files by mutating an index file that we provide to the --listed-incremental flag.

So, how to perform incremental backups using Tar?

create a list that keeps track of all the files that compose the full state of the system up until a given point in time
create the initial tar with the full current state
keep creating files with only the difference (increments) and updating the file that lists these snapshots that we take.

As an example, assume we have the following file structure:

.
├── rootfs
│   └── 1.txt
└── snapshots

2 directories, 1 file

We want to back up /rootfs and, which right now contains 1.txt, and send the snapshots to the snapshots directory.

Starting with the first and second point (creating a list that keeps track of all the files that compose the full state and then creating the first snapshots that contain all the state):

gtar \
        --create \
        --no-check-device \
        --file=./snapshots/1.tar \
        --listed-incremental=./snapshots/index \
        ./rootfs

which leds us to the following:

tree
.
├── rootfs
│   └── 1.txt
└── snapshots
    ├── 1.tar
    └── index

2 directories, 3 files


cat ./snapshots/index 
GNU tar-1.29-2
1511892293436890000151189201680544478167772208590981353./rootfsY1.txt%

Now mutate the state of rootfs directory one more time by adding 2.txt and 3.txt and then perform a second snapshot:

# mutate the rootfs directory
echo "2.txt" > ./rootfs/2.txt
echo "3.txt" > ./rootfs/3.txt
tree
.
├── rootfs
│   ├── 1.txt
│   ├── 2.txt
│   └── 3.txt
└── snapshots
    ├── 1.tar
    └── index

# perform a second snapshot that should
# contain only the addition of the two files
 gtar \
        --create \
        --no-check-device \
        --file=./snapshots/2.tar \
        --listed-incremental=./snapshots/index \
        ./rootfs

# verify that the new snapshot has been created
# but we still have a single `index` file that
# lists the contents
tree
.
├── rootfs
│   ├── 1.txt
│   ├── 2.txt
│   └── 3.txt
└── snapshots
    ├── 1.tar
    ├── 2.tar
    └── index

# check the updated index file
cat ./snapshots/index 
GNU tar-1.29-2
151189253560533700001511892488471449116167772208590981353./rootfsN1.txtY2.txtY3.txt

# check the contents of the first snapshot
# (it should only contain 1.txt)
tar -tvf ./snapshots/1.tar 
drwxr-xr-x  0 cirocosta wheel       8 Nov 28 16:00 ./rootfs/
-rw-r--r--  0 cirocosta wheel       2 Nov 28 16:00 ./rootfs/1.txt

# check the contents of the first snapshot
# (it should only contain 2.txt and 3.txt)
tar -tvf ./snapshots/2.tar
drwxr-xr-x  0 cirocosta wheel      22 Nov 28 16:08 ./rootfs/
-rw-r--r--  0 cirocosta wheel       6 Nov 28 16:08 ./rootfs/2.txt
-rw-r--r--  0 cirocosta wheel       6 Nov 28 16:08 ./rootfs/3.txt

Cool, having that all we need to do is restore to check if it’s really working.

# create a separate directory to put the 
# restoration files. This is important for
# the demo because otherwise `gtar` would
# remove the files from `rootfs` that were
# not at the snapshot at that given moment.
mkdir ./restore
cd restore
gtar \
        --extract \
        --listed-incremental=../snapshots/index \
        --file=../snapshots/1.tar 

# check if only `1.txt` is there in the 
# rootfs from the first moment
tree
.
└── rootfs
    └── 1.txt

# remove the contents so that we start again
# with a clean rootfs
rm -rf ./rootfs

# to restore all the files until the last snapshot
# we must perform the extraction of all the snapshots
# from the first to the last, one after another
gtar \       
        --extract \
        --listed-incremental=../snapshots/index \
        --file=../snapshots/1.tar 
gtar \
        --extract \
        --listed-incremental=../snapshots/index \
        --file=../snapshots/2.tar 

# now check that all the files of the last snapshot
# are there
tree
.
└── rootfs
    ├── 1.txt
    ├── 2.txt
    └── 3.txt

1 directory, 3 files

That’s it!

Integrating with S3

Someone who has worked with S3 knows that now syncing this to S3 is not a big deal - make use of the AWS CLI and then it’ll push the new / changed files to a bucket that you want.

Besides the simplicity of uploading content to S3, I’d like to highlight a simple way of testing it locally. The easiest way to do it, in my opinion, is using minio which provides us an implementation of the S3 API that we can use locally:

Minio is an open source object storage server with Amazon S3 compatible API. Build cloud-native applications portable across all major public and private clouds.

If you have Docker installed, having minio running is one command away:

# Run a container in the background using the image 
# `minio/minio` publishing the target port 9000 as 9000 
# on the host and initialize it with the command `server`
# with the argument `/data` (location where minio
# will save the contents - this is a directory you'd
# like persisted somehow, here, we don't care).
# With MINIO_ACCESS_KEY and MINIO_SECRET_KEY we set some
# testing credentials that we can use with the AWS CLI tool.
docker run \
        --publish 9000:9000 \
        --name minio \
        --detach \
        --env 'MINIO_ACCESS_KEY=accesskey' \
        --env 'MINIO_SECRET_KEY=secretkey' \
        minio/minio \
        server /data

Now to make use of it we can install the AWS CLI and configure it:

# install `pip` to fetch python packages (in
# this case, `awscli`)
sudo easy_install pip

# fetch the `awscli` package.
pip install awscli --upgrade

Configure the credentials:

aws configure
AWS Access Key ID [None]: accesskey
AWS Secret Access Key [None]: secretkey
Default region name [us-east-1]: 
Default output format [None]:

# enable AWS signature V4 for the minio server
aws configure set default.s3.signature_version s3v4

# make sure we don't have previous environment
# variables set
unset AWS_DEFAULT_REGION    
unset AWS_ACCESS_KEY_ID
unset AWS_SECRET_ACCESS_KEY

With everything configured, we can create a bucket and then sync our snapshots:

aws \
        --endpoint-url=http://localhost:9000 \
        s3 mb \
        s3://backups
make_bucket: backups

# get into the snapshots directory
cd ./snapshots

# Sync the current directory (snapshots) with
# the S3 bucket.
# Only new files will be sent on each new run of
# the command. If there are no changes, nothing
# is sent. 
# Because it's our first run, everything is sent.
aws \
        --endpoint-url=http://localhost:9000 \
        s3 sync \
        ./ s3://backups
upload: ./2.tar to s3://backups/2.tar                             
upload: ./1.tar to s3://backups/1.tar                           
upload: ./index to s3://backups/index

Now, to retrieve all the files from that bucket:

# copy from a remote location (s3 bucket `backups`)
# to the current location `./` recursively.
aws \
        --endpoint-url=http://localhost:9000 \
        s3 cp \
        s3://backups ./ \
        --recursive
download: s3://backups/index to ./index                           
download: s3://backups/1.tar to ./1.tar                         
download: s3://backups/2.tar to ./2.tar

Alternative - restic

In the search of a better UX for performing these incremental backups, I stumbled upon restic, a backup program in Go that takes care of sending the backups to a remote location and presents a delightful view of the snapshots that have been taken. Besides that, it also takes care of encrypting the contents and performing integrity checks, pretty cool.

To install it on macOS you can go to the restic/releases GitHub page and take the latest release (in my case, version 0.8.0) and then fetch the .bz2 file, decompress it and put in your $PATH:

# Fetch the .bz2 file with some options:
# --show-error  will fail with error messages
#               if something goes wrong
# --remote-name instructs `curl` to save the
#               file with the name it receives
#               from the remote
# --location    follows redirects if there are
#               any
curl \
        --show-error \
        --remote-name \
        --location \
        https://github.com/restic/restic/releases/download/v0.8.0/restic_0.8.0_darwin_amd64.bz2

# Decompress the file
bzip2 \
        --decompress \
        ./restic_0.8.0_darwin_amd64.bz2

# Move it to somewhere in your $PATH
sudo mv ./restic_0.8.0_darwin_amd64 /usr/local/bin/restic

# Modify the permission bits (make it executable)
sudo chmod +x /usr/local/bin/restic

# Check if it worked
restic --help

Having restic set up we can perform a local snapshot as we performed before with gtar:

# Create a snapshots directory that will
# hold the repository structure that
# restic uses to keep track of the changes
# and store the snapshotted content.
mkdir snapshots

# Create a rootfs that will contain the data
# that we plan to backup.
mkdir rootfs
echo "1" > rootfs/1.txt

        tree
        .
        ├── rootfs
        │   └── 1.txt
        └── snapshots


# initialize the restic repository at `./snapshots`
restic init --repo ./snapshots 

        enter password for new backend: 
        enter password again: 
        created restic backend 100efe3e68 at ./snapshots


# check how the filesystem looks like after 
# the initialization
tree
        .
        ├── rootfs
        │   └── 1.txt
        └── snapshots
            ├── config
            ├── data
            │   ├── 00
            │   ├── 01
            ...
            │   └── ff
            ├── index
            ├── keys
            │   └── 04b12ce19886a5d25...
            ├── locks
            └── snapshots

# Perform our first snapshot of `rootfs`
# saving it in our repository that we
# initialized before.
restic --repo ./snapshots backup ./rootfs 

        enter password for repository: 
        password is correct
        scan [./rootfs]
        scanned 1 directories, 1 files in 0:00
        [0:00] 100.00%  0B/s  2B / 2B  2 / 2 items  0 errors  ETA 0:00 
        duration: 0:00, 0.00MiB/s
        snapshot 9171fe56 saved


# With the snapshot creation succeeded, let's
# check how the snapshots dir changed.
tree
        .
        ├── rootfs
        │   └── 1.txt
        └── snapshots
            ├── config
            ├── data
            │   ├── 00
            ...
            │   ├── 06          # some new stuff
            │   │   └── 0623cae88fefc31f07c09a11e8743b9eb94877883d458a6bcc2ec8a870bccf37
            ...
            ├── index           # index dir has some content
            │   └── 89cbbfae048c49f7abbaa5ac929971550537eb97195039f0a70220890b08d259
            ...
            ...
            └── snapshots       # snapshot!
                └── 9171fe56882a6ac7d4cc8836ec096dc761698dd2d47bfe8b0f9a6b82c061b7dc


# Having the snapshot created we can
# now create a `restored` directory where
# we'll put the results of the restoration
mkdir restored

# Perform the restoration by pointing restic
# at the snapshots directory and a target
# directory that will receive the restored files.
restic \
        --repo ./snapshots \
        restore \
        --target ./restored \
        latest

        enter password for repository: 
        password is correct
        restoring <Snapshot 9171fe56 of [./rootfs] (...) to ./restored


# Go to the `restored` directory ad check if it worked
cd restored 
tree
        .
        └── rootfs
            └── 1.txt

# We can also check what are the files in
# a given snapshot as well as list all the
# snapshots we've taken so far.

# Let's first list the snapshots
restic --repo ./snapshots snapshots

        enter password for repository:
        password is correct
        ID        Date                 Host             Tags        Directory
        ----------------------------------------------------------------------
        9171fe56  2017-12-01 09:17:51  cirocosta.local              ./rootfs
        ----------------------------------------------------------------------
        1 snapshots

# With the ID of the snapshot we can then
# inspect the contents of the snapshot
restic \
        --repo ./snapshots \
        ls 9171fe56

        enter password for repository:
        password is correct
        snapshot 9171fe56 of (...):
        /rootfs
        /rootfs/1.txt

# Now, what if we create a new file, snapshot
# it and then check the contents?

# Create a new file:
echo "2" > ./rootfs/2.txt

# Take the new snapshot
restic --repo ./snapshots backup ./rootfs
        enter password for repository: 
        password is correct
        using parent snapshot 9171fe56
        scan [./rootfs]
        scanned 1 directories, 2 files in 0:00
        [0:00] 100.00%  0B/s  4B / 4B  3 / 3 items  0 errors  ETA 0:00 
        duration: 0:00, 0.00MiB/s
        snapshot 3b5e8701 saved


# Look at the snapshots that we created
restic --repo ./snapshots snapshots      
        ID        Date                 Host             Tags        Directory
        ----------------------------------------------------------------------
        9171fe56  2017-12-01 09:17:51  cirocosta.local              /tmp/cc/rootfs
        3b5e8701  2017-12-01 10:38:12  cirocosta.local              /tmp/cc/rootfs
        ----------------------------------------------------------------------
        2 snapshots


# Check the state of the latest snapshot
# (we `latest` is an alias for the last snapshot)
restic --repo ./snapshots ls latest      
        /rootfs
        /rootfs/1.txt
        /rootfs/2.txt


# Interesting. Looking at the latest snapshot
# reveals all the files, not only the addition
# of `2` - it shows us the complete representation
# of the final state.

# If we look at the first snapshot though,
# then it shows only the final state up to
# that point in time.
 restic --repo ./snapshots ls 9171fe56
        /rootfs
        /rootfs/1.txt

It works! We have our files there.

It’s important to note that right now restic does not have support for compression but it performs deduplication across the backups and the backups are all incremental.

As it already has S3 support we can test it against minio:

# run the minio container just like before
docker run \
        --publish 9000:9000 \
        --name minio \
        --detach \
        --env 'MINIO_ACCESS_KEY=accesskey' \
        --env 'MINIO_SECRET_KEY=secretkey' \
        minio/minio \
        server /data


# Initialize the S3 (minio) repository passing
# the AWS credentials as environment variables
# and the repository pointing to our local minio
# instance
export AWS_ACCESS_KEY_ID=accesskey
export AWS_SECRET_ACCESS_KEY=secretkey
restic \
        --repo s3:http://localhost:9000/restic-test \
        init

        enter password for new backend: 
        enter password again: 
        created restic backend 0871674e7e at s3:http://localhost:9000/restic-test


# now, perform the snapshot:
restic \
        --repo s3:http://localhost:9000/restic-test \
        backup ./rootfs 

        enter password for repository: 
        password is correct
        scan [/tmp/cc/rootfs]
        scanned 1 directories, 2 files in 0:00
        [0:00] 100.00%  0B/s  4B / 4B  3 / 3 items  0 errors  ETA 0:00 
        duration: 0:00, 0.00MiB/s
        snapshot e870fc54 saved


# check that restic created the contents there 
# in S3 (minio):
aws \
        --endpoint-url=http://localhost:9000 \
        s3 ls --recursive \
        s3://restic-test

        2017-12-01 10:48:41        155 config
        2017-12-01 10:49:01        178 data/25/25fe4bc29169f29...
        2017-12-01 10:49:01       1312 data/5b/5b525003c0450a4...
        2017-12-01 10:49:01        661 index/e691cc0bd3ebc2e9c...
        2017-12-01 10:48:41        460 keys/50a8c58f10d1c4ae1d...
        2017-12-01 10:49:01        247 snapshots/e870fc54bb482...

Closing thoughts

I was very happy knowing that there’s an easy way of creating these incremental snapshots using standard Linux utilities. For sure there are better ways (or more modern at least) but this one described in the article seems to be pretty much enough for various cases.

Restic reveals to be pretty good for my use case (performing some snapshots and having them sent directly to S3) but it has some limitations like not having compression. As I never used it very extensively, that’s the biggest one for me (ps.: if you want to see a comparison of Restic against some other tools, make sure you check https://github.com/gilbertchen/benchmarking).

Some resources used for this article are:

GNU Tar Manual - Performing incremental dumps - great section in the man pages of tar on how to perform incremental backups;
Source of the GNU-TAR homebrew formula;
minio - the S3-compatible storage used in this article so simulate S3;
github.com/restic/restic - the backup tool to use in comparison with the TAR method.

What about you? Have you been performing incremental backups? What are you using to do it? Please let me know, I’m cirowrc on Twitter and would really like to know more about!

Have a good one,

Ciro.

finis