Home

Awesome

scat

Scatter your data before loosing it

Backup tool that treats its stores as throwaway, untrustworthy commodity

Features

...pick some or all of the above, apply in any order.

Indeed, scat decomposes backing up and restoring into basic stream processors ("procs") arranged like filters in a pipeline. They're chained together, piping the output of proc x to the input of proc x+1. As such, though created for backing up data, its core doesn't actually know anything about backups, but provides the necessary procs.

Such modularity enables unlimited flexibility: stream data from anywhere (local/remote file, arbitrary command, etc.), process it in any way (encrypt, compress, filter through arbitrary command, etc.), to anywhere: write/read/upload/download is just another proc at the end/beginning of a chain.

                 +---------------------------------+
                 | chain proc                      |
                 |                                 |
+---------+      |  +--------+         +--------+  |
| chunk 0 +----->|  | proc 0 |         | proc 1 |  |
| (seed)  |      |  +--+-----+         +--------+  |
+---------+      |     |                    ^      |
                 |     |    +-------+       |      |
                 |     +--->|+-------+ -----+      |
                 |          +|+-------+            |
                 |           +| chunk |            |
                 |            +-------+            |
                 +---------------------------------+

...where seed may be a tar stream and procs 0..n would be split, checksum, parity, gzip, scp, etc. part of a chain that is itself a proc also.

Demo

demo

Full-length 4K demo video: on YouTube

Setup

  1. Download: latest release
    • flat versioning scheme: v0, v1, etc.
  2. Put scat in your $PATH

Usage

Stream processing, like performing a backup from a tar stream, is done via a proc chain formulated as a proc string.

The following examples showcase proc strings for typical use cases. They're good starting points to start playing with. Copy them in shell scripts and play around with them, backing up and restoring test files until fully understanding the mechanics at play and reaching desired behaviours. It's important to get comfortable both ways to both back up often and not fear potential moments restoring gets necessary.

See Proc string for syntax documentation and the full list of available procs.

Backup

Example of backing up dir foo/ in a RAID 5 fashion to 2 Google Drive accounts and 1 VPS (compress, encrypt, 2 data shards, 1 parity shard, upload >= 2 exclusive copies - using 8 threads, 4 concurrent transfers)

Command:

$ tar c foo | scat -stats "split | backlog 8 {
  checksum
  | index foo_index
  | gzip
  | parity 2 1
  | checksum
  | cmd gpg --batch -e -r 00828C1D
  | group 3
  | concur 4 stripe(1 2
      mydrive=rclone(drive:tmp)=7gib
      mydrive2=rclone(drive2:tmp)=14gib
      myvps=scp(bankmon tmp)
    )
  }"

The combination of parity, group and stripe creates a RAID 5:

  1. parity(2 1): split into 2 data shards and 1 parity shard
  2. group(3): aggregate all 3 shards for striping
  3. stripe(1 2 ...): interleave those across given stores, making 1 copy of each, ensuring at least 2 of 3 are on distinct stores from the others so we can afford to lose any one of them and still be able to recompute original data

Order matters. Notably:

Note:

Restore

Reverse chain:

Command:

$ scat -stats "uindex | backlog 8 {
  backlog 4 multireader(
    drive=rclone(drive:tmp)
    drive2=rclone(drive2:tmp)
    bankmon=scp(bankmon tmp)
  )
  | cmd gpg --batch -d
  | uchecksum
  | group 3
  | uparity 2 1
  | ugzip
  | join -
}" < foo_index | tar x

More

The above only demonstrate a subset of what's possible with scat. There exist more procs and they may be assembled in different manners to tailor to one's particular needs. See Proc string.

Command

$ scat [options] <proc>

Options:

Args:

Progress

Being stream-based implies not knowing in advance the total size of the data to process. Thus, no progress percentage can be reported. However, when transferring files or directories, size can be known by the caller and passed to pv.

Note: When piping from pv, do not pass the -stats option to scat. Both commands would step on each other's toes writing to stderr and moving the terminal cursor.

File backup:

$ pv my_file | scat "..."

Directory backup (approximate progress, not taking into account tar headers):

# Using GNU du:
$ tar c my_dir | pv -s $(du -sb ~/tmp/100m | cut -f1) | scat "..."

# Under macOS, install GNU coreutils
$ brew install coreutils
$ # idem above, replace du with gdu

# ...or using stock Darwin du, even more approximate:
$ tar c my_dir | pv -s $(du -sk my_dir | cut -f1)k | scat "..."

Snapshots

Making snapshots boils down to versioning the index file in a git repository:

$ git init
$ git add foo_index
$ git commit -m "backup of foo"

Restoring a snapshot consists in checking out a particular commit and restoring using the old index file:

$ git checkout <commit-ish>
$ # ...use foo_index: see restore example

You could have a single repository for all your backups and commit index files after each backup, as well as the backup and restore scripts used to write and read these particular indexes. This allows for modifying proc strings from one backup to the next, while reusing identical chunks if any, and still be able to restore old snapshots created with potentially different proc strings, without having to remember what they were at the time.

Rationale

scat is born out of frustration from existing backup solutions.

As of writing the initial version, I had one or more of the following gripes with available solutions:

I wanted to be able to:

without:

I believe scat achieves these objectives 🙂

Next

Thanks