Awesome

ptrack

Overview

Ptrack is a block-level incremental backup engine for PostgreSQL. You can effectively use ptrack engine for taking incremental backups with pg_probackup backup and recovery manager for PostgreSQL.

It is designed to allow false positives (i.e. block/page is marked in the ptrack map, but actually has not been changed), but to never allow false negatives (i.e. loosing any PGDATA changes, excepting hint-bits).

Currently, ptrack codebase is split between small PostgreSQL core patch and extension. All public SQL API methods and main engine are placed in the ptrack extension, while the core patch contains only certain hooks and modifies binary utilities to ignore ptrack.map.* files.

This extension is compatible with PostgreSQL 11, 12, 13, 14, 15.

Installation

Specify the PostgreSQL branch to work with:

export PG_BRANCH=REL_15_STABLE

Get the latest PostgreSQL sources:

git clone https://github.com/postgres/postgres.git -b $PG_BRANCH

Get the latest ptrack sources:

git clone https://github.com/postgrespro/ptrack.git postgres/contrib/ptrack

Change to the ptrack directory:

cd postgres/contrib/ptrack

Apply the PostgreSQL core patch:

make patch

Compile and install PostgreSQL:

make install-postgres prefix=$PWD/pgsql  # or some other prefix of your choice

Add the newly created binaries to the PATH:

export PATH=$PWD/pgsql/bin:$PATH

Compile and install ptrack:

make install USE_PGXS=1

Set ptrack.map_size (in MB):

echo "shared_preload_libraries = 'ptrack'" >> <DATA_DIR>/postgresql.conf
echo "ptrack.map_size = 64" >> <DATA_DIR>/postgresql.conf

Run PostgreSQL and create the ptrack extension:

postgres=# CREATE EXTENSION ptrack;

Configuration

The only one configurable option is ptrack.map_size (in MB). Default is 0, which means ptrack is turned off. In order to reduce number of false positives it is recommended to set ptrack.map_size to 1 / 1000 of expected PGDATA size (i.e. 1000 for a 1 TB database).

To disable ptrack and clean up all remaining service files set ptrack.map_size to 0.

Public SQL API

ptrack_version() — returns ptrack version string.
ptrack_init_lsn() — returns LSN of the last ptrack map initialization.
ptrack_get_pagemapset(start_lsn pg_lsn) — returns a set of changed data files with a number of changed blocks and their bitmaps since specified start_lsn.
ptrack_get_change_stat(start_lsn pg_lsn) — returns statistic of changes (number of files, pages and size in MB) since specified start_lsn.

Usage example:

postgres=# SELECT ptrack_version();
 ptrack_version 
----------------
 2.4
(1 row)

postgres=# SELECT ptrack_init_lsn();
 ptrack_init_lsn 
-----------------
 0/1814408
(1 row)

postgres=# SELECT * FROM ptrack_get_pagemapset('0/185C8C0');
        path         | pagecount |                pagemap                 
---------------------+-----------+----------------------------------------
 base/16384/1255     |         3 | \x001000000005000000000000
 base/16384/2674     |         3 | \x0000000900010000000000000000
 base/16384/2691     |         1 | \x00004000000000000000000000
 base/16384/2608     |         1 | \x000000000000000400000000000000000000
 base/16384/2690     |         1 | \x000400000000000000000000
(5 rows)

postgres=# SELECT * FROM ptrack_get_change_stat('0/285C8C8');
 files | pages |        size, MB        
-------+-------+------------------------
    20 |    25 | 0.19531250000000000000
(1 row)

Upgrading

Usually, you have to only install new version of ptrack and do ALTER EXTENSION ptrack UPDATE;. However, some specific actions may be required as well:

Upgrading from 2.0.0 to 2.1.*:

Put shared_preload_libraries = 'ptrack' into postgresql.conf.
Rename ptrack_map_size to ptrack.map_size.
Do ALTER EXTENSION ptrack UPDATE;.
Restart your server.

Upgrading from 2.1.* to 2.2.*:

Since version 2.2 we use a different algorithm for tracking changed pages. Thus, data recorded in the ptrack.map using pre 2.2 versions of ptrack is incompatible with newer versions. After extension upgrade and server restart old ptrack.map will be discarded with WARNING and initialized from the scratch.

Upgrading from 2.2.* to 2.3.*:

Stop your server
Update ptrack binaries
Remove global/ptrack.map.mmap if it exist in server data directory
Start server
Do ALTER EXTENSION ptrack UPDATE;.

Upgrading from 2.3.* to 2.4.*:

Stop your server
Update ptrack binaries
Start server
Do ALTER EXTENSION ptrack UPDATE;.

Limitations

You can only use ptrack safely with wal_level >= 'replica'. Otherwise, you can lose tracking of some changes if crash-recovery occurs, since certain commands are designed not to write WAL at all if wal_level is minimal, but we only durably flush ptrack map at checkpoint time.
The only one production-ready backup utility, that fully supports ptrack is pg_probackup.
You cannot resize ptrack map in runtime, only on postmaster start. Also, you will loose all tracked changes, so it is recommended to do so in the maintainance window and accompany this operation with full backup.
You will need up to ptrack.map_size * 2 of additional disk space, since ptrack uses additional temporary file for durability purpose. See Architecture section for details.

Benchmarks

Briefly, an overhead of using ptrack on TPS usually does not exceed a couple of percent (~1-3%) for a database of dozens to hundreds of gigabytes in size, while the backup time scales down linearly with backup size with a coefficient ~1. It means that an incremental ptrack backup of a database with only 20% of changed pages will be 5 times faster than a full backup. More details here.

Architecture

We use a single shared hash table in ptrack. Due to the fixed size of the map there may be false positives (when some block is marked as changed without being actually modified), but not false negative results. However, these false postives may be completely eliminated by setting a high enough ptrack.map_size.

All reads/writes are made using atomic operations on uint64 entries, so the map is completely lockless during the normal PostgreSQL operation. Because we do not use locks for read/write access, ptrack keeps a map (ptrack.map) since the last checkpoint intact and uses up to 1 additional temporary file:

temporary file ptrack.map.tmp to durably replace ptrack.map during checkpoint.

Map is written on disk at the end of checkpoint atomically block by block involving the CRC32 checksum calculation that is checked on the next whole map re-read after crash-recovery or restart.

To gather the whole changeset of modified blocks in ptrack_get_pagemapset() we walk the entire PGDATA (base/**/*, global/*, pg_tblspc/**/*) and verify using map whether each block of each relation was modified since the specified LSN or not.

Contribution

Feel free to send a pull request, create an issue or reach us by e-mail if you are interested in ptrack.

Tests

All changes of the source code in this repository are checked by CI - see commit statuses and the project status badge. You can also run tests locally by executing a few Makefile targets.

Prerequisites

To run Python tests install the following packages:

OS packages:

python3-pip
python3-six
python3-pytest
python3-pytest-xdist

PIP packages:

testgres

For example, for Ubuntu:

sudo apt update
sudo apt install python3-pip python3-six python3-pytest python3-pytest-xdist
sudo pip3 install testgres

Testing

Install PostgreSQL and ptrack as described in Installation, install the testing prerequisites, then do (assuming the current directory is ptrack):

git clone https://github.com/postgrespro/pg_probackup.git ../pg_probackup  # clone the repository into postgres/contrib/pg_probackup
# remember to export PATH=/path/to/pgsql/bin:$PATH
make install-pg-probackup USE_PGXS=1 top_srcdir=../..
make test-tap USE_PGXS=1
make test-python

If pg_probackup is not located in postgres/contrib then additionally specify the path to the pg_probackup directory when building pg_probackup:

make install-pg-probackup USE_PGXS=1 top_srcdir=/path/to/postgres pg_probackup_dir=/path/to/pg_probackup

You can use a public Docker image which already has the necessary build environment (but not the testing prerequisites):

docker run  -e USER_ID=`id -u` -it -v $PWD:/work --name=ptrack ghcr.io/postgres-dev/ubuntu-22.04:1.0
dev@a033797d2f73:~$

Environment variables

Variable	Possible values	Required	Default value	Description
NPROC	An integer greater than 0	No	Output of `nproc`	The number of threads used for building and running tests
PG_CONFIG	File path	No	pg_config (from the PATH)	The path to the `pg_config` binary
TESTS	A Pytest filter expression	No	Not set (run all Python tests)	A filter to include only selected tests into the run. See the Pytest `-k` option for more information. This variable is only applicable to `test-python` for the tests located in tests.
TEST_MODE	normal, legacy, paranoia	No	normal	The "legacy" mode runs tests in an environment similar to a 32-bit Windows system. This mode is only applicable to `test-tap`. The "paranoia" mode compares the checksums of each block of the database catalog (PGDATA) contents before making a backup and after the restoration. This mode is only applicable to `test-python`.

TODO

Should we introduce ptrack.map_path to allow ptrack service files storage outside of PGDATA? Doing that we will avoid patching PostgreSQL binary utilities to ignore ptrack.map.* files.
Can we resize ptrack map on restart but keep the previously tracked changes?
Can we write a formal proof, that we never loose any modified page with ptrack? With TLA+?