Home

Awesome

About

This repo contains the source code and data in the ICLR 2022 paper, Towards Empirical Sandwich Bounds on the Rate-Distortion Function. For an introduction to this topic, see this blog post: part 1, part 2.

In this work, we make a first attempt at a numerical method for estimating sandwich bounds on the rate-distortion (R-D) function of a general data source using i.i.d. samples. Unlike the classic Blahut-Arimoto (BA) algorithm, which only works when the source and reproduction alphabets are finite and the source probabilities known, here we consider general (e.g., continuous) alphabets, without knowledge of the source distribution other than access to its samples, a common setting in applications like image compression.

We estimate an upper bound on the R-D function using an SGD version of the BA algorithm that is scalable to high-resolution image datasets, based on a close theoretical connection to beta-VAEs and neural compression autoencoders. To verify the tightness of our upper bounds, we also develop a lower bound algorithm based on Csiszár's dual characterization of the R-D function. Due to the associated computational challenges, we restrict to a squared error distortion and a continuous reproduction alphabet, and obtain non-trivial lower bound estimates on data with relatively low intrinsic dimensions, ranging from particle physics measurements to realistic images generated by a BigGAN. Our estimated R-D upper bound on natural images indicates that there is theoretical room for improving state-of-the-art image compression methods by at least 1 dB in PSNR, at various bitrates.

If you find this work helpful, please consider citing the paper

@inproceedings{yang2022towards,
  title={Towards Empirical Sandwich Bounds on the Rate-Distortion Function},
  author={Yang, Yibo and Mandt, Stephan},
  booktitle={International Conference on Learning Representations},
  year={2022}
}

Overview

To get started, you may want to check out the demo notebook that estimates the proposed R-D sandwich bounds on a 2D Gaussian source, and compares with the BA algorithm and the analytical R-D function.

The main scripts implementing the various algorithms are:

  1. ba.py: this contains a vectorized implementation of the Blahut-Arimoto algorithm. It takes in samples from a continuous source, discretizes/estimates the source probabilities by a histogram, and then runs the BA algorithm on the discretized source.
  2. rdub_mlp.py: this implements the proposed upper bound algorithm, by training a beta-VAE with an MLP encoder and (optionally) decoder. It also implements the non-linear transform coding method from Ballé et al., 2021, by placing shape restrictions on the variational posterior and prior distributions to simulate quantization.
  3. rdlb.py: this implements the proposed lower bound algorithm, by training a neural-network function (referred to as "log u" in the paper) to maximize an unconstrained formulation of Csiszár's dual formulation of R-D.
  4. resnet_vae.py: this implements the proposed upper bound algorithm specialized to images, using a hierarchical convolutional VAE inspired by the ResNet-VAE.
  5. mbt2018.py: this implements the mean-scale hyperprior neural compression method from Minnen et al., 2018, based on the tfc implementation of the scale-only hyperprior model.
  6. msms2020.py: this implements the channelwise autoregressive neural compression method from Minnen et al., 2020, adapted from the tfc implementation.

During training, all these scripts will write to disk a log file in .jsonl format, containing the train and val/test objectives evaluated after each training epoch.

Software

The code was developed in a python 3.7 environment on Linux. The main dependices are tensorflow 2.5 and tensorflow-compression 2.2, which can be installed by pip install tensorflow-compression==2.2

I also use the syntax of the GNU parallel tool to conveniently capture experiment commands that are repeated with different hyper-parameters (most commonly, experiments are rerun with different lambda values to sweep out an R-D bound), and are embarrassingly parallelizable. The parallel tool is not required; I simply use its syntax, e.g., {1} {2} ::: a b ::: c d e, to specify all the possible tuples in the Cartesian product a c, a d, a e, b c, b d, and b e.

In fact, depending on the compute resources available, it may not make sense to directly run the parallel commands as written, or you may want to run them sequentially one job at a time with an extra -j 1 flag (e.g., parallel does not take care of allocating jobs to different GPUs, so running more than one job on a GPU may result in OOM crashes).

Data preparation

All other data is self-contained.

Running the experiments

1. Gaussian

UB experiments

Since we have access to the source and compute the training loss on true i.i.d. samples generated on the fly, there's no need for separate validation/test data (--max_validation_steps is set to a small dummy value of 1 here), and the training objectives (loss/rate/mse in the jsonl log) are unbiased estimates of the true Lagrangian/rate/distortion.

LB experiments

The main training command is

nn_size_to_n_ratio=100;
nn_size=$(( n * nn_size_to_n_ratio ));
parallel python rdlb.py  --dataset gaussian --data_dim $n --checkpoint_dir checkpoints/gaussian --model mlp --units $nn_size,$nn_size,$nn_size --lamb {1} train --num_Ck_samples 2 --batchsize 1024 --last_step 3000 --checkpoint_interval 3000 --y_init quick --y_quick_topn 10 --lr 5e-4 --lr_schedule ::: $lambs_to_use

which trains a MLP log u model with 3 hidden layers, with the number of units set based on the source dimension n.

The bash variables $n and $lambs_to_use for each run are set as follows:

######################
n=2
lambs_to_use="1.0 3.0 10.0 30.0 100.0"

######################
n=4
lambs_to_use="2.0 6.0 20.0 60.0 200.0"

######################
n=8
lambs_to_use="4.0 12.0 40.0 120.0 400.0"

######################
n=16
lambs_to_use="8.0 24.0 80.0 240.0 800.0"

The evaluation command is

parallel python rdlb.py --dataset gaussian --data_dim $n --checkpoint_dir checkpoints/gaussian --model mlp --units $nn_size,$nn_size,$nn_size --lamb {1} eval --batchsize 1024 --y_init exhaustive --num_Ck_samples 30 ::: $lambs_to_use

where the bash variables are set the same way as before. The evaluation code loads the trained log u model, and draws Ck samples to form the LB estimator xi as defined in Appendix A.6, saving the resulting Ck and xi samples in a .npz file. The expected value of xi then gives a natural sample-mean estimator of the LB intercept (denoted R_ in code), and samples of xi can also be used to estimate confidence intervals.

2. Particle physics

UB experiments

LB experiments

3. Speech

The commands are almost identical to the previous ones on physics data.

UB experiments

LB experiments

4. Banana-shaped source

UB experiments

The bash variables $n and $nn_size are set as follows:

######################
n=4
nn_size=400

######################
n=16
nn_size=1600

######################
n=100
nn_size=2000

######################
n=500
nn_size=2000

LB experiments

The bash variables $n and $nn_size are set as follows (somewhat arbitrarily, as I didn't do much architecture tuning):

######################
n=2
nn_size=200

######################
n=4
nn_size=400

######################
n=16
nn_size=1000

######################
n=100
nn_size=1000

######################
n=500
nn_size=1000

5. GAN-generated images

UB experiments

In the following commands, $d stands for the intrinsic dimensionality of the images, and is set to 2 or 4; $lamb is the coefficient in front of the distortion term in the Lagrangian, and takes value in 5e-4, 3e-4, 1e-4, 3e-5, 1e-5, 3e-6, 1e-6.

In the training commands, we set the amount of training based on how long it takes the model with the highest lambda to converge. The models with lower lambda values typically require less training.

In the evaluation commands, --batchsize sets the number of GAN image samples to evaluate on; we use --no_cast_xhat in order to treat the reconstruction as floating-point valued (rather than uint8) when computing the MSE, since we take the alphabets of the GAN source to be continuous (more precisely, [0, 1]^n). As the neural compression autoencoders assume input and output to be in [0, 255] (and similarly do resnet_vae.py and the evaluation code, for consistency), the resulting MSE needs to be divided by 255^2 to align with results in the paper.

LB experiments

Note that rdlb.py here trains the lower bound model with a gradually increasing lambda, and saves model checkpoints along the way when lambda reaches a value in --target_lambs. Thus, one training run suffices to sweep out an entire R-D lower bound curve, although the performance can likely still be improved by further training/fine-tuning the intermediate checkpoints saved along the way.

Again, the results can be collected using the convenience method utils.aggregate_lb_results.

6. Natural images

You can train the proposed hierarchical ResNet VAE with:

python resnet_vae.py -V --latent_channels 4,8,16,32,64,128 --ar_prior_levels 4 --ar_slices 8 --num_filters 256 --lambda $lamb --checkpoint_dir checkpoints/img_compression train --dataset cocotrain --patchsize 256 --batchsize 8 --epochs 600 --steps_per_epoch 10000 --warmup 400 --patience 20

and evaluate on Kodak and Tecnick with

parallel -j1 python resnet_vae.py -V --latent_channels 4,8,16,32,64,128 --ar_prior_levels 4 --ar_slices 8 --num_filters 256 --lambda $lamb --checkpoint_dir checkpoints/img_compression eval --dataset {1} --results_dir results/img_compression --no_sub_dir ::: 0.005 0.01 0.02 0.04 0.08 0.16 ::: kodak tecnick

Above $lamb takes value in 0.08, 0.04, 0.02, 0.01, 0.005, 0.0025, 0.001.