Awesome
Learning from Irregularly-Sampled Time Series: A Missing Data Perspective
This repository provides a PyTorch implementation of the paper "Learning from Irregularly-Sampled Time Series: A Missing Data Perspective".
Requirements
This repository requires Python 3.6 or later. The file requirements.txt contains the full list of required Python modules and their version that we tested on. To install requirements:
pip install -r requirements.txt
Image
<img src="examples/images.png" alt="image completion" width="750" />MNIST
Under the image
directory, the following commands train P-VAE and P-BiGAN for
incomplete MNIST:
# P-VAE:
python mnist_pvae.py
# P-BiGAN:
python mnist_pbigan.py
CelebA
For CelebA, you need to download the dataset from its website. Specifically, you may either:
-
Download the file
img_align_celeba.zip
from this link and extract the zip file into the directoryimage/celeba-data
, or -
Run the script download-celeba.sh under the directory
image/celeba-data
. Make sure you have curl on your system.cd image/celeba-data && bash download-celeba.sh
Under the image
directory, the following commands train P-VAE and P-BiGAN for
incomplete CelebA:
# P-VAE:
python celeba_pvae.py
# P-BiGAN:
python celeba_pbigan.py
Command-line options
For both MNIST and CelebA scripts, using the option
--mask block --block-len n
to specify "square observation" missingness
with n-by-n observed blocks and
--mask indep --obs-prob .2
to specify "independent dropout" missingness
with 80% missing pixels.
Use -h
to see all the available command-line options for each script
(also for the scripts for time series described below).
Time Series
Our implementation takes as input a time series dataset in a format
composed of three tensors time
, data
, mask
saved as numpy's
npz file.
For a time series of N
data cases, each of which has C
channels
with each channel having at most L
observations (time-value pairs),
it is represented by three tensors time
, data
and mask
of size (N, C, L)
:
mask
is the binary mask indicating which entries intime
anddata
correspond to a missing value.mask[n, c, k]
is 1 if thek
-th entry of thec
-th channel of then
-th time series is observed, and 0 if it is missing.time
stores the timestamps of the time series rescaled to the range [0, 1]. Note that for those missing entries, whose correspondingmask
entry is zero, they must be set to values within [0, 1] for the decoder to work correctly. The easiest way is to set those to zero bytime *= mask
.data
stores the corresponding time series values associated withtime
. For those missing entries, they may contain arbitrary values.
The script gen_toy_data.py is an example of creating a synthetic time series dataset in such format.
Synthetic data
This notebook provides an overview of P-VAE and P-BiGAN and demonstrates how to train them on a synthetic dataset.
<img src="examples/time-series.png" alt="time series imputation" width="600" />Under the time-series
directory, the following commands train a P-VAE
and P-BiGAN on a synthetic multivariate time series dataset:
# P-VAE:
python toy_pvae.py
# P-BiGAN:
python toy_pbigan.py
MIMIC-III
MIMIC-III can be downloaded following the instructions from its website.
For the experiments, we apply the optional preprocessing used in this work to the MIMIC-III dataset.
For time series classification task, our implementation takes as input one of the following three labeled time series data format:
- Unsplit format with an additional label vector with the following 4 fields.
The data will be randomly split into train/test/validation set.
(time|data|mask)
: numpy array of shape(N, C, L)
as described before.label
: binary label of shape(N,)
.
- Data come with train/test split with the following 8 fields.
The training set will be
subsequently split into a smaller training set (80%)
and a validation set (20%).
(train|test)_(time|data|mask)
(train|test)_label
- Data come with train/test/validation split with the following 12 fields.
This is useful for model selection based on the metric evaluated
on the validation set with multiple runs (with different randomness).
(train|test|val)_(time|data|mask)
(train|test|val)_label
The function split_data
in time_series.py
demonstrates how the data file is read and split
into training/test/validation set.
You can follow this to create time series data of your own.
Once the time series data is ready,
run the following command under the time-series
directory:
# P-VAE:
python mimic3_pvae.py
# P-BiGAN:
python mimic3_pbigan.py
Citation
If you find our work relevant to your research, please cite:
@InProceedings{li2020learning,
title = {Learning from Irregularly-Sampled Time Series: A Missing Data Perspective},
author = {Li, Steven Cheng-Xian and Marlin, Benjamin M.},
booktitle = {Proceedings of the 37th International Conference on Machine Learning},
year = {2020}
}
Contact
Your feedback would be greatly appreciated! Reach us at li.stevecx@gmail.com.