Home

Awesome

ctgan-server-cli

This ctgan package provides a simple way to deploy CTGAN, a GAN-based data synthesizer onto a remote server. The package allows you to create synthetic samples of tabular data i.e confidential or proprietary datasets for sharing. For more details and use cases, see References. The package contains the following additional features:

Important Caveats!!!

The username/password feature is not secure as the username/passwords are stored in an unencrypted dictionary. This feature exists for proof of concept only and as a simple method to separate models/data between TRUSTED users on the same server.

In the original ctgan package, you must specify which columns are discrete before fitting. For simplicity I have setup this package to automatically treat 'string' columns as 'discrete' and treat all numerical columns as continuous. In the future I will try and add the ability for users to specify discrete_columns.

Installation

You can install from GitHub by:

You will also need to install the required packages onto the server:

pip3 install ctgan --no-cache-dir

If you're using a fresh linux server then you'll probably need to install pip and git before installing ctgan:

sudo apt -y update
sudo apt -y upgrade
sudo apt -y install python3-pip
sudo apt -y install git
git clone https://github.com/oregonpillow/ctgan-server-cli.git
pip3 install ctgan --no-cache-dir

Requirements

Example

A quick example:

bash fit.sh

#>
#>=========================================================
#>                  ___ _____ ___   _   _  _
#>                 / __|_   _/ __| /_\ | \| |
#>                | (__  | || (_ |/ _ \| .` |
#>                 \___| |_| \___/_/ \_\_|\_|
#>
#>               Deep Learning Synthetic Data.
#>                  github.com/sdv-dev/CTGAN
#>
#>=========================================================
#>
#>Original data files:
#>--------------------------------------------------------
#>original_data_demo.csv
#>--------------------------------------------------------
#>
#>Please enter original data filename: original_data_demo.csv
#>
#>original_data_demo.csv                                      100% 3722KB   3.6MB/s   00:01                                                                                                                                                     
#>Fit Module
#>
#>A CLI implemention of CTGAN
#>
#>
#>Please enter your username.
#>user1
#>Please enter your password.
#>password1
#>***** Login successful *****
#>
#>
#>Please re-confirm the file name you wish to fit: original_data_demo.csv
#>Data Integrity check PASSED
#>Please choose number of epochs for fitting: 5
#>Fitting Data. This may take a while...
#>
#>Epoch 1, Loss G: 1.9984, Loss D: -0.3420
#>Epoch 2, Loss G: 1.2560, Loss D: 0.0264
#>Epoch 3, Loss G: 0.3803, Loss D: 0.0651
#>Epoch 4, Loss G: -0.2293, Loss D: 0.0992
#>Epoch 5, Loss G: -0.4099, Loss D: 0.0719
#>Fitting Complete
#>/root/ctgan_server/model_database/user1/ Successfully found existing database for current user
#>
#>Model successfully added to database. Exiting...

This generated synthetic model is now stored on the server in a compressed serialized file format. We can sample from the model using the sampler.sh script:

bash sampler.sh

#>
#>=========================================================
#>                  ___ _____ ___   _   _  _
#>                 / __|_   _/ __| /_\ | \| |
#>                | (__  | || (_ |/ _ \| .` |
#>                 \___| |_| \___/_/ \_\_|\_|
#>
#>               Deep Learning Synthetic Data.
#>                  github.com/sdv-dev/CTGAN
#>
#>=========================================================
#>
#>Sample Module
#>
#>A CLI implemention of CTGAN
#>
#>
#>Please enter your username.
#>user1
#>Please enter your password.
#>password1
#>
#>***** Login successful *****
#>
#>user1 database models:
#>========================================================
#>
#>original_data_demo_20200301t181421_model.gz
#>
#>========================================================
#>
#>Please enter model to load:
#>original_data_demo_20200301t181421_model.gz
#>original_data_demo_20200301t181421_model.gz selected
#>Processing model...
#>Model loaded successfully
#>
#>Please enter sampling size: 100000
#>Sampling Data...
#>created synthetic csv folder for user
#>Download Synthetic data from server...
#>original_data_demo_20200301t181928_100000_synthetic.csv     100%   11MB   4.1MB/s   00:02
#>Download complete

The sampled data is now saved within the local folder synthetic_output

In the case that a user does not want to re-sample data from a model, but instead wants an exact carbon copy of a previously generated synthetic dataset, they can use the downloader.sh script:

bash downloader.sh

#>
#>=========================================================
#>                  ___ _____ ___   _   _  _
#>                 / __|_   _/ __| /_\ | \| |
#>                | (__  | || (_ |/ _ \| .` |
#>                 \___| |_| \___/_/ \_\_|\_|
#>
#>               Deep Learning Synthetic Data.
#>                  github.com/sdv-dev/CTGAN
#>
#>=========================================================
#>
#>Download module
#>
#>A CLI implemention of CTGAN
#>
#>
#>Please enter your username.
#>user1
#>Please enter your password.
#>password1
#>
#>*****Login successful*****
#>
#>user1 database models:
#>========================================================
#>
#>original_data_demo_20200301t181928_100000_synthetic.csv.gz
#>
#>========================================================
#>
#>Please enter synthetic data to download: original_data_demo_20200301t181928_100000_synthetic.csv.gz
#>
#>original_data_demo_20200301t181928_100000_synthetic.csv.gz selected
#>Processing data...
#>Download Synthetic data from server...
#>original_data_demo_20200301t181928_100000_synthetic.csv      100% 4480KB   4.4MB/s   00:01 ETA                                                                                                                     
#>Download complete

Description of folders inside 'ctgan_server'

File naming structure

All models and sampled data stored on server database have a timestamp according to ISO8601 (yyyymmddthhmmss): 'yyyy' 4digit year
'mm' 2digit month
'dd' 2digit day
't' time seperator
'hh' 2digit hour
'mm' 2digit minute
'ss' 2digit second.

model file example (compressed serialized pytorch model):

"original file name"_"timestamp"_model.gz
e.g original_data_demo_20200301t181421_model.gz

sampled csv example:

"original file name""timestamp""sample size"_synthetic.csv.gz
e.g. original_data_demo_20200301t181928_100000_synthetic.csv.gz

References

If you use ctgan, please cite the original work,

for an R package implementation of ctgan, see the following work,

<!-- end list -->
@inproceedings{xu2019modeling,
  title={Modeling Tabular data using Conditional GAN},
  author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
  booktitle={Advances in Neural Information Processing Systems},
  year={2019}
}

@misc{kuo2019generative,
    title={Generative Synthesis of Insurance Datasets},
    author={Kevin Kuo},
    year={2019},
    eprint={1912.02423},
    archivePrefix={arXiv},
    primaryClass={stat.AP}
}