Home

Awesome

Update (2024-06)

The code is updated to allow evaluations on the dataset provided by https://github.com/alephzerox/ancestry-fhe.

You can simply download the dataset from here, placing downloaded files in the data directory and prepare the dataset using python preprocessing.py

Performance on M1 MAX:

Summary
    Number of samples: 100

    Performance (Inference)
        Clear:  0.31 s/sample
        FHE:    293.79 s/sample
        
    Accuracy (Inference)
        Clear:  76.42 %
        FHE:    76.18 %

Installation

  1. Download the query file from here. After downloading, place the vcf.gz file in the data directory.
  2. Install the required packages using pip install -r requirements.txt.
  3. Install bcftools for VCF file manipulation. Detailed instructions can be found here.
  4. Prepare dataset using python preprocessing.py. This will build the dataset and simulations in data/.
  5. Run python main_clear.py to train and evaluate the model using clear training.
  6. Run python main_fhe.py to train and evaluate the model using FHE.

Methodology

We follow gnomix's attempt to firstly extract features from Genomic Data slices, and then smooth the features to get the final prediction. Specifically, we use the LogisticRegression for Genomic Data feature extraction, and XGBClassifier for smoothing.

The training is conducted on the 1000genomes dataset, a large-scale Generation Genomic Dataset. After training, models will be compiled into FHE-compatible models, and then evaluated using FHE.

Configurations

Main configurations are stored in config.py. Here is a brief explanation of each configuration.

Hyper-parameters tuning

For better performance, some hyper-parameters, including n_bits, n_estimators, max_depth, can be tuned. We conducted extensive experiments on 1000genomes dataset to find the best hyper-parameters. Note that here we will not alter any configurations for the dataset (i.e., BUILD_GENS, WINDOW_SIZE_CM, etc), as they are related to downstream tasks.

Specifically, we first tune hyper-parameters of models using main_clear.py. The results are shown in the following table.

n_estimatorsmax_depthAccuracyInference time (M1 MAX)
10040.98110.138 s/it
5040.98010.134 s/it
2040.97820.133 s/it
10030.98090.128 s/it

Then, with the above-mentioned hyper-parameters for clear training, we conducted experiments using main_fhe.py to determine parameters for FHE, with minimal impact on accuracy and performance. The results are shown in the following table.

n_bitsp_errorAccuracyInference time (SIMULATE, M1 MAX)
61e-400.98112.981 s/it
61e-20.98102.931 s/it
61e-10.98082.935 s/it
41e-20.97882.891 s/it
21e-10.96362.247 s/it

Performance for reference

Note that the following performance is tested only for reference. Depending on the hardware, downstream tasks, and set parameters, the performance may vary significantly.

Hardware: M1 MAX, 64GB RAM

Hyper-parameters:

Configurations are default as mentioned in the config.py. There are in total 370 base models for ensembles. According to downstream tasks, this can be adjusted by changing WINDOW_SIZE_CM and other related configurations.

DatasetAccuracyTime (execute)Time (non FHE)
1000genomes (Augmented)0.9810350 min/it0.164 s/it