Home

Awesome

Analysis pipeline

METEORE: MEthylation deTEction with nanopORE sequencing :stars:

About METEORE

METEORE provides snakemake pipelines for various tools to detect DNA methylation from Nanopore sequencing reads. Additionally, it provides new predictive models (random forest and multiple linear regression) that combine the outputs from the tools to produce a consensus prediction with higher accuracy than the individual tools.

NEW UPDATES (Mar-2021)

METEORE can now produce two per-site result files in an augmented BED format for each tool except for DeepMod (which will be updated very soon). The first output file contains the following fields:

  1. Reference chromosome
  2. Start position in chromosome
  3. End position in chromosome
  4. Read coverage
  5. Methylation (i.e. methylation frequency)
  6. Strandedness

In the second output file, we combine the methylation predictions from both strands on CpG sites by averaging the methylation frequencies and adding up the coverage. This output file contains the following fields:

  1. Reference chromosome
  2. Start position in chromosome
  3. End position in chromosome
  4. Read coverage
  5. Methylation (i.e. methylation frequency)

Table of Contents



Pipeline


Analysis pipeline Fig 1. Pipeline for CpG methylation detection form nanopore sequencing data. All tools take the input fast5 files, detect modified bases (5-methylcytosine at CG dinucleotides in this case) in reads and predict per-site methylation frequency at genome level.


Installation


We recommend to install software dependencies via Conda on Linux. You can find Miniconda installation instructions for Linux here. Make sure you install the Miniconda Python3 distribution.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Accept the license terms during installation.

For performance and compatibility reasons you should install Mamba via conda to install Snakemake for each pipeline later. See Snakemake documentation for more details.

conda install -c conda-forge mamba

Once you have installed Conda and Mamba, you can download the Snakemake pipelines and the example datasets.

git clone https://github.com/comprna/METEORE.git
cd METEORE/

Tutorial on an example dataset


We provide an example dataset data/example along with a genome reference data/ecoli_k12_mg1655.fasta for you to try the pipelines with. The example contains 50 single-read fast5 files from the positive control dataset for E.coli generated by Simpson et al. (2017).

Run the pipelines with your own data:

Nanopolish snakemake pipeline

Create and activate the Conda environment

To install packages for Nanopolish pipeline, run one of the following:

  1. Installing packages via Mamba
# Create an environment with Snakemake installed
mamba create -c conda-forge -c bioconda -n meteore_nanopolish_env snakemake
# Activate
conda activate meteore_nanopolish_env
# Install all required conda packages with mamba
mamba install -c bioconda nanopolish minimap2 samtools r-data.table r-dplyr r-plyr
  1. Installing packages using .yml file**
mamba env create -f nanopolish.yml
conda activate meteore_nanopolish_env

Run the snakemake

Before executing the workflow below, make sure you have the basecalled fastq file in the METEORE directory. Nanopolish needs to link the read ids from the fastq file with their signal-level data in the fast5 files. An example fastq file example.fastq is provided.

A Snakefile named Nanopolish contains all rules for the Snakemake workflow. Run the snakemake to create the output files:

snakemake -s Nanopolish nanopolish_results/example_nanopolish-freq-perCG.tsv --cores all

This will produce four index files example.fastq.index, example.fastq.index.fai, example.fastq.index.gzi and example.fastq.index.readdb, and the nanopolish_results output directory containing all output files.

Chr           Pos         Strand    Log.like.ratio  Read_ID
NC_000913.3   3499494     +         -0.62           094dfe6b-23ed-4195-8876-805a399fade5
NC_000913.3   3499526     +         -0.33           094dfe6b-23ed-4195-8876-805a399fade5
NC_000913.3   3499546     +         -0.12           094dfe6b-23ed-4195-8876-805a399fade5
NC_000913.3   3499563     +         8.26            094dfe6b-23ed-4195-8876-805a399fade5
Chr             Pos_start   Pos_end   Coverage    Methylation   Strand
NC_000913.3     3503839     3503840   7           1             +
NC_000913.3     3503840     3503841   7           1             -
NC_000913.3     3503849     3503850   7           1             +
NC_000913.3     3503850     3503851   7           1             -
Chr             Pos_start   Pos_end   Coverage    Methylation
NC_000913.3     3503839     3503840   14          1
NC_000913.3     3503849     3503850   14          1

Prepare the input file for combined model usage (optional)

You can also generate the per-read prediction output in a format that can be used in METEORE to generate a consensus prediction (see further below).

snakemake -s Nanopolish nanopolish_results/example_nanopolish-perRead-score.tsv --cores all

The output is in TSV format and contains the following fields: ID, Chr, Pos, Strand and Score, e.g.:

ID                                      Chr             Pos       Strand    Score
094dfe6b-23ed-4195-8876-805a399fade5    NC_000913.3     3499655   -         20.79
...

DeepSignal snakemake pipeline

Create and activate the first Conda environment

Note: Two environments with Snakemake installed are required. The first is for Deepsignal itself. The second is to produce output files that are in the same format generated by other tools.

We will create and first Conda environment:

# Here we install Python 3.6 with older version of Snakemake so we can install tensorflow==1.13.1 later
mamba create -n meteore_deepsignal_env1 -c bioconda -c conda-forge python=3.6.1 snakemake=5.3.0
# Activate
conda activate meteore_deepsignal_env1
# Install all required packages using pip
# We will run DeepSignal on CPUs so we install the CPU version of tensorflow
pip install numpy deepsignal 'tensorflow==1.13.1' ont-tombo ont-fast5-api

Alternatively, you can run DeepSignal on a GPU-enabled machine. Then you can install the GPU version of tensorflow. Please check out DeepSignal Github Page for more information.

Please download DeepSignal's trained model model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz here. To extract the model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz to the METEORE/data directory:

tar xvzf model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz -C <path_to_METEORE/data_directory>

Generate default outputs by DeepSignal

Note:

A Snakefile named Deepsignal1 contains rules to generate default output files from DeepSignal in the Snakemake workflow. Run it to create these files:

snakemake -s Deepsignal1 deepsignal_results/example_deepsignal-freq-perCG-raw.tsv

This will produce the deepsignal_results output directory containing the following output files:

#chr          pos       strand    0-based_pos   prob_unmethy_sum    prob_methyl_sum   count_modified  count_unmodified  coverage  mod_freq  k_mer
NC_000913.3   3501290   +         3501290       1.947               5.053             7                0                 7        1.000     AAAAGCACCGTGGACTT
NC_000913.3   3501291   -         1140360       3.103               1.897             1                4                 5        0.200     AAAGTCCACGGTGCTTT
NC_000913.3   3501308   +         3501308       1.032               5.968             7                0                 7        1.000     CTGGTCACCGAAAATAT
NC_000913.3   3501309   -         1140342       1.493               3.507             3                2                 5        0.600     CATATTTTCGGTGACCA

Create and activate the second Conda environment

Deactivate the meteore_deepsignal_env environment before creating a new environment. We use Conda rather than Mamba to avoid any caching issues (known bug in Mamba).

conda deactivate
conda create -n meteore_deepsignal_env2 -c bioconda -c conda-forge r-base=4.0.3 r-data.table r-dplyr snakemake
conda activate meteore_deepsignal_env2

Produce result files that are in the same format generated by other tools

Another Snakefile named Deepsignal2 contains the rules for the Snakemake workflow to produce per-site result files and prepare the input file for combined model usage. Note the uses of --cores in the newer version of Snakemake.

snakemake -s Deepsignal2 deepsignal_results/example_deepsignal-freq-perCG.tsv --cores 1

Two output files will be produced and saved to the deepsignal_resultsdirectory:

Chr             Pos_start   Pos_end   Coverage    Methylation   Strand
NC_000913.3     3501827     3501828   9           0.889         +
NC_000913.3     3501828     3501829   8           1             -
NC_000913.3     3501840     3501841   9           1             +
NC_000913.3     3501841     3501842   8           0.875         -
NC_000913.3     3501861     3501862   9           0.889         +
NC_000913.3     3501862     3501863   8           0.875         -
Chr             Pos_start   Pos_end     Coverage    Methylation
NC_000913.3     3501827     3501828     17          0.9445
NC_000913.3     3501840     3501841     17          0.9375
NC_000913.3     3501861     3501862     17          0.882

Prepare the input file for combined model usage (optional)

You can now generate the per-read prediction output in a format that can be used in METEORE to generate a consensus prediction (see further below). Note the uses of --cores in the newer version of Snakemake.

snakemake -s Deepsignal2 deepsignal_results/example_deepsignal-perRead-score.tsv --cores 1

The output is in TSV format and contains the following fields: ID, Chr, Pos, Strand and Score.

Tombo snakemake pipeline

Create and activate the Conda environment

# Create an environment with Snakemake installed
mamba create -c conda-forge -c bioconda -n meteore_tombo_env snakemake==5.3.0 python==3.6.10 ont-tombo r-data.table r-dplyr
# Activate
conda activate meteore_tombo_env
# Install all required packages using pip
pip install wiggelen ont-fast5-api h5py==2.10.0

Run the snakemake

Note:

A Snakefile named Tombo contains all rules in the Snakemake workflow. Run it to create the output files:

snakemake -s Tombo tombo_results/example_tombo-freq-perCG.tsv

This will produce the following original modified base prediction results generated by Tombo. You can find the detailed information for all Tombo commands and outputs on the tombo documentation page.

There are rules in the Snakemake workflow for downstream processing of the output files generated by Tombo. The above command also generates a tombo_results output directory which contains the following output files:

Prepare the input file for combined model usage (optional)

You can also generate the per-read prediction output in a format that can be used in METEORE to generate a consensus prediction (see further below).

snakemake -s Tombo tombo_results/example_tombo-perRead-score.tsv

The output is in TS format and contains the following fields: ID, Chr, Pos, Strand and Score.

Note:

######## specify region of interest below: ########
chromosome = 'NC_000913.3'
start_position = 412305
end_position = 4584088

Guppy Snakemake pipeline

Modified basecalling

You need to basecall with the standalone Guppy basecaller before running the Snakemake pipeline. The pipeline was only designed to process and analyze the Guppy's fast5 output using the open-source custom scripts available from https://github.com/kpalin/gcf52ref. This was only tested with Guppy (version 3.6) when modified basecalling was introduced. We highly recommend to use Megalodon (ONT-developed tool) coupled with Guppy for methylation calling (see Megalodon section below).

Guppy basecaller is only available to Oxford Nanopore Technologies' customers via the community site. Please check the community page for download and installation instructions.

Once you have installed Guppy, you can perform modified basecalling from the signal data using the dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg guppy config.

# For Guppy (CPU on Linux)
./<path_to_ont-guppy-cpu>/bin/guppy_basecaller --config dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg --fast5_out --input_path data/example/ --save_path guppy_results/ --cpu_threads_per_caller 10
# For Guppy (GPU on Linux)
./<path_to_ont-guppy-gpu>/bin/guppy_basecaller --config dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg --fast5_out --input_path data/example/ --save_path guppy_results/ --device auto

Once you have done modified basecalling with Guppy, the output files will be saved to the guppy_results directory where you will find the logs, basecalled fastq file(s) and a folder named workspace which contains basecalled fast5 files with modified base information.

Create and activate the Conda environment

Before running the Snakmake pipeline, you need to prepare the following files:

cd guppy_results/
cat *.fastq > example.fq
mv [original_fastq_file_with_long_filename] example.fq
git clone https://github.com/kpalin/gcf52ref.git

Then you can create and activate the environment from the guppy.yml file:

mamba env create -f guppy.yml
conda activate guppy_cpg_snakemake

Run the snakemake

A Snakefile named Guppy contains all rules in the Snakemake workflow. Run the snakemake to create the output files:

cd ../
snakemake -s Guppy guppy_results/example_guppy-freq-perCG.tsv

This will produce two output files in the guppy_results directory:

chromosome    strand  start     end       read_name                               log_lik_ratio   log_lik_methylated    log_lik_unmethylated    num_calling_strands   num_motifs  sequence
NC_000913.3   +       3505711   3505712   8386596c-ff56-4032-b54e-8f062c194b16    -2.41           -2.41                 0.0                     1                     1           CG
NC_000913.3   +       3505714   3505715   8386596c-ff56-4032-b54e-8f062c194b16    -0.575          -0.676                -0.101                  1                     1           CG
NC_000913.3   +       3505726   3505727   8386596c-ff56-4032-b54e-8f062c194b16    1.49            -0.012                -1.51                   1                     1           CG
NC_000913.3   +       3505728   3505729   8386596c-ff56-4032-b54e-8f062c194b16    0.361           -0.155                -0.516                  1                     1           CG
NC_000913.3   +       3505745   3505746   8386596c-ff56-4032-b54e-8f062c194b16    -0.112          -0.359                -0.247                  1                     1           CG

Prepare the input file for combined model usage (optional)

You can also generate the per-read prediction output in a format that can be used in METEORE to generate a consensus prediction (see further below).

snakemake -s Guppy guppy_results/example_guppy-perRead-score.tsv

The output is in tsv format and contains the following fields: ID, Chr, Pos, Strand and Score.

Megalodon

We did not develop a snakemake pipeline for Megalodon as it can basecall, map, detect base modifications with Guppy (GPU on Linux) and Remora running backend, using a single command.

Please check out Megalodon GitHub Page and Megalodon's documentation for more details.

Modified base detection

Here we used the most recent modified basecalling models in Remora that provides the most accurate methylation result. See Remora for installation and usage.

We tested with Megalodon (v2.5.0), Guppy (v6.1.1, GPU on linux), Remora (v1.0.0). GPU resources are required.

megalodon data/example/ \
--outputs basecalls mappings mod_mappings per_read_mods mods \
--guppy-config dna_r9.4.1_450bps_hac.cfg \
--guppy-server-path <path_to_ont-guppy-gpu>/bin/guppy_basecall_server \
--reference data/ecoli_k12_mg1655.fasta \
--remora-modified-bases dna_r9.4.1_e8 hac 0.0.0 5mc CG 0 \
--overwrite --mod-motif m CG 0  --write-mods-text \
--devices "cuda:all" --processes 20

The megalodon command produces the megalodon_results directory containing logs, per-read modified base output per_read_modified_base_calls.txt, per-site modified base output modified_bases.5mC.bed.

Output files

We provide a bash script to generate the methylation frequency file and the input file for combined model usage, which are in the same format used in every method. Run the bash script with your preferred filename (eg. example) for the output files:

./script/megalodon.sh example

You will get the following files:

DeepMod

We did not develop a snakemake pipeline for DeepMod as it can be run with a single command.

Please check out DeepMod GitHub Page for installation instructions and usage.

# Set the working directory where you install DeepMod
python bin/DeepMod.py detect --wrkBase <path_to_METEORE>/data/example/ --Ref <path_to_METEORE>/data/ecoli_k12_mg1655.fasta --outFolder <path_to_METEORE>/deepmod_results --Base C --modfile train_mod/rnn_conmodC_P100wd21_f7ne1u0_4/mod_train_conmodC_P100wd21_f3ne1u0 --FileID example_deepmod --threads 10

You will find a deepmod_results output folder which contains a .done file and a example_deepmod result folder. Inside the example_deepmod folder, you will find two output files in a bed format mod_pos.NC_000913.3-.C.bed and mod_pos.NC_000913.3+.C.bed, which per-site data including genomic position of the CpG site, strand, base, coverage, methylation percentage and methylation coverage. The format of the output of DeepMod is described here.

We provide a R script to generate a methylation frequency file in the same format used in other methods. Run Rscript script/run_deepmod.R deepmod_output_4_plus_strand.bed deepmod_output_4_minus_strand.bed output_file.tsv

# Set the working directory where you install METEORE
Rscript script/run_deepmod.R deepmod_results/example_deepmod/mod_pos.NC_000913.3+.C.bed deepmod_results/example_deepmod/mod_pos.NC_000913.3-.C.bed example_deepmod-freq-perCG.tsv

The example_deepmod-freq-perCG.tsv output file contains the per-site data including the genomic position of the CpG site, methylation frequency and coverage. The methylation calls from both strands are merged into a single strand.

Note that DeepMod does not produce per-read predictions, so we could not produce an output file in the same format as those described above (.tsv) to be later combined to build a consensus.


Combined model (random forest) usage


We have trained Random Forest models that combine the outputs from two of the methods above to produce consensus predictions with improved accuracy. We also provide the scripts to train new models with two or more methods.

First, please make sure you install the required libraries in the Conda environment

pip install -r requirements.txt

Prepare the per-read score file from the methods you select

To make the predictions using the combination model (Here we use DeepSignal and Nanopolish as an example), you can generate the per-read prediction input file example_deepsignal-perRead-score.tsv and example_nanopolish-perRead-score.tsv as described before in our snakemake pipelines. The input (.tsv) file from each method must be formatted as below:

ID                                    Chr           Pos       Strand  Score
094dfe6b-23ed-4195-8876-805a399fade5  NC_000913.3   3499495   -       2.99
094dfe6b-23ed-4195-8876-805a399fade5  NC_000913.3   3499527   -       3.55
094dfe6b-23ed-4195-8876-805a399fade5  NC_000913.3   3499547   -       6.19
094dfe6b-23ed-4195-8876-805a399fade5  NC_000913.3   3499564   -       4.45

where the score is the significance score given by each method. Nanopolish and Guppy provide a log-likelihood ratio value per site and per read. Similarly, Tombo produces a significance value per site and per read. On the other hand, for DeepSignal we give the log-ratio of the probabilities for a site to be methylated over the probablity of being unmethylated for each read. Similarly, for Megalodon we used the difference of the log-probabilities to obtain a log-ratio.

Command

Given the already pre-trained model (models available in ./saved_models/), to run the model you need the following command:

python combination_model_prediction.py  -i [path to the tsv file storing the methods' names and the paths to the per-read score file] -m [model_to_use (default or optimized)] -o [output_file]

The options (-i, -m and -o) are explained below:

Inputs (-i)

The file for option -i contains the methods you choose to use (1st column) and the path to the respective per-read score files (2nd column):

deepsignal	./example_results/deepsignal_results/example_deepsignal-perRead-score.tsv
nanopolish	./example_results/nanopolish_results/example_nanopolish-perRead-score.tsv

Here we call this file model_content.tsv (provided as an example with the package) and use the results example_results/ produced by the method from the example dataset.

Models (-m)

You can choose to use either -m default or -m optimised:

# Use the default model
python combination_model_prediction.py -i model_content.tsv -m default -o example_deepsignal_nanopolish-default-model-perRead.tsv
# Use the optimized model
python combination_model_prediction.py -i model_content.tsv -m optimized -o example_deepsignal_nanopolish-optimized-model-perRead.tsv

Run with your own files using a different combination of tools

  1. If you want a different combination, e.g. deepsignal+guppy, you can replace the "nanopolish" line in the model_content.tsv file with guppy and its input file path. You can also add other methods paths as well in the same file.
  2. The order of method's names in the model_content.tsv file should be same as the order used to generate the combined model. For example, we provide the models with name 'rf_model_default_deepsignal_nanopolish.model' so the order in the model_content.tsv file should be deepsignal and then nanopolish and not the other way round. You can check the models we trained and the order of the names in the saved_models directory.

Output (-o)

Use -o option to write the per-read consensus prediction output to the specified file. All outputs are written into the directory called combined_model_results. New results from subsequent runs will be saved into the same output directory.

The output will contain per-read predictions for the reads common to both deepsignal and nanopolish method. For example, in the example_deepsignal-nanopolish-optimized-model-perRead.tsv

ID                                    Chr           Pos       Prediction  Prob_methylation
c3e4c0c6-11f9-4ecb-858e-877e81e31a0c  NC_000913.3   3508489   0           0.41759488451060933
c3e4c0c6-11f9-4ecb-858e-877e81e31a0c  NC_000913.3   3508467   0           0.41759488451060933
c3e4c0c6-11f9-4ecb-858e-877e81e31a0c  NC_000913.3   3508454   1           0.9565835858934412

The column prediction (0 refers to unmethylated and 1 refers to methylated) is based on a threshold of 0.5 score. That is, if the P(methylation) is <= 0.5, it is predicted as unmethylated (0), otherwise as methylated (1). Predictions for a different thresholds can be obtained by parsing the column Prob_methylation.

Per-site predictions

We also provide a Python script to convert the per-site predictions for each individual read (methylated / unmethylated) into per-site predictions at genome level (methylation frequency) by summarizing the predictions on individual reads from the model.

prediction_to_mod_frequency.py [-h] -i INPUT [-t THRESHOLD] -o OUTPUT

where the input is the file produced from the previous step, and the output is the per-site result at genome level. You can specify a threshold which applies on the Prob_methylation column in the per-read combined prediction file (see above) to define the site on the read as methylated or unmethylated.

Run python prediction_to_mod_frequency.py -h for help.

For example:

# You can specify a threshold when -t option is used
python prediction_to_mod_frequency.py -i combined_model_results/example_deepsignal_nanopolish-optimized-model-perRead.tsv -t 0.46 -o combined_model_results/example_deepsignal_nanopolish-optimized-model-freq-perCG.tsv
# Or you can use the default threshold of 0.5
python prediction_to_mod_frequency.py -i combined_model_results/example_deepsignal_nanopolish-default-model-perRead.tsv -o combined_model_results/example_deepsignal_nanopolish-default-model-freq-perCG.tsv

The per-site prediction output contains the following fields:

Chr           Pos     Unmodified_read Modified_read   Coverage  Methylation_frequency
NC_000913.3   3504995 6               10              16        0.625
NC_000913.3   3505005 1               15              16        0.9375
NC_000913.3   3505010 1               15              16        0.9375
NC_000913.3   3505034 3               13              16        0.8125

Train your own combination model

We provide the script to train a combined model from the per-read and per-site output from any number of methods (from 2 to 5). For instance, the command to train a model with 5 methods would be:

python combination_model_train.py  -d [path of deepsignal file] -n [path of nanopolish file] -g [path of guppy file] -m [path of megalodon file] -t [path of tombo file] -c [number of methods to combine together for training (range from 2-5)] -o [output_path_to_save_model]

The order of the methods in the command is not relevant, but it will determine how to specify the model when making predictions, as described above. For example, the command to train a model with DeepSignal, Nanopolish, and guppy, could take the following form:

python combination_model_train.py  -d deepsignal.tsv -n nanopolish.tsv -g guppy.tsv -c 3 -o output_models

This command will create an output directory called output_models, and the models named rf_model_default_deepsignal_nanopolish_guppy.model and rf_model_max_depth_3_n_estimator_10_deepsignal_nanopolish_guppy.model will be saved inside this directory. To run this new model, you will need to create a new file new_model_content of the form, e.g.:

deepsignal    deepsignal_test2.tsv
nanopolish    nanopolish_test2.tsv
guppy         guppy_test2.tsv

where the _test2.tsv files contain the reads and scores that you want to use for consensus prediction in the format

ID                                    Pos       Strand  Score
b9fdd6aa-ba93-4424-8f4b-c632e4d16d2e  1817032    -      2.45
...

The command to obtain the consensus predictions could be of the form

python combination_model_prediction.py  -i new_model_content.tsv -m optimized -o output2

This command will use the model rf_model_max_depth_3_n_estimator_10_deepsignal_nanopolish_guppy.model to predict on the sites and reads common in the input files deepsignal_test2.tsv, nanopolish_test2.tsv, and guppy_test2.tsv.


Combined model (multiple linear regression) usage


An alternative combination model based on multiple linear regression (REG) is available through the meteore_reg.py script. This is a standalone script that first builds a model training data and then applies it to unknown test data generated from the same tools. It's a simple script to run but requires upwards of 6 GB of RAM, and approximately five minutes for the data sets used in our publication.

Training datasets

The METEORE REG model was trained on the E.coli datasets with 11 different mixture sets of methylated and unmethylated reads (0%, 10%,..., 90%, 100% of methylated reads), each set with ~2400 reads. Here we provide the training datasets of the three best performing tools (Nanopolish, DeepSignal, Megalodon) where you can download them from figshare. The minimum needed for training is to have two mixture proportions that are known.

Command and options

The example usage below would train the REG model on Nanopolish and DeepSignal outputs from the "mix1" dataset, and apply this model to the data in "mix2" for the same tools. Inputs and outputs rely on the ordering of arguments to remain consistent.

python meteore_reg.py --train set1_nanopolish.tsv set1_deepsignal.tsv \
      --tool_names nanopolish deepsignal --train_desc mix1 --test_desc mix2 \
      --test set2_nanopolish.tsv set2_deepsignal.tsv --testIsTraining \
      --trunc_min -50 -5 --trunc_max 50 5 --filter 0.1

The following command-line options are available:

python meteore_reg.py -h
usage: meteore_reg.py [-h] --train TRAIN [TRAIN ...] --train_desc TRAIN_DESC [--test [TEST [TEST ...]]]
                      [--test_desc TEST_DESC] --tool_names TOOL_NAMES [TOOL_NAMES ...]
                      [--trunc_max [TRUNC_MAX [TRUNC_MAX ...]]] [--trunc_min [TRUNC_MIN [TRUNC_MIN ...]]] [--debug]
                      [--inc_pred] [--filter FILTER] [--testIsTraining] [--no_scaling]

optional arguments:
  -h, --help            show this help message and exit
  --train TRAIN [TRAIN ...]
                        Path to training data
  --train_desc TRAIN_DESC
                        Describe the training data set
  --test [TEST [TEST ...]]
                        Path to test data. Must be omitted or same number of tool files as as --train (and in the same order)
  --test_desc TEST_DESC
                        Describe the test data set
  --tool_names TOOL_NAMES [TOOL_NAMES ...]
                        Names of each tool in the same order as given for --train (and --test if provided)
  --trunc_max [TRUNC_MAX [TRUNC_MAX ...]]
                        Truncate maximum values. Must match order of training data
  --trunc_min [TRUNC_MIN [TRUNC_MIN ...]]
                        Truncate minimum values. Must match order of training data
  --debug               Turn on debugging output (STDERR)
  --inc_pred            Include tool prediction categories in regression. NOTE: must be present in all input files
  --filter FILTER       Percentage of total reads to remove around mean score, default -1 (disabled)
  --testIsTraining      Use functions for handling training data for the test parameters
  --no_scaling          Disable score scaling

Regression specific options

Several arguments are specific to the REG model and deserve further discussion.

--trunc_min INT ... --trunc_max INT ...

The trunc_min and trunc_max parameters can be used to restrict the useful range of log difference scores on a per-tool basis. Most of the existing methylation calling tools are well behaved but Nanopolish in particular can output extreme scores. By default REG scales each tool's scores linearly between 0 and 1. Extreme positive or negative values will reduce the contribution by scores that are less extreme but still meaningful indictors of methylation status. Values less than trunc_min INT for tool N will be replaced with the user provided score, and vice versa for the scores larger than trunc_max INT.

--inc_pred

Enables methylation labels to be considered as a contributor to the prediction model. This requires the test data to have the same format as the test data, with "group" and "label" columns. In some case it can improve prediction if the tool makes good classifications in spite of oddly distributed scores.

--filter FLOAT 0..1

Filters out a fixed proportion of reads, evenly distributed either side of the scaling mid-point. If scaling is disabled then the mean score is considered the mid-point.

--testIsTraining

Informs REG to use the input format corresponding to training data for test data as well.

--no_scaling

Disables MinMax (0..1) scaling. Not recommended in general but might be necessary for tools with wildly different scoring systems.

Example run report

python meteore_reg.py --train set1_deepsignal.tsv.gz set1_nanopolish_updated.tsv.gz --tool_names deepsignal nanopolish \
  --train_desc mix1 --test set2_deepsignal.tsv.gz set2_nanopolish.tsv.gz --test_desc mix2 --testIsTraining \
  --trunc_min -5 -50 --trunc_max 5 50
Filter  Ignored deepsignal_pass deepsignal_prop nanopolish_pass nanopolish_prop Model_succ      Model_prop
-1      0       2981104 0.9105415715991803      2729348 0.8336457961081127      3029092 0.9251989163070138
Filter  Ignored deepsignal_pass deepsignal_prop nanopolish_pass nanopolish_prop Model_succ      Model_prop
-1      0       5728226 0.9151707857313122      5332267 0.8519103436420188      5778967 0.923277428318178

REG reports a standard summary of the training and test accuracy.

Filter is the proportion of (presumably low-accuracy) reads to be removed, with -1 indicated it is disabled. Likewise the Ignored column is the corresponding number of reads removed by filtering.

Each tool is then reported with the number of site observations that match the expected methylation status (not accurate for hemi-methylated sites), and the overall proportion accuracy.

Finally the combined regression model numbers are reported for training data where labels are available to ascertain truth.