Awesome
Genotyping imputation : Pipeline V1.0
A nextflow pipeline to realise a dataset's genotyping imputation
Description
The pipeline used to perform the imputation of several targets datasets processed with standard input.
Here is a summary of the method :
- Preprocessing of data : by using the nextflow script Preparation.nf with create a directory "file/" with all the dependencies.
- First step : Origin estimation of sample from the target dataset by using admixture tools and the hapmap dataset as reference.
- Second step : Series of SNPs filters and quality checking from the target dataset before the imputation step.
- Third step : VCF production
- Last step : Phasing and imputation
See the Usage section to test the full pipeline with your target dataset.
Dependencies
The pipeline works under Linux distributions.
-
This pipeline is based on nextflow. As we have several nextflow pipelines, we have centralized the common information in the IARC-nf repository. Please read it carefully as it contains essential information for the installation, basic usage and configuration of nextflow and our pipelines.
-
External software:
- LiftOver : conda install ucsc-liftover
- Plink (PLINK v1.90b6.12 64-bit (28 Oct 2019)) : conda install plink
- Admixture (ADMIXTURE Version 1.3.0) : conda install admixture
- Perl : conda install perl
- Term::ReadKey module : conda install perl-termreadkey
- BcfTools : conda install bcftools
- eagle 2.4.1 : See instructions
- minimac4 : conda install cmake ; pip install cget ; git clone https://github.com/statgen/Minimac4.git ; cd Minimac4 ; bash install.sh
- Samtools : conda install samtools
- File to download :
- Hapmap Dataset : as reference's dataset for admixture
- HGDP Dataset : for the dataset's test, you have to use the toMap.py & toPed.py in the 'converstion' directory to convert files in the .map/.ped plink format. Next you have to convert this last output in the .bed/.bam/.fam plink format by using plink line command and run the imputation's pipeline.
- Perl tool : HRC-1000G-check-bim-NoReadKey.pl & 1000GP_Phase3_combined.legend
- LiftOver tool : hg19ToHg38.over.chain & hg18ToHg38.over.chain
- Peparation dataset tool : pone.0002551.s003.xls (Convert it in .csv format)
- Admixture tool : relationships_w_pops_121708.txt
- CheckVCF, Fasta file in V37 & Fasta file in V38
- 1000G Reference in Hg38 with the doc
- Create legend, bcf & m3vcf files for the reference
- Other to know :
- See the Usage part to create the environment to run the pipeline. All the necessary dependencies are download with the using of the script Preparation.nf. To run it, you'll need to install the next software : in2csv(1.0.5), liftOver, plink, Minimac3(2.0.1) & bcftools
You can avoid installing all the external software of the main scritp by only installing Docker. See the IARC-nf repository for more information.
Input
Type | Description |
---|---|
Plink datasets | Corresponds to the target dataset to be analysed. Composed by the following files : bed, bim & fam |
Input environment | Path to your input directory |
Parameters
-
Mandatory
Name | Example value | Description |
---|---|---|
--target | my_target | Pattern of the target dataset which do the link with the file .bed/.bim./fam for plink |
--input | user/main_data/ | The path of the main directory where we can find 2 directory : my_target/ + files/ |
--output | user/my_result/ | The path of the main directory where you want to place your results |
-
Optional
Name | Default value | Description |
---|---|---|
--script | my/directory/script/bin | The path of the bin script directory, to be able to run the annexe programme grom the pipeline |
--geno1 | 0.03 | First genotyping call rate plink threshold, apply in the full target dataset |
--geno2 | 0.03 | Second genotyping call rate plink threshold, apply in the target dataset divide by population |
--maf | 0.01 | Minor allele frequencies plink threshold, apply in the full target dataset |
--pihat | 0.185 | Minimum pi_hat value use for the relatedness test, 0.185 is halfway between the expected IBD for third- and second-degree relatives |
--hwe | 1e-8 | Hardy-Weinberg Equilibrium plink p-value threshold |
--legend | ALL.chr_GRCh38.genotypes.20170504.legend | File to use as .legend |
--fasta | GRCh38_full_analysis_set_plus_decoy_hla.fa | File to use as fasta reference |
--chain | hg18ToHg38.over.chain | File to use as liftover conversion |
--VCFref | my/directory/ref/vcf/ | Directory to use as VCF reference |
--BCFref | my/directory/ref/bcf/ | Directory to use as BCF reference |
--M3VCFref | my/directory/ref/m3vcf/ | Directory to use as M3VCF reference |
--conversion | hg38/hg18/hg19 | Option to convert data from hg18 to HG38 version of the genome. Standard value is hg38 |
--cloud | hg38/hg18/hg19 | Option to convert data from hg18 to HG38 version of the genome. Standard value is hg38 |
--token_Michighan | path/to/my_token.txt | Option to convert data from hg18 to HG38 version of the genome. Standard value is hg38 |
--token_TOPMed | path/to/my_token.txt | Option to convert data from hg18 to HG38 version of the genome. Standard value is hg38 |
--QC_cloud | my/directory/donwload_imputation_server | Option to convert data from hg18 to HG38 version of the genome. Standard value is hg38 |
-
Flags
Flags are special parameters without value.
Name | Description |
---|---|
--help | Display help |
Usage
- Prepare the environment to run the imputation pipeline.
mkdir data
cd data
nextflow run IARCbioinfo/Imputation-nf/bin/Preparation.nf --out /data/
- Paste the bim/bed/fam plink target files in a directory, and the directory in your "data/" directory. You have to call the plink files and your directory with the same pattern, as the following exemple : data/target/target{.bed,.bim,.fam}. So now you have 2 directories in your "data/" repertory :
_ data/my_target/ : with the plink target files (my_target.bed, my_target.bim, my_target.fam).
_ data/files/ : with all the dependencies.
- Run the imputation pipeline.
nextflow run IARCbioinfo/Imputation.nf --target my_target --input /data/ --output /results/ -r v1.0 -profile singularity
- If you want to run the imputation in one of the server (Michigan and/or TOPMed Imputation), you need you write your token acces in a file and to give it in argument. For example :
nextflow run IARCbioinfo/Imputation.nf --target my_target --input /data/ --output /results/ --cloud on --token_Michighan /folder/my_token_Michighan.txt --token_TOPMed /folder/my_token_TOPMed.txt -r v1.0 -profile singularity
Once your imputation data is downloaded, you can run the end of the QC analysis :
nextflow run IARCbioinfo/Imputation.nf --target my_target --input /data/ --output /results/ --QC_cloud /downloaded_imputation_server_file/ -r v1.0 -profile singularity
Output
Type | Description |
---|---|
output1 | ...... |
output2 | ...... |
Detailed description (optional section)
...
Directed Acyclic Graph
Contributions
Name | Description | |
---|---|---|
Gabriel Aurélie | gabriela@students.iarc.fr | Developer to contact for support |
Lipinski Boris | LipinskiB@students.iarc.fr / boris.lipinski@etu.univ-lyon1.fr | Developer to contact for support |