Home

Awesome

H3AGWAS : dataset and command line example

Example and command line to run h3agwas pipeline

These are very simple examples to show working of the pipeline and get you started. These examples show the basic working of some of the key sub-workflows only. We can't show all options here -- see the main documentation for more.

1. Set-up

A dataset has been build using 1000 Genomes Project data, selecting 50,000 SNPs found in the H3Africa Custom Chip Array and 500 individuals. Description and explanation can be found here or bash script here

Information to run locally can be found here

1.1 Data directory :

1.2 To run these examples

  1. Fetch or update the workflow itself nextflow pull h3abionet/h3agwas
  2. Clone the example directory: git clone h3abionet/h3agwas-examples
  3. Change your working directory so you are in the new directory cd h3agwas-examples
  4. You need to have have Singularity or Docker installed (or install all the software tools yourself)

2. Quality control

The first example takes raw PLINK data as input (that is data that comes after genotype calling) and performs quality control. The qc workflow is used. The key parameters are

In this example, we run QC on the raw data found in data/array_plk/array

nextflow run h3abionet/h3agwas/qc/main.nf --input_dir data/array_plk  --input_pat array --output_dir qc  --output array_qc \
 --phenotype data/pheno/pheno_test.all --pheno_col phenoqc_ql \
 --case_control data/pheno/pheno_test.all --case_control_col Sex \
 --batch data/pheno/pheno_test.all --batch_col batch \
 -profile singularity -resume

The -profile option asks for the use of Singularity. The first time you run the workflow, Nextflow will automatically download all the singularity images -- and this may take a few minutes (this only happens the first time and there's nothing special you have to do other than be patient). If you want to use Docker instead of Singularity you would say -profile docker. If you want to use both Singularity and a scheduler like slurm you say in this order -profile singularity,slurm.

In the QC we look for batch (broadly understood) effects in 3 ways (you don't have to do such a complex analysis). We check for batch effects of the phenoqc_ql phenotype, sex of individuals, and the batch from which the sample came. Default QC cut-offs for missingness found in the nextflow.config file are used.

Output goes to the qc/ directory with kgexample as base for file names..

The PDF file gives the report of the QC. Note that because this is such a small and artificial file some of the pictures will look very odd.

3. Association pipeline

The pipeline offers multiple ways of testing for association. Only some examples:

Here's a simple example, assuming you've done the QC example above and haven't changed working directory. We take the QCed data as genotype input, and the data in data/pheno/pheno_test.all as the source of the phenotype data. There are two columns in the phenotype file (pheno_qt1, pheno_qt2) and we'll test for associations against both. We specify which directory and base file name to use for output. We do linear regression. We use singularity.

nextflow run  h3abionet/h3agwas/assoc/main.nf --input_dir qc/    --input_pat kgpexample  --data data/pheno/pheno_test.all   --pheno pheno_qt1,pheno_qt2  --output_dir assoc1 --output qt12 --linear 1 -profile  singularity

Now we do a more complex example, adding on to this

nextflow run h3abionet/h3agwas/assoc/main.nf \
 --input_dir data/imputed/ --input_pat imput_data \
 --data data/pheno/pheno_test.all --pheno pheno_qt1,pheno_qt2 \
 --output_dir assoc --output assoc \
 --gemma 1 --sample_snps_rel 1 --linear 1 \ 
  -profile singularity 

Output :

In the next example, we have separate BGEN files as input -- we can analyse them use the list_bgen option. Here we also use BOLTLMM, FASTGWA and RGENIE. There are some advanced options given for these tools

ls data/imputed/bgen_chro/*.bgen > listbgen
nextflow run h3abionet/h3agwas/assoc --input_dir data/imputed/ --input_pat input_data \
 --data data/pheno/pheno_test.all --pheno pheno_qt1,pheno_qt2 \
 --output_dir assoc_listbgen --output assoc \
 --boltlmm 1 --sample_snps_rel 1 --regenie 1 --fastgwa 1 --grm_nbpart 2\
  -profile singularity \
 --list_bgen listbgen --bgen_sample data/imputed/bgen/out.sample --saige 1 -resume

Input types for different tools

The table below shows the different data types, the information they store and how referenced.

| | plink | vcf | bgen | impute 2 |

genotypedosagedosagedosage
Option--input_dir/--input_pat--list_vcf--list_bgen/ --bgen/--bgen_samplebolt_impute2filelist/bolt_impute2fidiid

The table below shows different data types used as input for the supported tools and the command used to activate

Softwareplinkvcfbgenimpute 2option to activate
gemmayesnonono--gemma / --gemma_gxe
plink associationyesnonono--[linear,logistic,assoc",fisher] 1 / --plink_gxe
gcta/fastGWAyesnoyesno--fastGWA 1
saigeyesyesyesno--saige 1
bolt-LMMyesnoyesyes--boltlmm 1
fast-lmmyesnonono--fastlmm 1
regenieyesnoyesno--regenie 1
---------------

4. Meta Analysis

Build an input file

There are some pre-computed GEMMA output files we'll use in this example

ls data/summarystat/*.gemma

We create a file input_meta.csv with the file information

echo "rsID,Chro,Pos,A1,A2,Beta,Se,Pval,N,freqA1,direction,Imputed,Sep,File,Ncount" > utils/input_meta.csv
for File in `ls data/summarystat/{AFR,AMR,EAS,EUR,SAS}*.gemma`; do
   echo "rs,chr,ps,allele0,allele1,beta,se,p_wald,NA,af,NA,NA,TAB,$File,2500" >>  utils/input_meta.csv
done

Run the meta-analysis pipeline


nextflow run h3abionet/h3agwas/meta/meta-assoc.nf   --metal 1 --gwama 1 --metasoft 1 --mrmega 1 --plink  1  --file_config utils/input_meta.csv -resume -profile singularity --output_dir meta

Options for the meta-analysis software

Softwarema_genomic_contma_inv_var_weigthma_overlap_samplema_random_effect
explanationgenomic controlinvert variance weightsample overlapRandom Effect
default0000
metalyesyesyesno
gwamayesdefault noyes
Mr Megayesnonono
plinknoyesnono
Metasoftnononodefault
---------------

1 'weighted-z' requests weighted Z-score-based p-values (as computed by the Abecasis Lab's METAL software)

5. Fine-mapping

Fine-mapping can be run on full summary statistics, or specific windows using two different scripta. There is also a script which just does GCTA.

There are a number of general options.

First, we expect the summary statistics file(s) to have column headers. There are options that associated the column headers with the the input values that the workflow expects

5.1 Full summary statistics

The pipeline will apply clump to defined significant positions and run for each windows various software of fine mapping

nextflow run  h3abionet/h3agwas/finemapping/main.nf --head_pval p_wald --head_bp ps --head_chr chr --head_rs rs --head_beta beta --head_se se --head_A1 allele1 --head_A2 allele0 --list_phenogc "Type 2 diabetes" --input_dir  data/imputed/  --input_pat imput_data --file_gwas data/summarystat/all_pheno.gemma  --output_dir finemapping_pheno1 --output finemapping_pheno1 -resume  -profile singularity

The output folder contains for each independant SNPs a folder with result :

5.2 Specific windows

Algorithm is same than previous, but user must specify the chromosome (--chro), and range position (--begin_seq, --end_seq)

nextflow run h3abionet/h3agwas/finemapping/finemap_region.nf  --head_pval p_wald --head_bp ps --head_chr chr --head_rs rs --head_beta beta --head_se se --head_A1 allele1 --head_A2 allele0 --list_pheno "Type 2 diabetes" --input_dir  data/imputed/  --input_pat imput_data --file_gwas data/summarystat/all_pheno.gemma  --output_dir finemapping_pheno1_wind --output finemapping_pheno1 -resume  -profile singularity --begin_seq 112178657 --end_seq 113178657 --chro 10

5.3 GCTA-COJO

The cojo-assoc.nf pipeline performs conditional and joint analysis using summary data pipeline of cojo using two type of input :

The command to execute

nextflow run h3abionet/h3agwas/finemapping/cojo-assoc.nf --head_pval p_wald --head_bp ps --head_chr chr --head_rs rs --head_beta beta --head_se se --head_A1 allele1 --head_A2 allele0 --input_dir data/imputed/  --input_pat imput_data  --output_dir cojo --data data/pheno/pheno_test.all --pheno pheno_qt1 --file_gwas data/summarystat/all_pheno.gemma  -resume   -profile singularity --cojo_top_snps_chro 1

6. Annotating data

Warning aws : This workflow does not work on AWS at the moment

The following inputs are required

nextflow run  h3abionet/h3agwas/utils/annotation/main.nf --head_pval p_wald --head_bp ps --head_chr chr --head_rs rs --head_beta beta --head_se se --head_A1 allele1 --head_A2 allele0 --input_dir data/imputed/  --input_pat imput_data --file_gwas data/summarystat/all_pheno.gemma  --output_dir annotation --list_rs "2:45832137:A:G,1:117539108:G:T" --data data/pheno/pheno_test.all --pheno pheno_qt1  -resume  -profile singularity --loczm_bin  "<full path to your locusZoom>"

7. Simulation of phenotype and build phenotype

The algorithm will use the GWAS Catalog (default) and lead position to define position to build phenotype and gcta. The input required is a list of position for the genotype positions.

7.1 Basic operation

Key options

For our example, we construct a list of positions. For real work your data should be in a smilar format

awk '{print $1"\t"$4}' data/array_plk/array.bim  > utils/list_posarray

We can then execute the workflow:

nextflow run h3abionet/h3agwas/utils/build_example_data/main.nf -profile singularity   --pos_allgeno utils/list_posarray -resume --nb_snp 3 --output_dir simul_gcta_main

Output of pipeline

7.2 other example

7.3 Simulation using gcta and data

The algorithm is similar to the above but need genetic data in plink format

nextflow run h3abionet/h3agwas/utils/build_example_data/simul-assoc_gcta.nf -profile singularity  --input_dir data/imputed/  --input_pat  imput_data --output_dir simul_gcta

7.4 Simulation using random position and phenosim

phenosim will select random position in genome, defined effect of positions and run gemma or boltmm

nextflow run h3abionet/h3agwas/utils/build_example_data/simul-assoc_phenosim.nf -profile singularity  --ph_normalise 0 --input_dir data/imputed/ --input_pat  imput_data --gemma 1

output :

8. Conditional analysis using gemma

To perform a conditional analysis using gemma, we use come SNPs (positions) as covariate of the phenotype and the workflow seems is the specified positions (--pos_ref) are linked or indepependant.

The arguments are the same as gwas using gemma but you need to add :

nextflow run  h3abionet/h3agwas/finemapping/cond-assoc.nf --input_dir data/imputed/  --input_pat imput_data --output_dir cond --data data/pheno/pheno_test.all --pheno pheno_qt1  -profile singularity  -resume  --chro_cond 17  --pos_ref 78562921 --pos_cond 78494969,78502076,78534842  --sample_snps_rel=1

output :

"rscond","chr","rs","ps","n_miss","allele1","allele0","af","beta","se","logl_H1","l_remle","p_wald","Num","CHR_A","BP_A","R2"
"Init",17,"17:78562921:G:A",78562921,0,"A","G",0.17,8.987846,1.361653,-2108.394,15.37693,1.050086e-10,1,NA,NA,1
"17:78494969:C:T",17,"17:78562921:G:A",78562921,0,"A","G",0.17,8.512503,2.45803,-2104.635,15.16254,0.0005800203,2,17,78494969,0.700696
"17:78534842:C:T",17,"17:78562921:G:A",78562921,0,"A","G",0.17,9.080245,1.380124,-2104.504,15.93441,1.200983e-10,3,17,78534842,0.00910659
"17:78502076:T:G",17,"17:78562921:G:A",78562921,0,"A","G",0.17,8.972201,1.362844,-2104.497,15.95574,1.170588e-10,4,17,78502076,0.0037796
"Merge",17,"17:78562921:G:A",78562921,0,"A","G",0.17,8.696802,2.570401,-2096.868,16.26486,0.0007726084,5,NA,NA,NA
* rscond : rs used  for condition  
 * Init : gemma without pos cond
 *  Merge : all position together used as cond 

9. Format summary statistics, prepared data for imputation, or format data after imputation

9.1 format summary statistics

The `format_gwasfile.nf' script formats summary statistics, replaces header information

Input :

change header in file

nextflow run  h3abionet/h3agwas/formatdata/format_gwasfile.nf --head_pval p_wald --head_bp ps --head_chr chr --head_rs rs --head_beta beta --head_se se --head_A1 allele1 --head_A2 allele0 --file_gwas data/summarystat/all_pheno.gemma  --output_dir format_assoc   -resume --headnew_pval p --headnew_bp bp --headnew_chr CHR --headnew_rs SNP --headnew_beta beta --headnew_se se --headnew_A1 allele1 --headnew_A2 allele0 --file_ref_gzip data/utils/all_rsinfo.init.gz --input_dir data/imputed/ --input_pat imput_data -profile singularity

Add rsid to summary statistics

nextflow run  h3abionet/h3agwas/formatdata/format_gwasfile.nf --head_pval p_wald  --head_chr chr --head_bp ps --head_beta beta --head_se se --head_A1 allele1 --head_A2 allele0 --file_gwas data/summarystat/all_pheno.gemma  --output_dir  ./ -resume  --file_ref_gzip data/utils/all_rsinfo.init.gz -profile singularity --output all_pheno.gemma.addrs -r dev

Add chromosome and position in function of rsid

nextflow run  h3abionet/h3agwas/formatdata/format_gwasfile.nf --head_pval p_wald   --head_beta beta --head_se se --head_A1 allele1 --head_A2 allele0 --file_gwas data/summarystat/all_pheno.gemma.addrs --output_dir  ./ -resume  --file_ref_gzip data/utils/all_rsinfo.init.gz --input_dir data/imputed/ --input_pat imput_data -profile slurmSingularity --head_out out_chrbp.gemma --head_rs rs --output all_pheno.gemma.addchrps -r dev

format in gwas catalogs ouptut

N=500
nextflow run  h3abionet/h3agwas/formatdata/format_gwasfile.nf --head_pval p_wald --head_bp ps --head_chr chr  --head_A1 allele1 --head_A2 allele0 --file_gwas  data/summarystat/all_pheno.gemma  -resume --file_ref_gzip data/utils/all_rsinfo.init.gz  -profile singularity  --out_gc 1 --output_dir format_gc --output format_gc --N_value $N --sep TAB --head_beta beta --head_se se --head_freq af 

9.2 prepare data for imputation

plk_in_vcf_imp.nf script take in input a plink file and prepared data for imputation

mkdir -p utils_data/
cd utils_data/
wget -c http://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
cd ../
nextflow run h3abionet/h3agwas/formatdata/plk_in_vcf_imp.nf -profile slurm  --input_dir='data/array_plk/' --input_pat='array_qc' --output_dir='vcf_output' --output='data_qc' --file_ref_gzip="data/utils/all_rsinfo.init.gz" --reffasta="utils_data/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz"

9.3 Format vcf file after imputation to plink format

9.2.1 Example from Sanger imputation panel

 ls data/imputed/vcf/*.vcf.gz  > utils/listvcf
 mkdir -p utils_data/
 cd utils_data/
 wget -c http://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
 cd ../
 nextflow run h3abionet/h3agwas/formatdata/vcf_in_plink.nf --file_listvcf utils/listvcf --output_pat  kgp_imputed --output_dir plink_imputed/   --reffasta utils_data/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz  -profile singularity  --file_ref_gzip data/utils/all_rsinfo.init.gz

9.2.2 Example from Michigan Imputation with genetic map

nextflow run h3abionet/h3agwas/formatdata/vcf_in_plink.nf --file_listvcf $ListVCF --min_scoreinfo 0.4 -profile slurm --output_pat 1000G_imp_mich --output_dir 1000G_imp_mich --genetic_maps genetic_map_hg19.txt --plink_mem_req 10GB -resume --big_time 1000h  --score_imp R2 --other_mem_req 20GB --reffasta hs37d5.fa.gz --unzip_zip 1 --unzip_password "xxx" --statfreq_vcf "%AF"

Output :

Look at this other example

9.3 Format VCF file in bimbam format

bimbam files contain dosage data used, for instance in gemma. Our pipeline does not use bimbam in this case, but some users may find this useful. The pipeline uses qctools (v2) and bcftools for format

nextflow run h3abionet/h3agwas/formatdata/vcf_in_bimbam.nf --file_listvcf utils/listvcf  --output_pat  kgp_imputed --output_dir bimbam/   -profile singularity

9.4 Format VCF file in BGEN format

The BGEN Format is a dosage format used by fastgwa, bolt-lmm or regenie

Input option:

9.4.1 Each vcf files independently

Each VCF files is filtered independently; they are then merged and formatted in BGEN formay

nextflow run h3abionet/h3agwas/formatdata/vcf_in_bgen.nf --file_listvcf utils/listvcf --output_pat  exampledata2_imp --output_dir ./bgen -resume -profile singularity

9.4.2 Merge VCF files and reformat

Each vcf is filtered independantly; they are merged and the final file is refromatten

nextflow run h3abionet/h3agwas/formatdata/vcf_in_bgen_merge.nf --output_dir bgen_v3 --output all  -profile singularity --file_listvcf listvcf -resume

9.4.3 Format first, and reformat

Each vcf is filtered reformatted in BGEN; all bgen are then merged

nextflow run h3abionet/h3agwas/formatdata/vcf_in_bgen_merge.nf --output_dir bgen_v3 --output all  -profile singularity --file_listvcf listvcf -resume

9.5 Format vcf in IMPUTE2 format

Dosage format for bolt-lmm

** Input option** :

nextflow h3abionet/h3agwas/formatdata/vcf_in_impute2.nf --file_listvcf listvcf --output_pat  utils/exampledata2_imp --output_dir ./impute2 -profile singularity -resume

10 Heritability and co-heritability estimation

The heritability pipeline can use summary statistics or genetics data for heritability and co-heritatbility estimation:

Example witthout ldsc

nextflow run h3abionet/h3agwas/heritabilities/main.nf \
  --input_dir data/imputed/  --input_pat imput_data --data data/pheno/pheno_test.all --pheno pheno_qt1,pheno_qt2 \
  --file_gwas data/summarystat/all_pheno.gemma,data/summarystat/all_phenoq2.gemma   --head_pval  "p_wald"  --head_freq  "af" --head_bp  "bp" --head_chr  "chr" --head_rs  "rs" --head_beta "beta" --head_se "se" --head_A1 "allele1" --head_A2 "allele0" --Nind 500,500 \
  --ldsc_h2 0 --ldsc_h2_multi 0 --bolt_h2 1 --bolt_h2_multi 1 --gcta_h2 0 --gcta_h2_imp 0 --gcta_h2_multi 0 --gemma_h2 1 --gemma_h2_pval 1 -resume --output_dir heritability/ -profile singularity --grm_cutoff 0.5

Example wit ldsch

wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
tar xvjf eur_w_ld_chr.tar.bz2
~/nextflow $DirNex/h3agwas/heritabilities/main.nf \
  --input_dir data/imputed/  --input_pat imput_data --data data/pheno/pheno_test.all --pheno pheno_qt1,pheno_qt2 \
  --file_gwas data/summarystat/all_pheno.gemma,data/summarystat/all_phenoq2.gemma   --head_pval  "p_wald"  --head_freq  "af" --head_bp  "ps" --head_chr  "chr" --head_rs  "rs" --head_beta "beta" --head_se "se" --head_A1 "allele1" --head_A2 "allele0" --Nind 500,500 \
  --ldsc_h2 1 --ldsc_h2_multi 1 --bolt_h2 1 --bolt_h2_multi 1 --gcta_h2 1 --gcta_h2_imp 0 --gcta_h2_multi 0 --gemma_h2 1 --gemma_h2_pval 1 -resume --output_dir heritability/ $profile --grm_cutoff 0.5 --dir_ref_ld_chr eur_w_ld_chr 

###Output :

11. Multi-Trait Analysis of GWAS

Multi-trait analysis of GWAS (MTAG), a method for joint analysis of summary statistics from genome-wide association studies (GWAS) of different traits, possibly from overlapping samples.

Input :

Output:

nextflow run h3abionet/h3agwas/meta/mtag-assoc.nf --head_freq af --head_pval p_wald --head_bp ps --head_chr chr --head_rs rs --head_beta beta --head_se se --head_A1 allele1 --head_A2 allele0 --input_dir data/imputed/ --input_pat imput_data --file_gwas  data/summarystat/all_pheno.gemma,data/summarystat/all_phenoq2.gemma --pheno pheno_qt1,pheno_qt2 --data data/pheno/pheno_test.all -resume   -profile singularity

12 Conversion of positions between build

12.1 GWAS catalog

Objective is to used two way to convert positions build using rs value and crossmap using chromosome positions

The objective is to convert build positions using rs value and crossmap using chromosome positions

Example : download gwas catalog in hg38, and convert in 19 / 37 position

nextflow run h3abionet/h3agwas/formatdata/convert_posversiongenome.nf -profile singularity -resume

12.2 convert Summary statistics between hg38 and 19

Example : we used output of h3abionet/h3agwas/formatdata/format_gwasfile.nf to convert hg19 in 38

nextflow run h3abionet/h3agwas/formatdata/convert_posversiongenome.nf -profile singularity -resume --file_toconvert  data/summarystat/assoc_testwithrs.sumstat  link_rs_info=ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz --link_data_crossmap=http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz --head_chr CHR --head_bp bp --head_rs SNP --sep SPACE

13 replication using gwas catalog

Replication using gwas catalog :

Extract gwas catalog by default in pipeline

nextflow h3abionet/h3agwas/replication/gwascat/main.nf --justpheno_gc 1

Do a replication

nextflow h3abionet/h3agwas/replication/gwascat/main.nf  --head_pval p_wald --head_bp ps --head_chr chr --head_rs rs --head_beta beta --head_se se --head_A1 allele1 --head_A2 allele0  --file_gwas data/summarystat/all_pheno.gemma  --output_dir replication_gc -profile singularity -resume  --input_dir data/imputed/ --input_pat imput_data --head_af af --pheno_gc "Type 2 diabetes"

13 replication using summary statistics of other studies

Replication using other studies:

Extract gwas catalog by default in pipeline

nextflow h3abionet/h3agwas/replication/gwascat/main.nf --justpheno_gc 1

Do a replication

nextflow h3agwas/replication/fullsumstat/main.nf  \
 --head_pval_sumstat1 p_wald --head_bp_sumstat1 ps --head_chr_sumstat1 chr --head_rs_sumstat1 rs --head_beta_sumstat1 beta --head_se_sumstat1 se --head_A1_sumstat1 allele1 --head_A2_sumstat1 allele0 --file_gwas_sumstat1 data/summarystat/all_pheno.gemma --head_frq_sumstat1 af --n_count1 500 \
 --head_pval_sumstat2 p_wald --head_bp_sumstat2 ps --head_chr_sumstat2 chr --head_rs_sumstat2 rs --head_beta_sumstat2 beta --head_se_sumstat2 se --head_A1_sumstat2 allele1 --head_A2_sumstat2 allele0 --file_gwas_sumstat2 data/summarystat/all_phenoq2.gemma --head_frq_sumstat1 af --n_count2 500\
  --output_dir replication_ss -profile $profile -resume  --input_dir data/imputed/ --input_pat imput_data