Awesome

Using the IARC nextflow bioinformatics pipelines course 2018

nextflow

The aim of this course is to enable participants to use the bioinformatics pipelines developed at IARC using nextflow.

Description of the course

Learning objectives
After completing this course, participants will be able to use the IARC nextflow bioinformatics pipelines and more specifically:

set pipeline parameters and configuration
execute, resume and debug pipelines,
trace and visualise pipeline execution,
run the pipelines on workstation or on a high performance computing cluster,
use github and Docker/Singularity containers to run reproducible analyses,
understand the basis concepts of the nextflow language used to describe the pipelines.

Please note that the development of pipelines will only be very briefly covered in this course.

Agenda and slides

Wednesday 23 May (slides)
09:00-10:00 Introduction to bioinformatics pipelines, nextflow, docker, Github and the IARC organization
10:00-10:30 Practical application: running your first pipeline
10:30-11:00 Break
11:00-11-30 The hidden structure of nextflow: work folder and configuration
11:30-12:30 Practical application: configuring, crashing, resuming and debugging pipelines

Thursday 24 May (slides)
09:00-09:30 Introduction to HPC clusters and running pipelines on a cluster.
09:30-10:30 Practical application: trace and visualise pipeline execution with log files.
10:30-11:00 Break
11:00-11h30 Introduction to the nextflow language: understanding what the pipelines are doing
11:30-12:30 Practical application: advanced usages toward reproducibility (choosing a container, Github releases and branches, modifying a pipeline etc.)

Gitter Chat

A is open for the course. This will allow participants to discuss on their projects but also to ask any question regarding the course.

Laptop setup

Laptops use Ubuntu 16.04.

Nextflow is already installed and in ~/bin, which is in your PATH.

Docker is already installed. If you are curious, here is how to install it on Docker website.

If you need a good text editor, Atom is also installed.

Demo commands

nextflow run iarcbioinfo/nf_coverage_demo -with-docker --bam_folder data_test/BAM/BAM_multiple/ --bed data_test/BED/TP53_exon2_11.bed

nextflow run iarcbioinfo/platypus-nf -with-docker --input_folder data_test/BAM/ --ref data_test/REF/17.fasta

nextflow run iarcbioinfo/RNAseq-nf -with-docker --input_folder data_test/BAM/BAM_multiple/ --output_folder BAM_realigned --ref_folder data_test/REF --gtf data_test/REF/TP53_small.gtf --bed data_test/BED/TP53_small.bed --mem 4

Config

Config files examples are in the config folder in this repository. Note that adding -with-trace in your nextflow run command is equivalent to have a configuration file containing:

trace {
    enabled = true
}

or:

trace.enabled = true

One example of each possibility is given (nextflow.config_1 and nextflow.config_2). You will also find the configuration file I propose to use on IARC Jupiter cluster.

IARC Jupiter cluster

Create a symlink to singularity

ln -s  /appli57/singularity/singularity-2.4.5/bin/singularity /home/username/bin/

Add in your ~/.bash_profile

export NXF_JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk.x86_64/jre/
export NXF_TEMP=/data/tmp

export SINGULARITY_CACHEDIR=/data/username/.singularity
export SINGULARITY_LOCALCACHEDIR=/data/tmp/

Change your ~/.nextflow/config with the one on the config folder in this repository.

Check cluster usage using bhosts or your own jobs using bjobs. You can also run our script to check what the others are doing: /appli57/scripts/bjobs_monitor.r.

Useful links

IARC bioinformatics GitHub organization
Docker and DockerHub. See my short docker tutorial here if you want to know more about it. IARC bioinformatics DockerHub page.
Singularity and the PLOS one paper presenting it.
Nextflow ressources:
- Nextflow website
- Nextflow documentation
- Nextflow releases on GitHub with changelogs
- Nextflow issues on GitHub
- Nextflow
- Nextflow blog
- Nextflow google group
- Nextflow twitter
- A curated list of Nextflow pipelines
- nf-core: an emerging effort to collect high quality pipelines
- Nextflow paper: Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.
- Another paper by nextflow folks about the impact of Docker on performance: https://peerj.com/articles/1273/
Dataflow programming on wikipedia
Scientific workflow system on wikipedia
A paper in PLOS Comp. Bio. about using GitHub efficiently to manage your bioinformatics projects