Awesome

nextflow-course

Contains course material for a short course on Nextflow DSL2

Scott Hazelhurst School of Electrical & Information Engineering and Sydney Brenner Institute for Molecular Bioscience

University of the Witwatersrand

Set up

Clone the repo

git clone shaze/nextflow-course

Exercise 0

Read and run the show.nf script

nextflow run show.nf

Read and run the cleandups.nf script

Fix the obvious bug -- we will fix the deeper semantic issue later

You should commit your fix into git

git commit -a -m "fixed the problem"

Exercise 1

If you need some intro to Groovy, checkout the groovy branch

git checkout groovy

Read the groovy.nf and run. Make sure you understand.

Otherwise or then move to Exercise 2

Exercise 2

Checkout the phase branch

git checkout phasing

In this exercise we fix the semantic problem from Exercise 0 to show that even if we have multiple input files we get the corrrect result

First, look at, understand phase.nf which shows a simple example of how join works. By default, it takes two channels of tuples and phases them using the first element of each tuple as a common key
Note also -- we can construct output names dynamically using input values -- the use of the directive publishDir to direct output. In this case, the output directory is fixed, but we could also construct it dynamically

Now look at cleandups.nf. This takes our previous example and fixes the semantic problems. There is some new Nextflow here -- Generally, we connect the channels of two processes by passing paramters

processA(x,y)
processB(processA.out)

If processA had multiple output channels we could also specify which channel to be used. However, when the output signature of one process (i.e., the number of channels) matches the input of the next we can use the pipe ("|") symbol

processA(x,y) | processB

which is cleaner. The two are semantically equivalent though so you can choose which you'd prefer.

-- note the use of join. Join takes two parameters -- generally we'd write a_ch.join(b_ch) using the tradtional "." notation (so the first parameter is before the "."!). When we use piping then the first parameter comes from the preceding channel and the second is given explicitly

-- note the use of view here. view takes values from one channel, displays it on the terminal and then creates a new output channel with the same value. It's main purpose is debugging. I introduced it because there was a bug in my first version of the code and I wanted to see what was happening. Once I found the bug and fixed the code, I should have taken it out and have only left it in order to show you

Run the code

nextflow run cleandups.nf
Note where the output can be found
Run it again : nextflow run cleandups.nf -resume

We now move the next exercise.

Exercise 3

To move to this exercise say git checkout grouping (If you've changed any of my files in the current exercise, git may complain and you will have to say git commit -a -m "MovingOn!!!")

In this exercise we are going to look at two things -- creating configuration files -- more sophisticated ways of creating channels

Configuration files

The default configuration file is nextflow.config (we can actually have multiple configuration files and can also structure config files in JSON or yaml format) but we'll stick with the basics.

A config file consists of two scopes. In our example, there are two scopes -- the manifest and params. The manifest is mainly informational although it can be used by nextflow to choose which script to run if there are multiple scripts in a directory (we'll explore this later).

The params file is used set set a record/struct variable called params that can be accessed and manipulated in the Nextflow program. In this case, we can refer to params.input_dir and params.output etc. So this is a very convenient way of sending values into our Nextflow program rather than hard-coding values. In particular, it's generally a bad idea to fix absolute directory paths into a Nextflow program -- these should be set by using params.

There are several ways in which params can be set. This causes occasional confusion but it gives great flexibility in writing code that has sensible defaults for parameters which can be over-ridden by the user or environment. In order of increasing precedence (that is paramters set by later methods over-ride values set by earlier methods) the key ones are:

you can set the parameter in the Nextflow program itself params.input_dir="myinputdir"
the nextflow.config file in the directory of the workflow script
the nextflow.config file in the directory from which you run the script
paramters set by on the command-line: for example if I run nextflow run --input_dir data simplefreq.nf then in my program, params.input_dir will hve the value data.

There are other rules and you can also have different config files -- look at the documentation for detail.

Grouping input files

Have a look at group.nf. First, note how we create channels using different methods. The factory method fromFilePairs allows us to create a channel of file tuples that are grouped according to some rule. In our case, we do several things -- we use our params to select which directory and file sets will be used -- we specify we only want bed, bim and fam files (NB: the use of braces and commas like this {bed,bim,fam} is a standard command-line feature (glob) (and is not specific to Nextflow) -- just as the glob "*" means everything, a list of things in braces, separated by commas specifies a list of things. -- we specify that we are only intersted in filesets with exactly three elements -- for example, if there's a fam file missing for one data set, we are not interested in it. (As an aside the use of the word Pairs is misleading since you can have any number of elements picked up, the default value for size is 2.) -- Finally and very importantly we pass fromFilePairs a closure which specifies how to group the elements. What fromFilePairs does is to pick up all the files that match the expression given, and then groups the elements according to the closure -- it applies the function specified in the closure to each file and groups all files for which the closure returns the same value).

Now run group.nf: nextflow run group.nf

There's another example you can uncomment out. One slightly confusing thing in running is that Nexflow maximises concurrency where possible and so you see the outputs all mixed up.

PLINK example

In this example, we are going to take a number of different PLINK file sets, compute the frequencies, and then merge the results. When plink is run it needs three input files a bed, bim and fam file. If we are running multiple file sets through plink we have to make sure that we are consistent -- we can't use the bed file from one data set, the bim from another and the fam from yet another. Even if by some miracle the sizes of the files somehow allow this to work we are going to get nonsense as a result.

Thus we use fromFilePairs to do the grouping. Look at freq.nf

Exercise: The problem with this solution is that the output file is not in the right order -- how can it be fixed?

Exercise: Checkout pairs for a simple exercise. Here we have different data for differents months of different years. Combine the monthly data by doing a paste and then combined the monthly data to get yearly data.

Exercise 4: Using the config file.

Checkout the config branch: git checkout config (again, if you have amended any of my files you may have to commit the change.

Have a look at the nextflow.config file
Run the program show_param.nf: nextflow run show_param.nf
Can you understand the output?
Try different toptions like nextflow run show_param.nf --other 7 --cut-off 15
modify the nextflow.config file and see how it works

Now check out the docker branch: git checkout docker

Exercise 5: Using containers

We will use singularity rather than docker.

First, do this to see what version of git is native on the computer: git --version

Now read both the nextflow.config and the git.nf file.

Run git.nf: nextflow run git.nf

Look at the output of the file to see what version of git runs. Explain what happens.

Exercise 5: Running on SLURM

Checkout slurm: git checkout slurm