Home

Awesome

nf-core/lncpipe

Build Status Nextflow

install with bioconda Docker Singularity Container available

A Nextflow-based pipeline for comprehensive analyses of long non-coding RNAs from RNA-seq datasets

Introduction

Recently, long noncoding RNA molecules (lncRNA) captured widespread attentions for their critical roles in diverse biological process and important implications in variety of human diseases and cancers. Identification and profiling of lncRNAs is a fundamental step to advance our knowledge on their function and regulatory mechanisms. However, RNA sequencing based lncRNA discovery is currently limited due to complicated operations and implementation of the tools involved. Therefore, we present a one-stop multi-tool integrated pipeline called LncPipe focused on characterizing lncRNAs from raw transcriptome sequencing data. The pipeline was developed based on a popular workflow framework Nextflow, composed of four core procedures including reads alignment, assembly, identification and quantification. It contains various unique features such as well-designed lncRNAs annotation strategy, optimized calculating efficiency, diversified classification and interactive analysis report. LncPipe allows users additional control in interuppting the pipeline, resetting parameters from command line, modifying main script directly and resume analysis from previous checkpoint.

Documentation

The nf-core/lncpipe pipeline comes with documentation about the pipeline, found in the docs/ directory:

  1. Installation
  2. Pipeline configuration
  3. Running the pipeline
  4. Output and how to interpret the results
  5. Troubleshooting

Citation

Qi Zhao, Yu Sun, Dawei Wang, Hongwan Zhang, Kai Yu, Jian Zheng, Zhixiang Zuo. LncPipe: A Nextflow-based pipeline for identification and analysis of long non-coding RNAs from RNA-Seq data. J Genet Genomics. 2018 Jul 20;45(7):399-401

LncPipe

Table of Contents

<!-- START doctoc generated TOC please keep comment here to allow auto update --> <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->

Schematic diagram

Installation

Nextflow
LncPipe is implemented with Nextflow pipeline management system. To run LncPipe. Nextflow should be pre-installed at POSIX compatible system (Linux, Solaris, OS X, etc), It requires BASH and Java 7 or higher to be installed. We do not recommend running the pipes in the Windows since most of bioinformatic tools are not supported.

Quick start

Here, we show step by step installation of Nextflow in a linux system as an example (adopted from NextFlow).

It will create the Nextflow main executable file in the current directory.

git clone https://github.com/likelet/LncPipe.git

Prepare input files

References, index and annotation files(Mandatory).

Species

>Currently, LncPipe has been tested for detection of lncRNAs in 'humans' only.
However, LncPipe can be manually configured to run the anlysis for other species as well and requires additional files  "known_protein_coding.gtf" and  "known_lncRNA.gtf" for coding probability calculations. More information on usage for non-human species can be found here.  

Run Docker

  1. Prepare input files as mentioned earlier.

  2. Modify the docker.config in mandatory section.

  3. Install docker and download the latest LncPipe build using: docker pull bioinformatist/lncpipe

  4. Run LncPipe using the following command:

     nextflow -c docker.config run LncRNAanalysisPipe.nf
    

The docker image for LncPipe is available on the docker-hub (https://hub.docker.com/r/bioinformatist/lncpipe/tags/). Alternatively, nextflow can automatically pull image from docker.io. Dockerfile recorded that what we have done with the image. For user from local China looking to pull the docker image can use this mirror site instead.

Dependencies

TO Install softwares locally on your machine, please see install instructions here

Interactive reports

The results of LncPipe are summarized and visualized via interactive plots by our novel R package LncPipeReporter. Users can also try LncPipeReporter as stand-alone for visualizing known and novel lncRNAs.

Configuration

As a nextflow-based analysis pipeline, LncPipe allow users edit configure file nextflow.config to set the index files and default file path parameters instead of typing them into the command line.

To configure, please go to params line, and set the following information of various file locations and system environment settings

        params {
        /*
            User setting options (mandatory)
             */
        // input file and genome reference
            fastq_ext = '*_{1,2}.fq.gz'
            fasta_ref = '/data/database/hg38/genome.fa'
            design = 'design.file'
            hisat2_index = '/data/database/hg38/hisatIndex/grch38_snp_tran/genome_snp_tran'
            cpatpath='/opt/CPAT-1.2.3'
            //human gtf only
            gencode_annotation_gtf = "/data/database/hg38/Annotation/gencode.v24.annotation.gtf"
            lncipedia_gtf = "/data/database/hg38/Annotation/lncipedia_4_0_hg38.gtf" // set "null" if you are going to perform analysis on other species

        // additional options for non-human species, else leaving them unchanged
            species="human"// mouse , zebrafish, fly
            known_coding_gtf=""
            known_lncRNA_gtf=""
            //for test
            cpatpath = '/home/zhaoqi/software/CPAT/CPAT-1.2.2/'


        /*
            User setting options (optional)
             */
            // tools setting
            star_idex = ''//set if star used
            bowtie2_index = ''//set if tophat used
            aligner = "hisat" // or "star","tophat"
            sam_processor="sambamba"//or "samtools(deprecated)"
            qctools ="fastp"  // or "afterqc","fastp","fastqc"
            detools = "edger"//or "deseq2","noiseq" not supported yet
            quant = "kallisto"// or 'htseq'

            //other setting
            singleEnd = false
            unstrand = false
            skip_combine = false
            lncRep_Output = 'reporter.html'
            lncRep_theme = 'npg'
            lncRep_cdf_percent = 10
            lncRep_max_lnc_len = 10000
            lncRep_min_expressed_sample = 50
            mem=60
            cpu=30
        }

        manifest {
            homePage = 'https//github.com/likelet/LncPipe'
            description = 'LncPipe:a Nextflow-based Long non-coding RNA analysis PIPELINE'
            mainScript = 'LncRNAanalysisPipe.nf'
        }


        timeline {
            enabled: true
            file: "timeline.html"
        }

Parameters

Those parameters would cover the setting from nextflow.config file

NameExample/Default valueDescription
--input_folder.input folder
--specieshumanYour species, mouse, fly and zebra fish are also supported
--fastq_ext*_{1,2}.fastq.gzinput raw paired reads
--out_folder.output folder
--designFALSEa txt file that stored experimental design information, plz see details from --design section below
NameRequiredDescription
--star_index/--bowtie2_index/--hisat2_index-Path to STAR?bowtie2/hisat2(mutually exclusive) index(required if not set in config file)
--fasta-Path to Fasta reference(required if not set in config file)
--gencode_annotation_gtf-An annotation file from GENCODE database for annotating lncRNAs(required if not set in config file). e.g. gencode.v26.annotation.gtf
--lncipedia_gtf-An annotation file from LNCipedia database for annotating lncRNAs(required if not set in config file) e.g. lncipedia_4_0_hc_hg38.gtf
NameRequiredDescription
--cpatpath-Home folder of cpat installed location

since cpat may call model data from its home path, users should specified where the model file is located in. Especially users install cpat by themselves without our install code.

NameDefault valueDescription
--singleEndFALSEspecify that the reads are single ended
--merged_gtfFALSESkip mapping and assembly step by directly providing assembled merged gtf files
--unstrandFALSEspecify that library is unstrand specific
--alignerstarAligner for reads mapping (optional), STAR is default and supported only at present,star/tophat/hisat2
--qctoolsfastpTools for assess raw reads quality or filtered by fastp, fastqc, afterqc or none(skip qc step)
NameDefault valueDescription
--lncRep_Outputreporter.htmlSpecify report file name.
--lncRep_themenpgPlot theme setting in interactive plot. Values from ggsci
--lncRep_min_expressed_sample50Minimum expressed gene allowed in each sample, 50 default. Samples not passed were filtered from analysis

--fastq_ext

Raw fastq files are required for de-novo analysis.This parameters should be set according to your paired or singled reads file names.

For example:

    Sample1_1.fq.gz
    Sample1_2.fq.gz
    Sample2_1.fq.gz
    Sample2_2.fq.gz

Then you can input pattern *_{1,2}.fq.gz to make the all paired-end file recognized by LncPipe .

For singled reads file, file pattern should be fed with --singleEnd parameter specified

--star_idex?--bowtie2_index/--hisat2_index

This parameter is required when not configured in nextflow.config file. It specify the star/tophat/hisat2(mutually exclusive) index folder built before running LncPipe . If you don't know what it is?You can use --fasta to specify the reference sequence data. The index file would be built by LncPipe automatically.

--design

Experimental design file matrix for differential expression analysis. Default: null Format:

WT:Sample1,Sample2,Sample3
KO:Sample1,Sample2,Sample3

While KO/WT represents the two experimental condition, and sample1, sample2, sample3 are replicates which should be comma-delimited in the same line .

For sample names, it should be the sample as the prefix of fastq files which was trimmed by --fastq_ext.

For example:

if fastq file names are Sample1_1.fq.gz, Sample1_2.fq.gz that comes from one sample and your --fastq_ext is set as *_{1,2}.fq.gz, the sample name should be Sample1.

Output

Result folder under current path(default) or output_folder set by user. A typical structure of Result is follows:

    Result/
        ├── QC
        │   ├── N1141_1.clean_fastqc.html
        │   ├── N1141_2.clean_fastqc.html
        │   ├── N1177_1.clean_fastqc.html
        │   └── N1177_2.clean_fastqc.html
        ├── Identified_lncRNA
        │   ├── all_lncRNA_for_classifier.gtf
        │   ├── final_all.fa
        │   ├── final_all.gtf
        │   ├── lncRNA.fa
        │   ├── protein_coding.fa
        │   └── protein_coding.final.gtf
        ├── LncReporter
        │   ├── Differential_Expression_analysis.csv
        │   └── Report.html
        ├── Quantification
        │   ├── kallisto.count.txt
        │   └── kallisto.tpm.txt
        └── Star_alignment
            ├── STAR_N1141
            │   ├── N1141Aligned.sortedByCoord.out.bam
            │   ├── N1141Log.final.out
            │   ├── N1141Log.out
            │   ├── N1141Log.progress.out
            │   └── N1141SJ.out.tab
            └── STAR_N1177
                ├── N1177Aligned.sortedByCoord.out.bam
                ├── N1177Log.final.out
                ├── N1177Log.out
                ├── N1177Log.progress.out
                └── N1177SJ.out.tab

Tips

Acknowledgement

Thanks to the author of AfterQC, Shifu Chen, for his help on providing a gzip output support to meet the require of LncPipe. Thanks to the internal test by Hongwan Zhang and Yan Wang from SYSUCC Cancer bioinformatics platform.

FAQ

A: using the follow command as suggested in the installation section.

    perl -CD -pi -e'tr/\x{feff}//d && s/[\r\n]+/\n/' *.py 

A: The cpat command required the Human_Hexamer.tsv to predict lncRNA coding potential, plz check your cpatpath parameters.

A: It's a version conflict caused by htseq and hisat generated bamfile, a possible solution for this is to install the old version of htseq

Contact

For implementation:

We strongly recommend users open new issues if they have questions or find bugs.