Home

Awesome

OVERVIEW

This repository keeps the scripts and links to the data used to generate the results in SQUID manuscripts. There are 3 sections of results: on simulation data, on previously studied cell lines, on TCGA data. SQUID software is available at https://github.com/Kingsford-Group/squid.

ON SIMULATION DATA

Software Used

Required:

Optional (you will need them for simulating RNA-seq reads and SVs):

Workflow Description

image of workflow with simulation data

Re-producing the Result

Make sure that the above software are in your path.

cat simulation_data00 simulation_data01 > simulation_data.tar.gz
tar -xzvf simulation_data.tar.gz

To move on with the workflow, we provide a script to do alignment and TSV calling. You can generate all results by running

./scripts/runSimulationData.sh <decompressed data folder> <number of threads>
./scripts/runWholeSimulation.sh

Find the specification of output directory and files here

ON PREVIOUSLY STUDIED CELL LINES

Data Used

Here we use Ensembl genome 75 with corresponding annotation file as reference genome for alignment and TSV detection.

For RNA-seq, we use publicly available dataset in SRA database. We merge SRR2532344 and SRR925710 for RNA-seq data of HCC1954 cell line, and use SRR2532336 for HCC1395 cell line.

Running the following command to download genome data and RNA-seq data for you. Make sure fastq-dump (from SRA Toolkit) is in your path.

./scripts/downloadHCC.sh <output directory>

Software Used

We don't provide scripts for automatic installation of the above tools. Please follow the instruction for each tools for installation and reference preparation. For preparing reference data for each fusion-gene detection tool, we use the following setting in manuscript:

Workflow Description

image of workflow with studied cell line

Re-producing the Result

Running the following command to reproduce the result:

./scripts/runHCC.sh 

ON TCGA DATA

Software Used

Workflow Description

image of workflow with studied cell line

Re-producing the Result

We use Ensembl genome 85 and corresponding gene annotation for alignment. To download the genome and prepare STAR index, run the following command:

./scripts/downloadTCGA_genome.sh <number threads> <output directory>

The downloaded reference file will have the following structure in output directory:

The barcodes of TCGA samples we used are located in dataTCGA.tgz. After getting RNA-seq fastq files (read1 and read2 in separate files), and make sure STAR and SpeedSeq executable are in your path, running the following command to execute the workflow:

./scripts/TCGAcommand.sh -p <number threads> -r1 <read1 fastq> -r2 <read2 fastq> -i <StarIndex folder> -g <gtf file> -o <output folder>