Awesome

<img align="left" src="./images/icon.jpg"> TBSP: Trajectory Inference Based on SNP information.

INTRODUCTION
PREREQUISITES
INSTALLATION
USAGE
INPUTS AND PRE-PROCESSINGS
OUTPUTS
EXAMPLES
INTEGRATION WITH EXISTING METHODS

INTRODUCTION

<div style="text-align: justify"> Several recent studies focus on the inference of developmental and response trajectories from single cell RNA-Seq (scRNA-Seq) data. A number of computational methods, often referred to as pseudo-time ordering, have been developed for this task. Recently, CRISPR has also been used to reconstruct lineage trees by inserting random mutations. However, both approaches suffer from drawbacks that limit their use. Here we develop a method to detect significant, cell type specific, sequence mutations from scRNA-Seq data. We show that only a few mutations are enough for reconstructing good branching models. Integrating these mutations with expression data further improves the accuracy of the reconstructed models. </div>

flowchart

PREREQUISITES

python (python 2 and python 3 are both supported)
It was installed by default for most Linux distribution and MAC.
If not, please check https://www.python.org/downloads/ for installation instructions.
Python packages dependencies:
-- scikit-learn
-- scipy
-- numpy
-- matplotlib
-- networkx
-- pyBigWig
-- Biopython
-- decorator

Note that networkx might not work well with some versions of decorator. In that case, consider upgrading or downgrading decorator. For example, "pip install decorator==5.0.7" should solve the problem.

other dependencies:
-- python-dev (python2) or python3-dev (python3)
It can be installed easily for most linux distributions. For example, debian/ubuntu:
```
sudo apt-get install python-dev
or 
sudo apt-get install python3-dev
```
For Macos, it was installed by default. The python setup.py script (or pip) will try to install these packages automatically. However, please install them manually if, by any reason, the automatic installation fails.
Platform:
Macos and Linux verified. For windows, the dependent pyBigWig package is not available.
However, the Docker version of TBSP is available for all platforms.

INSTALLATION

There are 4 options to install tbsp.

Option 1: setup.py script installation from download directory
cd to the downloaded scdiff package root directory
```
$cd tbsp
```
run python setup to install
```
$python setup.py install
```
MacOS or Linux users might need the sudo/root access to install. Users without the root access can install the package using the pip/easy_install with a --user parameter (install python libraries without root)．
```
$sudo python setup.py install 
```
use python3 instead of python in the above commands to install if using python3.

Option 2: pip installation from Github:

python 2:

$sudo pip install  --trusted-host github.com --upgrade http://github.com/phoenixding/tbsp/zipball/master

python 3:

$sudo pip3 install --trusted-host github.com --upgrade http://github.com/phoenixding/tbsp/zipball/master

Option 3: pip installation from PyPI:

python 2:
```
$sudo pip install  --trusted-host pypi.org --trusted-host files.pythonhosted.org --upgrade tbsp
```
python 3:
```
$sudo pip3 install  --trusted-host pypi.org --trusted-host files.pythonhosted.org --upgrade tbsp
```
The above pip installation options should be working for Linux and MacOS systems.
For MacOS users, it's recommended to use python3 installation. The default python2 in MacOS has some compatibility issues with a few dependent libraries. The users would have to install their own version of python2 (e.g. via Anocanda) if they prefer to use python2 in MacOS.
Option 4: Installation as Docker images (Most robust way, no dependencies on all platforms)
Users would need to install Docker to run the tbsp image. Please refer to Docker installation page for detailed installation instructions.

To install tbsp image:
```
$sudo docker pull johnding1986/tbsp 
```
To run tbsp image :
```
$sudo docker run -v /PATH_TO_FILE_DIRECTORY:/tbsp  johnding1986/tbsp 
```
Users would need to use the -v command to mount local directories.

USAGE

usage: tbsp.py [-h] -i IVCF [-b [IBW]] [-k KCLUSTER] [-l [CELL_LABEL]] -o
               OUTPUT [--cutl CUTL] [--cuth CUTH] [--greedycut GREEDYCUT]
               [--cutc CUTC] [--maxiter MAXITER]

optional arguments:
  -h, --help            show this help message and exit
  -i IVCF, --ivcf IVCF  Required,directory with all input .vcf files. This
                        specifies the directory of SNP files (.vcf) for the
                        cells (one .vcf file for each cell). These .vcf files
                        can be obtained using the provided bam2vcf script or
                        other RNA-seq variant calling pipelines preferred by
                        the users.
  -b [IBW], --ibw [IBW]
                        Optional,directory with all input bigwig (.bw) files
                        with the information about the number of aligned reads
                        at each genomic position. These bigwig files are used
                        to filter the SNPs, which are redundant to expression
                        information.
  -k KCLUSTER, --kcluster KCLUSTER
                        Optional, number of clusters, Integer. If not
                        specified, the program will choose the k with best
                        silhouette score.
  -l [CELL_LABEL], --cell_label [CELL_LABEL]
                        Optional, labels for the cells. This is used only to
                        annotate the cells with known information, not used
                        for building the model.
  -o OUTPUT, --output OUTPUT
                        Required,output directory
  --cutl CUTL           Optional, lower bound cutoff to remove potential false
                        positive SNPs, default=0.1
  --cuth CUTH           Optional, upper bound cutoff to remove baseline SNPs,
                        which are common in most cells, default=0.8
  --greedycut GREEDYCUT
                        Optional, the stopping cutoff for the greedy search of
                        candidate SNPs, default=0.05 (less than 0.05 score
                        improvement). A larger cutoff means less strict SNP
                        candidate search
  --cutc CUTC           Optional, convergence cutoff, a smaller cutoff
                        represents a stricter convergence
                        criterion,default=0.001
  --maxiter MAXITER     Optional, the maximal number of iterations allowed

INPUTS AND PRE-PROCESSINGS

-i:
Required input, this specifies the directory of all SNP(.vcf) files. We recommend using GATK RNA-seq variant calling pipeline to call the vcfs from .bam (mapped reads) files. Users are also allowed to use the SNPs (.vcfs) identified by programs of their preferences (e.g. freebayes).
We also provied a GATK based SNP calling script (bam2vcf.py) at bam2vcf, which automates the SNP calling procedure from .bam files to .vcf files. Please use "python bam2vcf.py --help" to see the script usage.
-b:
Optional input, this specifies the directory of all bigwig (.bw) files. We provided the script bam2bw.py under bam2bw directory to convert the bam files to bigwig files. This files are used to filter SNPs, potentially redundant to expression.
-l:
Optional input, this specifies the labels for the cells. File format (tab-delimited):

cell1	label1
cell2	label2

These cell labels are only used to annotate the cells in the trajectory. The other optional parameters are specified above.

OUTPUTS

GroupCells.txt:
A text file, which describes the cells in each cluster.

Format:

Cell_ID	Cluster_ID
SRR1931024	cluter:0
SRR1930999	cluter:0
SRR1930977	cluter:0
SRR1931041	cluter:0
SRR1931012	cluter:0
SRR1930945	cluter:0
SRR1931003	cluter:0
SRR1931002	cluter:0
SRR1931004	cluter:0
..

SNP_matrix.tsv:
The SNP matrix for all the cells. Row: SNPs Column: Cells Value: Binary (0/1), which indicates whther the SNP is included in the cell.

SNP	SRR1931024	SRR1930999	SRR1930977	SRR1931041	SRR1931012	SRR1930945	SRR1931003	SRR1931002	SRR1931004	SRR1931009	SRR1931031	SRR1930994	SRR1930995	SRR1930992
chr18,31438715	0	0	0	0	0	0	0	0	0	0	0	0	0	0
chr13,99421798	0	0	0	0	0	0	0	0	0	0	0	0	0	0
chr18,48047497	0	0	0	0	0	0	0	0	0	0	0	0	0	0
chr1,98777608	0	1	1	1	0	1	1	0	0	0	0	0	1	0
chr12,75631154	0	0	0	0	0	0	0	0	0	0	0	0	0	0
chr2,80837890	0	0	0	0	0	0	0	0	0	0	0	0	0	0
chr10,56497362	1	0	0	1	0	1	1	0	1	1	0	0	1	0
chr9,42254324	0	0	0	0	0	0	0	0	0	0	1	0	1	0
chr10,56497777	0	0	0	0	0	0	1	1	0	1	0	0	0	1
chr9,42254361	0	0	0	0	0	0	0	0	0	0	1	0	1	0
chr18,28188943	0	1	1	1	1	1	1	1	0	0	1	1	0	1
chr14,46775753	1	0	0	1	0	1	0	1	0	1	1	0	1	1
chr4,141015053	0	0	0	1	0	0	0	0	0	0	0	0	0	0
chr6,3201327	0	0	1	0	0	0	1	1	1	0	0	0	0	0
chr2,152737258	0	1	0	0	1	0	1	1	0	1	1	0	1	0
chr4,150417174	1	0	0	0	1	0	1	0	1	0	0	1	0	0
chr6,136872487	0	1	1	1	0	0	1	1	0	0	1	0	0	0
chr8,41341106	1	1	1	0	1	1	1	1	1	0	1	1	0	1
chr17,72849135	1	1	1	1	0	0	1	1	1	1	1	1	1	1
chr17,6132279	0	0	0	0	0	0	0	0	0	0	0	0	0	0
chr8,102864840	0	0	0	0	0	0	0	0	0	0	0	0	0	0
chr13,74406968	1	1	1	1	0	1	1	1	1	1	1	0	1	1
chr11,45829261	0	1	1	0	0	1	1	0	0	1	0	0	0	0
chr8,54522180	0	1	1	0	0	1	0	0	0	0	0	0	0	0
chr9,67023057	0	1	1	0	1	1	0	1	0	0	1	1	1	1
chr10,128076602	1	1	1	1	1	0	1	1	1	1	1	1	1	1
chr6,4540674	0	1	1	1	1	1	1	0	1	1	1	0	1	1
chr17,70996716	1	1	1	1	0	1	1	1	1	1	1	1	1	1
chr5,115345677	0	1	1	1	1	1	0	1	0	0	0	1	1	1
chr2,136782052	0	0	0	0	0	0	0	0	0	0	0	0	0	0

SNP_matrix.jpg:
The SNP matrix in jpg image.

Trajectory.dat:

4	(-0.6168642633606496, -0.29504774213348317)
Inner4	(-0.01784069226314263, -0.19237987266625067)
2	(-1.0, 0.1907479586411944)
Inner5	(0.526511663382742, -0.048432121394020027)
6	(0.6167190582080249, -0.21090119687699874)
0	(-0.02807716265846153, -0.3564777531262085)
3	(-0.7661682030072015, 0.29873283057457906)
Inner3	(-0.4986251141938113, -0.13657170781011707)
Inner1	(0.785885623794461, 0.14046537917538365)
5	(0.8477737598817315, 0.3307492805568145)
1	(0.9509945159224894, 0.1308847244003101)
Inner2	(-0.8003091857061821, 0.14823022065879607)

First column: cluster id
second column: coordinates
TBSP trajectories were presented in the form of uprooted phylogenetic tree structure. The coordinates for each node(cluster) are given by the above Trajectory.dat result. With the coordinates, users can able to compute the length for each of the branches in the tree. User can setup a user-defined distance cutoff to find out major divergence events. An example cutoff can be set as top 25% of all branch lengths.

Trajectory.jpg:
Graph representation of Trajectory.dat

INTEGRATION WITH EXISTING METHODS

TBSP model provides a SNP MATRIX, which presents the SNP signature vector for each of the cells in the dataset. As we show in the paper, such SNP MATRIX is very informative for trajectory inference. The cell trajectories can be improved by integrating SNP matrix with expression data.

There are many ways to utilize the SNP matrix information, here we used a simple example to demonstrate the integration.

MONOCLE 2 is used widely for trajectory inference and has very good performance. However, it's based on only single-cell expression data and thus may be limited in many scenarios.

For example, the following is the Monocle 2 results on a 2016 neuron reprogramming single-cell dataset (https://www.ncbi.nlm.nih.gov/pubmed/27281220) images/monocle_neuron.jpg . The processed expression data can be found in the integration_example directory.

(1) In the above results, we observe that d2_induced cells are in a separate branch in the bottom right while the neuron cells are on the branch at top right, which contradicts the findings in the original study (Trajectory: MEF->d2_intermediate-> d2_induced-> d5_intermeidate-> d5_earlyiN->Neuron), in which the d2_induced cells are serving as the progenitors to the neuron cells.
(2) Running TBSP on the same dataset, we have obtained 36 SNPs as shown in integration_example/SNP_matrix.tsv.
(3) Combine the expression features and SNP features. There are multiple ways to combine these two types of information. In the paper, we have discussed the strategy of refining the cell assignment in the expression-based trajectories using SNP information. To be more specific, we integrate the SNP information to re-calculate the likelihood when re-assigning the cells to the trajectories. Here, we discuss the most naive way to integrate SNP features with expression features: Merge the features directly. For each cell, we put together the expression features (gene expression levels associated with cell) and SNP features (Binary SNP features 0/1 associated with the cell).

expression:
cell	gene1	gene2	gene3 ...
c1	1.6	2.4	3.8 ...
c2	2.8	4.8	6.4	...

+snp:
cell snp1	snp2	snp3
c1	1	0	1	...
c2	0	1	0	...

=>combined:
combined_info	gene1	gene2	gene3	snp1	snp2	snp3	...
c1	1.6	2.4	3.8	1	0	1	...
c2	2.8	4.8	6.5	0	1	0	...

Please note that number of the SNP features (usually in the range of 30-100) is much smaller than the number of genes (expression features). Therefore, to put more weights on the SNP features, we need to over-sample the SNP features. In the above example, we oversampled the SNPs for 100 times (=> 3600 SNP features). The combined dataset for the above example can be found in the integration_example directory.

(4) Run Monocle on the SNP-integrated dataset. images/monocle_snp.jpg

The above SNP-integrated Monocle results on the neuron reprogramming dataset perfectly matches the findings in the original study (MEF->d2_intermediate-> d2_induced-> d5_intermeidate-> d5_earlyiN->Neuron). d2_induced cells nows are the progenitors of the neuron cells. Also, d5 cells are coming later than d2 cells. Even using very simple integration strategy as shown above, the cell trajectories can be significantly improved.

EXAMPLES

Direct Run:
- Example inputs:
  We provided example vcf files under examples folder. To run tbsp on the example data:
```
$tbsp -i examples/vcf_example -o example_out
```
- Example outputs:
  Example output files can be found under examples folder.
Docker Run:
We provided example vcf files under examples folder. To run docker tbsp on the example data:
```
$sudo docker run -v ~/Download/examples:/tbsp  johnding1986/tbsp -i /tbsp/vcf_example -o /tbsp/example_out 
```
where ~/Download/examples should be replaced by the parent directory of the folder of vcf files. The result files are the same as described above.

CREDITS

This software was developed by ZIV-system biology group @ Carnegie Mellon University.
Implemented by Jun Ding.

LICENSE

This software is under MIT license.

CONTACT

zivbj at cs.cmu.edu
jund at cs.cmu.edu