Home

Awesome

<img align="left" src="./images/icon.jpg"> TBSP: Trajectory Inference Based on SNP information.
License: MIT

Table of Contents

  1. INTRODUCTION
  2. PREREQUISITES
  3. INSTALLATION
  4. USAGE
  5. INPUTS AND PRE-PROCESSINGS
  6. OUTPUTS
  7. EXAMPLES
  8. INTEGRATION WITH EXISTING METHODS

INTRODUCTION

<div style="text-align: justify"> Several recent studies focus on the inference of developmental and response trajectories from single cell RNA-Seq (scRNA-Seq) data. A number of computational methods, often referred to as pseudo-time ordering, have been developed for this task. Recently, CRISPR has also been used to reconstruct lineage trees by inserting random mutations. However, both approaches suffer from drawbacks that limit their use. Here we develop a method to detect significant, cell type specific, sequence mutations from scRNA-Seq data. We show that only a few mutations are enough for reconstructing good branching models. Integrating these mutations with expression data further improves the accuracy of the reconstructed models. </div>

flowchart

PREREQUISITES

Note that networkx might not work well with some versions of decorator. In that case, consider upgrading or downgrading decorator. For example, "pip install decorator==5.0.7" should solve the problem.

INSTALLATION

There are 4 options to install tbsp.

USAGE

usage: tbsp.py [-h] -i IVCF [-b [IBW]] [-k KCLUSTER] [-l [CELL_LABEL]] -o
               OUTPUT [--cutl CUTL] [--cuth CUTH] [--greedycut GREEDYCUT]
               [--cutc CUTC] [--maxiter MAXITER]

optional arguments:
  -h, --help            show this help message and exit
  -i IVCF, --ivcf IVCF  Required,directory with all input .vcf files. This
                        specifies the directory of SNP files (.vcf) for the
                        cells (one .vcf file for each cell). These .vcf files
                        can be obtained using the provided bam2vcf script or
                        other RNA-seq variant calling pipelines preferred by
                        the users.
  -b [IBW], --ibw [IBW]
                        Optional,directory with all input bigwig (.bw) files
                        with the information about the number of aligned reads
                        at each genomic position. These bigwig files are used
                        to filter the SNPs, which are redundant to expression
                        information.
  -k KCLUSTER, --kcluster KCLUSTER
                        Optional, number of clusters, Integer. If not
                        specified, the program will choose the k with best
                        silhouette score.
  -l [CELL_LABEL], --cell_label [CELL_LABEL]
                        Optional, labels for the cells. This is used only to
                        annotate the cells with known information, not used
                        for building the model.
  -o OUTPUT, --output OUTPUT
                        Required,output directory
  --cutl CUTL           Optional, lower bound cutoff to remove potential false
                        positive SNPs, default=0.1
  --cuth CUTH           Optional, upper bound cutoff to remove baseline SNPs,
                        which are common in most cells, default=0.8
  --greedycut GREEDYCUT
                        Optional, the stopping cutoff for the greedy search of
                        candidate SNPs, default=0.05 (less than 0.05 score
                        improvement). A larger cutoff means less strict SNP
                        candidate search
  --cutc CUTC           Optional, convergence cutoff, a smaller cutoff
                        represents a stricter convergence
                        criterion,default=0.001
  --maxiter MAXITER     Optional, the maximal number of iterations allowed
                                             

INPUTS AND PRE-PROCESSINGS

cell1	label1
cell2	label2

These cell labels are only used to annotate the cells in the trajectory. The other optional parameters are specified above.

OUTPUTS

INTEGRATION WITH EXISTING METHODS

TBSP model provides a SNP MATRIX, which presents the SNP signature vector for each of the cells in the dataset. As we show in the paper, such SNP MATRIX is very informative for trajectory inference. The cell trajectories can be improved by integrating SNP matrix with expression data.

There are many ways to utilize the SNP matrix information, here we used a simple example to demonstrate the integration.

MONOCLE 2 is used widely for trajectory inference and has very good performance. However, it's based on only single-cell expression data and thus may be limited in many scenarios.

For example, the following is the Monocle 2 results on a 2016 neuron reprogramming single-cell dataset (https://www.ncbi.nlm.nih.gov/pubmed/27281220) images/monocle_neuron.jpg. The processed expression data can be found in the integration_example directory.

(1) In the above results, we observe that d2_induced cells are in a separate branch in the bottom right while the neuron cells are on the branch at top right, which contradicts the findings in the original study (Trajectory: MEF->d2_intermediate-> d2_induced-> d5_intermeidate-> d5_earlyiN->Neuron), in which the d2_induced cells are serving as the progenitors to the neuron cells.
(2) Running TBSP on the same dataset, we have obtained 36 SNPs as shown in integration_example/SNP_matrix.tsv.
(3) Combine the expression features and SNP features. There are multiple ways to combine these two types of information. In the paper, we have discussed the strategy of refining the cell assignment in the expression-based trajectories using SNP information. To be more specific, we integrate the SNP information to re-calculate the likelihood when re-assigning the cells to the trajectories. Here, we discuss the most naive way to integrate SNP features with expression features: Merge the features directly. For each cell, we put together the expression features (gene expression levels associated with cell) and SNP features (Binary SNP features 0/1 associated with the cell).

expression:
cell	gene1	gene2	gene3 ...
c1	1.6	2.4	3.8 ...
c2	2.8	4.8	6.4	...

+snp:
cell snp1	snp2	snp3
c1	1	0	1	...
c2	0	1	0	...

=>combined:
combined_info	gene1	gene2	gene3	snp1	snp2	snp3	...
c1	1.6	2.4	3.8	1	0	1	...
c2	2.8	4.8	6.5	0	1	0	...

Please note that number of the SNP features (usually in the range of 30-100) is much smaller than the number of genes (expression features). Therefore, to put more weights on the SNP features, we need to over-sample the SNP features. In the above example, we oversampled the SNPs for 100 times (=> 3600 SNP features). The combined dataset for the above example can be found in the integration_example directory.

(4) Run Monocle on the SNP-integrated dataset. images/monocle_snp.jpg

The above SNP-integrated Monocle results on the neuron reprogramming dataset perfectly matches the findings in the original study (MEF->d2_intermediate-> d2_induced-> d5_intermeidate-> d5_earlyiN->Neuron). d2_induced cells nows are the progenitors of the neuron cells. Also, d5 cells are coming later than d2 cells. Even using very simple integration strategy as shown above, the cell trajectories can be significantly improved.

EXAMPLES

CREDITS

This software was developed by ZIV-system biology group @ Carnegie Mellon University.
Implemented by Jun Ding.

LICENSE

This software is under MIT license.

CONTACT

zivbj at cs.cmu.edu
jund at cs.cmu.edu