Home

Awesome

		 ____  ____  ____  _  _____ _____
		/ ___\/   _\/  _ \/ \/    //    /
		|    \|  /  | | \|| ||  __\|  __\
		\___ ||  \__| |_/|| || |   | |   
		\____/\____/\____/\_/\_/   \_/    

Build Status License: MIT

!!!!NEW!!!

For large single-cell datasets (e.g, > 2k cells), please use the new version of scdiff (scdiff2) at : https://github.com/phoenixding/scdiff2

SCDIFF 2.0 utilizes HDF5, Sparse matrix, and multi-threading techniques to reduce the resource requirement of the program while improving the efficiency. It also incorperates many new clustering and trajectory inference methods for more comprehensive and accurate predictions.

A few highlights:
(1) VERY EFFICIENT: Analyze 40k cells (~10k genes/cell) within 1-2 hours (--ncores 12 --maxloop 0)
(2) VERY FLEXIBLE: It was composed of many moving pieces, each can be customized by the users.

INTRODUCTION

<div style="text-align: justify"> Most existing single-cell trajectory inference methods have relied primarily on the assumption that descendant cells are similar to their parents in terms of gene expression levels. These assumptions do not always hold for in-vivo studies which often include infrequently sampled, un-synchronized and diverse cell populations. Thus, additional information may be needed to determine the correct ordering and branching of progenitor cells and the set of transcription factors (TFs) that are active during advancing stages of organogenesis. To enable such modeling we developed scdiff, which integrates expression similarity with regulatory information to reconstruct the dynamic developmental cell trajectories.

SCDIFF is a package written in python and javascript, designed to analyze the cell differentiation trajectories using time-series single cell RNA-seq data. It is able to predict the transcription factors and differential genes associated with the cell differentiation trajectoreis. It also visualizes the trajectories using an interactive tree-stucture graph, in which nodes represent different sub-population cells (clusters).

</div>

flowchart

PREREQUISITES

The python setup.py script (or pip) will try to install these packages automatically. However, please install them manually if, by any reason, the automatic installation fails.

INSTALLATION

There are 3 options to install scdiff.

The above pip installation options should be working for Linux, Window and MacOS systems.
For MacOS users, it's recommended to use python3 installation. The default python2 in MacOS has some compatibility issues with a few dependent libraries. The users would have to install their own version of python2 (e.g. via Anaconda) if they prefer to use python2 in MacOS.

USAGE

scdiff.py [-h] -i INPUT -t TF_DNA -k CLUSTERS -o OUTPUT [-l LARGE]
                 [-s SPEEDUP] [-d DSYNC] [-a VIRTUALANCESTOR]
                 [-f LOG2FOLDCHANGECUT] [-e ETFLISTFILE] [--spcut SPCUT]

	-h, --help            show this help message and exit

	-i INPUT, --input INPUT, required 
						input single cell RNA-seq expression data
						
	-t TF_DNA, --tf_dna TF_DNA, required
						TF-DNA interactions used in the analysis
						
	-k CLUSTERS, --clusters CLUSTERS, required
						how to learn the number of clusters for each time
						point? user-defined or auto? if user-defined, please
						specify the configuration file path. If set as "auto"
						scdiff will learn the parameters automatically.
						
	-o OUTPUT, --output OUTPUT, required
						output folder to store all results
						
	-s SPEEDUP, --speedup SPEEDUP(1/None), optional
						If set as 'True' or '1', SCIDFF will speedup the running
						by reducing the iteration times.
						
	-l LARGETYPE,  --largetype LARGETYPE (1/None), optional
						if specified as 'True' or '1', scdiff will use LargeType mode to 
						improve the running efficiency (both memory and time). 
						As spectral clustering is not scalable to large data,
						PCA+K-Means clustering was used instead. The running speed is improved 
						significantly but the performance is slightly worse. If there are
						more than 2k cells at each time point on average, it is highly 
						recommended to use this parameter to improve time and memory efficiency.
						
						
	-d DSYNC,  --dsync DSYNC (1/None), optional
						If specified as 'True' or '1', the cell synchronization will be disabled. 
						If the users believe that cells at the same time point are similar in terms of 
						differentiation/development. The synchronization can be disabled.

	-a VIRTUALANCESTOR, --virtualAncestor VIRTUALANCESTOR (1/None), optional
						scdiff requires a 'Ancestor' node (the starting node, 
						all other nodes are descendants).  By default, 
						the 'Ancestor' node is set as the first time point. The hypothesis behind is :  
						The cells at first time points are not differentiated yet
						( or at the very early stage of differentiation and thus no clear sub-groups, 
						all Cells at the first time point belong to the same cluster).  
						  
						If it is not the case, users can set -a as 'True' or '1' to enable
						a virtual ancestor before the first time point.  The expression of the 
						virtual ancestor is the median expression of all cells at first time point. 
						 
	-f LOG2FOLDCHANGECUT, --log2foldchangecut LOG2FOLDCHANGECUT (Float), optional
						By default, scdiff uses log2 Fold change 1(=>2^1=2)
						as the cutoff for differential genes (together with t-test p-value cutoff 0.05).
						However, users are allowed to customize the cutoff based on their 
						application scenario (e.g. log2 fold change 1.5). 
						
	-e ETFLISTFILE, --etfListFile ETFLISTFILE (String), optional  
						By default, scdiff recognizes 1.6k
						TFs (we collected in human and mouse). Users are able
						to provide a customized list of TFs instead using this
						option. It specifies the path to the TF list file, in
						which each line is a TF name. Here, it does not require 
						the targets information for the TFs, which will be used to infer
						eTFs (TFs predicted based on the expression of themselves instead of the their targets).
						
	--spcut SPCUT       Float, optional  
						By default, scdiff uses p-value=0.05
						as the cutoff to tell whether the DistanceToAncestor
						(DTA) of clusters are significantly different.
						Clusters with similar DTA will be placed in the same
						level.

                        

INPUTS AND PRE-PROCESSING

scdiff takes the two required input files (-i/--input and -t/--tf_dna), two optional files (-k/--cluster, -e/--etfListFile) and a few other optional parameters.

For other scdiff optional parameters, please refer to the usage section.

RECOMMENDED PIPELINE

Please follow the following steps to analyze the single-cell data.

$python semiAutomaticK.py -i example.E

images/MBE2.tsne.pdf

RESULTS AND VISUALIZATION

The results are given under the specified directory. The predicted model was provided as a json file, which is visualized by the provided JavaScript. Please use Chrome/FireFox/Safari browser for best experience.

example_out_fig

The following is the manual for the visualization page.

Visualization Config (Left panel):

Visualization Canvas (Right Panel):

EXAMPLES

Run scdiff on given time-series single cell RNA-seq data.
An example script exampleRun.py is provided under the example directory.

1) Run with automatic config

$ scdiff -i example.E -t example.tf_dna -k auto -o example_out

The TF-DNA directory provides the TF-DNA interaction file used in this study.

2) Run with user-defined config

$scdiff -i example.E  -t example.tf_dna -k example.config -o example_out

The format of example.E and example.tf_dna are the same as described above.

example.config specifies the custom initial clustering parameters. This was used when we have some prior knowledge. For example, if we know they are how many sub-populations within each time, we can just directly specify the clustering parameters using the example.config file, which provides better performance.

example.config format(tab delimited)

time	#_of_clusters

For example:

14  1  
16  2  
18  5  

However, if we don't have any prior knowledge about the sub-populations within each time point. We will just use the automatic initial clustering. :-k auto.

3) Run scdiff on large single cell dataset

$scdiff -i example.E -t example.tf_dna -k auto -o example_out -l True -s True

-i, -t, -k, -o parameters were discussed above.
For very large dataset (e.g., more than 20k cell), it's recommended to filter genes with very low variance. It significantly cuts down the the memory cost and running time.

(4) Run scdiff on large single cell dataset with synchronization disabled and virtual ancestor

$scdiff -i example.E -t example.tf_dna -k auto -o example_out -l True -s True -d True -a True

-i, -t , -k, -o, -l ,-s parameters were defined above.

5) example running result

The following link present the results for an example running.
example_out

MODULES & FUNCTIONS

scdiff module

This python module is used to perform the single cell differentiation analysis and it builds a graph (differentiation). Users can use the modules by importing scdiff package in their program. Besides the description below, we also provided a module testing example inside the example directory under the name moduleTestExample.py.

scdiff.Cell(Cell_ID, TimePoint, Expression,typeLabel,GeneList)<a id="cell"></a>
This class defines the cell.

Parameters:

Output:
A Cell class instance (with all information regarding to a cell)

Attributes:

Example:

import scdiff
from scdiff.scdiff import *

# reading example cells ...
AllCells=[]
print("reading cells...")
with open("example.E","r") as f:
	line_ct=0
	for line in f:
		if line_ct==0:
			GL=line.strip().split("\t")[3:]
		else:
			line=line.strip().split("\t")
			iid=line[0]
			ti=float(line[1])
			li=line[2]
			ei=[round(float(item),2) for item in line[3:]]
			ci=scdiff.Cell(iid,ti,ei,li,GL)
			AllCells.append(ci)
		line_ct+=1
		print('cell:'+str(line_ct))

scdiff.Graph(Cells, tfdna, kc, largeType=None, dsync=None, virtualAncestor=None,fChangCut=1.0, etfile=None) <a id="graph"></a>
This class defines the differentiation graph.

Parameters:

Output:
A graph instance with all nodes and edges, which represents the differentiation structure for given inputs.

Attributes:

Example:

import scdiff
from scdiff.scdiff import *

print("testing scdiff.Graph module ...")
# creating graph using scdiff.Graph module and examples cells build above
g1=scdiff.Graph(AllCells,"example.tf_dna",'auto')

scdiff.Clustering(Cells, kc,largeType=None)
This class represents the clustering.

Parameters:

Method: getClusteringPars()

import scdiff
from scdiff import *
Clustering_example=scdiff.Clustering(AllCells,'auto',None)
[dCK,dBS]=Clustering_example.getClusteringPars()

Method: performClustering()

import scdiff 
from scdiff import *
Clustering_example=scdiff.Clustering(AllCells,'auto',None)
Clusters=Clustering_example.performClustering()

scdiff.Cluster(Cells,TimePoint,Cluster_ID)<a id="cluster"></a>
This class defines the node in the differentiation graph.

Parameters:

Output: List of float, this function calculates the average gene expression of all cells in cluster.

Attributes:

Example:

import scdiff 
from scdiff import *
cluster1=scdiff.Cluster([item for item in AllCells if item.T==14],14,'C1')

scdiff.Path(fromNode,toNode,Nodes,dTD,dTG,dMb)

This class defines the edge in the differentiation graph.

Parameters:

Output: Graph edge instance.

Attributes:

Example:

import scdiff 
from scdiff import *
g1=scdiff.Graph(AllCells,"example.tf_dna",'auto')
p1=scdiff.Path(g1.Nodes[0],g1.Nodes[1],g1.Nodes,g1.dTD,g1.dTG,g1.dMb)

viz module

This module is designed to visualize the differentiation graph structure using JavaScript.

scdiff.viz(exName,Graph,output)

Parameters:

Output: a visualization folder with HTML page, JavaScript Code and Graph Structure in JSON format.

Example:

import os
import scdiff
from scdiff import *
print ("testing scdiff.viz module ...")
# visualizing graph using scdiff.viz module 
os.mkdir("e1_out")
scdiff.viz("example",g1,"e1_out")

Then, you will find the visualized result page in HTML under 'e1_out' directory.

CREDITS

This software was developed by ZIV-system biology group @ Carnegie Mellon University.
Implemented by Jun Ding.

Please cite our paper Reconstructing differentiation networks and their regulation from time series single cell expression data.

LICENSE

This software is under MIT license.
see the LICENSE.txt file for details.

CONTACT

zivbj at cs.cmu.edu
jund at cs.cmu.edu