Awesome

scGTM: Single-cell generalized trend model

scGTM (orignally named as scKGAM) is the abbreviation for Single-cell Gene Expression Generalized Trend Model. This is a Python package for modeling the statistical relationship between pseudotime and gene expression data. The paper is published in Bioinformatics and is also available at bioRXiv.

It is intended for bioinformatic scientists, applied statisticians, and students who prefer using Metaheuristic algorithms in solving their own bioinformatic optimization problems. scGKM is able to provide various marginal gene distributions with interpretable regression functions. Check out more features!

Free software: MIT license
Python versions: 3.6 and above

Installation

To install the bleeding-edge version of scGTM, clone this repo:

$ git clone -b git@github.com:ElvisCuiHan/scGTM.git

and then run

$ cd scGTM
$ python run_scGTM.py --model.iter 100 --model.marginal 'ZIP' --model.save_dir "your/path/to/save" --data.dir "your/path/file.csv" --gene.start 3 --gene.end 4

Usage

scGTM provides a high-level implementation of various marginal distributions including Poisson, negative binomial (NB), zero-inflated Poisson (ZIP) and zero-inflatd negative binomial (ZINB). Further, it utilizes particle swarm optimization algorithm in the package pyswarms to optimize the objective function. Thus, it aims to be user-friendly and customizable.

The data should be a cell-by-gene matrix where the first column corresponding to the pseudotime:

Index	Pseudotime	Gene1	Gene2	...
1.	t1	y11	y12	...
2.	t2	y21	y22	...
3.	t3	y31	y32	...
4.	t4	y41	y42	...

A typical data structure will be of the following form:

All-in-one function

Suppose we want to regress Gene 1 on pseudotime using the scGTM, simply we run the run_scGTM file in shell:

python run_scGTM.py --model.iter {# of iterations} --model.marginal 'ZIP' --model.save_dir "your/path/to/save" --data.dir "your/path/file.csv" --gene.start {START INDEX} --gene.end {END INDEX}

and we can replace run_scGTM.py with either run_scGTM_Hill_Only.py or run_scGTM_Valley_Only.py if we are only interested in one of the two trends.

Using the data in our demo folder, the command is:

python run_scGTM_Valley_Only.py --model.iter 120 --model.marginal 'ZIP' --model.save_dir "Demo/Results/" --data.dir "Demo/simu_nb_scGTM_input.csv" --gene.start 1 --gene.end 60

gene_index: The index of gene that we want to model.
model.marginal: The marginal distribution of the gene expression, should be one of ["NB", "ZINB", "Poisson", "ZIP"].
model.iter: Number of iterations run by PSO, usually 150 suffices.
model.save_dir: The directory to save our results.
data.dir: The path to our data file.
gene.start: Index of the first gene to fit.
gene.end: Index of the last gene to fit.

In the scGTM.py file (and the other two), we can modify the arguments to let the model outputs user-defined colors.

plot_args: A dictionary with keys color and cmap. color is a 4x1 vector and cmap is a string. For example:

plot_args={
             'color': ['red', 'tomato', 'orange', 'violet'],
             'cmap': 'Blues',
         }

If one wants to estimate many genes with different marginals, we can first change the data directory in the function parallel and then use the command in terminal:

python run_scGTM_Hill_Only.py  --gene.start {START INDEX} --gene.end {END INDEX} --model.marginal "NB" --model.save_dir "YourTargetPath" --model.iter 150

Note the data should be in .csv format. The main function will return a .json file and .png figure.

Example

The following figure has shown a typical output by the main function in scGTM.py.

Red line: fitted log mean expression (log(tau_c) in the paper).
Blue line: Red line minus -log(1-p_c) so that the zero-inflation part is removed from expectation.
Orange vertical line: Estimated t0, i.e., the turning point of the model.
Purple line: fitted zero-inflation parameter, for details, see paper.
Scatters/Points: observed log expression value (log(y+1)).

The confidence intervals of {t0, k1, k2, mu} are saved in a .json file in the same directory.