Home

Awesome

scJoint

scJoint is a transfer learning method to integrate atlas-scale, heterogeneous collections of scRNA-seq and scATAC-seq data. scJoint leverages information from annotated scRNA-seq data in a semi-supervised framework and uses a neural network to simultaneously train labeled and unlabeled data, enabling label transfer and joint visualization in an integrative framework. For more information, please see scJoint manuscript: https://doi.org/10.1101/2020.12.31.424916.

scJoint is developed using PyTorch 1.0.0 and has been tested under both PyTorch 1.0.0 and 1.4.0. scJoint requires 1 GPU to run.

Tutorials

Installation

scJoint can be obtained by simply clonning the github repository:

git clone https://github.com/SydneyBioX/scJoint.git

The following python packages are required to be installed before running scJoint: h5py, torch, itertools, scipy, numpy, os, random, sys, time, and datetime.

Preparing intput for scJoint

scJoint's main function takes expression data in .npz format and cell type labels in .txt format. To prepare the input for scJoint, modifying dataset paths in process_db.py which:

  1. take .h5 files of expression matrix stored in matrix/data as input and generate .npz files for each expression matrix.
  2. transform .csv files of cell type labels to numeric and stored in .txt files; and output label_to_idx.txt file indicates the correpondence of the numeric labels and the cell type labels.

Note:

  1. The expression matrix for scRNA-seq data are the gene expression matrix (either normalised or raw data), and gene actvitiy matrix for scATAC-seq data.
  2. The cell type labels for scRNA-seq is required, while the labels for scATAC-seq is optional and will only be used in accuracy calculation.

Running scJoint

Edit config.py according to the data input (See Arguments section for more details).

In terminal, run

python main.py

The output will be saved in ./output folder.

Arguments

The script config.py indicate the arguments for scJoint, which needs to be modified according to the data.

Dataset information

Training config

The configuration we used in our paper can be found in link.

Output

scJoint will output 4 types of .txt files:

Visualisation

To generate tSNE and UMAP plots for the output data using R, run the following codes in terminal

Rscript embedding_visualisation_R.R --output_dir output/ --input_dir data/ --TSNE TRUE --UMAP TRUE --proportion 1

where

Note:

install.packages(c("ggplot2", "ggthemes", "scattermore", "ggpubr", "Rtsne", "uwot", "pals", "grDevices", "optparse"))

Output of embedding_visualisation_R.R:

Online app

scJoint is also available via superbio: https://app.superbio.ai/apps/114/.

Reference

Lin, Y., Wu, T.Y., Wan, S., Yang, J.Y., Wong, W.H. and Wang, Y.X., 2022. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nature Biotechnology, 40(5), pp.703-710.