Home

Awesome

Readme

Akshay Tambe
Pachter and Doudna groups
UC Berkeley

Summary

sircel (pronounced "circle") separates reads in a fastq file based on barcode sequences that occur at known positions of reads. This is an essential first step in analyzing single-cell genomics data from experiments such as Drop-Seq. Barcode sequences often contain deletion and/or mismatch errors that arise during barcode synthesis and sequencing, and we have designed our barcode recovery approach with these issues in mind. In addition to identifying barcodes in an unbiased manner, sircel also quantifies their abundances.

sircel is well-suited for Drop-Seq, in which barcode deletion errors are prevalent. In order to identify barcodes in a deletion-robust manner, sircel begins by counting k-mers in circularized barcodes extracted from the reads. Circularization of the barcode portion ensures that a mismatch-containing barcode will share at least one k-mer with the true barcode sequence. This is not the case when using 'linear' barcode sequences for some larger values of k.

To address insertion and deletion errors, barcodes are also circularized after either appending or deleting one sequenced base. In the case of dropseq barcoded beadas, a deletion will result in the last sequenced position of the barcode actually arising from the first base of the UMI. Conversely an insertion will result in the first base (by position) of the UMI actually arising from the last base of the barcode. As each read might potentially contain an unknown mismatch / insertion / deltion error, each read is converted to three circularized variants. Kmers are then counted from these circularized barcodes, and a De Bruijn graph is constructed where a node represents by a kmer, and edges represent two kmers that were adjacent in at least one read. Edges are weighted by how many reads connect those kmers.

In this graph, a barcode will appear as a circular path of fixed length through this graph. We define the weight of such a path to be that of the lowest-weight edge within this path. We can identify such circular paths in this graph, and such paths represent possible consensus barcode sequences. It should be noted here that reads with either 0 or 1 error can contribute to these weighted paths; reads with 2 or more errors typically do not.

This path weight can be interpreted as the number of reads that support a given barcode at its lowest-confidence kmer. As such we identify the true consensus barcodes by selecting the paths with highest weight. This process is typically straightforward, as we have observed that the cumulative distribution of path weights contains an inflection point corresponding to true paths. This result is seen in a number of simulated and real datasets.

We assign each read in our dataset to a consensus barcode based on the number of kmers the read shares with the consensus barcode. When computing kmers for this assignment, the value of k can differ from the value used to produce the barcode De Bruijn graph. This allows us to assign reads which might include 2 or more errors.

Installation

Run the setup py script below. In order to use kallisto to quantify single-cell expression levels after sircel identifies cell barcodes, and kallisto is not on the PATH the --kallisto argument must point to the kallisto binary (requires kallisto version >= 0.43.0).

python3 setup.py install --kallisto PATH_TO_KALLISTO --install-scripts=./sircel/

The sircel executable should be in the directory ./sircel/sircel. After installation, you can run sircel via sircel.

Requirements

python3
numpy
scipy
scikit-learn (for TCCs)
	Optional. Not needed if only splitting reads by barcode (not quantifying single-cell expression levels)
kallisto (0.43.0 or higher)
	Optional. Not needed if only splitting reads by barcode (not quantifying single-cell expression levels)

Commandline arguments

--help				Prints help message

--dropseq			Use barcode / umi positions from Macosko et al.
--10xgenomics			Use barcodes / umi coordinates from 10xGenomics version2 chemistry
--barcodes			Fastq.gz file from bead barcodes (Reads 1 for dropseq, reads 1 for 10xGenomics)
--reads				Fastq.gz file from RNA-seq / 3' sequence tags (Reads 2 for dropseq, reads 2 for 10xGenomics)
--umis				UMIs file. (Index I1 file, only required for 10xGenomics )
--output_dir			Directory where output files are written
--kmer_size			Size of k-mer to use for building the barcode graph
--depth				Search depth at each node [Default: 4]
--breadth			Search breadth through the graph [Default: 1000]
--threads			Number of threads to use [Default: 1]

Optional arguments

--min_dist		Minimum Hamming distance between barcodes
--kallisto_idx		Path to pre-computed kallisto index.
				If this is provided the script will quantify single-cell expression profiles using kallisto
--barcode_start		First position of cell barcode within read in --barcode file.
				Not needed if --dropseq or --10xgenomics arguments are provided
--barcode_end		Last position of cell barcode within read in --barcode file.
				Not needed if --dropseq or --10xgenomics arguments are provided
--umi_start		First position of unique molecule ID within read in --barcode file.
				Not needed if --dropseq or --10xgenomics arguments are provided
--umi_end		Last position of unique molecule ID within read in --barcode file.
				Not needed if --dropseq or --10xgenomics arguments are provided
--num_cells		Estimated number of cells. Not required

Note about barcode and UMI position indexing Barcode and UMI positions within a read follow the same indexing convension as python strings. For example, the string BARCODEUMI would have coordinates:

barcode_start		0
barcode_end			7
umi_start			7
umi_end				10

This follows from the following python snippet:

>>> seq = 'BARCODEUMI'
>>> seq[0:7]
'BARCODE'
>>> seq[7:10]
'UMI'

Use example

1 million reads from from Macosko 2015

This tutorial will run this software on an example dataset from Macosko et al (SRR1873277). According to the authors' software we expect ~570 single cells from this dataset. To speed things up we only use the first 1 million reads in the file (rather than ~400 million).Note that if you are not using kallisto to quantify expression levels, omit the --kallisto_idx option. This command will run a complete singe-cell RNA-seq analysis, going from raw, un-split reads to single-cell expression profiles. Note that the human / mouse transcriptome fasta can be found at http://bio.math.berkeley.edu/kallisto/transcriptomes/. Build a kallisto index following instrutuions https://pachterlab.github.io/kallisto/starting

sircel \
	--threads 32 \
	--dropseq \
	--output_dir example \
	--reads example/SRR1873277_1m_r2.fastq.gz \
	--barcodes example/SRR1873277_1m_r1.fastq.gz \
	--kallisto_idx kallisto_idx/hgmm_kallisto.idx

The following output files will be produced in the example/ directory:

	all_paths.txt
		Tab-delimited file containing all circular paths (putative barcodes) found in the graph 
	merged_paths.txt
		Tab-delimited file containing circular paths after Hamming correction	
	paths_plotted.png
		Plots / histograms of circular path weights
	fits.txt
		Tab-delimited file containing Gaussian fits for plots above
	kallisto/
		Folder containing kallisto and TCC output files
		See Pachterlab scRNA-seq TCC pipeline for more:
			https://github.com/pachterlab/scRNA-Seq-TCC-prep
	reads_split/
		Folder containing fastq.gz files split by read.
		Also contains umi.txt and batch.txt files for kalisto single-cell
	run_log.txt
	run_outputs.json

The outputs from this command can be visualized with an included ipython notebook. Simply point the notebook to the appropriate run output file [codeblock 3] to re-compute the plots.