Home

Awesome

Introduction

cgranges is a small C library for genomic interval overlap queries: given a genomic region r and a set of regions R, finding all regions in R that overlaps r. Although this library is based on interval tree, a well known data structure, the core algorithm of cgranges is distinct from all existing implementations to the best of our knowledge. Specifically, the interval tree in cgranges is implicitly encoded as a plain sorted array (similar to binary heap but packed differently). Tree traversal is achieved by jumping between array indices. This treatment makes cgranges very efficient and compact in memory. The core algorithm can be implemented in ~50 lines of C++ code, much shorter than others as well. Please see the code comments in cpp/IITree.h for details.

Usage

Test with BED coverage

For testing purposes, this repo implements the bedtools coverage tool with cgranges. The source code is located in the test/ directory. You can compile and run the test with:

cd test && make
./bedcov-cr test1.bed test2.bed

The first BED file is loaded into RAM and indexed. The depth and the breadth of coverage of each region in the second file is computed by query against the index of the first file.

The test/ directory also contains a few other implementations based on IntervalTree.h in C++, quicksect in Cython and ncls in Cython. The table below shows timing and peak memory on two test BEDs available in the release page. The first BED contains GenCode annotations with ~1.2 million lines, mixing all types of features. The second contains ~10 million direct-RNA mappings. Time1a/Mem1a indexes the GenCode BED into memory. Time1b adds whole chromosome intervals to the GenCode BED when indexing. Time2/Mem2 indexes the RNA-mapping BED into memory. Numbers are averaged over 5 runs.

Algo.Lang.CovProgramTime1aTime1bMem1aTime2Mem2
IAITreeCYcgranges9.0s13.9s19.1MB4.6s138.4MB
IAITreeC++Ycpp/iitree.h11.1s24.5s22.4MB5.8s160.4MB
CITreeC++YIntervalTree.h17.4s17.4s27.2MB10.5s179.5MB
IAITreeCNcgranges7.6s13.0s19.1MB4.1s138.4MB
AIListCN3rd-party/AIList7.9s8.1s14.4MB6.5s104.8MB
NCListCN3rd-party/NCList13.0s13.4s21.4MB10.6s183.0MB
AITreeCN3rd-party/AITree16.8s18.4s73.4MB27.3s546.4MB
IAITreeCythonNcgranges56.6s63.9s23.4MB43.9s143.1MB
binningC++Ybedtools201.9s280.4s478.5MB149.1s3438.1MB

Here, IAITree = implicit augmented interval tree, used by cgranges; CITree = centered interval tree, used by Erik Garrison's IntervalTree; AIList = augmented interval list, by Feng et al; NCList = nested containment list, taken from ncls by Feng et al; AITree = augmented interval tree, from kerneltree. "Cov" indicates whether the program calculates breadth of coverage. Comments:

Use cgranges as a C library

cgranges_t *cr = cr_init(); // initialize a cgranges_t object
cr_add(cr, "chr1", 20, 30, 0); // add a genomic interval
cr_add(cr, "chr2", 10, 30, 1);
cr_add(cr, "chr1", 10, 25, 2);
cr_index(cr); // index

int64_t i, n, *b = 0, max_b = 0;
n = cr_overlap(cr, "chr1", 15, 22, &b, &max_b); // overlap query; output array b[] can be reused
for (i = 0; i < n; ++i) // traverse overlapping intervals
	printf("%d\t%d\t%d\n", cr_start(cr, b[i]), cr_end(cr, b[i]), cr_label(cr, b[i]));
free(b); // b[] is allocated by malloc() inside cr_overlap(), so needs to be freed with free()

cr_destroy(cr);

Use IITree as a C++ library

IITree<int, int> tree;
tree.add(12, 34, 0); // add an interval
tree.add(0, 23, 1);
tree.add(34, 56, 2);
tree.index(); // index
std::vector<size_t> a;
tree.overlap(22, 25, a); // retrieve overlaps
for (size_t i = 0; i < a.size(); ++i)
	printf("%d\t%d\t%d\n", tree.start(a[i]), tree.end(a[i]), tree.data(a[i]));

Cite cgranges

This library is integrated into bedtk, which is published in:

Li H and Rong J (2021) Bedtk: finding interval overlap with implicit interval tree. Bioinformatics, 37:1315-1316