Home

Awesome

The BUS format specification

The BUS format is a binary format for storing intermediate results for single cell RNA-Seq datasets. This repository details the specification of the format.

The motivation and example usage of the BUS format is described in

P Melsted, V Ntranos, L Pachter, The Barcode, UMI, Set format and BUStools, Bioinformatics, btz279, 2019.

Tools

BUS generation

BUS file manipulation

BUS parsing

Format specification

A BUS file is a binary file consisting of a header followed by zero or more BUS records. Each BUS header consists of the following elements in order

Field nameDescriptionTypeValue
magicfixed magic stringchar[4]BUS\0
versionBUS format versionuint32_t
bc_lenBarcode length [1-32]uint32_t
umi_lenUMI length [1-32]uint32_t
tlenLength of plain text headeruint32_t
textPlain text headerchar[tlen]

The encoding used for the text header is ASCII.

BUS records are stored directly after the header in the following format, the size of each BUS record is rounded up to 32 bytes. This is done by adding 32 unused bits at the end of the record.

Field nameDescriptionType
barcode2-bit encoded barcodeuint64_t
umi2-bit encoded UMIuint64_t
ecequivalence classint32_t
countfragment countuint32_t
flagsflagsuint32_t

The flags column can be used to store extra information, but the format specification of BUS files does not put any restrictions or specify the content.

The 2-bit encoding encodes A,C,G,T and 00,01,10,11, and such that the first nucleotides are encoded in the most significant bits. For example the barcode GCCA corresponds to the bit code 10010100 or the integer 148. All integers are encoded as little-endian.