Awesome
UST
UST is a bioinformatics tool for constructing a spectrum-preserving string set (SPSS) representation from sets of k-mers.
Quick start
To install, compile from source:
git clone https://github.com/jermp/UST
cd UST
make
After compiling, use
./ust -i [unitigs.fa] -k [kmer_size]
e.g.
./ust -i examples/k11.unitigs.fa -k 11
The important parameters are:
k [int]
: The k-mer size that was used to generate the input, i.e. the length of the nodes of the node-centric de Bruijn graph.i [input-file]
: Unitigs file produced by BCALM2 in FASTA format.a [0 or 1]
: Default is 0. A value of 1 tells UST to preserve abundance. Use this option when the input file was generated with the-all-abundance-counts
option of BCALM2.
The output is a FASTA file with extenstion "ust.fa" in the working folder, which is the SPSS representaiton of the input.
If the program is run with the option -a 1
, then the header line of each sequence will also contain the abundance counts as
in the provided BCALM input file.
Detailed Usage
In order to build a SPSS representation for your k-mer set, you must first run BCALM2 on your set of k-mers. BCALM2 will construct a set of unitigs. Those unitigs are then fed as input to ust
, which outputs a FASTA file with the SPSS representation. Note that the k parameter to ust
must match the -kmer-size
used when running BCALM2.
If you would like to store the data on disk in compressed form (like UST-Compress in our paper), you can then install and run MFCompress on the output of UST as follows: MFCompressC mykmers.ust.fa
If you would like to build a membership data structure based on UST, then see the SSHash repository.
Citation
If using UST in your research, please cite
- Amatur Rahman and Paul Medvedev, Representation of k-mer sets using spectrum-preserving string sets, RECOMB 2020.
- Here is the bibtex entry:
@inproceedings{RahmanMedvedevRECOMB20,
author = {Amatur Rahman and Paul Medvedev},
title = {Representation of $k$-mer sets using spectrum-preserving string sets},
booktitle = {Research in Computational Molecular Biology - 24th Annual International Conference, {RECOMB} 2020, Padua, Italy, May 10-13, 2020, Proceedings},
series = {Lecture Notes in Computer Science},
volume = {12074},
pages = {152--168},
publisher = {Springer},
year = {2020
}
Note that the general notion of an SPSS was independently introduced under the name of simplitigs. Therefore, if citing this general notion, please also cite:
- Brinda K, Baym M, and Kucherov G, Simplitigs as an efficient and scalable representation of de Bruijn graphs, bioRxiv 2020.