Home

Awesome

Assembled Genomes Compressor

GitHub downloads Bioconda downloads

Assembled Genomes Compressor (AGC) is a tool designed to compress collections of de-novo assembled genomes. It can be used for various types of datasets: short genomes (viruses) as well as long (humans).

The tool offers high compression ratios, especially for high-quality genomes. For example the 96 haplotype sequences from the Human Pangenome Project (47 samples), GRCh 38 reference, and CHM13 v.1.1 assembly containing about 290Gb are squeezed to less than 1.5GB. The compressed samples are easily accessible as agc offers extraction of single samples or contigs in a few seconds. The compression is also fast. On a AMD TR 3990X-based machine (32 threads used) it takes about 12 minutes to compress the HPP collection.

Quick start

git clone --recurse-submodules https://github.com/refresh-bio/agc
cd agc

# Linux compilation
make

# MacOS compilation: must specify g++ compiler
make CXX=g++-11

# Compress a collection of 3 genomes
./agc create ref.fa in1.fa in2.fa > col.agc                         # file names given in command-line
./agc create ref.fa in1.fa.gz in2.fa.gz > col.agc                   # gzipped non-reference FASTA files
./agc create -i fn.txt ref.fa > col.agc                             # fl.txt contains 2 file names in seperate lines: 
                                                                    # in1.fa in2.fa
./agc create -a -i fn.txt ref.fa > col.agc                          # adaptive mode (use for bacterial data)
./agc create -i fn.txt -o col.agc ref.fa                            # output file name is specified as a parameter
./agc create -i fn.txt -o col.agc -k 29 -l 22 -b 100 -t 16 ref.fa   # same as above, but manual selection 
                                                                    # of compression parameters
./agc create -c -o col.agc ref.fa samples.fa                        # compress samples stored in a single file
                                                                    # (reference must be given separately)

# Add new genomes to the collection
./agc append in.agc in3.fa in4.fa > out.agc                         # add 2 genomes to the compressed archive
./agc append -i fn.txt in.agc -o out.agc                            # add genomes (fn.txt contains file names)
./agc append -a -i fn.txt in.agc -o out.agc                         # add genomes (adaptive mode)

# Extract all genomes from the compressed archive
./agc getcol in.agc > out.fa                                        # extract all samples
./agc getcol -o out_path/ in.agc                                    # extract all samples and store them in separate files

# Extract a genome or genomes from the compressed archive
./agc getset in.agc in1 > out.fa                                    # extract sample in1 from the archive
./agc getset in.agc in1 in2 > out.fa                                # extract samples in1 and in2 from the archive

# Extract contigs from the compressed archive
./agc getctg in.agc ctg1 ctg2 > out.fa                              # extract contigs ctg1 and ctg2 from the archive
./agc getctg in.agc ctg1@gn1 ctg2@gn2 > out.fa                      # extract contigs ctg1 from genome gn1 and ctg2 from gn2 
                                                                    # (useful if contig names are not unique)
./agc getctg in.agc ctg1@gn1:from1-to1 ctg2@gn2:from2-to2 > out.fa  # extract parts of contigs 
./agc getctg in.agc ctg1:from1-to1 ctg2:from2-to2 > out.fa          # extract parts of contigs 

# List genome names in the archive
./agc listset in.agc > out.txt                                      # list sample names

# List contig names in the archive
./agc listctg in.agc gn1 gn2 > out.txt                              # list contig names in genomes gn1 and gn2

# Show info about the compression archive
./agc info in.agc                                                   # show some stats, parameters, command-lines 
                                                                    # used to create and extend the archive

Installation and configuration

agc should be downloaded from https://github.com/refresh-bio/agc and compiled. The supported OS are:

Compilation options

For better performance gzipped input is readed using isa-l library for x64 CPUs. This, however, requires NASM compiler to be installed (you can install it from GitHub or are nasm package, e.g., sudo apt install nasm). If NASM is not present (or at ARM-based CPUs), the zlib-ng is used.

Compilation with default options optimizes the tool for native platform. If you want more control you can specify the platform:

make PLATFORM=arm8		# compilation for ARM-based machines (turns on `-march=armv8-a`)
make PLATFORM=m1		# compilation for M1/M2/... (turns on `-march=armv8.4-a`)
make PLATFORM=SSE2		# compilation for x64 CPUs with SSE2 support
make PLATFORM=AVX		# compilation for x64 CPUs with AVX support
make PLATFORM=AVX2		# compilation for x64 CPUs with AVX2 support

You can also specity the g++ compiler version (if installed):

make CXX=g++-11
make CXX=g++-12

Prebuild releases

The release contains a set of precompiled binaries for Windows, Linux, and OS X.

The software is also available on Bioconda:

conda install -c bioconda agc

For detailed instructions on how to set up Bioconda, please refer to the Bioconda manual.

Version history

Usage

agc <command> [options]

Command:

Creating new archive

agc create [options] <ref.fa> [<in1.fa> ...] > <out.agc>

Options:

Hints

FASTA files can be optionally gzipped. It is, however, recommended (for performance reasons) to use uncompressed reference FASTA file.

If all samples are given in a single file (concatenated genomes mode) the reference must be given in a separate file.

Setting parameters allows difference compromises, usually between compressed size and decompression time. The impact of the most important options is discussed below.

Append new genomes to the existing archive

agc append [options] <in.agc> [<in1.fa> ...] > <out.agc>

Options:

Hints

FASTA files can be optionally gzipped.

Decompress whole collection

agc getcol [options] <in.agc> > <out.fa>

Options:\n";

Hints

If output path is specified then it must be an existing directory. Each sample will be stored in a separate file (the files in the directory will be overwritten if their names are the same as sample name). Samples can be gzipped when -g flag is provided.

Extract genomes from the archive

agc getset [options] <in.agc> <sample_name1> [<sample_name2> ...] > <out.fa>

Options:

Hints

Samples can be gzipped when -g flag is provided.

Extract contigs from the archive

agc getctg [options] <in.agc> <contig1> [<contig2> ...] > <out.fa> <br /> agc getctg [options] <in.agc> <contig1@sample1> [<contig2@sample2> ...] > <out.fa> agc getctg [options] <in.agc> <contig1:from-to>[<contig2:from-to> ...] > <out.fa> agc getctg [options] <in.agc> <contig1@sample1:from1-to1> [<contig2@sample2:from2-to2> ...] > <out.fa>

Options:

Hints

Contigs can be gzipped when -g flag is provided.

List reference sample name in the archive

agc listref [options] <in.agc> > <out.txt>

Options:

List samples in the archive

agc listset [options] <in.agc> > <out.txt>

Options:

List contigs in the archive

agc listctg [options] <in.agc> <sample1> [<sample2> ...] > <out.txt>

Options:

Show some info about the archive

agc info [options] <in.agc> > <out.txt>

Options:

AGC decompression library

AGC files can be accessed also with C/C++ or Python library.

C/C++ libraries

The C and C++ APIs are provided in src/lib-cxx/agc-api.h file (in C++ you can use C or C++ API). You can also take a look at src/examples to see both APIs in use.

Python library

AGC files can be accessed also with Python wrapper for AGC API, which was created using pybind11, version 2.11.1. It is available for Linux and macOS.

Warning: Python binding is experimental. The library used to create binding as well as public interface may change in the future.

Python module wrapping AGC API must be compiled. To compile it run:

make py_agc_api

As a result of pybind11 *.so file is created and may be used as a python module. The following file is created: py_agc_api.`python3-config --extension-suffix`

To be able to use this file one should make it visible for Python. One way to do this is to extend PYTHONPATH environment variable. It can be done by running:

source py_agc_api/set_path.sh

The example usage of Python wrapper for AGC API is presented in file: py_agc_api/py_agc_test.py To test it, run:

python3 py_agc_api/py_agc_test.py 

It reads the AGC file from the toy example (toy_ex/toy_ex.agc).

Toy example

There are four example FASTA files (ref.fa, a.fa, b.fa and c.fa) in the toy_ex folder. They can be used to test AGC. The toy_ex/toy_ex.agc is an AGC archive created with:

agc create -o toy_ex/toy_ex.agc toy_ex/ref.fa toy_ex/a.fa toy_ex/b.fa toy_ex/c.fa

The AGC file is read in Python test script (py_agc_api/py_agc_test.py) and can be used in the example using C/C++ library (src/examples).

For more options see Usage section.

Large datasets

Archives of 94 haplotype human assemblies <a href="https://github.com/human-pangenomics/HPP_Year1_Data_Freeze_v1.0">released by HPRC</a> in 2021 as well as 619,750 complete SARC-Cov-2 genomes <a href="https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes/">published by NCBI</a> can be downloaded from <a href="https://zenodo.org/record/5826274">Zenodo</a>.

Citing

S. Deorowicz, A. Danek, H. Li, AGC: Compact representation of assembled genomes with fast queries and updates. Bioinformatics, btad097 (2023) https://doi.org/10.1093/bioinformatics/btad097