Awesome
YaHS: yet another Hi-C scaffolding tool
Overview
YaHS is a scaffolding tool using Hi-C data. It relies on a new algothrim for contig joining detection which considers the topological distribution of Hi-C signals aiming to distingush real interaction signals from mapping nosies. YaHS has been tested in a wide range of genome assemblies. Compared to other Hi-C scaffolding tools, it usually generates more contiguous scaffolds - especially with a higher N90 and L90 statistics. It is also super fast - takes less than 5 minutes to reconstruct the human genome from an assembly of 5,483 contigs with ~45X Hi-C data. See the poster presented in the Bioversity Genemics 2021 conference for more information.
Installation
You need to have a C compiler, GNU make and zlib development files installed. Download the source code from this repo or with git clone https://github.com/c-zhou/yahs.git
. Then type make
in the source code directory to compile.
Run YaHS
YaHS has two required inputs: a FASTA format file with contig sequences which need to be indexed (with samtools faidx for example) and a BAM/BED/BIN/PA5 file with the alignment results of Hi-C reads to the contigs. There are many pipelines available to generate the alignment file such as the Arima Genomics' mapping pipeline, the Omni-C's mapping pipeline and the HiC-Pro. The BAM file is recommened to mark PCR/optical duplicates before feeding to YaHS. Several tools are available out there for marking duplicates such as bammarkduplicates2
from biobambam2 and MarkDuplicates
from Picard. For BED format input, each line should contain at least four columns, i.e., the contig name the read mapped to, the start position of the alignment, the end position of the alignment and the read name. The first and second read from a read pair is optionally marked by '/1' and '/2' suffix to the read name. The fifth column will be treated as the mapping quality score if they are numeric values. All the information after the fifth column are ignored. Each read pair should be placed in two consecutive lines (i.e., the BED file should be sorted by the first column). The BED format file can be generated from the BAM file with bedtools bamtobed for example. There is no need to convert the BAM format to BED format unless you want to compare YaHS to other tools. YaHS also accepts pair format input (PA5). Each line in a pair format file should at least contain five columns containing the mapping information of a read pair, i.e., the read pair name, the contig name the first read mapped to, the mapping position the of the first read, the contig name the second read mapped to, and the mapping position of the second read. The sixth and seventh column will be treated as mapping quality scores for the first and second reads respectively if they are numeric values. All the information after the seventh column are ignored. The BIN format is a binary format specific to YaHS. If the input file is not BIN format, the first step of YaHS is to convert them to BIN format. This is to save running time as multiple rounds of file IO are needed during the scaffolding process. If you have run YaHS and need to rerun it, the BIN file in the output directory could be reused to save some time - although might be just a few minutes. YaHS determines the input file format from file name extensions (case insensitive): .bam
for BAM files, .bed
for BED files, .pa5
for pair format files and .bin
for BIN files. YaHS also provides the parameter --file-type
to allow specification of input file format. Accepted values include BAM, BED, PA5 and BIN (case insensitive). This parameter has to be set when using pipe in as input where file name extension detection will fail.
NOTE 1: The input BAM could either sorted by read names (samtools sort with
-n
option) or not. The behaviours of the program are slightly different, which might lead to slightly different scaffolding results. For a BAM input sorted by read names, with each mapped read pair, a Hi-C link is counted between the middle positions of the read alignments; while for a BAM input sorted by coordinates or unsorted, Hi-C links are counted between the start positions of the read alignments. Also, for a BAM input not sorted by read names, the mapping quality filtering is suppressed (-q
option).
NOTE 2: The BAM file used to genereate BED file need to be filtered out unmapped reads, supplementary/secondary alignment records, and PCR/optical duplicates, and sorted by read names (otherwise the resulted BED file need to be sorted by the read name column).
Here is an example to run YaHS,
yahs contigs.fa hic-to-contigs.bam
The outputs include several AGP format files and a FASTA format file. The *_inital_break_[0-9]{2}.agp
AGP files are for initial assembly error corrections. The *_r[0-9]{2}.agp
and related *_r[0-9]{2}_break.agp
AGP files are for scaffolding results in each round. The *_scaffolds_final.agp
and *_scaffolds_final.fa
files are for the final scaffolding results.
There are some optional parameters.
With -o
option, you can specify the prefix of the output files. It is ./yash.out
by default. If a directory structure is included, the directory needs to be existed.
With -a
option, you can specify a AGP format file to ask YaHS to do scaffolding with the scaffolds in the AGP file as the start point.
With -r
option, you can specify a range of resultions (in ascending order). It is 10000,20000,50000,100000,200000,500000,1000000,2000000,5000000,10000000,20000000,50000000,100000000,200000000,500000000
by default and the upper limit is automatically adjusted by the genome size. For highly fragmented genome assemblies, you can try to start with higher resultions by adding smaller -r
values.
With -R
option, you can specify the number of rounds to run for each resolution level.
With -e
option, you can specify the restriction enzyme(s) used by the Hi-C experiment. For example, GATC
for the DpnII restriction enzyme used by the Dovetail Hi-C Kit; GATC,GANTC
and GATC,GANTC,CTNAG,TTAA
for Arima genomics 2-enzyme and 4-enzyme protocol, respectively. Sometimes, the specification of enzymes may not change the scaffolding result very much if not make it worse, especially when the base quality of the assembly is not very good, e.g., assembies constructed from noisy long reads.
With -l
option, you can specify the minimum contig length included for scaffolding.
With -q
option, you can set the minimum read mapping quality.
With --no-contig-ec
option, you can skip the initial assembly error correction step. With -a
option, this will be set automatically.
With --no-scaffold-ec
option, YaHS will skip the scaffolding error check in each round. There will be no *_r[0-9]{2}_break.agp
AGP output files.
With --no-mem-check
option, the runtime memory check is disabled. When running out of memory, the scaffolding process will terminate immediately instead of try lower resolutions.
With --file-type
option, you can specify the input file format. It has to be set when in the input is pipe in.
With --read-length
option, you can specify the read length of your HiC data. This is only required and used when the input is in pair format.
With --telo-motif
option, you can specify a telomeric motif. This will be used to identify telomeric sequences on each input sequences. Sequences representing telomeric ends will not be allowed to join with other sequences. YaHS maintains a telomeric motif database as default (run yahs --print-telo-motifs
to print a list the motifs). If this option is set, only this motif will be checked. The code for telomere identification was adopted from seqtk telo.
YaHS also allows customizations of the output AGP file, such as changing the scaffold gap size, linkage evidence etc. See the long help message (yahs -?
) and the AGP specification documentation for details.
Generate HiC contact maps
YaHS offers some auxiliary tools to help generating HiC contact maps for visualisation. A demo is provided in the bash script scripts/run_yahs.sh
. To generate and visualise a HiC contact map, the following tools are required.
- samtools: https://github.com/samtools/samtools
- juicer_tools: https://github.com/aidenlab/juicer/wiki/Download
- Juicebox: https://github.com/aidenlab/Juicebox
The first step is to convert the HiC alignment file (BAM/BED/BIN) to a file required by juicer_tools
using the tool juicer pre
provided by YaHS. To save time, BIN file is recommended which has already been generated in the scaffolding step. Here is an example bash command:
(juicer pre hic-to-contigs.bin scaffolds_final.agp contigs.fa.fai | sort -k2,2d -k6,6d -T ./ --parallel=8 -S32G | awk 'NF' > alignments_sorted.txt.part) && (mv alignments_sorted.txt.part alignments_sorted.txt)
The tool juicer pre
takes three positional parameters: the alignments of HiC reads to contigs, the scaffold AGP file and the contig FASTA index file. With -o
option, it will write the results to a file. Here, the outputs are directed to stdout
as we need a sorted (by scaffold names) file for juicer_tools
.
For sorting, we use 8 threads, 32Gb memory and the current directory for temporaries. You might need to adjust these settings according to your device.
The next step is to generate HiC contact matrix using juicer_tools
. Here is an example bash command:
(java -jar -Xmx32G juicer_tools.1.9.9_jcuda.0.8.jar pre alignments_sorted.txt out.hic.part scaffolds_final.chrom.sizes) && (mv out.hic.part out.hic)
The juicer_tools
's pre
command takes three positional parameters: the sorted alignment file generated in the first step, the output file name and the file for scaffold sizes. The file for scaffold sizes should contain two columns - scaffold name and scaffold size, which can be taken from the first two columns of the FASTA index file.
Finally, the output file out.hic
could be used for visualisation with Juicebox. More information about juicer_tools
and Juicebox can be found here.
Manual curation with Juicebox (JBAT)
You can generate a HiC contact map file that can be loaded by Juicebox (JBAT, Dudchenko et al. 2018) for manual editing with juicer pre
by adding -a
parameter. For example,
juicer pre -a -o out_JBAT hic-to-contigs.bin scaffolds_final.agp contigs.fa.fai >out_JBAT.log 2>&1
where hic-to-contigs.bin
(can be replaced with the original BED/BAM file with some sacrifice in running time) and scaffolds_final.agp
are the outputs of YaHS. This results in five output files,
out_JBAT.txt - BED format file for hic links
out_JBAT.liftover.agp - coordinate file between new and old contigs
out_JBAT.assembly - assembly annotation file for Juicebox
out_JBAT.assembly.agp - AGP file contains same information as the assembly annotation file. Not a real AGP file as there are no gaps.
out_JBAT.log - the output log file
You will then need to run juicer_tools pre
with out_JBAT.txt
file.
(java -jar -Xmx32G juicer_tools.1.9.9_jcuda.0.8.jar pre out_JBAT.txt out_JBAT.hic.part <(cat out_JBAT.log | grep PRE_C_SIZE | awk '{print $2" "$3}')) && (mv out_JBAT.hic.part out_JBAT.hic)
There will be a line like [I::main_pre] JUICER_PRE CMD: java -Xmx36G -jar ${juicer_tools} pre out_JBAT.txt out_JBAT.hic <(echo "assembly 183277074")
in the out_JBAT.log
file telling you how to run juicer_tools. The <(echo "assembly 183277074")
part can be replaced by a chromosome size file containing only one line - assembly 183277074.
The out_JBAT.hic
can be loaded into Juicebox along with out_JBAT.assembly
for manual editing.
NOTE 3: if your total assembly size is larger than 2Gb, there will be a scale factor applied to the HiC contact map file to make it loadable by Juicebox. This scale factor can be found in the log file (out_JBAT.log) and always a power of two, e. g.
[I::main_pre] scale factor: 4
. You need to set this parameter in Juicebox throughAssembly > Set Scale
. Otherwise the HiC contact map and assembly file will not match.
Once completed editing, there should be a file named something like out_JBAT.review.assembly
generated by Juicebox, which can be fed into juicer post
command to generate AGP and FASTA files for the final genome assembly. You also need the out_JBAT.liftover.agp
coordinate file previously generated with juicer pre
command.
juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp contigs.fa
This will end up with two files out_JBAT.FINAL.agp
and out_JBAT.FINAL.fa
. Together with hic-to-contigs.bin
or the original BED/BAM file, you can regenerate a HiC contact map for the final assembly as described in the previous section.
You can find more information about manual editing with Juicebox here and Issue 4.
Other tools
- juicer is a tool used to quickly generate HiC alignment file required for HiC contact map generation with tools like Juicebox, PretextMap and Higlass (
juicer pre
). It can be also used to generate AGP and FASTA files after manual editing with Juicebox JBAT (juicer post
). - agp_to_fasta creates a FASTA file from a AGP file. It takes two positional parameters: the AGP file and the contig FASTA file. By default, the output will be directed to
stdout
. You can write to a file with-o
option. It allows changing the FASTA line width with-l
option, which by default is 60. If the AGP file contains sequence components of unknown orientations ('?', '0' or 'na' identifiers, see AGP format), you will need-u
option, with which components with unknown orientation are treated as if they had '+' orientation.
Limitations
- In rare cases, YaHS has been seen making telomere-to-telomere false joins.
- YaHS can only handle up to 45,000 contigs. Consider excluding short contigs from scaffolding (with
-l
option) if the contig number exceeds this limit. - The memory consumption might be very high when the genome size is large and at the same time the contig number is large. In this case, consider starting from a lower resolution (larger
-r
values).
Citation
Chenxi Zhou, Shane A. McCarthy, Richard Durbin. YaHS: yet another Hi-C scaffolding tool. Bioinformatics, 39(1), btac808.