Home

Awesome

HPRC Pangenome Resources

This repo describes pangenomes produced by the Human Pangenome Reference Consortium from year 1 data. For information about data reuse and publicating with HPRC data please see the HPRC's Data Use Protocol.

Note: The pangenomes and resultant files referred to in this repo have not been fully QC'd, are not published, and may have known issues.

Background Information

Preprint

A Draft Human Pangenome Reference

Graph Creation Strategies

Graphs are available from three different strategies summarized in the table (and relevant sections) below:

<sub> </sub><sub>Minigraph</sub><sub>Minigraph-Cactus</sub><sub>PGGB</sub>
<sub> sequence comparison </sub><sub> reference-based, progressive </sub><sub> reference-based, progressive </sub><sub> symmetric, all-vs-all </sub>
<sub> resolution </sub><sub> SV only </sub><sub> base-level (via abPOA) </sub><sub> base-level (via abPOA) </sub>
<sub> scope </sub><sub> full assemblies </sub><sub> Non-centromeric </sub><sub> full assemblies </sub>
<sub> cyclic paths </sub><sub> no </sub><sub> non-reference </sub><sub> all </sub>
<sub> short read mapping </sub><sub> untested </sub><sub> yes (fast) </sub><sub> untested </sub>
<sub> long read mapping </sub><sub> yes (fastest) </sub><sub> yes </sub><sub> yes (slowest) </sub>
<sub> Assembly mapping </sub><sub> yes (direct) </sub><sub> untested </sub><sub> yes (via injection) </sub>

Index files listing file locations for download with the AWS CLI can be found in the indexes folder of this repository. Alternatively, tables are listed below in each graph creation strategy's section. Note that the index files list the file locations with s3:// uris -- as opposed to http:// urls as found in the tables.

Assembly Inputs

Information about the source assemblies can be found in the HPRC Assembly GitHub repository. Of the 47 samples assembled (94 assemblies) in year 1, all but three samples were included in graph constructions (HG002, HG005 and NA19240 were excluded for evaluation purposes). GRCh38 and CHM13 were added to make the total number of haplotypes included 90.

Graphs

Minigraph

Minigraph (cite) is a generalization of minimap2 (very fast) which builds the graph with iterative construction. Minigraph aligns with approximate locations and can be used to call structural variants (>50nt). Graphs were built with both GRCh38 and CHM13+Y (found here) used as reference sequences.

<sub>Description</sub><sub>GRCh38 Graph</sub><sub>CHM13 Graph</sub>
<sub> graph </sub><sub>graph </sub><sub>graph </sub>
<sub> bed </sub><sub>bed     index </sub><sub>bed     index </sub>

Minigraph-Cactus

Minigraph-Cactus (cite) adds base-level alignment to minigraph graphs.

Note: The links below have been updated to point to version 1.1 of the graphs which contain numerous bug fixes and updated file formats (this includes switching from . to # as path name separator in all vg files). The original version 1.0 graph that was described in the HPRC paper, has been moved here. The input assemblies are the same for both versions, so unless you are trying to exactly reproduce results from the paper, please consider using the updated version.

Graphs and associated files are summarized below.

<sub>Description</sub><sub>GRCh38 Graph</sub><sub>CHM13 Graph</sub>
<sub> Graph </sub><sub>gfa     gbz</sub><sub>gfa     gbz</sub>
<sub> Full (Unclipped) Graph </sub><sub>gfa     gbz     odgi </sub><sub> gfa     gbz     odgi</sub>
<sub> Chromosome Graphs </sub><sub>chroms </sub><sub>chroms </sub>
<sub> Decomposed VCF </sub><sub>VCF     VCF index </sub><sub> VCF     VCF index     GRCh38-VCF     GRCh38-VCF index </sub>
<sub> Raw VCF </sub><sub>VCF     VCF index </sub><sub> VCF     VCF index     GRCh38-VCF     GRCh38-VCF index </sub>
<sub> Multiple Alignment </sub><sub>HAL     MAF     MAF Index     TAF     TAF Index </sub><sub>HAL     MAF     MAF Index     TAF     TAF Index </sub>
<sub> Multiple Alignment (Duplications removed) </sub><sub>MAF     MAF Index     TAF     TAF Index </sub><sub>MAF     MAF Index     TAF     TAF Index </sub>
<sub> VG Indexes </sub><sub>gbz     hapl    dist     min     snarls </sub><sub> gbz     hapl     dist     min     snarls </sub>
<sub> AF-Filtered VG Indexes </sub><sub>gbz     dist     min     snarls </sub><sub> gbz     dist     min     snarls </sub>
<sub> Excluded Regions </sub><sub> full graph bed     clipped graph bed</sub><sub> full graph bed     clipped graph bed</sub>
<sub> All Files </sub><sub> files </sub><sub> files</sub>

The graphs are available in gfa format alongside other graph and index files. Information about the associated file formats can be found:

VCF Decomposition

The Raw VCF files contain a site for each bubble in the graph. Nested bubbles will result in overlapping sites. The nesting relationships are denoted with the PS (parent snarl), LV (level) and AT (allele traversal) tags and need to be taken into account when interpreting the VCF. Alternatively, you can use the "Decomposed VCFs" which have been normalized by using vcfbub to "pop" bubbles with alleles larger than 100k and vcfwave to realign each alt allele to the reference (script). Note that in order to reproduce the PanGenie analyses from the papers, you should instead use the PanGenie HPRC Workflow. This workflow has a CHM13 branch to use when working with that reference.

The exact tools and commands used to produce the VCFs are given here.

Filtered Graphs

The "AF-Filtered VG indexes" above were created by dropping nodes and edges supported by fewer than 10% of haplotypes, and give the best performance for Giraffe and are what have been used in the various papers to date. Note that giraffe requires only the .gbz, .dist and .min indexes.

Excluded Sequence

Some input contigs could not be assigned to a reference chromosome and were dropped. See the "full graph bed" files above for a listing of these. Contig fragments >10kb that did not map anywhere were likewise excluded (these regions are predominantly centromeric). See the "clipped graph bed" files above for these regions (this file includes the unassigned contigs). dna-brnn was not used to make these graphs.

PGGB

The Pangenome Graph Builder pipeline (PGGB) (cite) creates and all-vs-all graph with base-level alignments and no clipping of mitochondrial or centromeric regions.

Graphs and associated files are summarized below.

<sub>Description</sub><sub>Location</sub>
<sub> graph </sub><sub> gfa </sub>
<sub> untangle </sub><sub> delta     paf </sub>
<sub> Decomposed VCFs </sub><sub> GRCh38 VCF     GRCh38 VCF Index </sub>
<sub> Raw VCFs </sub><sub> chm13.1-22+X     chm13.M     grch38.1-22+X     grch38.M     grch38.Y </sub>

Graph chromosome files and images can be found here and here.

See above for more information of VCF decomposition (script).

Change Log

* Dec 03, 2021: updated minigraph-cactus VCFs to fix headers (thanks to Wen-Wei)