The Human Pangenome Reference Consortium (HPRC) tested which combination of current genome sequencing and automated assembly approaches yields the most complete, accurate, and cost-effective diploid genome assemblies with minimal manual curation. Assemblies were generated for GIAB HG002. Variant calls from twenty-nine assemblies were evaluated by NIST using dipcall v0.3 (https://github.com/lh3/dipcall) to produce variant calls when aligned to GRCh38. Benchmarking of small variant calls was then performed against GIAB benchmark v4.2.1 using hap.py v3.12 (https://github.com/Illumina/hap.py).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
HG002 (NA24385) is a reference human genome sample used for benchmarking and comparing bioinformatics applications. This dataset contains a subset of 20,000 reads from the HG002 human reference sample, sequenced using an Oxford Nanopore Technologies PromethION sequencer on an R10.4.1 flowcell. Sheared DNA libraries (~17Kb) were prepared using the ONT LSK114 ligation library prep and an R10.4.1 flow cell was used to generate ~30X genome coverage. The original data in the FAST5 format was converted to BLOW5 format using slow5tools v0.8.0. This is a downsampled subset containing 20,000 reads in BLOW5 format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HG002 Illumina PCR Free vcf 40x coverage
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
CMRG v1.00 of a small variant benchmark and structural variant benchmark focused on 273 challenging medically relevant genes for the Genome in a Bottle (GIAB) sample HG002 (aka Ashkenazi son). These benchmarks were generated from a trio-based hifiasm v0.11 (https://doi.org/10.1038/s41592-020-01056-5) diploid assembly of HG002 using PacBio HiFi reads for HG002 for assembly and partitioning into phased haplotypes using Illumina reads for the parents, HG003 and HG004. This benchmark contains vcfs for small and structural variants along with corresponding benchmark bed files indicating regions that are homozygous reference if they do not have a variant in the vcf. We extensively curated the variant calls, excluding any found to be questionable or errors. This benchmark helps measure performance in important challenging regions, including challenging segmental duplications, regions with complex variants, regions with structural variants, and regions affected by false duplications in GRCh37 or GRCh38. This benchmark is described in https://doi.org/10.1101/2021.06.07.444885.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HG007 Ultima VCF 40x coverage from 2024from https://giab-data.s3.amazonaws.com/ultima-GIAB-Feb-2024/DeepVariant_vcfs/NA24385-Z0027.annotated.AF.vcf.gz
This distribution includes data analyzed by the sv-callers workflow (v1.1.0) in the single-sample (germline) and paired-sample (somatic) modes:
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The dataset contains reference samples that will be useful for benchmarking and comparing bioinformatics tools for genome analysis. Currently, there are two samples, which are NA12878 (HG001) and NA24385 (HG002), sequenced on an Oxford Nanopore Technologies (ONT) PromethION using the latest R10.4.1 flowcells. Raw signal data output by the sequencer is provided for these datasets in BLOW5 format, and can be rebasecalled when basecalling software updates bring accuracy and feature improvements over the years. Raw signal data is not only for rebasecalling, but also can be used for emerging bioinformatics tools that directly analyse raw signal data. We also provide the basecalled data alongside the raw signal data and will continue to provide updated basecalls when there is a major update to the basecalling software. In the future, we plan to extend this open dataset with additional samples, including sequencing runs from vendors other than ONT.
This data set contains the variant calls sets generated by different tools for the benchmarks in the paper PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes. It includes the VCFs/BCFs for the following test cases: Random deletion simulation on up to 1000 chromosome 21 samples 1000 Genomes Project deletions inserted into simulated chromosomes 17 to 22 of up to 500 samples HG001 (NA12878) Trio of HG002 + HG003 + HG004 Polaris Diversity cohort Polaris Kids cohort Further, the long and short read reference call sets for HG001 are provided. For HG002 the reference call set and the high confidence regions by the Genome in a Bottle consortium are provided. For details on how the files have been created, please refer to the paper and the script repository on GitHub. {"references": ["Auton, A., Abecasis, G., Altshuler, D. et al. A global reference for human genetic variation. Nature 526, 68\u201374 (2015).", "Zook, J.M., Chapman, B., Wang, J. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 32, 246\u2013251 (2014).", "Zook, J.M., Hansen, N.F., Olson, N.D. et al. Author Correction: A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol (2020)."]} This work was funded by the German Federal Ministry of Education and Research under grant number 031L0180.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Assembly graphs and reference Verkko assembly for HG002 dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HG002 Ultima VCF 40x coverage from 2022from https://s3.amazonaws.com/ultima-selected-1k-genomes-vcf-only/DeepVariant_vcfs/HG002_005401-UGAv3-1-CACATCCTGCATGTGAT.vcf.gz
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pangenome graphs built with minigraph-0.14 for HPRC year-1 samples, excluding HG002, HG002 and NA19240. See 00README.txt for file description.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the bgzipped variants calls in VCF format for CHM1, NA12878 and AJ trio dataset that are used in the SVXplorer manuscript. The names of the files contain the name of the sample (CHM1/NA12878/HG002/HG003/HG004), the name of the method (SVXplorer/DELLY/LUMPY/TIDDIT/TARDIS/MANTA) used to call the variants. There are three separate files for the DELLY calls which have the deletions, duplications and the inversion calls made by DELLY for each of the samples. For NA12878, there are two sets of calls, one for each of the libraries (ERR194147/SRR505885)
A genome graph and associated GCSA index describing the transposable element pangenome derived from reference annotations and non-reference insertions from the HG00733 and HG002 genomes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HG002 PacBio Hifi vcf 37x coverage
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of deletion calls for HG002.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Heuristics used to determine HG002 genotypes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classification of inversions detected in HG002 using PacBio and ONT reads.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2. Effect of contaminant cDNA reads and decontamination with cDNA-detector on structural variant calling in HG002 WGS.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adaptive immune receptor repertoire (AIRR) is encoded by T cell receptor (TR) and immunoglobulin (IG) genes. Profiling these germline genes encoding AIRR (abbreviated as gAIRR) is important in understanding adaptive immune responses but is challenging due to the high genetic complexity. Our gAIRR Suite comprises three modules. gAIRR-seq, a probe capture-based targeted sequencing pipeline, profiles gAIRR from individual DNA samples. gAIRR-call and gAIRR-annotate call alleles from gAIRR-seq reads and annotate whole-genome assemblies, respectively. We gAIRR-seqed TRV and TRJ of seven Genome in a Bottle (GIAB) DNA samples with 100% accuracy and discovered novel alleles. We also gAIRR-seqed and gAIRR-called the TR and IG genes of a subject from both the peripheral blood mononuclear cells (PBMC) and oral mucosal cells. The calling results from these two cell types have a high concordance (99% for all known gAIRR alleles). We gAIRR-annotated 36 genomes to unearth 325 novel TRV alleles and 29 novel TRJ alleles. We could further profile the flanking sequences, including the recombination signal sequence (RSS). We validated two structural variants for HG002 and uncovered substantial differences of gAIRR genes in references GRCh37 and GRCh38. gAIRR Suite serves as a resource to sequence, analyze, and validate germline TR and IG genes to study various immune-related phenotypes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 1: Correlation of ATLs computed from PacBio and ONT datasets for HG002, using different methods for choosing a representative ATL from length distributions.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
The Human Pangenome Reference Consortium (HPRC) tested which combination of current genome sequencing and automated assembly approaches yields the most complete, accurate, and cost-effective diploid genome assemblies with minimal manual curation. Assemblies were generated for GIAB HG002. Variant calls from twenty-nine assemblies were evaluated by NIST using dipcall v0.3 (https://github.com/lh3/dipcall) to produce variant calls when aligned to GRCh38. Benchmarking of small variant calls was then performed against GIAB benchmark v4.2.1 using hap.py v3.12 (https://github.com/Illumina/hap.py).