The Human Pangenome Reference Consortium (HPRC) tested which combination of current genome sequencing and automated assembly approaches yields the most complete, accurate, and cost-effective diploid genome assemblies with minimal manual curation. Assemblies were generated for GIAB HG002. Variant calls from twenty-nine assemblies were evaluated by NIST using dipcall v0.3 (https://github.com/lh3/dipcall) to produce variant calls when aligned to GRCh38. Benchmarking of small variant calls was then performed against GIAB benchmark v4.2.1 using hap.py v3.12 (https://github.com/Illumina/hap.py).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Assembly graphs and reference Verkko assembly for HG002 dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is assembly of human HG002 (GM24385) dataset generated by phasebook. The raw sequencing data is from : October 2021 GM24385 Q20+ Simplex Dataset Release. Reads from 4 flowcells were used. https://labs.epi2me.io/gm24385_q20_2021.10/?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HG002 dorado0.4.3 SUP
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HG007 Ultima VCF 40x coverage from 2024from https://giab-data.s3.amazonaws.com/ultima-GIAB-Feb-2024/DeepVariant_vcfs/NA24385-Z0027.annotated.AF.vcf.gz
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data for creating a hybrid assembly from a supernova (10X Genomics linked reads) and canu (PacBio CLR long reads). Reads used to generate these assemblies were sourced from the Genome In A Bottle dataset for HG002.
Haplotigs have been removed using purge_dups.
{"references": ["Long Read based Human Genomic Structural Variation Detection with cuteSV. Tao Jiang, et al. bioRxiv 780700; doi: https://doi.org/10.1101/780700"]} The SV callsets of the HG002 human sample produced by cuteSV with multi long-read sequencing platforms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data for creating a hybrid assembly from a supernova (10X Genomics linked reads) and canu (PacBio CLR long reads). Reads used to generate these assemblies were sourced from the Genome In A Bottle dataset for HG002. Pipeline details can be found here.
Assembly files with haplotigs have been removed using purge_dups:
1 kb BLASTn alignments of supernova against canu:
Hybrid assembly:
HG002 (NA24385) is a reference human genome sample used for benchmarking and comparing bioinformatics applications. This dataset contains a subset of 20,000 reads from the HG002 human reference sample, sequenced using an Oxford Nanopore Technologies PromethION sequencer on an R10.4.1 flowcell. Sheared DNA libraries (~17Kb) were prepared using the ONT LSK114 ligation library prep and an R10.4.1 flow cell was used to generate ~30X genome coverage. The original data in the FAST5 format was converted to BLOW5 format using slow5tools v0.8.0. This is a downsampled subset containing 20,000 reads in BLOW5 format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HG002 Illumina PCR Free vcf 40x coverage
https://www.gnu.org/licenses/gpl-2.0.htmlhttps://www.gnu.org/licenses/gpl-2.0.html
Variant calls for HG002 on PacBio HiFi, CLR, Oxford Nanopore, are included for alignments generated by lra, minimap2, and ngmlr. Variants
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The basecalling of the R9.4.1 data was performed with Guppy 2.3.7, Guppy 4.2.2, and Guppy 6.3.8. The basecalling of the R10.4.1 data was performed by Dorado 0.5.3.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HG002 Ultima VCF 40x coverage from 2022from https://s3.amazonaws.com/ultima-selected-1k-genomes-vcf-only/DeepVariant_vcfs/HG002_005401-UGAv3-1-CACATCCTGCATGTGAT.vcf.gz
CMRG v1.00 of a small variant benchmark and structural variant benchmark focused on 273 challenging medically relevant genes for the Genome in a Bottle (GIAB) sample HG002 (aka Ashkenazi son). These benchmarks were generated from a trio-based hifiasm v0.11 (https://doi.org/10.1038/s41592-020-01056-5) diploid assembly of HG002 using PacBio HiFi reads for HG002 for assembly and partitioning into phased haplotypes using Illumina reads for the parents, HG003 and HG004. This benchmark contains vcfs for small and structural variants along with corresponding benchmark bed files indicating regions that are homozygous reference if they do not have a variant in the vcf. We extensively curated the variant calls, excluding any found to be questionable or errors. This benchmark helps measure performance in important challenging regions, including challenging segmental duplications, regions with complex variants, regions with structural variants, and regions affected by false duplications in GRCh37 or GRCh38. This benchmark is described in https://doi.org/10.1101/2021.06.07.444885.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HG002 PacBio Hifi vcf 37x coverage
This distribution includes data analyzed by the sv-callers workflow (v1.1.0) in the single-sample (germline) and paired-sample (somatic) modes:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a data repository for the genome assemblies of three human samples enhanced by LOCLA (DOI: 10.5281/zenodo.8280853 ). LOCLA is a novel genome assembly optimization tool, LOCLA, that iteratively improves the quality of an assembly by locating sequencing reads on partially assembled scaffolds and thus enable gap filling and further scaffolding.
The three human genome assemblies and the assembly statistics are compressed into one single zip file. File names are explained as follows:
LLD0021C_locla.fasta : Whole genome assembly of a Taiwanese male individual generated by LOCLA
LLD0021C_locla_quality.txt : Assembly statistics of LLD0021C_locla.fasta
chm13_locla.fasta : Whole genome assembly of the CHM13 cell line generated by LOCLA
chm13_locla_quality.txt : Assembly statistics of chm13_locla.fasta
hg002_gma_locla.fasta : Whole genome assembly of the HG002 sample generated by LOCLA
hg002_gma_locla_quality.txt : Assembly statistics of hg002_gma_locla.fasta
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
Generating high-quality variant callsets across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a “universal” algorithm has magnified the unknown impacts when used with non-human species. We use bovine genomes to assess the limits of using human-genome-trained variant callers, including the allele frequency channel (DV-AF) and joint-caller DeepTrio (DT). Our novel approach, TrioTrain, automates extending DV for diploid species lacking Genome-in-a-Bottle (GIAB) resources, using a region shuffling approach to mitigate barriers for SLURM-based clusters. Imperfect animal truth labels are curated to remove Mendelian discordant sites before training DV to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to create the first multi-species-trained DV-AF checkpoint. Although incomplete bovine truth sets constrain recall within challenging repetitive regions, we observe a mean SNV F1 score >0.990 across new checkpoints during GIAB benchmarking. With HG002, a bovine-trained checkpoint (28) decreased the Mendelian Inheritance Error (MIE) rate by a factor of two compared to the default (DV). Checkpoint 28 has a mean MIE rate of 0.03 percent in three bovine interspecies cross genomes. These results illustrate that a multi-species, trio-based training strategy reduces inheritance errors during single-sample variant calling. While exclusively training with human genomes deters transferring deep-learning-based variant calling to new species, we use the diverse ancestry within bovids to illustrate the need for advanced tools designed for comparative genomics.
TrioTrain_README.md
README file that describes the contents and purpose of these files in further detail.
TrioTrain_project_metadata.csv
Pedigree and breed labels for all bovine samples included in this study.
CallableRegions.tar.gz
Per-sample callable region files. After cohort QC, we generated truth sets based on the UMAG1 cohort using GATK-derived genotypes. The regions files produced by GATK (v3.8-1-0-gf15c1c3ef), followed by parsing per-sample CallableLoci to extract only PASS regions for downstream analyses.
UMAG1.POP.FREQ.vcf.gz
UMAGv1 cohort population allele frequency file.
ReferenceGenome.tar.gz
Bovine reference genome files
ModelCheckpoint.tar.gz
Final selected TrioTrain checkpoint (28). This file is compatible with DeepVariant (v1.4) for short-read, whole-genome-sequencing (WGS) data. Using this alternative checkpoint requires a Population VCF compatible with the reference genome provided to DeepVariant.
DV-TrioTrain-0.8.tar.gz
The source code for the TrioTrain pipeline (v0.8) at the time of publication. Additional information, including installation instructions, are available on Github: https://github.com/jkalleberg/DV-TrioTrain/releases/tag/v0.8
Short telomeres cause age-related disease and long telomeres predispose to cancer; however, the mechanisms regulating telomere length are unclear. To probe these mechanisms, we developed a nanopore sequencing method, Telomere Profiling, that identified mean telomere length to similar a Southern and to the clinical FlowFISH assay.  We mapped telomere reads to specific chromosome ends and, strikingly, could identify chromosome end-specific lengths that differed by more than 6kb. We measured chromosome end-specific telomere lengths for 147 individuals and found that specific chromosome ends were consistently shorter or longer. This rank order of specific chromosome end telomere lengths was also found in newborn cord blood, suggesting telomere length is determined at birth. The average telomere length at birth was ~8kb +/- 250 bp, shorter than previously estimated. Understanding the mechanisms regulating length will allow deeper insights into telomere biology that can lead to new approach..., Telomeres were isolated and measured using telomere profiling protocol with Oxford Nanopore's MinION instrument and basecalled using ONT's guppy basecaller. , , # Telomere Profiling Dataset: Telomere dataset used for calculating bulk and chromosome specific telomere lengths
https://doi.org/10.5061/dryad.dz08kps5d
The dataset includes demultiplexed raw telomere reads that were used for measuring telomere length across a wide population, telomere reads from the HG002 cell line, as well as custom CHM13 and HG002 reference genomes with truncated telomere sequences which were used for mapping reads.
1) Data files types are described in the table below.
| Data Type | File Name | Description ...
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Complex structural variations (cxSVs) are often overlooked in genome analyses due to detection challenges. We developed ARC-SV, a probabilistic and machine-learning-based method that enables accurate detection and reconstruction of cxSVs from standard whole-genome sequencing datasets. By applying ARC-SV across 4,262 genomes representing all continental populations, we identified cxSVs as a significant source of natural human genetic variation. The 4,262 individual genomes are sourced from the 1000 Genomes Project, the Human Genome Diversity Project, and the Simons Genome Diversity Project. We also applied ARC-SV to Neanderthal genomes, a number of benchmarking genomes including CHM13-T2T, HG002, HuRef, PG1, and HepG2 (cancer) as well as 119 postmortem brain (79 from ComminMind Consortium and 40 from the National Institute of Mental Health Human Brain Collection Core). Most brain samples are from donors with major psychiatric disorders. The high-confidence cxSV calls for all samples (including dot plot visualizations) are compiled into Dataset S1. ARC-SV. The high-confidence simple SV calls produced by ARC-SV for all samples are also included and compiled into Dataset S2. In our study (Zhou et al, Cell 2024), our analysis of these Datasets revealed that rare cxSVs have a propensity to occur in neural genes and loci that underwent rapid human-specific evolution, including those regulating corticogenesis. By performing single-nucleus multiomics in postmortem brains, we discovered cxSVs associated with differential gene expression and chromatin accessibility across various brain regions and cell types. Additionally, cxSVs detected in brains of psychiatric cases are enriched for linkage with psychiatric GWAS risk alleles detected in the same brains. Furthermore, our analysis revealed significantly decreased brain-region- and cell-type-specific expression of cxSV genes, specifically for psychiatric cases, implicating cxSVs in the molecular etiology of major neuropsychiatric disorders. Methods Structural variation (SV) calls from standard whole-genome sequencing (WGS) datasets were made via ARC-SV (https://github.com/SUwonglab/arcsv). Dot plots were generated using LAST (https://github.com/lpryszcz/last).
The Human Pangenome Reference Consortium (HPRC) tested which combination of current genome sequencing and automated assembly approaches yields the most complete, accurate, and cost-effective diploid genome assemblies with minimal manual curation. Assemblies were generated for GIAB HG002. Variant calls from twenty-nine assemblies were evaluated by NIST using dipcall v0.3 (https://github.com/lh3/dipcall) to produce variant calls when aligned to GRCh38. Benchmarking of small variant calls was then performed against GIAB benchmark v4.2.1 using hap.py v3.12 (https://github.com/Illumina/hap.py).