30 datasets found

Data from: GIAB Benchmarking of HG002 Assemblies from HPRC Year 1 Bakeoff
catalog.data.gov
data.nist.gov
Updated Jul 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). GIAB Benchmarking of HG002 Assemblies from HPRC Year 1 Bakeoff [Dataset]. https://catalog.data.gov/dataset/giab-benchmarking-of-hg002-assemblies-from-hprc-year-1-bakeoff
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The Human Pangenome Reference Consortium (HPRC) tested which combination of current genome sequencing and automated assembly approaches yields the most complete, accurate, and cost-effective diploid genome assemblies with minimal manual curation. Assemblies were generated for GIAB HG002. Variant calls from twenty-nine assemblies were evaluated by NIST using dipcall v0.3 (https://github.com/lh3/dipcall) to produce variant calls when aligned to GRCh38. Benchmarking of small variant calls was then performed against GIAB benchmark v4.2.1 using hap.py v3.12 (https://github.com/Illumina/hap.py).
HG002
figshare.com
application/gzip
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Tolstoganov (2022). HG002 [Dataset]. http://doi.org/10.6084/m9.figshare.21678842.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21678842.v1
Dataset updated
Dec 6, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ivan Tolstoganov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Assembly graphs and reference Verkko assembly for HG002 dataset
Assembly of human HG002 (GM24385) ONT Q20+ Simplex dataset generated by...
zenodo.org
application/gzip
Updated Nov 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiao Luo; Xiao Luo (2021). Assembly of human HG002 (GM24385) ONT Q20+ Simplex dataset generated by phasebook [Dataset]. http://doi.org/10.5281/zenodo.5729181
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5729181
Dataset updated
Nov 26, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiao Luo; Xiao Luo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is assembly of human HG002 (GM24385) dataset generated by phasebook. The raw sequencing data is from : October 2021 GM24385 Q20+ Simplex Dataset Release. Reads from 4 flowcells were used. https://labs.epi2me.io/gm24385_q20_2021.10/?
S
HG002
scidb.cn
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jun Mencius (2024). HG002 [Dataset]. http://doi.org/10.57760/sciencedb.14326
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.14326
Dataset updated
Oct 8, 2024
Dataset provided by
Science Data Bank
Authors
Jun Mencius
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HG002 dorado0.4.3 SUP
HG002 Ultima (2024)
figshare.com
application/gzip
Updated Apr 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan Dwarshuis (2024). HG002 Ultima (2024) [Dataset]. http://doi.org/10.6084/m9.figshare.25554984.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25554984.v1
Dataset updated
Apr 5, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Nathan Dwarshuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HG007 Ultima VCF 40x coverage from 2024from https://giab-data.s3.amazonaws.com/ultima-GIAB-Feb-2024/DeepVariant_vcfs/NA24385-Z0027.annotated.AF.vcf.gz
HG002 Supernova and Canu assemblies
zenodo.org
application/gzip, bin
Updated Mar 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Mastromatteo; Scott Mastromatteo (2025). HG002 Supernova and Canu assemblies [Dataset]. http://doi.org/10.5281/zenodo.15058964
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15058964
Dataset updated
Mar 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Scott Mastromatteo; Scott Mastromatteo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data for creating a hybrid assembly from a supernova (10X Genomics linked reads) and canu (PacBio CLR long reads). Reads used to generate these assemblies were sourced from the Genome In A Bottle dataset for HG002.

Haplotigs have been removed using purge_dups.

https://github.com/ScottMastro/hybrid-pipeline
o
The SV callsets of the HG002 human sample produced by cuteSV with multi...
explore.openaire.eu
Updated Oct 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tao Jiang (2019). The SV callsets of the HG002 human sample produced by cuteSV with multi long-read sequencing platforms. [Dataset]. http://doi.org/10.5281/zenodo.3556403
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3556403
Dataset updated
Oct 9, 2019
Authors
Tao Jiang
Description
{"references": ["Long Read based Human Genomic Structural Variation Detection with cuteSV. Tao Jiang, et al. bioRxiv 780700; doi: https://doi.org/10.1101/780700"]} The SV callsets of the HG002 human sample produced by cuteSV with multi long-read sequencing platforms.
HG002 Supernova and Canu - Hybrid Assembly
zenodo.org
application/gzip, bin +1
Updated Mar 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Mastromatteo; Scott Mastromatteo (2025). HG002 Supernova and Canu - Hybrid Assembly [Dataset]. http://doi.org/10.5281/zenodo.15062708
Explore at:
application/gzip, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15062708
Dataset updated
Mar 21, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Scott Mastromatteo; Scott Mastromatteo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data for creating a hybrid assembly from a supernova (10X Genomics linked reads) and canu (PacBio CLR long reads). Reads used to generate these assemblies were sourced from the Genome In A Bottle dataset for HG002. Pipeline details can be found here.

Assembly files with haplotigs have been removed using purge_dups:

HG002_canu.contigs.purged.fa.gz

HG002_canu.unitigs.clean.bed (unitigs from canu)

HG002_supernova.pseudohap.purged.fa.gz

1 kb BLASTn alignments of supernova against canu:

HG002.blastn.summary.txt

Hybrid assembly:

HG002_hybrid.assembly.fa.gz (main scaffolds)

HG002_hybrid.assembly.leftover.fa.gz (unplaced scaffolds)

HG002_hybrid.source.bed (source assembly)

HG002_hybrid.source_leftover.bed (source assembly)

HG002_hybrid.scaff.t2t.txt (best chromosome match against T2T)
d
Data for: Nanopore R10.4.1 LSK114 HG002: subset of 20000 reads in BLOW5...
dataone.org
search.dataone.org
+3more
Updated May 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hasindu Gamaarachchi (2025). Data for: Nanopore R10.4.1 LSK114 HG002: subset of 20000 reads in BLOW5 format [Dataset]. http://doi.org/10.5061/dryad.905qfttq9
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.905qfttq9
Dataset updated
May 18, 2025
Dataset provided by
Dryad Digital Repository
Authors
Hasindu Gamaarachchi
Time period covered
Jan 1, 2023
Description
HG002 (NA24385) is a reference human genome sample used for benchmarking and comparing bioinformatics applications. This dataset contains a subset of 20,000 reads from the HG002 human reference sample, sequenced using an Oxford Nanopore Technologies PromethION sequencer on an R10.4.1 flowcell. Sheared DNA libraries (~17Kb) were prepared using the ONT LSK114 ligation library prep and an R10.4.1 flow cell was used to generate ~30X genome coverage. The original data in the FAST5 format was converted to BLOW5 format using slow5tools v0.8.0. This is a downsampled subset containing 20,000 reads in BLOW5 format.
f
HG002 Illumina PCR Free
figshare.com
application/gzip
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan Dwarshuis (2023). HG002 Illumina PCR Free [Dataset]. http://doi.org/10.6084/m9.figshare.22637347.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22637347.v1
Dataset updated
Jun 21, 2023
Dataset provided by
figshare
Authors
Nathan Dwarshuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HG002 Illumina PCR Free vcf 40x coverage
f
lra-supplemental-HG002-SV.vcf.tar.gz
figshare.com
application/x-gzip
Updated Nov 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Chaisson (2020). lra-supplemental-HG002-SV.vcf.tar.gz [Dataset]. http://doi.org/10.6084/m9.figshare.13238717.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13238717.v1
Dataset updated
Nov 15, 2020
Dataset provided by
figshare
Authors
Mark Chaisson
License
https://www.gnu.org/licenses/gpl-2.0.htmlhttps://www.gnu.org/licenses/gpl-2.0.html
Description
Variant calls for HG002 on PacBio HiFi, CLR, Oxford Nanopore, are included for alignments generated by lra, minimap2, and ngmlr. Variants
S
HG002-HG004 Basecalled and modcalled data
scidb.cn
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chenxi Zhang (2025). HG002-HG004 Basecalled and modcalled data [Dataset]. http://doi.org/10.57760/sciencedb.23979
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.23979
Dataset updated
Apr 28, 2025
Dataset provided by
Science Data Bank
Authors
Chenxi Zhang
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The basecalling of the R9.4.1 data was performed with Guppy 2.3.7, Guppy 4.2.2, and Guppy 6.3.8. The basecalling of the R10.4.1 data was performed by Dorado 0.5.3.
HG002 Ultima (2022)
figshare.com
application/x-gzip
Updated Apr 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan Dwarshuis (2024). HG002 Ultima (2022) [Dataset]. http://doi.org/10.6084/m9.figshare.25554978.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25554978.v1
Dataset updated
Apr 5, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Nathan Dwarshuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HG002 Ultima VCF 40x coverage from 2022from https://s3.amazonaws.com/ultima-selected-1k-genomes-vcf-only/DeepVariant_vcfs/HG002_005401-UGAv3-1-CACATCCTGCATGTGAT.vcf.gz
Challenging Medically-Relevant Genes Benchmark Set
catalog.data.gov
data.nist.gov
Updated Jul 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Challenging Medically-Relevant Genes Benchmark Set [Dataset]. https://catalog.data.gov/dataset/challenging-medically-relevant-genes-benchmark-set
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
CMRG v1.00 of a small variant benchmark and structural variant benchmark focused on 273 challenging medically relevant genes for the Genome in a Bottle (GIAB) sample HG002 (aka Ashkenazi son). These benchmarks were generated from a trio-based hifiasm v0.11 (https://doi.org/10.1038/s41592-020-01056-5) diploid assembly of HG002 using PacBio HiFi reads for HG002 for assembly and partitioning into phased haplotypes using Illumina reads for the parents, HG003 and HG004. This benchmark contains vcfs for small and structural variants along with corresponding benchmark bed files indicating regions that are homozygous reference if they do not have a variant in the vcf. We extensively curated the variant calls, excluding any found to be questionable or errors. This benchmark helps measure performance in important challenging regions, including challenging segmental duplications, regions with complex variants, regions with structural variants, and regions affected by false duplications in GRCh37 or GRCh38. This benchmark is described in https://doi.org/10.1101/2021.06.07.444885.
HG002 PacBio Hifi
figshare.com
application/gzip
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan Dwarshuis (2023). HG002 PacBio Hifi [Dataset]. http://doi.org/10.6084/m9.figshare.22637410.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22637410.v1
Dataset updated
Jun 21, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Nathan Dwarshuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HG002 PacBio Hifi vcf 37x coverage
Test data for sv-callers workflow
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arnold Kuzniar; Arnold Kuzniar; Luca Santuari; Luca Santuari (2024). Test data for sv-callers workflow [Dataset]. http://doi.org/10.5281/zenodo.4001614
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4001614
Dataset updated
Jun 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Arnold Kuzniar; Arnold Kuzniar; Luca Santuari; Luca Santuari
Description
This distribution includes data analyzed by the sv-callers workflow (v1.1.0) in the single-sample (germline) and paired-sample (somatic) modes:

human reference genomes (in .fa[sta])

excluded genomic regions (in .bed[pe])

ENCODE:ENCFF001TDO

CEPH by Layer et al. (2014)

structural variants (SVs) detected by the workflow (in .vcf)

SV truth sets (in .bed[pe] and .vcf.gz)

Personalis/1000 Genomes Project data by Parikh et al. (2016)

PacBio/Moleculo data by Layer et al. (2014)

dbVar:nstd167 data by Wenger et al. (2019)

dbVar:nstd137 data by Huddleston et al. (2017)

workflow samples (in .csv) and config files (in .yaml)

short-read alignments are not included due to large sizes but are freely available for download (in .bam)

NA12878 sample

NA24385 sample

CHM1_CHM13 sample

COLO829 tumor sample with matched normal sample

Jupyter Notebooks to analyze SV callsets (in .ipynb)
Z
Human genome assemblies enhanced by LOCLA
data.niaid.nih.gov
Updated Aug 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pao-Yin Fu (2023). Human genome assemblies enhanced by LOCLA [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8297188
Explore at:
Dataset updated
Aug 29, 2023
Dataset provided by
Chung-Yen Lin
Yu-Jung Chang
Yi-Chen Huang
Jan-Ming Ho
Hsueh-Chien Cheng
Ping-Heng Hsieh
Shu-Hwa Chen
Pao-Yin Fu
Wei-Hsuan Chuang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a data repository for the genome assemblies of three human samples enhanced by LOCLA (DOI: 10.5281/zenodo.8280853 ). LOCLA is a novel genome assembly optimization tool, LOCLA, that iteratively improves the quality of an assembly by locating sequencing reads on partially assembled scaffolds and thus enable gap filling and further scaffolding.

The three human genome assemblies and the assembly statistics are compressed into one single zip file. File names are explained as follows:

LLD0021C_locla.fasta : Whole genome assembly of a Taiwanese male individual generated by LOCLA

LLD0021C_locla_quality.txt : Assembly statistics of LLD0021C_locla.fasta

chm13_locla.fasta : Whole genome assembly of the CHM13 cell line generated by LOCLA

chm13_locla_quality.txt : Assembly statistics of chm13_locla.fasta

hg002_gma_locla.fasta : Whole genome assembly of the HG002 sample generated by LOCLA

hg002_gma_locla_quality.txt : Assembly statistics of hg002_gma_locla.fasta
Data from: Overcoming limitations to customize DeepVariant for domesticated...
zenodo.org
application/gzip, bin +1
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Schnabel; Robert Schnabel; Jenna Kalleberg; Jenna Kalleberg (2025). Overcoming limitations to customize DeepVariant for domesticated animals with TrioTrain [Dataset]. http://doi.org/10.5281/zenodo.15482485
Explore at:
bin, application/gzip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15482485
Dataset updated
May 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Robert Schnabel; Robert Schnabel; Jenna Kalleberg; Jenna Kalleberg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT

Generating high-quality variant callsets across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a “universal” algorithm has magnified the unknown impacts when used with non-human species. We use bovine genomes to assess the limits of using human-genome-trained variant callers, including the allele frequency channel (DV-AF) and joint-caller DeepTrio (DT). Our novel approach, TrioTrain, automates extending DV for diploid species lacking Genome-in-a-Bottle (GIAB) resources, using a region shuffling approach to mitigate barriers for SLURM-based clusters. Imperfect animal truth labels are curated to remove Mendelian discordant sites before training DV to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to create the first multi-species-trained DV-AF checkpoint. Although incomplete bovine truth sets constrain recall within challenging repetitive regions, we observe a mean SNV F1 score >0.990 across new checkpoints during GIAB benchmarking. With HG002, a bovine-trained checkpoint (28) decreased the Mendelian Inheritance Error (MIE) rate by a factor of two compared to the default (DV). Checkpoint 28 has a mean MIE rate of 0.03 percent in three bovine interspecies cross genomes. These results illustrate that a multi-species, trio-based training strategy reduces inheritance errors during single-sample variant calling. While exclusively training with human genomes deters transferring deep-learning-based variant calling to new species, we use the diverse ancestry within bovids to illustrate the need for advanced tools designed for comparative genomics.

TrioTrain_README.md

README file that describes the contents and purpose of these files in further detail.

TrioTrain_project_metadata.csv

Pedigree and breed labels for all bovine samples included in this study.

CallableRegions.tar.gz

Per-sample callable region files. After cohort QC, we generated truth sets based on the UMAG1 cohort using GATK-derived genotypes. The regions files produced by GATK (v3.8-1-0-gf15c1c3ef), followed by parsing per-sample CallableLoci to extract only PASS regions for downstream analyses.

UMAG1.POP.FREQ.vcf.gz

UMAGv1 cohort population allele frequency file.

ReferenceGenome.tar.gz

Bovine reference genome files

ModelCheckpoint.tar.gz

Final selected TrioTrain checkpoint (28). This file is compatible with DeepVariant (v1.4) for short-read, whole-genome-sequencing (WGS) data. Using this alternative checkpoint requires a Population VCF compatible with the reference genome provided to DeepVariant.

DV-TrioTrain-0.8.tar.gz

The source code for the TrioTrain pipeline (v0.8) at the time of publication. Additional information, including installation instructions, are available on Github: https://github.com/jkalleberg/DV-TrioTrain/releases/tag/v0.8
d
Telomere dataset used for calculating bulk and chromosome specific telomere...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Apr 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kayarash Karimian; Aljona Groot; Vienna Huso; Ramin Kahidi; Samantha Sholes; Rebecca Keener; Andreas Rechtsteiner; Jonathan Alder; John McDyer; Carol Greider (2024). Telomere dataset used for calculating bulk and chromosome specific telomere length [Dataset]. http://doi.org/10.5061/dryad.dz08kps5d
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.dz08kps5d
Dataset updated
Apr 19, 2024
Dataset provided by
Dryad Digital Repository
Authors
Kayarash Karimian; Aljona Groot; Vienna Huso; Ramin Kahidi; Samantha Sholes; Rebecca Keener; Andreas Rechtsteiner; Jonathan Alder; John McDyer; Carol Greider
Time period covered
Apr 5, 2024
Description
Short telomeres cause age-related disease and long telomeres predispose to cancer; however, the mechanisms regulating telomere length are unclear. To probe these mechanisms,Â we developed a nanopore sequencing method, Telomere Profiling, that identified mean telomere length to similar a Southern and to the clinical FlowFISH assay. Â We mapped telomere reads to specific chromosome ends and, strikingly, could identify chromosome end-specific lengths that differed by more than 6kb.Â We measured chromosome end-specific telomere lengths for 147 individuals and found that specific chromosome ends were consistently shorter or longer. This rank order of specific chromosome end telomere lengths was also found in newborn cord blood, suggesting telomere length is determined at birth. The average telomere length at birth was ~8kb +/- 250 bp, shorter than previously estimated. Understanding the mechanisms regulating length will allow deeper insights into telomere biology that can lead to new approach..., Telomeres were isolated and measured using telomere profiling protocol with Oxford Nanopore's MinION instrument and basecalled using ONT's guppy basecaller.Â , , # Telomere Profiling Dataset: Telomere dataset used for calculating bulk and chromosome specific telomere lengths

https://doi.org/10.5061/dryad.dz08kps5d

The dataset includes demultiplexed raw telomere reads that were used for measuring telomere length across a wide population, telomere reads from the HG002 cell line, as well as custom CHM13 and HG002 reference genomes with truncated telomere sequences which were used for mapping reads.

Description of the data and file structure

1) Data files types are described in the table below.

| Data Type | File Name | Description ...
Data from: Detection and analysis of complex structural variation in human...
data.niaid.nih.gov
datadryad.org
zip
Updated Oct 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bo Zhou; Joseph Arthur; Hanmin Guo; Taeyoung Kim; Yiling Huang; Reenal Pattni; Tao Wang; Soumya Kundu; Jay Luo; HoJoon Lee; Daniel Nachun; Carolin Purmann; Emma Monte; Annika Weimer; Pingping Qu; Minyi Shi; Lixia Jiang; Xinqiong Yang; John Fullard; Jaroslav Bendl; Kiran Girdhar; Xi Chen; PsychENCODE Consortium; William Greenleaf; Laramie Duncan; Hanlee Ji; Xiang Zhu; Giltae Song; Stephen Montgomery; Dean Palejev; Heinrich Dohna; Panos Roussos; Anshul Kundaje; Joachim Hallmayer; Michael Snyder; Wing Wong; Alexander Urban (2024). Detection and analysis of complex structural variation in human genomes across populations and in brains of donors with psychiatric disorders [Dataset]. http://doi.org/10.5061/dryad.z08kprrpc
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.z08kprrpc
Dataset updated
Oct 7, 2024
Dataset provided by
National Institute of Mental Healthhttp://www.nimh.nih.gov/
Stanford University
Pennsylvania State University
Pusan National University
Stanford University School of Medicine
Icahn School of Medicine at Mount Sinai
Bulgarian Academy of Sciences
American University of Beirut
Authors
Bo Zhou; Joseph Arthur; Hanmin Guo; Taeyoung Kim; Yiling Huang; Reenal Pattni; Tao Wang; Soumya Kundu; Jay Luo; HoJoon Lee; Daniel Nachun; Carolin Purmann; Emma Monte; Annika Weimer; Pingping Qu; Minyi Shi; Lixia Jiang; Xinqiong Yang; John Fullard; Jaroslav Bendl; Kiran Girdhar; Xi Chen; PsychENCODE Consortium; William Greenleaf; Laramie Duncan; Hanlee Ji; Xiang Zhu; Giltae Song; Stephen Montgomery; Dean Palejev; Heinrich Dohna; Panos Roussos; Anshul Kundaje; Joachim Hallmayer; Michael Snyder; Wing Wong; Alexander Urban
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Complex structural variations (cxSVs) are often overlooked in genome analyses due to detection challenges. We developed ARC-SV, a probabilistic and machine-learning-based method that enables accurate detection and reconstruction of cxSVs from standard whole-genome sequencing datasets. By applying ARC-SV across 4,262 genomes representing all continental populations, we identified cxSVs as a significant source of natural human genetic variation. The 4,262 individual genomes are sourced from the 1000 Genomes Project, the Human Genome Diversity Project, and the Simons Genome Diversity Project. We also applied ARC-SV to Neanderthal genomes, a number of benchmarking genomes including CHM13-T2T, HG002, HuRef, PG1, and HepG2 (cancer) as well as 119 postmortem brain (79 from ComminMind Consortium and 40 from the National Institute of Mental Health Human Brain Collection Core). Most brain samples are from donors with major psychiatric disorders. The high-confidence cxSV calls for all samples (including dot plot visualizations) are compiled into Dataset S1. ARC-SV. The high-confidence simple SV calls produced by ARC-SV for all samples are also included and compiled into Dataset S2. In our study (Zhou et al, Cell 2024), our analysis of these Datasets revealed that rare cxSVs have a propensity to occur in neural genes and loci that underwent rapid human-specific evolution, including those regulating corticogenesis. By performing single-nucleus multiomics in postmortem brains, we discovered cxSVs associated with differential gene expression and chromatin accessibility across various brain regions and cell types. Additionally, cxSVs detected in brains of psychiatric cases are enriched for linkage with psychiatric GWAS risk alleles detected in the same brains. Furthermore, our analysis revealed significantly decreased brain-region- and cell-type-specific expression of cxSV genes, specifically for psychiatric cases, implicating cxSVs in the molecular etiology of major neuropsychiatric disorders. Methods Structural variation (SV) calls from standard whole-genome sequencing (WGS) datasets were made via ARC-SV (https://github.com/SUwonglab/arcsv). Dot plots were generated using LAST (https://github.com/lpryszcz/last).

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institute of Standards and Technology (2022). GIAB Benchmarking of HG002 Assemblies from HPRC Year 1 Bakeoff [Dataset]. https://catalog.data.gov/dataset/giab-benchmarking-of-hg002-assemblies-from-hprc-year-1-bakeoff

Data from: GIAB Benchmarking of HG002 Assemblies from HPRC Year 1 Bakeoff

Explore at:

Dataset updated

Jul 29, 2022

Dataset provided by

National Institute of Standards and Technologyhttp://www.nist.gov/

Description

The Human Pangenome Reference Consortium (HPRC) tested which combination of current genome sequencing and automated assembly approaches yields the most complete, accurate, and cost-effective diploid genome assemblies with minimal manual curation. Assemblies were generated for GIAB HG002. Variant calls from twenty-nine assemblies were evaluated by NIST using dipcall v0.3 (https://github.com/lh3/dipcall) to produce variant calls when aligned to GRCh38. Benchmarking of small variant calls was then performed against GIAB benchmark v4.2.1 using hap.py v3.12 (https://github.com/Illumina/hap.py).

Clear search

Close search

Google apps

Main menu

Data from: GIAB Benchmarking of HG002 Assemblies from HPRC Year 1 Bakeoff

HG002

Assembly of human HG002 (GM24385) ONT Q20+ Simplex dataset generated by...

HG002

HG002 Ultima (2024)

HG002 Supernova and Canu assemblies

The SV callsets of the HG002 human sample produced by cuteSV with multi...

HG002 Supernova and Canu - Hybrid Assembly

Data for: Nanopore R10.4.1 LSK114 HG002: subset of 20000 reads in BLOW5...

HG002 Illumina PCR Free

lra-supplemental-HG002-SV.vcf.tar.gz

HG002-HG004 Basecalled and modcalled data

HG002 Ultima (2022)

Challenging Medically-Relevant Genes Benchmark Set

HG002 PacBio Hifi

Test data for sv-callers workflow

Human genome assemblies enhanced by LOCLA

Data from: Overcoming limitations to customize DeepVariant for domesticated...

Telomere dataset used for calculating bulk and chromosome specific telomere...

Description of the data and file structure

Data from: Detection and analysis of complex structural variation in human...

Data from: GIAB Benchmarking of HG002 Assemblies from HPRC Year 1 BakeoffSee More Versions

Data from: GIAB Benchmarking of HG002 Assemblies from HPRC Year 1 Bakeoff