30 datasets found
  1. Data from: GIAB Benchmarking of HG002 Assemblies from HPRC Year 1 Bakeoff

    • catalog.data.gov
    • data.nist.gov
    Updated Jul 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). GIAB Benchmarking of HG002 Assemblies from HPRC Year 1 Bakeoff [Dataset]. https://catalog.data.gov/dataset/giab-benchmarking-of-hg002-assemblies-from-hprc-year-1-bakeoff
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    The Human Pangenome Reference Consortium (HPRC) tested which combination of current genome sequencing and automated assembly approaches yields the most complete, accurate, and cost-effective diploid genome assemblies with minimal manual curation. Assemblies were generated for GIAB HG002. Variant calls from twenty-nine assemblies were evaluated by NIST using dipcall v0.3 (https://github.com/lh3/dipcall) to produce variant calls when aligned to GRCh38. Benchmarking of small variant calls was then performed against GIAB benchmark v4.2.1 using hap.py v3.12 (https://github.com/Illumina/hap.py).

  2. HG002

    • figshare.com
    application/gzip
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Tolstoganov (2022). HG002 [Dataset]. http://doi.org/10.6084/m9.figshare.21678842.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Dec 6, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ivan Tolstoganov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Assembly graphs and reference Verkko assembly for HG002 dataset

  3. Assembly of human HG002 (GM24385) ONT Q20+ Simplex dataset generated by...

    • zenodo.org
    application/gzip
    Updated Nov 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiao Luo; Xiao Luo (2021). Assembly of human HG002 (GM24385) ONT Q20+ Simplex dataset generated by phasebook [Dataset]. http://doi.org/10.5281/zenodo.5729181
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Nov 26, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiao Luo; Xiao Luo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is assembly of human HG002 (GM24385) dataset generated by phasebook. The raw sequencing data is from : October 2021 GM24385 Q20+ Simplex Dataset Release. Reads from 4 flowcells were used. https://labs.epi2me.io/gm24385_q20_2021.10/?

  4. S

    HG002

    • scidb.cn
    Updated Oct 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jun Mencius (2024). HG002 [Dataset]. http://doi.org/10.57760/sciencedb.14326
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 8, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Jun Mencius
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    HG002 dorado0.4.3 SUP

  5. HG002 Ultima (2024)

    • figshare.com
    application/gzip
    Updated Apr 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan Dwarshuis (2024). HG002 Ultima (2024) [Dataset]. http://doi.org/10.6084/m9.figshare.25554984.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 5, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Nathan Dwarshuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
  6. HG002 Supernova and Canu assemblies

    • zenodo.org
    application/gzip, bin
    Updated Mar 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Mastromatteo; Scott Mastromatteo (2025). HG002 Supernova and Canu assemblies [Dataset]. http://doi.org/10.5281/zenodo.15058964
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Scott Mastromatteo; Scott Mastromatteo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data for creating a hybrid assembly from a supernova (10X Genomics linked reads) and canu (PacBio CLR long reads). Reads used to generate these assemblies were sourced from the Genome In A Bottle dataset for HG002.

    Haplotigs have been removed using purge_dups.

    https://github.com/ScottMastro/hybrid-pipeline

  7. o

    The SV callsets of the HG002 human sample produced by cuteSV with multi...

    • explore.openaire.eu
    Updated Oct 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tao Jiang (2019). The SV callsets of the HG002 human sample produced by cuteSV with multi long-read sequencing platforms. [Dataset]. http://doi.org/10.5281/zenodo.3556403
    Explore at:
    Dataset updated
    Oct 9, 2019
    Authors
    Tao Jiang
    Description

    {"references": ["Long Read based Human Genomic Structural Variation Detection with cuteSV. Tao Jiang, et al. bioRxiv 780700; doi: https://doi.org/10.1101/780700"]} The SV callsets of the HG002 human sample produced by cuteSV with multi long-read sequencing platforms.

  8. HG002 Supernova and Canu - Hybrid Assembly

    • zenodo.org
    application/gzip, bin +1
    Updated Mar 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Mastromatteo; Scott Mastromatteo (2025). HG002 Supernova and Canu - Hybrid Assembly [Dataset]. http://doi.org/10.5281/zenodo.15062708
    Explore at:
    application/gzip, bin, txtAvailable download formats
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Scott Mastromatteo; Scott Mastromatteo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data for creating a hybrid assembly from a supernova (10X Genomics linked reads) and canu (PacBio CLR long reads). Reads used to generate these assemblies were sourced from the Genome In A Bottle dataset for HG002. Pipeline details can be found here.

    Assembly files with haplotigs have been removed using purge_dups:

    • HG002_canu.contigs.purged.fa.gz
    • HG002_canu.unitigs.clean.bed (unitigs from canu)
    • HG002_supernova.pseudohap.purged.fa.gz

    1 kb BLASTn alignments of supernova against canu:

    • HG002.blastn.summary.txt

    Hybrid assembly:

    • HG002_hybrid.assembly.fa.gz (main scaffolds)
    • HG002_hybrid.assembly.leftover.fa.gz (unplaced scaffolds)
    • HG002_hybrid.source.bed (source assembly)
    • HG002_hybrid.source_leftover.bed (source assembly)
    • HG002_hybrid.scaff.t2t.txt (best chromosome match against T2T)
  9. d

    Data for: Nanopore R10.4.1 LSK114 HG002: subset of 20000 reads in BLOW5...

    • dataone.org
    • search.dataone.org
    • +3more
    Updated May 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hasindu Gamaarachchi (2025). Data for: Nanopore R10.4.1 LSK114 HG002: subset of 20000 reads in BLOW5 format [Dataset]. http://doi.org/10.5061/dryad.905qfttq9
    Explore at:
    Dataset updated
    May 18, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Hasindu Gamaarachchi
    Time period covered
    Jan 1, 2023
    Description

    HG002 (NA24385) is a reference human genome sample used for benchmarking and comparing bioinformatics applications. This dataset contains a subset of 20,000 reads from the HG002 human reference sample, sequenced using an Oxford Nanopore Technologies PromethION sequencer on an R10.4.1 flowcell. Sheared DNA libraries (~17Kb) were prepared using the ONT LSK114 ligation library prep and an R10.4.1 flow cell was used to generate ~30X genome coverage. The original data in the FAST5 format was converted to BLOW5 format using slow5tools v0.8.0. This is a downsampled subset containing 20,000 reads in BLOW5 format.

  10. f

    HG002 Illumina PCR Free

    • figshare.com
    application/gzip
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan Dwarshuis (2023). HG002 Illumina PCR Free [Dataset]. http://doi.org/10.6084/m9.figshare.22637347.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    figshare
    Authors
    Nathan Dwarshuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    HG002 Illumina PCR Free vcf 40x coverage

  11. f

    lra-supplemental-HG002-SV.vcf.tar.gz

    • figshare.com
    application/x-gzip
    Updated Nov 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Chaisson (2020). lra-supplemental-HG002-SV.vcf.tar.gz [Dataset]. http://doi.org/10.6084/m9.figshare.13238717.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Nov 15, 2020
    Dataset provided by
    figshare
    Authors
    Mark Chaisson
    License

    https://www.gnu.org/licenses/gpl-2.0.htmlhttps://www.gnu.org/licenses/gpl-2.0.html

    Description

    Variant calls for HG002 on PacBio HiFi, CLR, Oxford Nanopore, are included for alignments generated by lra, minimap2, and ngmlr. Variants

  12. S

    HG002-HG004 Basecalled and modcalled data

    • scidb.cn
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chenxi Zhang (2025). HG002-HG004 Basecalled and modcalled data [Dataset]. http://doi.org/10.57760/sciencedb.23979
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Chenxi Zhang
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The basecalling of the R9.4.1 data was performed with Guppy 2.3.7, Guppy 4.2.2, and Guppy 6.3.8. The basecalling of the R10.4.1 data was performed by Dorado 0.5.3.

  13. HG002 Ultima (2022)

    • figshare.com
    application/x-gzip
    Updated Apr 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan Dwarshuis (2024). HG002 Ultima (2022) [Dataset]. http://doi.org/10.6084/m9.figshare.25554978.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Apr 5, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Nathan Dwarshuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
  14. Challenging Medically-Relevant Genes Benchmark Set

    • catalog.data.gov
    • data.nist.gov
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). Challenging Medically-Relevant Genes Benchmark Set [Dataset]. https://catalog.data.gov/dataset/challenging-medically-relevant-genes-benchmark-set
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    CMRG v1.00 of a small variant benchmark and structural variant benchmark focused on 273 challenging medically relevant genes for the Genome in a Bottle (GIAB) sample HG002 (aka Ashkenazi son). These benchmarks were generated from a trio-based hifiasm v0.11 (https://doi.org/10.1038/s41592-020-01056-5) diploid assembly of HG002 using PacBio HiFi reads for HG002 for assembly and partitioning into phased haplotypes using Illumina reads for the parents, HG003 and HG004. This benchmark contains vcfs for small and structural variants along with corresponding benchmark bed files indicating regions that are homozygous reference if they do not have a variant in the vcf. We extensively curated the variant calls, excluding any found to be questionable or errors. This benchmark helps measure performance in important challenging regions, including challenging segmental duplications, regions with complex variants, regions with structural variants, and regions affected by false duplications in GRCh37 or GRCh38. This benchmark is described in https://doi.org/10.1101/2021.06.07.444885.

  15. HG002 PacBio Hifi

    • figshare.com
    application/gzip
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan Dwarshuis (2023). HG002 PacBio Hifi [Dataset]. http://doi.org/10.6084/m9.figshare.22637410.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Nathan Dwarshuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    HG002 PacBio Hifi vcf 37x coverage

  16. Test data for sv-callers workflow

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnold Kuzniar; Arnold Kuzniar; Luca Santuari; Luca Santuari (2024). Test data for sv-callers workflow [Dataset]. http://doi.org/10.5281/zenodo.4001614
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Arnold Kuzniar; Arnold Kuzniar; Luca Santuari; Luca Santuari
    Description

    This distribution includes data analyzed by the sv-callers workflow (v1.1.0) in the single-sample (germline) and paired-sample (somatic) modes:

  17. Z

    Human genome assemblies enhanced by LOCLA

    • data.niaid.nih.gov
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pao-Yin Fu (2023). Human genome assemblies enhanced by LOCLA [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8297188
    Explore at:
    Dataset updated
    Aug 29, 2023
    Dataset provided by
    Chung-Yen Lin
    Yu-Jung Chang
    Yi-Chen Huang
    Jan-Ming Ho
    Hsueh-Chien Cheng
    Ping-Heng Hsieh
    Shu-Hwa Chen
    Pao-Yin Fu
    Wei-Hsuan Chuang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a data repository for the genome assemblies of three human samples enhanced by LOCLA (DOI: 10.5281/zenodo.8280853 ). LOCLA is a novel genome assembly optimization tool, LOCLA, that iteratively improves the quality of an assembly by locating sequencing reads on partially assembled scaffolds and thus enable gap filling and further scaffolding.

    The three human genome assemblies and the assembly statistics are compressed into one single zip file. File names are explained as follows:

    LLD0021C_locla.fasta : Whole genome assembly of a Taiwanese male individual generated by LOCLA

    LLD0021C_locla_quality.txt : Assembly statistics of LLD0021C_locla.fasta

    chm13_locla.fasta : Whole genome assembly of the CHM13 cell line generated by LOCLA

    chm13_locla_quality.txt : Assembly statistics of chm13_locla.fasta

    hg002_gma_locla.fasta : Whole genome assembly of the HG002 sample generated by LOCLA

    hg002_gma_locla_quality.txt : Assembly statistics of hg002_gma_locla.fasta

  18. Data from: Overcoming limitations to customize DeepVariant for domesticated...

    • zenodo.org
    application/gzip, bin +1
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Schnabel; Robert Schnabel; Jenna Kalleberg; Jenna Kalleberg (2025). Overcoming limitations to customize DeepVariant for domesticated animals with TrioTrain [Dataset]. http://doi.org/10.5281/zenodo.15482485
    Explore at:
    bin, application/gzip, csvAvailable download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Robert Schnabel; Robert Schnabel; Jenna Kalleberg; Jenna Kalleberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT

    Generating high-quality variant callsets across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a “universal” algorithm has magnified the unknown impacts when used with non-human species. We use bovine genomes to assess the limits of using human-genome-trained variant callers, including the allele frequency channel (DV-AF) and joint-caller DeepTrio (DT). Our novel approach, TrioTrain, automates extending DV for diploid species lacking Genome-in-a-Bottle (GIAB) resources, using a region shuffling approach to mitigate barriers for SLURM-based clusters. Imperfect animal truth labels are curated to remove Mendelian discordant sites before training DV to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to create the first multi-species-trained DV-AF checkpoint. Although incomplete bovine truth sets constrain recall within challenging repetitive regions, we observe a mean SNV F1 score >0.990 across new checkpoints during GIAB benchmarking. With HG002, a bovine-trained checkpoint (28) decreased the Mendelian Inheritance Error (MIE) rate by a factor of two compared to the default (DV). Checkpoint 28 has a mean MIE rate of 0.03 percent in three bovine interspecies cross genomes. These results illustrate that a multi-species, trio-based training strategy reduces inheritance errors during single-sample variant calling. While exclusively training with human genomes deters transferring deep-learning-based variant calling to new species, we use the diverse ancestry within bovids to illustrate the need for advanced tools designed for comparative genomics.

    1. TrioTrain_README.md

      README file that describes the contents and purpose of these files in further detail.

    2. TrioTrain_project_metadata.csv

      Pedigree and breed labels for all bovine samples included in this study.

    3. CallableRegions.tar.gz

      Per-sample callable region files. After cohort QC, we generated truth sets based on the UMAG1 cohort using GATK-derived genotypes. The regions files produced by GATK (v3.8-1-0-gf15c1c3ef), followed by parsing per-sample CallableLoci to extract only PASS regions for downstream analyses.

    4. UMAG1.POP.FREQ.vcf.gz

      UMAGv1 cohort population allele frequency file.

    5. ReferenceGenome.tar.gz

      Bovine reference genome files

    6. ModelCheckpoint.tar.gz

      Final selected TrioTrain checkpoint (28). This file is compatible with DeepVariant (v1.4) for short-read, whole-genome-sequencing (WGS) data. Using this alternative checkpoint requires a Population VCF compatible with the reference genome provided to DeepVariant.

    7. DV-TrioTrain-0.8.tar.gz

      The source code for the TrioTrain pipeline (v0.8) at the time of publication. Additional information, including installation instructions, are available on Github: https://github.com/jkalleberg/DV-TrioTrain/releases/tag/v0.8

  19. d

    Telomere dataset used for calculating bulk and chromosome specific telomere...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kayarash Karimian; Aljona Groot; Vienna Huso; Ramin Kahidi; Samantha Sholes; Rebecca Keener; Andreas Rechtsteiner; Jonathan Alder; John McDyer; Carol Greider (2024). Telomere dataset used for calculating bulk and chromosome specific telomere length [Dataset]. http://doi.org/10.5061/dryad.dz08kps5d
    Explore at:
    Dataset updated
    Apr 19, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Kayarash Karimian; Aljona Groot; Vienna Huso; Ramin Kahidi; Samantha Sholes; Rebecca Keener; Andreas Rechtsteiner; Jonathan Alder; John McDyer; Carol Greider
    Time period covered
    Apr 5, 2024
    Description

    Short telomeres cause age-related disease and long telomeres predispose to cancer; however, the mechanisms regulating telomere length are unclear. To probe these mechanisms, we developed a nanopore sequencing method, Telomere Profiling, that identified mean telomere length to similar a Southern and to the clinical FlowFISH assay.  We mapped telomere reads to specific chromosome ends and, strikingly, could identify chromosome end-specific lengths that differed by more than 6kb. We measured chromosome end-specific telomere lengths for 147 individuals and found that specific chromosome ends were consistently shorter or longer. This rank order of specific chromosome end telomere lengths was also found in newborn cord blood, suggesting telomere length is determined at birth. The average telomere length at birth was ~8kb +/- 250 bp, shorter than previously estimated. Understanding the mechanisms regulating length will allow deeper insights into telomere biology that can lead to new approach..., Telomeres were isolated and measured using telomere profiling protocol with Oxford Nanopore's MinION instrument and basecalled using ONT's guppy basecaller. , , # Telomere Profiling Dataset: Telomere dataset used for calculating bulk and chromosome specific telomere lengths

    https://doi.org/10.5061/dryad.dz08kps5d

    The dataset includes demultiplexed raw telomere reads that were used for measuring telomere length across a wide population, telomere reads from the HG002 cell line, as well as custom CHM13 and HG002 reference genomes with truncated telomere sequences which were used for mapping reads.

    Description of the data and file structure

    1) Data files types are described in the table below.

    | Data Type | File Name | Description ...

  20. Data from: Detection and analysis of complex structural variation in human...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Oct 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo Zhou; Joseph Arthur; Hanmin Guo; Taeyoung Kim; Yiling Huang; Reenal Pattni; Tao Wang; Soumya Kundu; Jay Luo; HoJoon Lee; Daniel Nachun; Carolin Purmann; Emma Monte; Annika Weimer; Pingping Qu; Minyi Shi; Lixia Jiang; Xinqiong Yang; John Fullard; Jaroslav Bendl; Kiran Girdhar; Xi Chen; PsychENCODE Consortium; William Greenleaf; Laramie Duncan; Hanlee Ji; Xiang Zhu; Giltae Song; Stephen Montgomery; Dean Palejev; Heinrich Dohna; Panos Roussos; Anshul Kundaje; Joachim Hallmayer; Michael Snyder; Wing Wong; Alexander Urban (2024). Detection and analysis of complex structural variation in human genomes across populations and in brains of donors with psychiatric disorders [Dataset]. http://doi.org/10.5061/dryad.z08kprrpc
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 7, 2024
    Dataset provided by
    National Institute of Mental Healthhttp://www.nimh.nih.gov/
    Stanford University
    Pennsylvania State University
    Pusan National University
    Stanford University School of Medicine
    Icahn School of Medicine at Mount Sinai
    Bulgarian Academy of Sciences
    American University of Beirut
    Authors
    Bo Zhou; Joseph Arthur; Hanmin Guo; Taeyoung Kim; Yiling Huang; Reenal Pattni; Tao Wang; Soumya Kundu; Jay Luo; HoJoon Lee; Daniel Nachun; Carolin Purmann; Emma Monte; Annika Weimer; Pingping Qu; Minyi Shi; Lixia Jiang; Xinqiong Yang; John Fullard; Jaroslav Bendl; Kiran Girdhar; Xi Chen; PsychENCODE Consortium; William Greenleaf; Laramie Duncan; Hanlee Ji; Xiang Zhu; Giltae Song; Stephen Montgomery; Dean Palejev; Heinrich Dohna; Panos Roussos; Anshul Kundaje; Joachim Hallmayer; Michael Snyder; Wing Wong; Alexander Urban
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Complex structural variations (cxSVs) are often overlooked in genome analyses due to detection challenges. We developed ARC-SV, a probabilistic and machine-learning-based method that enables accurate detection and reconstruction of cxSVs from standard whole-genome sequencing datasets. By applying ARC-SV across 4,262 genomes representing all continental populations, we identified cxSVs as a significant source of natural human genetic variation. The 4,262 individual genomes are sourced from the 1000 Genomes Project, the Human Genome Diversity Project, and the Simons Genome Diversity Project. We also applied ARC-SV to Neanderthal genomes, a number of benchmarking genomes including CHM13-T2T, HG002, HuRef, PG1, and HepG2 (cancer) as well as 119 postmortem brain (79 from ComminMind Consortium and 40 from the National Institute of Mental Health Human Brain Collection Core). Most brain samples are from donors with major psychiatric disorders. The high-confidence cxSV calls for all samples (including dot plot visualizations) are compiled into Dataset S1. ARC-SV. The high-confidence simple SV calls produced by ARC-SV for all samples are also included and compiled into Dataset S2. In our study (Zhou et al, Cell 2024), our analysis of these Datasets revealed that rare cxSVs have a propensity to occur in neural genes and loci that underwent rapid human-specific evolution, including those regulating corticogenesis. By performing single-nucleus multiomics in postmortem brains, we discovered cxSVs associated with differential gene expression and chromatin accessibility across various brain regions and cell types. Additionally, cxSVs detected in brains of psychiatric cases are enriched for linkage with psychiatric GWAS risk alleles detected in the same brains. Furthermore, our analysis revealed significantly decreased brain-region- and cell-type-specific expression of cxSV genes, specifically for psychiatric cases, implicating cxSVs in the molecular etiology of major neuropsychiatric disorders. Methods Structural variation (SV) calls from standard whole-genome sequencing (WGS) datasets were made via ARC-SV (https://github.com/SUwonglab/arcsv). Dot plots were generated using LAST (https://github.com/lpryszcz/last).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
National Institute of Standards and Technology (2022). GIAB Benchmarking of HG002 Assemblies from HPRC Year 1 Bakeoff [Dataset]. https://catalog.data.gov/dataset/giab-benchmarking-of-hg002-assemblies-from-hprc-year-1-bakeoff
Organization logo

Data from: GIAB Benchmarking of HG002 Assemblies from HPRC Year 1 Bakeoff

Related Article
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description

The Human Pangenome Reference Consortium (HPRC) tested which combination of current genome sequencing and automated assembly approaches yields the most complete, accurate, and cost-effective diploid genome assemblies with minimal manual curation. Assemblies were generated for GIAB HG002. Variant calls from twenty-nine assemblies were evaluated by NIST using dipcall v0.3 (https://github.com/lh3/dipcall) to produce variant calls when aligned to GRCh38. Benchmarking of small variant calls was then performed against GIAB benchmark v4.2.1 using hap.py v3.12 (https://github.com/Illumina/hap.py).

Search
Clear search
Close search
Google apps
Main menu