Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Genotyping of known SNPs from ClinVar using the VCF and gVCF file formats and the number of homozygous reference sites and no-calls based on WGS data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The genotype dataset with 17,717,568 SNPs was used for a resampling-based genome-wide association study on 162 phenotypes from two partially overlapping maize association panels viz, SAM and WiDiv panels. SNPs were filtered to remove markers with a minor allele frequency of less than 0.01 or proportion of heterozygous SNP calls greater than 0.1 to produce the final SNP set
Facebook
Twitterhttps://ega-archive.org/dacs/EGAC50000000708https://ega-archive.org/dacs/EGAC50000000708
Merged VCF file from sporadic Meniere disease cohort
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset makes available the UCSC Genome Browser (genome.ucsc.edu) GRCh37 genome build public session NA12878 WES Benchmark files in a single dataset so that these files can be used in other applications or genome browsers such as IGV. All genomic variant calls in all VCF files were decomposed and normalized with vt. This dataset contains:
Genome in a bottle (GIAB) version 3.3.2 high confidence (HC) variant calls and genomic regions for HapMap individual NA12878 :
GIAB_v3.3.2_NA12878-decomposed-normalized.vcf.gz
GIAB_v3.3.2_NA12878-decomposed-normalized.vcf.gz.tbi
GIAB_v3.3.2_NA12878_HC_regions.bed
HapMap individual NA12878 WES variant calls (VCF) and capture regions (BED) from diagnostic laboratories :
ARUP whole exome sequencing data (HiSeq 2000) publically available from NCBI GeT-RM Browser
converted_ARUP_NA12878_Exome-decomposed-normalized.vcf.gz
converted_ARUP_NA12878_Exome-decomposed-normalized.vcf.gz.tbi
ARUP_SeqCap_EZ_Exome.bed
UCSF whole exome sequencing data (HiSeq 2500) publically available from NCBI GeT-RM Browser
converted_UCSF_NA12878_WES_Agilent_V4_Custom-decomposed-normalized.vcf.gz
converted_UCSF_NA12878_WES_Agilent_V4_Custom-decomposed-normalized.vcf.gz.tbi
UCSF_WES_Agilent_V4_Custom.bed
Whole exome data (NextSeq 500) sequenced in CHEO diagnostic laboratory
CHEO_NA12878_WES_S1dataset.vcf.gz
CHEO_NA12878_WES_S1dataset.vcf.gz.tbi
Agilent_CRE_v2.bed
Genomic coordinates (BED) of OMIM genes for which a molecular basis of the associated disease is known (as of September 2019) :
Omim_Genes.bed
Facebook
Twitterhttps://ega-archive.org/dacs/EGAC50000000708https://ega-archive.org/dacs/EGAC50000000708
This dataset contains a merge VCF file generated from WES data of patients diagnosed with familial Meniere disease (FMD). Variant calling followed GATK best practices using the nf-core/Sarek pipeline (v3), and variants were filtered using genotype-level thresholds consistent with gnomAD filters. Multiallelic variants were split and INDELs were left-aligned during normalization. Variant Quality Score Recalibration (VQSR) was applied separately to SNVs and INDELs using well-established truth sets, with a 90% sensitivity threshold to maximize the detection of rare variants. Final variants were annotated with Ensembl VEP.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variant Calling Format (VCF) files associated with the publication entitled: "Selective whole-genome amplification reveals population genetics of Leishmania braziliensis directly from patient skin biopsies"
Facebook
TwitterLOCATION CHANGE FOR ALZHEIMER'S DISEASE SEQUENCING PROJECT (ADSP) DATA: Please go to NIAGADS DSS to apply for build 38 ADSP genetic and phenotypic data. See Background below for more details. For instructions on how to access the additional ADSP data that are shared through NIAGADS DSS, visit the Application Instructions page. Background: Additional sequencing data are continuously being generated by the ADSP. These data are mapped to the latest Genome Reference Consortium human genome build GRCh38 (hg38) and are being shared through the NIA Genetics of Alzheimer's Disease Data Storage Site (NIAGADS) Data Sharing Service (DSS). As of May 1, 2020 there are 4,789 whole genomes and 19,922 whole exomes available to the research community. Later in 2020 there will be a total of ~17,000 whole genomes and 19,922 whole exomes available through NIAGADS DSS (ng00067). The total number of genomes from multi-ethnic cohorts is anticipated to exceed 50,000. Please see the ADSP Design page for the complete study description. ADSP whole exome and whole genome sequence data that were shared through dbGaP were mapped to the GRCh37 (build 37). These data are from the Discovery Phase of the project (described below) and will continue to be available at this site. STUDY DESCRIPTION FOR dbGaP BUILD 37 ADSP DATA: The overarching goals of the Alzheimer's Disease Sequencing Project (ADSP) are to: (1) identify new genomic variants contributing to increased risk of developing Alzheimer's Disease (AD), (2) identify new genomic variants contributing to protection against developing AD, and (3) provide insight as to why individuals with known risk factor variants escape from developing AD. These factors will be studied in multi-ethnic populations in order to identify new pathways for disease prevention. Such a study of human genomic variation and its relationship to health and disease requires examination of a large number of study participants and needs to capture information about common and rare variants (both single nucleotide and copy number) in well phenotyped individuals. Using existing samples from NIH funded and other studies, three NHGRI funded Large Scale Sequencing and Analysis Centers (LSAC) - Broad, Baylor, and Washington University - produced the DNA sequence data. Variant call data are being made available to the scientific community through NIH-approved data repositories. Statistical analysis of the sequence data is anticipated to identify new genetic risk and protective factors. The ADSP will conduct and facilitate analysis of sequence data to extend previous discoveries that may ultimately result in new directions for AD therapeutics. Analysis of ADSP data will be done in two phases. The Discovery Phase analysis (2014-2018) is funded under PAR-12-183. The entire Discovery dataset contains whole-genome sequencing data on 584 subjects from 113 families, and pedigree data for > 4000 subjects; whole exome sequencing data on 5096 cases 4965 controls; and whole exome sequence data on an additional 853 (682 Cases [510 Non-Hispanic, 172 Hispanic]), and 171 Hispanic Control subjects from families that are multiply affected with AD. The Replication Phase (2016-2021) analysis will be funded under RFA-AG-16-001 and RFA-AG-16-002 and is expected to include a combination of genotyping and sequencing approaches on at least 30,000 subjects. Targeted sequencing will be done by the LSACs. GRCh37 Data Releases The first ADSP data release occurred on November 25, 2013. It included the whole-genome sequencing data in BAM file format on 410 individuals. The second ADSP data release occurred on March 31, 2014, and included the whole-genome sequencing data in BAM file format for an additional 168 individuals. The third ADSP data release occurred on November 03, 2014 and included whole-exome sequencing data in BAM file format for 10,939 individuals. The fourth ADSP data release occurred on February 13, 2015 and included revised ethnic data for subjects with whole-exome sequencing data. The fifth ADSP data release occurred on July 13, 2015 and included whole-genome genotypes and updated phenotypes as well as changes to pedigree structures and sample IDs. The sixth ADSP data release occurred on December 8, 2015, and included whole-exome genotypes and updated phenotypes as well as changes to subject IDs. This seventh ADSP data release on April 12, 2016 includes: (1) WES and WGS SNV VCF files (2) WES and WGS Indel PLINK files ADSP Data Available through dbGaP: ADSP - Whole Genome Sequencing ADSP - Whole Exome Sequencing Comments DNA-Seq (BAM) n=578 n=10913 Sequence data available (plus n=38 replications w/out genotype data) Concordant SNV Genotypes (PLINK format) N/A n=10913 QC'ed genotypes that are concordant between the Atlas (Baylor's) and GATK (Broad's) calling pipelines (a subset of the consensus genotype set) Consensus Genotypes (PLINK and VCF format) n=578 n=10913 QC'ed genotypes that are concordant between Atlas and GATK pipelines as well as those that that were called uniquely by Atlas or GATK Concordant Indel Genotypes (PLINK format) n=578 n=10913 QC'ed genotypes that are concordant between the Atlas and GATK calling pipelines Phenotype Data n=4735 n=10913 Data of n=53 phenotype variables available (plus administrative data), including APOE genotype. WGS phenotypes include data of connecting family members. Please use the release notes provided by dbGaP to obtain detailed information about study release updates. The ADSP data portal provides a customized interface for users to quickly identify and retrieve files by covariates, phenotypes, and data properties such as sequencing facility or coverage. For more information about the ADSP study and the data portal, please visit https://www.niagads.org/adsp/.
Facebook
Twitterhttps://ega-archive.org/dacs/EGAC00001000319https://ega-archive.org/dacs/EGAC00001000319
This dataset contains VCF files from a variant calling analysis of 19 neuroblastoma patients. WES or WGS data of the primary tumor were compared to WES cfDNA analysis at the time of diagnosis and at a 2nd timepoint (complete remission, partial remission, disease progression or relapse). For 4 patients, WGS of germline, tumor at diagnosis and tumor at relapse DNA was performed on Illumina HiSeq2500, with 100-bp paired-end reads. For the other patients, WES was performed using either an AgilentSureSelect Human All Exon v5 or a Roche Nimblegen SeqCap EZ Exome V3 kit on Illumina HiSeq2000, with 100-bp paired-end reads. SNVs observed in any of the primary tumors or cfDNA samples studied by WES were targeted using a capture sequencing panel at all intermediate time points.
Facebook
Twitterhttps://ega-archive.org/dacs/EGAC00001000222https://ega-archive.org/dacs/EGAC00001000222
The need for a detailed catalogue of local variability for the study of rare diseases within the context of the Medical Genome Project motivated the whole exome sequencing of 267 unrelated individuals, representative of the healthy Spanish population.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of the content and size of different standard file formats for the storage of genomic data.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Exome sequencing data (VCF files) of the nine adult patients with liver diseases.For exome sequencing, the library was prepared using Illumina DNA Prep with Exome 2.5 Enrichment product and sequenced on a NovaSeq 6000 instrument (Illumina, San Diego, CA). Reads were aligned to the GRCh38 Human Reference Genome using the Illumina DRAGEN Bio-IT Platform v3.9.
Facebook
TwitterFiltered WGS reads (fastp) aligned on Verticillium dahliae reference (https://www.ncbi.nlm.nih.gov/assembly/95341/GCA_000150675.2) with BWA. VCF called with freebayes v1.3.6 and annotated with snpeff.
Facebook
TwitterV-pipe pipeline for SARS-CoV-2 sequencing data
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The American chestnut (Castanea dentata) is a functionally extinct tree species that was decimated by an invasive fungal pathogen in the early 20th century. An understanding of the genomic architecture of local adaptation in wild American chestnut was necessary in order to deploy locally adapted, disease-resistant American chestnut populations. Here, we characterize the genomic basis of climate adaptation in remnant wild American chestnut, develop new computational methods, and evaluate the adaptive genomic content captured within backcross breeding populations. Whole genome re-sequencing data of 356 trees from Sandercock et al. (2022) coupled with genotype-environment association methods identified 18483 climate associated loci.
Facebook
TwitterHypertrophic cardiomyopathy (HCM) is the most common inherited cardiac disease in cats, often leading to congestive heart failure, arterial thromboembolism, and sudden cardiac death. The genetics of feline HCM are poorly understood, and limited genetic discoveries remain breed or family-specific. We aimed to identify novel causative or disease-modifying variants in a large cohort of cats reflective of the general cat population. In a second cohort, we sought to characterize transcriptomic differences between HCM-affected cats and healthy controls. DNA was isolated from 138 domestic cats (109 HCM and 29 controls). No single or combination of variants of high, moderate, or modifying impact were identified in genome-wide analysis to cause or modify the disease severity of HCM. Several rare high and moderate-impact variants in genes associated with human HCM were detected in diseased cats. In a second cohort, left ventricular (LV), interventricular septal (IVS), and left atrial (LA) tissues..., WGS data generation A total of 1-2 mL of whole blood were collected from the cephalic, saphenous, or jugular vein into EDTA blood collection tubes. DNA was either isolated from whole blood or from buffy coats after whole blood centrifugation at 2000 rpm for 15 minutes. Genomic DNA isolation was performed using commercially available kits (Gentra Puregene Blood kit, QIAGEN, Hilden Germany; ArchivePure;5Prime) and by following the respective manufacturer’s protocol. High-quality unfragmented DNA was selected by a combination of 1% agarose gel visualization and spectrophotometric confirmation (a 260/280 ratio of ~1.8 and a concentration of > 50 ng/uL; NanoDrop One/One, Thermofisher, Waltham, GA, USA). Samples were stored at -20°C until ready for shipment to Theragen Bio Co., Ltd, Gyeonggi-do, Republic of Korea for WGS. Paired-end DNA libraries were generated with a TruSeq DNA Nano library prep kit. Samples were then pooled and sequenced at ~30x coverage on the Illumina NovaSeq6000 platf..., # Unraveling the genetics of feline hypertrophic cardiomyopathy: A multiomics study of 138 cats
Dataset DOI: 10.5061/dryad.cjsxksnjh
1. A population level vcf of polymorphic SNP and indel variants were called among 138 domestic cats with and without hypertrophic cardiomyopathy (HCM). The VCF was generated by mapping paired wgs fastq reads to the Fca126 reference genome with bwa mem and calling variants through GATK4 best practices. Variant annotations were generated with Ensembl's VEP based on Fca126 gene and exon boundaries.  The vcf file contains meta-information lines, followed by a header line specifying fixed fields per sample and subsequent data lines detail variants at genomic positions. The fixed fields include chromosome (CHROM), position (POS), identifier (ID), the reference base(s) (REF), alternate base(s) (ALT), quality (QUAL), filter status (FILTER), and additional information ...,
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
A comprehensive understanding of the degree to which genomic variation is maintained by selection versus drift and gene flow is lacking in many important species such as Cannabis sativa (C. sativa), one of the oldest known crops to be cultivated by humans worldwide. We generated whole genome resequencing data across diverse samples of feralized (escaped domesticated lineages) and domesticated lineages of C. sativa. We performed analyses to examine population structure, and genome wide scans for FST, balancing selection, and positive selection. Our analyses identified evidence for sub-population structure and further support the Asian origin hypothesis of this species. Feral plants sourced from the U.S. exhibited broad regions on chromosomes 4 and 10 with high ?̅ST which may indicate chromosomal inversions maintained at high frequency in this sub-population. Both our balancing and positive selection analyses identified loci that may reflect differential selection for traits favored by natural selection and artificial selection in feral versus domesticated sub-populations. In the U.S. feral sub-population, we found six loci related to stress response under balancing selection and one gene involved in disease resistance under positive selection, suggesting local adaptation to new climates and biotic interactions. In the marijuana sub-population, we identified the gene SMALLER TRICHOMES WITH VARIABLE BRANCHES 2 to be under positive selection which suggests artificial selection for increased tetrahydrocannabinol yield. Overall the data generated, and results obtained from our study help to form a better understanding of the evolutionary history in C. sativa.
Facebook
Twitterhttps://ega-archive.org/dacs/EGAC00001000259https://ega-archive.org/dacs/EGAC00001000259
NIHR BioResource Rare Diseases WGS project - Hypertrophic Cardiomyopathy (HCM) Rare Disease domain (VCF data)
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/9JSMEDhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/9JSMED
These datasets contain phenotypic and genotypic data from three connected populations of common bean (Phaseolus vulgaris L.) that were used to identify the genomic regions controlling the phenotypic response to Bean Leaf Crumple Virus (BLCrV). The first is the Andean by Meso (AxM) population, which contains 190 individuals derived from bi-parental crosses between Andean and Mesoamerican breeding lines. The AxM population included 120 additional breeding lines of Andean and Mesoamerican origin that were used as checks for their response against other viral diseases, such as Bean Golden Yellow Mosaic Virus (BGYMV). The second is a pre-breeding population (termed P135-136) composed of 111 lines that was obtained from two-way and three-way crosses between elite Andean lines and resistant sources against viral diseases. The third population is a panel of 186 Mesoamerican breeding lines assembled from a collection of elite materials from the Mesoamerican breeding pipeline at CIAT. The AxM population was evaluated in three yield trials in Palmira (Colombia)between 2013 and 2015 for flowering, maturity time and yield. All three population were evaluated in three BLCrV trials in Pradera (Colombia), where the disease pressure is naturally high. The AxM and the Mesoamerican panel were genotyped by sequencing (GBS), and these datasets contain their corresponding genotypic matrices in variant-call format (VCF, v4.2) with sequence variants mapped against the reference genome of P. vulgaris (G19833, v2.1). A joint genotypic matrix with all available GBS data from these three populations is also included. The population P135-136 was genotyped with the DArTag targeted genotyping service offered by Diversity Arrays Technology (DArT PL, Bruce ACT, Australia), and the genotypic matrix is similarly included in VCF format.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The increasing prevalence of vector-borne diseases around the world highlights the pressing need for an in-depth exploration of the genetic and environmental factors that shape the adaptability and widespread distribution of mosquito populations. This research focuses on Culex tarsalis, a principal vector for various viral diseases including West Nile Virus (WNV). Through the development of a new reference genome and the examination of Restriction-Site Associated DNA sequencing (RAD-seq) data from over 300 individuals and 28 locations, we demonstrate that variables such as temperature, evaporation rates, and the density of vegetation significantly impact the genetic makeup of Cx. tarsalis populations. Among the alleles most strongly associated with environmental factors is a nonsynonymous mutation in a key gene related to circadian rhythms. These results offer new insights into the mechanisms of spread and adaptation in a key North American vector species, which is poised to become a growing health threat to both humans and animals in the face of ongoing climate change. Methods Sample Collection Individual mosquitoes were trapped and collected from 28 different locations across the United States and Canada as part of the North American Mosquito Project (NAMP). All samples used in this study were collected in 2012 between the months of April and October. Genome Sequencing, Assembly, and Annotation An F4 population was used to generate the reference genome assembly, and high molecular weight DNA was extracted and sequenced on a Pacific Biosciences (PacBio) RS II (University of Delaware). Thirty-five SMRTcells were generated. The resulting reads provided 76X coverage of the ~790Mb Cx. tarsalis genome, and were assembled with MECAT Gene annotation was completed by MAKER using EST and protein data from the Culex quinquefasciatus and Aedes aegypti mosquitoes. Sequences were downloaded from the NCBI Taxonomy database and both Trinotate and InterProScan were used for functional annotation of the MAKER predicted genes. The annotated assembly was assessed for completeness and quality using BUSCO and QUAST. RAD-Seq Library Preparation, Sequencing, and SNP Calling DNA was extracted from individual mosquitoes and libraries were constructed for Restriction-site Associated DNA Sequencing (RAD-Seq) according to previously established protocols. The SbfI enzyme was used to digest purified DNA, and individual samples were barcoded prior to Illumina sequencing. Raw sequencing reads were subsequently filtered to remove any reads with an uncalled base, an error in the restriction enzyme cut site, or with an average Phred quality score less than 20 over 15 consecutive nucleotides. Filtered reads were then de-multiplexed using the Stacks software package. After de-multiplexing, raw reads from each individual were aligned to the draft assembly of the Cx. tarsalis genome using BWA MEM, and individuals with poor mapping rates (less than 50%) were excluded from subsequent analyses. The mapped reads for the remaining 378 samples were then merged using the Samtools pipeline and SNPs were called using the GATK HaplotypeCaller. The SNPs were filtered using VCFtools v0.1.12a to retain only sites with a minimum average individual read depth of 10X and a maximum of 20% missing data, resulting in a total of 457,387 sites. Individual samples were then filtered again to remove individuals with missing data at more than 50% of the remaining SNP sites, leaving 322 samples from 28 different locations for further analysis.Environmental data Climate data was extracted from the ERA5-Land monthly averaged dataset provided by the Copernicus Climate Change Service. The original dataset was characterized by a temporal resolution of 1 hour and a native spatial resolution of 9 km on a reduced Gaussian grid (TCo1279). To facilitate broader accessibility and suitability for diverse analyses, the data underwent regridding to a regular lat-lon grid with a finer resolution of 0.1x0.1 degrees.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The study that produced this dataset aimed to discover and analyze polymorphisms in abaca (Musa textilis) vital for varietal authentication and cross-species genotyping for advanatageous traits such as disease resistance, climate change resilience and enhanced agronomic traits. This dataset contains the resulting genotype calls in abaca (within M. textilis and between M. textilis and other Musa spp.) stored in variant call format (VCF) files. The genotypes are in the form of SNPs or InDels.
The VCf files starting with 'Mtextilis' and 'Musa' pertain to genotypes mined within M. textilis and between Musa spp., respectively. The VCF files containing SNPs or InDels are denoted by 'SNPsonly' and 'Indelsonly' file labeles, respectively. The 'AP' label indicate that the reference genome used for mapping and variant calling is a polished version of Galvez et al. (2020)'s reference genome. The 'minQ40' label indicates that the VCF files contains only SNPs and InDels with mapping quality of at least 40. The 'geno0.1' and 'pruned' labeles indicate that genotypes having at most 10% missing genotypes and those that are pruned-in (based on linkage disequilibrium thresholds) were selected.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Genotyping of known SNPs from ClinVar using the VCF and gVCF file formats and the number of homozygous reference sites and no-calls based on WGS data.