Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Standard VCF files.
https://ega-archive.org/dacs/EGAC50000000708https://ega-archive.org/dacs/EGAC50000000708
This dataset contains a merge VCF file generated from WES data of patients diagnosed with familial Meniere disease (FMD). Variant calling followed GATK best practices using the nf-core/Sarek pipeline (v3), and variants were filtered using genotype-level thresholds consistent with gnomAD filters. Multiallelic variants were split and INDELs were left-aligned during normalization. Variant Quality Score Recalibration (VQSR) was applied separately to SNVs and INDELs using well-established truth sets, with a 90% sensitivity threshold to maximize the detection of rare variants. Final variants were annotated with Ensembl VEP.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
VCF files submitted for each group/pipeline.
https://ega-archive.org/dacs/EGAC50000000708https://ega-archive.org/dacs/EGAC50000000708
This dataset contains a merge VCF file generated from WES data of patients diagnosed with sporadic Meniere disease (FMD). Variant calling followed GATK best practices using the nf-core/Sarek pipeline (v3), and variants were filtered using genotype-level thresholds consistent with gnomAD filters. Multiallelic variants were split and INDELs were left-aligned during normalization. Variant Quality Score Recalibration (VQSR) was applied separately to SNVs and INDELs using well-established truth sets, with a 90% sensitivity threshold to maximize the detection of rare variants. Final variants were annotated with Ensembl VEP.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
VCF file containing filtered mutated sites in SARS-CoV-2 genomes obtained from GISAID EpiCoV, separated by individual mutations. The columns correspond to viral genome accession ID, nucleotide position in the genome, mutation ID (left blank in all rows), reference nucleotide, identified mutation, quality, filter, and information columns (all left blank), format (GT in all rows), column corresponding to reference genome (all 0, referring to reference nucleotide column), and columns corresponding to isolate genomes, with each row identifying the nucleotide in the POS column, and whether it is non-mutant (0), or the mutant indicated in the identified mutation column (1). The file is tab delimited, with 22546 rows including the names, and 30690 columns.
The file was generated to test the hypothesis whether the five most common mutations in the SARS-CoV-2 genome replication complex proteins, nsps 7, 8, 12, and 14, significantly affect the mutation density of the virus over time and whether these affect the synonymous and nonsynonymous mutation densities differently. We discovered that mutations in nsp14, an exonuclease with error correcting capabilities, are most likely to be correlated with increased mutational load across the genome compared to wildtype SARS-CoV-2. These results were obtained by identifying the frequency of mutations across all isolates in genomic regions of interest, analyzing which of the twenty mutations (five per nsp) have a statistically meaningful relationship with the mutation density in the M and E genes (chosen due to being under little selective pressure), and identifying the synonymous and nonsynonymous genomic SNV density for isolates with any of the statistically meaningful mutations, as well as isolates with none of the identified mutations.
https://ega-archive.org/dacs/EGAC00001000259https://ega-archive.org/dacs/EGAC00001000259
Short read whole genome sequencing (WGS) VCF files for the NIHR BioResource Rare Diseases WGS project – Participants from the Hypertrophic Cardiomyopathy (HCM) Rare Disease domain
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Concordance of genotypes represented in VCF and gVCF files with those detected by the MI RISK Plus kit.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset makes available the UCSC Genome Browser (genome.ucsc.edu) GRCh37 genome build public session NA12878 WES Benchmark files in a single dataset so that these files can be used in other applications or genome browsers such as IGV. All genomic variant calls in all VCF files were decomposed and normalized with vt. This dataset contains:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Genotyping of GWAS catalog sites using the VCF and gVCF file formats and the number of homozygous reference sites and no-calls based on WGS data.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
These datasets are important to genomics researchers because they characterize several aspects of what the scientific community has learned to date about human sequence variants. Making this human annotation data freely available in GCP will enable researchers to focus less on data movement and management tasks associated with procuring this data and instead make immediate use of the data to better understand the clinical relevance of particular variant such as disease causing or protective variants (ClinVar), search a catalog of SNPs that have been identified in the human genome (dbSNP), and discover how frequently a particular variant occurs across the human population (1000Genomes, ESP, ExAC, gnomAD) This human annotation dataset contains both a mirror of the original Variant Call Files (VCF) files from NCBI, NHLBI Exome Sequencing Project (ESP) and ensembl as Google Cloud Storage (GCS) objects. In addition, these human sequence variants have also been translated into a particular variant table format and made available in Google BigQuery giving researchers the ability to use cloud technology and code repositories such as the Verily Life Sciences Annotation Toolkit to perform analyses in parallel. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . This public dataset is hosted in Google Cloud Storage and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The vcf genotype for paper "Genetic dissection of QTLs for starch content in four maize DH populations". four DH populations (SC1, SC2, SC3 and SC4) developed from F1 plants of crosses among eight corresponding parents (SC*-P*). ALL lines were genotyped with the GenoBaits Maize 1K marker panel that was developed by Mol Breeding Biotechnology Co., Ltd., Shijiazhuang, China (http://www.molbreeding.com/), based on genotyping by target sequencing platform in maize.
Verticillium dahliae is an important soil-borne pathogen causing Verticillium wilt. It is also the primary causal agent of the Potato Early Dying, a disease complex involving the root-lesion nematode. Here, we report the whole-genome sequencing of 192 isolates of V. dahliae originating from the major potato production areas across Canada. Our results yielded a resource of 277,010 genetic variations that will be useful for genetic analyses and revealed the presence of two major lineages, both present in all provinces but exhibiting differences in regional prevalence.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Genotyping of known SNPs from ClinVar using the VCF and gVCF file formats and the number of homozygous reference sites and no-calls based on WGS data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The American chestnut (Castanea dentata) is a functionally extinct tree species that was decimated by an invasive fungal pathogen in the early 20th century. An understanding of the genomic architecture of local adaptation in wild American chestnut was necessary in order to deploy locally adapted, disease-resistant American chestnut populations. Here, we characterize the genomic basis of climate adaptation in remnant wild American chestnut, develop new computational methods, and evaluate the adaptive genomic content captured within backcross breeding populations. Whole genome re-sequencing data of 356 trees from Sandercock et al. (2022) coupled with genotype-environment association methods identified 18483 climate associated loci.Methods: VCF file: The ~21 million SNP dataset from Sandercock et al. (2022) was first imputed using BEAGLE and filtered to remove SNPs with MAF < 0.05. Climate associated loci were then identified using RDA and LFMM2 genotype-environment association methods. Seed zone shape files: Three seed zones were identified using the ~18k climate associated loci. These regions partition the chestnut range into geographic seed zones that reflect relatively homogeneous areas with respect to multivariate adaptive genomic variation. These regions can be used to conserve germplasm ex situ and guide subsequent breeding crosses that lead to climate-matched restoration populations. gmbigxhorn.jtl.map.2022.csv is a genetic map generated from American chestnut backcross genotyping-by-sequencing data. R code for estimating the average migration distance for each seed zone under future climate change conditions.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Exome sequencing data (VCF files) of the nine adult patients with liver diseases.For exome sequencing, the library was prepared using Illumina DNA Prep with Exome 2.5 Enrichment product and sequenced on a NovaSeq 6000 instrument (Illumina, San Diego, CA). Reads were aligned to the GRCh38 Human Reference Genome using the Illumina DRAGEN Bio-IT Platform v3.9.
These datasets contain phenotypic and genotypic data from three connected populations of common bean (Phaseolus vulgaris L.) that were used to identify the genomic regions controlling the phenotypic response to Bean Leaf Crumple Virus (BLCrV). The first is the Andean by Meso (AxM) population, which contains 190 individuals derived from bi-parental crosses between Andean and Mesoamerican breeding lines. The AxM population included 120 additional breeding lines of Andean and Mesoamerican origin that were used as checks for their response against other viral diseases, such as Bean Golden Yellow Mosaic Virus (BGYMV). The second is a pre-breeding population (termed P135-136) composed of 111 lines that was obtained from two-way and three-way crosses between elite Andean lines and resistant sources against viral diseases. The third population is a panel of 186 Mesoamerican breeding lines assembled from a collection of elite materials from the Mesoamerican breeding pipeline at CIAT. The AxM population was evaluated in three yield trials in Palmira (Colombia)between 2013 and 2015 for flowering, maturity time and yield. All three population were evaluated in three BLCrV trials in Pradera (Colombia), where the disease pressure is naturally high. The AxM and the Mesoamerican panel were genotyped by sequencing (GBS), and these datasets contain their corresponding genotypic matrices in variant-call format (VCF, v4.2) with sequence variants mapped against the reference genome of P. vulgaris (G19833, v2.1). A joint genotypic matrix with all available GBS data from these three populations is also included. The population P135-136 was genotyped with the DArTag targeted genotyping service offered by Diversity Arrays Technology (DArT PL, Bruce ACT, Australia), and the genotypic matrix is similarly included in VCF format.
The increasing prevalence of vector-borne diseases around the world highlights the pressing need for an in-depth exploration of the genetic and environmental factors that shape the adaptability and widespread distribution of mosquito populations. This research focuses on Culex tarsalis, a principal vector for various viral diseases including West Nile Virus (WNV). Through the development of a new reference genome and the examination of Restriction-Site Associated DNA sequencing (RAD-seq) data from over 300 individuals and 28 locations, we demonstrate that variables such as temperature, evaporation rates, and the density of vegetation significantly impact the genetic makeup of Cx. tarsalis populations. Among the alleles most strongly associated with environmental factors is a nonsynonymous mutation in a key gene related to circadian rhythms. These results offer new insights into the mechanisms of spread and adaptation in a key North American vector species, which is poised to become a g..., Sample Collection Individual mosquitoes were trapped and collected from 28 different locations across the United States and Canada as part of the North American Mosquito Project (NAMP). All samples used in this study were collected in 2012 between the months of April and October. Genome Sequencing, Assembly, and Annotation An F4 population was used to generate the reference genome assembly, and high molecular weight DNA was extracted and sequenced on a Pacific Biosciences (PacBio) RS II (University of Delaware). Thirty-five SMRTcells were generated. The resulting reads provided 76X coverage of the ~790Mb Cx. tarsalis genome, and were assembled with MECAT Gene annotation was completed by MAKER using EST and protein data from the Culex quinquefasciatus and Aedes aegypti mosquitoes. Sequences were downloaded from the NCBI Taxonomy database and both Trinotate and InterProScan were used for functional annotation of the MAKER predicted genes. The annotated assembly was assessed for complet..., , # Culex tarsalis dataset
https://doi.org/10.5061/dryad.51c59zwh3
The data were stored in 5 different files.
The code uses the data above are presented in github: [https://github.com/Afei99357/Culex_Tarsalis_GWAS_manuscript.git](https://github.com/Afei99357/Culex_Tarsali...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
VCF files containing filtered mutated sites in SARS-CoV-2 genomes obtained from GISAID EpiCoV and submitted from the UK and the US, separated by individual mutations. The columns correspond to viral genome accession ID, nucleotide position in the genome, mutation ID (left blank in all rows), reference nucleotide, identified mutation, quality, filter, and information columns (all left blank), format (GT in all rows), column corresponding to reference genome (all 0, referring to reference nucleotide column), and columns corresponding to isolate genomes, with each row identifying the nucleotide in the POS column, and whether it is non-mutant (0), or the mutant indicated in the identified mutation column (1). The files is tab delimited, with the UK file having 12696 rows including the names, and 18135 columns, and the US file having 15588 rows including the names, and 16277 columns.
The file was generated to test the hypothesis whether the different SARS-CoV-2 genes or protein coding regions are positively or negatively selected differently between 14408C>T / 23403A>G double mutants and double wildtype isolates, using mutation rate models, and whether regional distributions affect the mutation rates. Our findings have shown that the RdRp coding region and the S gene show the highest amount of selection across viral generations, and that different countries can affect the synonymous and nonsynonymous mutation rates for individual genes.
https://ega-archive.org/dacs/EGAC00001000319https://ega-archive.org/dacs/EGAC00001000319
This dataset contains VCF files from a variant calling analysis of 19 neuroblastoma patients. WES or WGS data of the primary tumor were compared to WES cfDNA analysis at the time of diagnosis and at a 2nd timepoint (complete remission, partial remission, disease progression or relapse). For 4 patients, WGS of germline, tumor at diagnosis and tumor at relapse DNA was performed on Illumina HiSeq2500, with 100-bp paired-end reads. For the other patients, WES was performed using either an AgilentSureSelect Human All Exon v5 or a Roche Nimblegen SeqCap EZ Exome V3 kit on Illumina HiSeq2000, with 100-bp paired-end reads. SNVs observed in any of the primary tumors or cfDNA samples studied by WES were targeted using a capture sequencing panel at all intermediate time points.
Database for phenotype genotype associations for humans. Used by clinical researchers to store standardized phenotypic information, diagnosis, and pedigree data and then run analyses on VCF files from individuals, families or cohorts with suspected Mendelian disease.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Standard VCF files.