100+ datasets found

o
Data from: The Building Data Genome 2 (BDG2) Data-Set
openenergyhub.ornl.gov
Updated Jul 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). The Building Data Genome 2 (BDG2) Data-Set [Dataset]. https://openenergyhub.ornl.gov/explore/dataset/the-building-data-genome-2-bdg2-data-set/
Explore at:
Dataset updated
Jul 26, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BDG2 is an open data set made up of 3,053 energy meters from 1,636 buildings. The time range of the times-series data is the two full years (2016 and 2017) and the frequency is hourly measurements of electricity, heating and cooling water, steam, and irrigation meters.
buds-lab/building-data-genome-project-2: v1.0
zenodo.org
data.niaid.nih.gov
zip
Updated Sep 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clayton Miller; Anjukan Kathirgamanathan; Bianca Picchetti; Pandarasamy Arjunan; June Young Park; Zoltan Nagy; Paul Raftery; Brodie W. Hobson; Zixiao Shi; Forrest Meggers; Clayton Miller; Anjukan Kathirgamanathan; Bianca Picchetti; Pandarasamy Arjunan; June Young Park; Zoltan Nagy; Paul Raftery; Brodie W. Hobson; Zixiao Shi; Forrest Meggers (2020). buds-lab/building-data-genome-project-2: v1.0 [Dataset]. http://doi.org/10.5281/zenodo.3887306
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3887306
Dataset updated
Sep 2, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Clayton Miller; Anjukan Kathirgamanathan; Bianca Picchetti; Pandarasamy Arjunan; June Young Park; Zoltan Nagy; Paul Raftery; Brodie W. Hobson; Zixiao Shi; Forrest Meggers; Clayton Miller; Anjukan Kathirgamanathan; Bianca Picchetti; Pandarasamy Arjunan; June Young Park; Zoltan Nagy; Paul Raftery; Brodie W. Hobson; Zixiao Shi; Forrest Meggers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BDG2 open data set consists of 3,053 energy meters from 1,636 non-residential buildings with a range of two full years (2016 and 2017) at an hourly frequency (17,544 measurements per meter resulting in approximately 53.6 million measurements). These meters are collected from 19 sites across North America and Europe, and they measure electrical, heating and cooling water, steam, and solar energy as well as water and irrigation meters. Part of these data was used in the Great Energy Predictor III (GEPIII) competition hosted by the ASHRAE organization in October-December 2019. This subset includes data from 2,380 meters from 1,448 buildings that were used in the GEPIII, a machine learning competition for long-term prediction with an application to measurement and verification. This paper describes the process of data collection, cleaning, and convergence of time-series meter data, the meta-data about the buildings, and complementary weather data. This data set can be used for further prediction benchmarking and prototyping as well as anomaly detection, energy analysis, and building type classification.
Synthetic genomic data
kaggle.com
Updated Apr 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oleg Bushuev (2023). Synthetic genomic data [Dataset]. https://www.kaggle.com/datasets/oubush/synthetic-genomic-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 23, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Oleg Bushuev
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises images representing animal genotypes and offers a unique opportunity to delve into the realm of image processing techniques applied to genomic analysis. The original genomic data were sourced from Daniela Lourenco's GitHub repository https://github.com/danielall/Data_ssGBLUP, which contains data used as examples in the paper entitled "Single-step genomic evaluations from theory to practice: using SNP chips and sequence data in blupf90" by Lourenco et al. (2020). According to the data description, these data were simulated using QMSim (Sargolzaei & Schenkel, 2009). All the genetic variance was explained by 500 QTL. Animals were genotyped for 45,000 SNP and the average LD was 0.18. 2024 animals have genotypes and phenotypes. SNP genotype is coded based on the number of copies of the alternative allele (0, 1, 2).

Simulation details
Data were simulated using the software QMsim (Sargolzaei and Schenkel, 2009). In the first simulation step, 200 generations of the historical population were simulated to create mutation and drift equilibrium and linkage disequilibrium (LD). This historical population started from 50,000 individuals and decreased to 2,100 in the last generation, with an equal proportion of males and females. The second step generated an expanded population, which started with 10 males and 2000 females from the last historical generation. Each one of the 2000 females was randomly mated and produced 1 offspring per generation. Sire and dam were randomly replaced over 20 generations, and the replacement was 50% and 20%, respectively. The third step was used to generate the recent population that had the same parameters as the expansion population. Five generations were simulated, and all animals were genotyped. Only data from the recent population were used, which comprised pedigree information and phenotypes for 10,000 animals, and genotypes for 1020 parents from generations 1-4 and 1004 individuals in generation 5. For the genome, 29 chromosomes with a total of 2319 cM were simulated. Each chromosome had a similar number of SNP as the BovineSNP50k BeadChip (Illumina Inc., San Diego, CA). Although the number of simulated SNP was 54,000, nearly 45,000 passed the quality control and remained in the analyses. Along with SNP, 500 biallelic QTL were randomly placed on chromosomes. The QTL effects were sampled from a gamma distribution. The QTL and SNP had recurrent mutations with a probability of 2.5 × 10-5.
c
Building Data Genome Project 2 - Dataset - CERC/NGCI CKAN
ngci.encs.concordia.ca
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Building Data Genome Project 2 - Dataset - CERC/NGCI CKAN [Dataset]. https://ngci.encs.concordia.ca/ckan/dataset/building-data-genome-project-2
Explore at:
Dataset updated
Mar 20, 2025
Description
BDG2 is an open data set made up of 3,053 energy meters from 1,636 buildings. The time range of the times-series data is the two full years (2016 and 2017) and the frequency is hourly measurements of electricity, heating and cooling water, steam, and irrigation meters. A subset of the data was used in the Great Energy Predictor III (GEPIII) competition hosted by the ASHRAE organization in late 2019.
o
COVID-19 Genome Sequence Dataset
registry.opendata.aws
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (NLM) (2020). COVID-19 Genome Sequence Dataset [Dataset]. https://registry.opendata.aws/ncbi-covid-19/
Explore at:
Dataset updated
Jul 9, 2020
Dataset provided by
<a href="http://nlm.nih.gov/">National Library of Medicine (NLM)</a>
Description
This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six hours, with updates to the AWS ODP bucket occurring daily.
l
1000 Genome Project samples STR genotypes
figshare.le.ac.uk
application/gzip
Updated May 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ed Hollox (2022). 1000 Genome Project samples STR genotypes [Dataset]. http://doi.org/10.25392/leicester.data.19804360.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.19804360.v1
Dataset updated
May 23, 2022
Dataset provided by
University of Leicester
Authors
Ed Hollox
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
STR genotypes generated by GangSTR and HipSTR on selected 1000 Genomes samples. A readme is included for further technical information.
d
Whole Genome Shotgun Submissions
catalog.data.gov
datadiscovery.nlm.nih.gov
+4more
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). Whole Genome Shotgun Submissions [Dataset]. https://catalog.data.gov/dataset/whole-genome-shotgun-submissions
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
Whole Genome Shotgun (WGS) projects are genome assemblies of incomplete genomes or incomplete chromosomes of prokaryotes or eukaryotes that are generally being sequenced by a whole genome shotgun strategy. WGS projects may be annotated, but annotation is not required. NCBI has a Prokaryotic Genomes Annotation Pipeline that may be requested at the time the genome files are submitted to GenBank. This pipeline generates a submission-ready annotated file that is posted back to the submitter for review and which the submitter could edit prior to data release. The public WGS projects are at the list of WGS projects. https://www.ncbi.nlm.nih.gov/Traces/wgs/
d
Data from: The Pacific Biosciences de novo assembled genome dataset from a...
catalog.data.gov
omicsdi.org
+2more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: The Pacific Biosciences de novo assembled genome dataset from a parthenogenetic New Zealand wild population of the longhorned tick, Haemaphysalis longicornis Neumann, 1901 [Dataset]. https://catalog.data.gov/dataset/data-from-the-pacific-biosciences-de-novo-assembled-genome-dataset-from-a-parthenogenetic--62c3a
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
The longhorned tick, Haemaphysalis longicornis, feeds upon a wide range of bird and mammalian hosts. Mammalian hosts include cattle, deer, sheep, goats, humans, and horses. This tick is known to transmit a number of pathogens causing tick-borne diseases, and was the vector of a recent serious outbreak of oriental theileriosis in New Zealand. A New Zealand-USA consortium was established to sequence, assemble, and annotate the genome of this tick, using ticks obtained from New Zealand's North Island. In New Zealand, the tick is considered exclusively parthenogenetic and this trait was deemed useful for genome assembly. Very high molecular weight genomic DNA was sequenced on the Illumina HiSeq4000 and the long-read Pac Bio Sequel platforms. Twenty-eight SMRT cells produced a total of 21.3 million reads which were assembled with Canu on a reserved supercomputer node with access to 12 TB of RAM, running continuously for over 24 days. The final assembly dataset consisted of 34,211 contigs with an average contig length of 215,205 bp. The quality of the annotated genome was assessed by BUSCO analysis, an approach that provides quantitative measures for the quality of an assembled genome. Over 95% of the BUSCO gene set was found in the assembled genome. Only 48 of the 1066 BUSCO genes were missing and only 9 were present in a fragmented condition. The raw sequencing reads and the assembled contigs/scaffolds are archived at the National Center for Biotechnology Information. Funded by USDA-ARS Knipling-Bushland US Livestock Insects Research Laboratory CRIS project 3094-32000-036-00 Resources in this dataset:Resource Title: The Pacific Biosciences de novo assembled genome dataset from a parthenogenetic New Zealand wild population of the longhorned tick, Haemaphysalis longicornis Neumann, 1901. File Name: Web Page, url: https://doi.org/10.1016/j.dib.2019.104602 NCBI data referenced in the article can be found in the related content links of this record
G
Genomic Data Analysis Service Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Genomic Data Analysis Service Report [Dataset]. https://www.archivemarketresearch.com/reports/genomic-data-analysis-service-55807
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Mar 10, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Genomic Data Analysis Service market is experiencing robust growth, projected to reach $4192.3 million in 2025. While the provided CAGR is missing, considering the rapid advancements in genomics technologies and increasing demand for personalized medicine, a conservative estimate of 15% CAGR from 2025-2033 seems reasonable. This implies significant market expansion, driven by factors such as decreasing sequencing costs, growing adoption of next-generation sequencing (NGS) technologies, and the increasing need for efficient and accurate analysis of large genomic datasets. The market is segmented by application (humanity, plant, animal, microorganism, virus) and by type of analysis (whole genome sequence analysis, whole exome sequence analysis, and others). The growth is fueled by the expanding application of genomic analysis across diverse sectors like healthcare, agriculture, and environmental science. Whole genome sequencing is expected to dominate the market due to its comprehensive nature, providing a complete picture of an organism's genetic makeup. However, whole exome sequencing remains a significant segment due to its cost-effectiveness and ability to target specific protein-coding regions. Key players such as Illumina, QIAGEN, and BGI Genomics are leading the market through continuous innovation in software and analytical tools. The market's geographical spread is substantial, with North America and Europe currently holding the largest market shares due to well-established research infrastructure and technological advancements. However, the Asia-Pacific region is projected to witness significant growth driven by rising investments in healthcare infrastructure and increasing adoption of genomic technologies. The market is expected to continue its upward trajectory throughout the forecast period (2025-2033), driven by ongoing technological innovations that enhance data analysis speed and accuracy. The increasing availability of large genomic datasets, fueled by large-scale genomics initiatives, provides a fertile ground for the development of advanced analytical tools. Furthermore, the increasing demand for personalized medicine and precision agriculture is further accelerating the adoption of genomic data analysis services. However, challenges remain, including the need for standardized data formats, data security concerns associated with handling sensitive genomic data, and the need for skilled professionals to interpret and utilize the complex data generated. Addressing these challenges will be critical for continued market growth and widespread adoption of genomic data analysis services.
1000 Cannabis Genomes Project
kaggle.com
zip
Updated Feb 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). 1000 Cannabis Genomes Project [Dataset]. https://www.kaggle.com/bigquery/genomics-cannabis
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 26, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Cannabis is a genus of flowering plants in the family Cannabaceae.

Source: https://en.wikipedia.org/wiki/Cannabis

Content

In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.

https://medium.com/google-cloud/dna-sequencing-of-1000-cannabis-strains-publicly-available-in-google-bigquery-a33430d63998

These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis. Data are available directly from Google Cloud Storage at gs://gcs-public-data--genomics/cannabis, as well as via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911, as well as in the BigQuery dataset bigquery-public-data:genomics_cannabis.

All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.

The following tables are included in the Cannabis Genomes Project dataset:

Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.

MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.

MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.

MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.

Fork this kernel to get started with this dataset.

Acknowledgements

Dataset Source: http://opencannabisproject.org/ Category: Genomics Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Update frequency: As additional data are released to GenBank View in BigQuery: https://bigquery.cloud.google.com/dataset/bigquery-public-data:genomics_cannabis View in Google Cloud Storage: gs://gcs-public-data--genomics/cannabis

Banner Photo by Rick Proctor from Unplash.

Inspiration

Which Cannabis samples are included in the variants table?

Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?

How many variants does each sample have at the THC Synthase gene (THCA1) locus?
Genome in a Bottle on AWS
registry.opendata.aws
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2020). Genome in a Bottle on AWS [Dataset]. https://registry.opendata.aws/giab/
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Several reference genomes to enable translation of whole human genome sequencing to clinical practice. On 11/12/2020 these data were updated to reflect the most up to date GIAB release.
Data from: Cacao Genome Database
s.cnmilf.com
datasets.ai
+2more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Cacao Genome Database [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/cacao-genome-database-0d068
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
Not only is cacao the basic ingredient in the world’s favorite confection, chocolate, but it provides a livelihood for over 6.5 million farmers in Africa, South America and Asia and ranks as one of the top ten agriculture commodities in the world. Historically, cocoa production has been plagued by serious losses due to pests and diseases. The release of the cacao genome sequence will provide researchers with access to the latest genomic tools, enabling more efficient research and accelerating the breeding process, thereby expediting the release of superior cacao cultivars. The sequenced genotype, Matina 1-6, is representative of the genetic background most commonly found in the cacao producing countries, enabling results to be applied immediately and broadly to current commercial cultivars. Matina 1-6 is highly homozygous which greatly reduces the complexity of the sequence assembly process. While the sequence provided is a preliminary release, it already covers 92% of the genome, with approximately 35,000 genes. We will continue to refine the assembly and annotation, working toward a complete finished sequence. Updates will be made available via the main project website. Resources in this dataset:Resource Title: Cacao Genome Database. File Name: Web Page, url: http://www.cacaogenomedb.org/
u
Data from: Sol Genomics Network (SGN)
agdatacommons.nal.usda.gov
datasetcatalog.nlm.nih.gov
+1more
bin
Updated Feb 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Noe Fernandez-Pozo; Naama Menda; Jeremy D. Edwards; Surya Saha; Isaak Y. Tecle; Susan R. Strickler; Aureliano Bombarely; Thomas Fisher-York; Anuradha Pujar; Hartmut Foerster; Aimin Yan; Lukas A. Mueller (2024). Sol Genomics Network (SGN) [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Sol_Genomics_Network_SGN_/24852978
Explore at:
binAvailable download formats
Dataset updated
Feb 13, 2024
Dataset provided by
Boyce Thompson Institute for Plant Research, Cornell University
Authors
Noe Fernandez-Pozo; Naama Menda; Jeremy D. Edwards; Surya Saha; Isaak Y. Tecle; Susan R. Strickler; Aureliano Bombarely; Thomas Fisher-York; Anuradha Pujar; Hartmut Foerster; Aimin Yan; Lukas A. Mueller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Sol Genomics Network (SGN) is a clade-oriented database dedicated to the biology of the Solanaceae family which includes a large number of closely related and many agronomically important species such as tomato, potato, tobacco, eggplant, pepper, and the ornamental Petunia hybrida. SGN is part of the International Solanaceae Initiative (SOL), which has the long-term goal of creating a network of resources and information to address key questions in plant adaptation and diversification. A key problem of the post-genomic era is the linking of the phenome to the genome, and SGN allows to track and help discover new such linkages. Data:

Solanaceae and other Genomes SGN is a home for Solanaceae and closely related genomes, such as selected Rubiaceae genomes (e.g., Coffea). The tomato, potato, pepper, and eggplant genome are examples of genomes that are currently available. If you would like to include a Solanaceae genome that you sequenced in SGN, please contact us. ESTs SGN houses EST collections for tomato, potato, pepper, eggplant and petunia and corresponding unigene builds. EST sequence data and cDNA clone resources greatly facilitate cloning strategies based on sequence similarity, the study of syntenic relationships between species in comparative mapping projects, and are essential for microarray technology. Unigenes SGN assembles and publishes unigene builds from these EST sequences. For more information, see Unigene Methods. Maps and Markers SGN has genetic maps and a searchable catalog of markers for tomato, potato, pepper, and eggplant. Tools SGN makes available a wide range of web-based bioinformatics tools for use by anyone, listed here. Some of our most popular tools include BLAST searches, the SolCyc biochemical pathways database, a CAPS experiment designer, an Alignment Analyzer and browser for phylogenetic trees. The VIGS tool can help predict the properties of VIGS (Viral Induced Gene Silencing) constructs.

The data in SGN have been submitted by many different research groups around the world. A web form is available to submit data for display on SGN. SGN community-driven gene and phenotype database: Simple web interfaces have been developed for the SGN user-community to submit, annotate, and curate the Solanaceae locus and phenotype databases. The goal is to share biological information, and have the experts in their field review existing data and submit information about their favorite genes and phenotypes. Resources in this dataset:Resource Title: Website Pointer to Sol Genomics Network. File Name: Web Page, url: https://solgenomics.net/ Specialized Search interfaces are provided for: Organisms/Taxon; Genes and Loci; Genomic sequences and annotations; QTLs, Mutants & Accessions, Traits; Transcripts: Unigenes, ESTs, & Libraries; Unigene families; Markers; Genomic clones; Images; Expression: Templates, Experiments, Platforms; Traits.
E
Test dataset: Sequence and variant data from public 1000 Genomes Project
ega-archive.org
Updated Oct 10, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Test dataset: Sequence and variant data from public 1000 Genomes Project [Dataset]. https://ega-archive.org/datasets/EGAD00001003338
Explore at:
Dataset updated
Oct 10, 2017
License
https://ega-archive.org/dacs/EGAC00001000514https://ega-archive.org/dacs/EGAC00001000514
Description
This is a test dataset derived from public data of the 1000 Genomes Project. Its purpose is not to allow for any inference about cohort data or results, but to aid bioinformaticians in the technical development and testing of tools, as well as data consumers in learning how to access information.
This dataset consists of 2508 samples from the 1000 Genomes Project (https://www.nature.com/articles/nature15393). Samples' (e.g. NA18534) data can be accessed through the IGSR portal (e.g. https://www.internationalgenome.org/data-portal/sample/NA18534) or their corresponding folder at the 1000 Genomes' FTP site (e.g. http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/CHB/NA18534/exome_alignment/). There are several different types of data this dataset encompasses: Variant Calling Format (VCF, or its binary counterparts BCF) files, both joint (e.g. ALL_chr22_20130502_2504Individuals.vcf.gz) and split (HG01775.chrY.vcf.gz); exome sequencing CRAM files (e.g. NA18534.GRCh38DH.exome.cram); whole genome sequencing CRAM/BAM files (e.g. NA19239.cram). Additionally, there are multiple files that were sliced to create shorter files, which allows for a quick download, formated as "{FILE-INFO}_{NUMBER-OF-READS}r_{CHR}.{START-COORDINATE}-{END-COORDINATE}.{FILETYPE}" (e.g. "HG01500.GRCh38DH_90r_3.10000-10500_4.10000-10500.cram"). These files can be downloaded directly through the EGA-download-client PyEGA3 (https://github.com/EGA-archive/ega-download-client).
d
Genomic Data of North American Sea Ducks
catalog.data.gov
data.usgs.gov
+3more
Updated Nov 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Genomic Data of North American Sea Ducks [Dataset]. https://catalog.data.gov/dataset/genomic-data-of-north-american-sea-ducks
Explore at:
Dataset updated
Nov 8, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data set describes accession numbers for nucleotide sequence data derived from whole mitochondrial genome and double digest restriction-site associated DNA (ddRAD).
d
Genomic and Demographic data from the San Francisco gartersnake
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Genomic and Demographic data from the San Francisco gartersnake [Dataset]. https://catalog.data.gov/dataset/genomic-and-demographic-data-from-the-san-francisco-gartersnake
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
We used genome-wide single nucleotide polymorphism (SNP) data and capture-mark-recapture methods to evaluate the genetic diversity and demography within seven focal sites of the endangered San Francisco gartersnake (Thamnophis sirtalis tetrataenia). As Thamnophis sirtalis tetrataenia is listed as endangered by the U.S. Fish and Wildlife Service (USFWS), sensitive location information can be made available upon request by contacting Brian J. Halstead and/or Amy G. Vandergast.
E
European Genome-phenome Archive
healthinformationportal.eu
html
Updated Mar 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). European Genome-phenome Archive [Dataset]. https://www.healthinformationportal.eu/health-information-sources/european-genome-phenome-archive
Explore at:
htmlAvailable download formats
Dataset updated
Mar 31, 2023
Variables measured
sex, title, topics, acronym, country, funding, language, description, sample_size, age_range_to, and 15 more
Measurement technique
Data from other records
Dataset funded by
<p>Public funding, mainly through competitive European projects</p>
Description
The European Genome-phenome Archive (EGA) is a global network for permanent archiving and sharing of personally identifiable genetic, phenotypic, and clinical data generated for the purposes of biomedical research projects or in the context of research-focused healthcare systems.

We aim to advance biomedical research and promote personalised medicine worldwide by enabling discovery of and access to human genomic and health research data.

With our expertise in data management and our technical infrastructure, we promote FAIR data reuse and enable researchers to share their data securely. By leveraging public funding and our strategic partnerships, the EGA provides a free service for permanent data storage, data discovery, and secure data access. In addition, we foster a federated network to provide transnational access to human research data in compliance with legal frameworks.
Genomics England - Quick View
healthdatagateway.org
unknown
Updated Mar 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The 100;,;000 Genomes Project Protocol v3;,;Genomics England. doi:10.6084/m9.figshare.4530893.v3. 2017. Publications that use the Genomics England Database should include an author as Genomics England Research Consortium. Please see the publication policy. (2023). Genomics England - Quick View [Dataset]. https://healthdatagateway.org/en/dataset/378
Explore at:
unknownAvailable download formats
Dataset updated
Mar 30, 2023
Dataset provided by
Genomics England
Authors
The 100;,;000 Genomes Project Protocol v3;,;Genomics England. doi:10.6084/m9.figshare.4530893.v3. 2017. Publications that use the Genomics England Database should include an author as Genomics England Research Consortium. Please see the publication policy.
License
https://www.genomicsengland.co.uk/about-gecip/joining-research-community/https://www.genomicsengland.co.uk/about-gecip/joining-research-community/
Description
Quickviews bring together data from several LabKey tables for convenient access, including:

rare_disease_analysis Data for all rare disease participants including: sex, ethnicity, disease recruited for and relationship to proband; latest genome build, QC status of latest genome, path to latest genomes and whether tiering data are available; as well as family selection quality checks for rare disease genomes on GRCh38, reporting abnormalities of the sex chromosomes, family relatedness, Mendelian inconsistencies and reported vs genetic sex summary checks. Please note that only sex checks are unpacked into individual data fields; a final status is shown in the “genetic vs reported results” column.

cancer_analysis Data for all cancer participants whose genomes have been through Genomics England bioinformatics interpretation and passed quality checks, including: sex, ethnicity, disease recruited for and diagnosis; tumour ID, build of latest genome, QC status of latest genome and path to latest genomes; as well file paths to the genomes. This table includes information derived from laboratory_sample and cancer_participant_tumour.
u
Data from: A High-Quality Genome Assembly from a Single, Field-collected...
agdatacommons.nal.usda.gov
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
+1more
zip
Updated Dec 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah Kingan; Julie Urban; Christine Lambert; Primo Baybayan; Anna Childers; Brad Coates; Brian Scheffler; Kevin Hackett; Jonas Korlach; Scott M. Geib (2023). Data from: A High-Quality Genome Assembly from a Single, Field-collected Spotted Lanternfly (Lycorma delicatula) using the PacBio Sequel II System [Dataset]. http://doi.org/10.15482/USDA.ADC/1503745
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1503745
Dataset updated
Dec 18, 2023
Dataset provided by
Ag Data Commons
Authors
Sarah Kingan; Julie Urban; Christine Lambert; Primo Baybayan; Anna Childers; Brad Coates; Brian Scheffler; Kevin Hackett; Jonas Korlach; Scott M. Geib
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
A high-quality reference genome is an essential tool for applied and basic research on arthropods. Long-read sequencing technologies may be used to generate more complete and contiguous genome assemblies than alternate technologies, however, long-read methods have historically had greater input DNA requirements and higher costs than next generation sequencing, which are barriers to their use on many samples. Here, we present a 2.3 Gb de novo genome assembly of a field-collected adult female Spotted Lanternfly (Lycorma delicatula) using a single PacBio SMRT Cell. The Spotted Lanternfly is an invasive species recently discovered in the northeastern United States, threatening to damage economically important crop plants in the region. The DNA from one individual female specimen collected in Reading, Berks County, Pennsylvania was used to make one standard, size-selected library with an average DNA fragment size of ~20 kb. The library was run on one Sequel II SMRT Cell 8M, generating a total of 132 Gb of long-read sequences, of which 82 Gb were from unique library molecules, representing approximately 38x coverage of the genome. The assembly had high contiguity (contig N50 length = 1.5 Mb), completeness, and sequence level accuracy as estimated by conserved gene set analysis (96.8% of conserved genes both complete and without frame shift errors). Further, it was possible to segregate more than half of the diploid genome into the two separate haplotypes. The assembly also recovered two microbial symbiont genomes known to be associated with L. delicatula, each microbial genome being assembled into a single contig. We demonstrate that field-collected arthropods can be used for the rapid generation of high-quality genome assemblies, an attractive approach for projects on emerging invasive species, disease vectors, or conservation efforts of endangered species. Supporting files for the manuscript "A High-Quality Genome Assembly from a Single, Field-collected Spotted Lanternfly (Lycorma delicatula) using the PacBio Sequel II System", include several intermediate versions of the assembly (raw output from Falcon, raw output from Falcon unzip, etc.) as well as the final assembly primary contigs and haplotigs (for the regions of the genome that were phased). Resources in this dataset:Resource Title: Final Assembly file . File Name: FinalAssembly.zipResource Description: Primary and haplotigs contigs in fasta format. File slf.8M.final.primary.fasta are the primary contigs, and slf.8M.final.haplotigs.fasta are the haplotigsResource Title: Falcon Raw assembly, polished with arrow. File Name: FalconAssembly.zipResource Description: Raw Primary contig assembly prior to falcon unzip. Contigs were polished with all subreads with arrow polishing tool.Resource Title: Fasta file of contig assemblies of the two symbiont genomes. File Name: Symbiont.zipResource Description: Contains contig fasta files for Sulcia (Sulciamuelleri.fa) and Vidania (vidania.fa) symbiont genomes recovered from the de novo assemblyResource Title: Haplotig placement file in PAF format. File Name: slf.haplotigPlacement.paf.zipResource Description: Final assembly placement file , describing the placement of haplotigs on the primary contig assemblyResource Title: Falcon Unzip assembly Polished with arrow . File Name: FalconUnzipAssembly.zipResource Description: Falcon unzip assembly both the primary and haplotigs, unfiltered
f
1002 Yeast Genome Dataset ploidy information
auckland.figshare.com
xlsx
Updated Feb 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diksha Sharma (2021). 1002 Yeast Genome Dataset ploidy information [Dataset]. http://doi.org/10.17608/k6.auckland.13888811.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.17608/k6.auckland.13888811.v1
Dataset updated
Feb 16, 2021
Dataset provided by
The University of Auckland
Authors
Diksha Sharma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The ploidy information for 789 isolates from 1002 Yeast genome dataset

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). The Building Data Genome 2 (BDG2) Data-Set [Dataset]. https://openenergyhub.ornl.gov/explore/dataset/the-building-data-genome-2-bdg2-data-set/

Data from: The Building Data Genome 2 (BDG2) Data-Set

Explore at:

Dataset updated

Jul 26, 2024

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BDG2 is an open data set made up of 3,053 energy meters from 1,636 buildings. The time range of the times-series data is the two full years (2016 and 2017) and the frequency is hourly measurements of electricity, heating and cooling water, steam, and irrigation meters.

Clear search

Close search

Google apps

Main menu

Data from: The Building Data Genome 2 (BDG2) Data-Set

buds-lab/building-data-genome-project-2: v1.0

Synthetic genomic data

Building Data Genome Project 2 - Dataset - CERC/NGCI CKAN

COVID-19 Genome Sequence Dataset

1000 Genome Project samples STR genotypes

Whole Genome Shotgun Submissions

Data from: The Pacific Biosciences de novo assembled genome dataset from a...

Genomic Data Analysis Service Report

1000 Cannabis Genomes Project

Context

Content

Acknowledgements

Inspiration

Genome in a Bottle on AWS

Data from: Cacao Genome Database

Data from: Sol Genomics Network (SGN)

Test dataset: Sequence and variant data from public 1000 Genomes Project

Genomic Data of North American Sea Ducks

Genomic and Demographic data from the San Francisco gartersnake

European Genome-phenome Archive

Genomics England - Quick View

Data from: A High-Quality Genome Assembly from a Single, Field-collected...

1002 Yeast Genome Dataset ploidy information

Data from: The Building Data Genome 2 (BDG2) Data-Set