Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BuildingsBench datasets consist of:
Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB).
BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below:
A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
[NOTE: PLEXdb is no longer available online. Oct 2019.] PLEXdb (Plant Expression Database) is a unified gene expression resource for plants and plant pathogens. PLEXdb is a genotype to phenotype, hypothesis building information warehouse, leveraging highly parallel expression data with seamless portals to related genetic, physical, and pathway data. PLEXdb (http://www.plexdb.org), in partnership with community databases, supports comparisons of gene expression across multiple plant and pathogen species, promoting individuals and/or consortia to upload genome-scale data sets to contrast them to previously archived data. These analyses facilitate the interpretation of structure, function and regulation of genes in economically important plants. A list of Gene Atlas experiments highlights data sets that give responses across different developmental stages, conditions and tissues. Tools at PLEXdb allow users to perform complex analyses quickly and easily. The Model Genome Interrogator (MGI) tool supports mapping gene lists onto corresponding genes from model plant organisms, including rice and Arabidopsis. MGI predicts homologies, displays gene structures and supporting information for annotated genes and full-length cDNAs. The gene list-processing wizard guides users through PLEXdb functions for creating, analyzing, annotating and managing gene lists. Users can upload their own lists or create them from the output of PLEXdb tools, and then apply diverse higher level analyses, such as ANOVA and clustering. PLEXdb also provides methods for users to track how gene expression changes across many different experiments using the Gene OscilloScope. This tool can identify interesting expression patterns, such as up-regulation under diverse conditions or checking any gene’s suitability as a steady-state control. Resources in this dataset:Resource Title: Website Pointer for Plant Expression Database, Iowa State University. File Name: Web Page, url: https://www.bcb.iastate.edu/plant-expression-database [NOTE: PLEXdb is no longer available online. Oct 2019.] Project description for the Plant Expression Database (PLEXdb) and integrated tools.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Targeted top-down strategies for genome reduction are considered to have a high potential for providing robust basic strains for synthetic biology and industrial biotechnology. Recently, we created a library of 26 genome-reduced strains of Corynebacterium glutamicum carrying broad deletions in single gene clusters and showing wild-type-like biological fitness. Here, we proceeded with combinatorial deletions of these irrelevant gene clusters in two parallel orders, and the resulting library of 28 strains was characterized under various environmental conditions. The final chassis strain C1* carries a genome reduction of 13.4% (412 deleted genes) and shows wild-type-like growth behavior in defined medium with d-glucose as carbon and energy source. Moreover, C1* proves to be robust against several stresses (including oxygen limitation) and shows long-term growth stability under defined and complex medium conditions. In addition to providing a novel prokaryotic chassis strain, our results comprise a large strain library and a revised genome annotation list, which will be valuable sources for future systemic studies of C. glutamicum.
Phenotypic, genotypic, and environment data for the 2015 field season: The data is stored in CyVerse. Data types in this directory tree are: hybrid and inbred agronomic and performance traits; inbred genotypic data; and environmental (soil, weather) data collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development. Resources in this dataset:Resource Title: CyVerse Genomes To Fields 2015 dataset download. File Name: Web Page, url: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Carolyn_Lawrence_Dill_G2F_Mar_2017 Dataset (csv) and metadata (BibTex, Endnote) data downloads. See _readme.txt for file contents.
Database for mapping gene expression profiles to pathways and genomes. Repository of microarray gene expression profile data for Synechocystis PCC6803 (syn), Bacillus subtilis (bsu), Escherichia coli W3110 (ecj), Anabaena PCC7120 (ana), and other species contributed by the Japanese research community.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping
## 1.code-and-documents
This directory contains the source code, executable binaries, and documents of KMCP,
which are also hosted at Github: https://github.com/shenwei356/kmcp .
Databases, usage, and tutorials of KMCP are also available at https://bioinf.shenwei.me/kmcp/.
- [Installation](https://bioinf.shenwei.me/kmcp/download)
- [Databases](https://bioinf.shenwei.me/kmcp/database)
- Tutorials
- [Taxonomic profiling](https://bioinf.shenwei.me/kmcp/tutorial/profiling)
- [Sequence and genome searching](https://bioinf.shenwei.me/kmcp/tutorial/searching)
- [Usage](https://bioinf.shenwei.me/kmcp/usage)
- [Benchmarks](https://bioinf.shenwei.me/kmcp/benchmark)
- [FAQs](https://bioinf.shenwei.me/kmcp/faq)
## 2.databases
This directory contains the building steps and reference genome accessions for
KMCP databases used in the manuscript.
cami2 Databases used in benchmarks on CAMI2 mouse gut datasets
kmcp Databases used in other benchmarks
## 3.figures
Each subdirectory contains steps to run the benchmark (`README.md`), steps for plotting (`README-plot.md`),
benchmark results, and figures.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Data types in this directory tree are: hybrid and inbred agronomic and performance traits; inbred genotypic data; and environmental data collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Version 3 (22 November, 2021)
See https://doi.org/10.24072/pcjournal.173 for a detailed description of the database. See http://evocellbio.com/eukprot/ for a BLAST database, interactive plots of BUSCO scores and ‘The Comparative Set’ (TCS): A selected subset of EukProt for comparative genomics investigations. Protein sequence FASTA files of the TCS are available at https://doi.org/10.6084/m9.figshare.21586065. See https://github.com/beaplab/EukProt for utility scripts, annotations, and all the files necessary to build the tree in Figures 1 and 3 (from the DOI above).
Scroll to the end of this page for changes since version 2.
Are we missing anything? Please let us know!
EukProt is a database of published and publicly available predicted protein sets selected to represent the breadth of eukaryotic diversity, currently including 993 species from all major supergroups as well as orphan taxa. The goal of the database is to provide a single, convenient resource for gene-based research across the spectrum of eukaryotic life, such as phylogenomics and gene family evolution. Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison and replication among analyses. The database is regularly updated, and all versions will be permanently stored and made available via FigShare. The current version has a number of updates, notably ‘The Comparative Set’ (TCS), a reduced taxonomic set with high estimated completeness while maintaining a substantial phylogenetic breadth, which comprises 196 predicted proteomes. A BLAST web server and graphical displays of data set completeness are available at http://evocellbio.com/eukprot/. We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, with the goal of building a collaborative resource that will promote research to understand eukaryotic diversity and diversification.
This release contains 5 files:
EukProt_proteins.v03.2021_11_22.tgz: 993 protein data sets, for species with either a genome (375) or single-cell genome (56), a transcriptome (498), a single-cell transcriptome (47), or an EST assembly (17).
EukProt_genome_annotations.v03.2021_11_22.tgz: gene annotations, in GFF format, as produced by EukMetaSanity (https://github.com/cjneely10/EukMetaSanity) for 40 genomes lacking publicly available protein annotations. The proteins predicted from these annotations are included in the proteins file.
EukProt_included_data_sets.v03.2021_11_22.txt and EukProt_not_included_data_sets.v03.2021_11_22.txt: tables of information on data sets either included (993 data sets) or not included (163) in the database. Tab-delimited; multiple entries in the same cell are comma-delimited; missing data is represented with the “N/A” value. With the following columns:
EukProt_ID: the unique identifier associated with the data set. This will not change among versions. If a new data set becomes available for the species, it will be assigned a new unique identifier.
Name_to_Use: the name of the species for protein/genome annotation/assembled transcriptome files.
Strain: the strain(s) of the species sequenced.
Previous_Names: any previous names that this species was known by.
Replaces_EukProt_ID/Replaced_by_EukProt_ID: if the data set changes with respect to an earlier version, the EukProt ID of the data set that it replaces (in the included table) or that it is replaced by (in the not_included table).
Genus_UniEuk, Epithet_UniEuk, Supergroup_UniEuk, Taxogroup1_UniEuk, Taxogroup2_UniEuk: taxonomic identifiers at different levels of the UniEuk taxonomy (Berney et al. 2017, DOI: 10.1111/jeu.12414, based on Adl et al. 2019, DOI: 10.1111/jeu.12691).
Taxonomy_UniEuk: the full lineage of the species in the UniEuk taxonomy (semicolon-delimited).
Merged_Strains: whether multiple strains of the same species were merged to create the data set.
Data_Source_URL: the URL(s) from which the data were downloaded.
Data_Source_Name: the name of the data set (as assigned by the data source).
Paper_DOI: the DOI(s) of the paper(s) that published the data set.
Actions_Prior_to_Use: the action(s) that were taken to process the publicly available files in order to produce the data set in this database. Actions taken (see our manuscript for more details): ‘assemble mRNA’: Trinity v. 2.8.4, http://trinityrnaseq.github.io/ ‘CD-HIT’: v. 4.6, http://weizhongli-lab.org/cd-hit/ ‘extractfeat’, ‘seqret’, ‘transeq’, ‘trimseq’: from EMBOSS package v. 6.6.0.0, http://emboss.sourceforge.net/ ‘translate mRNA’: Transdecoder v. 5.3.0, http://transdecoder.github.io/ ‘gffread’: v.0.12.3 https://github.com/gpertea/gffread ‘predict genes’: EukMetaSanity https://github.com/cjneely10/EukMetaSanity (cloned on 21 September, 2021) All parameter values were default, unless otherwise specified.
Data_Source_Type: the type of the source data (possible types: EST, transcriptome, single-cell transcriptome, genome, single-cell genome).
Notes: additional information on the data set (including why it is replaced by/is replacing another data set, or why it was not included).
Columns_Modified_Since_Previous_Version: column(s) in this file modified for the data set since the previous release. Not listed: modifications to the Notes column or to new columns added in this version.
Alternative_Strain_Names: non-exhaustive list of alternative names for the sequenced strain for this data set.
18S_Sequence_GenBank_ID: GenBank identifier for the strain sequenced in the data set. When multiple strains were sequenced, identifiers are separated with a comma, in the same order as the Strain column. Ranges of identifiers for the same strain are separated by a hyphen. ‘N/A’ indicates either that there is no GenBank sequence for the strain or that all available sequences are not full-length (< 1,500 bp).
18S_Sequence: 18S for the strain derived from publicly available sequences associated with the data set, in the case where a GenBank sequence is not available.
18S_Sequence_Source: the source for the sequence in the 18S_Sequence column, if any.
18S_Sequence_Other_Strain_GenBank_ID: GenBank identifier for 18S sequence(s) from other strains of the same species as the data set.
18S_Sequence_Other_Strain_Name: strain name(s) for the sequences in the 18S_Sequence_Other_Strain_GenBank_ID column.
18S_and_Taxonomy_Notes: additional information on the values in the 18S_Sequence columns.
Changes since version 2
There are 324 new data sets included. 57 of these replace data sets from version 2.
40 newly published data sets were added to the list that are not included in the database (annotated in the Notes column with the reasons they were not included).
Instead of unannotated genomes (for published genomes lacking protein predictions), we now include predicted proteins and gene annotations (in GFF3 format).
All sequences within each file are now assigned a standardized, unique identifier based on the data set’s EukProt_ID and on the type of data (protein or transcriptome). Illegal characters are removed from sequences.
In the UniEuk_Taxonomy field, single quotes are now used instead of double quotes, to be consistent with other UniEuk databases (EukMap, EukRibo).
Changes to metadata of individual data sets (in the included and not_included tables) with respect to the previous version are now listed in the Columns_Modified_Since_Previous_Version column.
The Taxogroup_UniEuk column has been split into the Taxogroup1_UniEuk and Taxogroup2_UniEuk columns. This resulted in the Supergroup_UniEuk column changing for Opisthokonta.
In addition, the following new columns have been added (see our manuscript for details): Alternative_Strain_Names, 18S_Sequence_GenBank_ID, 18S_Sequence, 18S_Sequence_Source, 18S_Sequence_Other_Strain_GenBank_ID, 18S_Sequence_Other_Strain_Name, 18S_and_Taxonomy_Notes.
EukProt_assembled_transcriptomes.v03.2021_11_22.tgz: assembled transcriptome contigs, for 126 species with publicly available mRNA sequence reads but no publicly available assembly. The proteins predicted from these assemblies are included in the proteins file.
Sequence names in the proteins and transcriptomes files have standardized, unique identifiers with the following format:
[EukProt ID]_[Name_to_Use]_[Type abbreviation][Counter] [Previous header contents]
Type abbreviations are P (protein) and T (transcriptome).
All characters not in the following list are removed from nucleic acid sequences: ACGTNUKSYMWRBDHV All characters not in the the following list are removed from protein sequences: ABCDEFGHIKLMNPQRSTUVWYZX*
Lists of legal characters are from: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Phenotypic, genotypic, and environment data for the 2016 field season: The data is stored in CyVerse. Data types in this directory tree are: hybrid and inbred agronomic and performance traits; inbred genotypic data; and environmental (soil, weather) data collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development. Resources in this dataset:Resource Title: CyVerse Genomes To Fields 2016 dataset download. File Name: Web Page, url: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/GenomesToFields_G2F_2016_Data_Mar_2018 Dataset (csv) and metadata (BibTex, Endnote) data downloads. See _readme.txt for file contents.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
🇺🇸 미국 English [NOTE: PLEXdb is no longer available online. Oct 2019.] PLEXdb (Plant Expression Database) is a unified gene expression resource for plants and plant pathogens. PLEXdb is a genotype to phenotype, hypothesis building information warehouse, leveraging highly parallel expression data with seamless portals to related genetic, physical, and pathway data. PLEXdb (http://www.plexdb.org), in partnership with community databases, supports comparisons of gene expression across multiple plant and pathogen species, promoting individuals and/or consortia to upload genome-scale data sets to contrast them to previously archived data. These analyses facilitate the interpretation of structure, function and regulation of genes in economically important plants. A list of Gene Atlas experiments highlights data sets that give responses across different developmental stages, conditions and tissues. Tools at PLEXdb allow users to perform complex analyses quickly and easily. The Model Genome Interrogator (MGI) tool supports mapping gene lists onto corresponding genes from model plant organisms, including rice and Arabidopsis. MGI predicts homologies, displays gene structures and supporting information for annotated genes and full-length cDNAs. The gene list-processing wizard guides users through PLEXdb functions for creating, analyzing, annotating and managing gene lists. Users can upload their own lists or create them from the output of PLEXdb tools, and then apply diverse higher level analyses, such as ANOVA and clustering. PLEXdb also provides methods for users to track how gene expression changes across many different experiments using the Gene OscilloScope. This tool can identify interesting expression patterns, such as up-regulation under diverse conditions or checking any gene’s suitability as a steady-state control.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Data types in this directory tree are: dimension and width profile data collected from scanned images of ears, cobs, and kernels collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Data types in this directory tree are: hybrid and inbred agronomic and performance traits; inbred genotypic data; and environmental data collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the GDAb and GDAt datasets. GDAb and GDAt are large-scale, distantly supervised, and manually enhanced datasets for Gene-Disease Association (GDA) extraction. Each dataset consists of three text files, corresponding to train, validation, and test sets, plus an additional JSON file containing the mapping between relation names and IDs. Each record in train, validation, or test files corresponds to a single GDA extracted from a sentence. Records are represented as JSON objects with the following structure:
text: sentence from which the GDA was extracted.
relation: relation name associated to the given GDA.
h: JSON object representing the gene entity, composed of:
id: UMLS CUI associated to the gene entity.
name: UMLS preferred name associated to the gene entity.
pos: list consisting of starting position and length of the gene mention within text.
t: JSON object representing the disease entity, composed of:
id: UMLS CUI associated to the disease entity.
name: UMLS preferred name associated to the disease entity.
pos: list consisting of starting position and length of the disease mention within text.
Both datasets contain over 2,500,000 sentences and 500,000 bags. The zip file consists of two folders, GDAb and GDAt, containing the files corresponding to the two datasets, respectively.
Prion infection results in progressive neurodegeneration of the central nervous system invariably resulting in death. The pathological effects of prion diseases in the brain are morphologically well defined, such as gliosis, vacuolation, and the accumulation of disease-specific protease-resistant prion protein (PrPSc). However, the underlying molecular events that lead to the death of neurons are poorly characterised. In this study cDNA microarrays were used to profile gene expression changes in the brains of two different strains of mice infected with three strains of mouse-adapted scrapie. Extensive data was collected and analyzed, from which we identified a core group of 349 prion-related genes (PRGs) that consistently showed altered expression in mouse models. Gene ontology analysis assigned many of the up-regulated genes to functional groups associated with one of the primary neuropathological features of prion diseases, astrocytosis and gliosis; protein synthesis, inflammation, cell proliferation and lipid metabolism. Using a computational tool, Ingenuity Pathway Analysis (IPA), we were able to build networks of interacting genes from the PRG list. The regulatory cytokine TGFB1, involved in modulating the inflammatory response, was identified as the outstanding interaction partner for many of the PRGs. The majority of genes expressed in neurons were down-regulated; a number of these were involved in regulatory pathways including synapse function, calcium signalling, long-term potentiation and ERK/MAPK signalling. Two down-regulated genes coding for the transcription regulators, EGR1 and CREB1, were also identified as central to interacting networks of genes; these factors are often used as markers of neuronal activity and their deregulation could be key to loss of neuronal function. These data provides a comprehensive list of genes that are consistently differentially expressed in multiple scrapie infected mouse models. Building networks of interactions between these genes provides a means to understand the complex interplay in the brain during neurodegeneration. Resolving the key regulatory and signaling events that underlie prion pathogenesis will provide targets for the design of novel therapies and the elucidation of biomarkers. Keywords: disease state analysis C57BL/6 mice were inoculated by intracerebral infection of brain homogenate from mice clinically infected with the ME7, 79a and 22A strains of scrapie. In addition VM mice were also inoculated with the 22A scrapie strain. Mice were sacrificed at the onset of clinical diseases as manifested by uncoordinated gait, flaccid paralysis of the hind limbs, rigidity and abolishment of the righting reflex. Brain tissue was collected from these mice and the RNA isolated. Mouse CNS gene expression was analysed by two-colour microarray experiments using an in house manufactured 11K mouse cDNA microarray. RNA from individual infected mice was hybridized to each array versus pooled reference RNA from an equivalent number of age-matched, mock-infected control mice. In total we hybridized 34 different samples to microarrays in this experiment; 8-10 individual mice from each of the four sample groups were individually processed for separate microarrays. Hierachical clustering shows that the patterns of gene expression are for the most part common to the different mouse models . We used the program EDGE to identify genes that were differentially expressed in mouse brain during clinical disease. We used a P value cut-off of 0.05 as the criteria for selection of significantly differentially expressed genes. A similar design was used to determine gene expression profiles from C57BL/6 mice infected with a fourth strain of scrapie, RML. Total RNA from three mice, plus dye swaps, was hybridized to Agilent whole genome mouse arrays to validate genes identified using our manufactured BMAP arrays.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
A subset of ~30 inbreds were evaluated in 2014 and 2015 to develop an image based ear phenotyping tool. The data is stored in CyVerse. Data types in this directory tree are: dimension and width profile data collected from scanned images of ears, cobs, and kernels collected from the Genomes To Fields (G2F) project cooperators. G2F is an umbrella initiative to support translation of maize (Zea mays) genomic information for the benefit of growers, consumers and society. This public-private partnership is building on publicly funded corn genome sequencing projects to develop approaches to understand the functions of corn genes and specific alleles across environments. Ultimately this information will be used to enable accurate prediction of the phenotypes of corn plants in diverse environments. There are many dimensions to the over-arching goal of understanding genotype-by-environment (GxE) interactions, including which genes impact which traits and trait components, how genes interact among themselves (GxG), the relevance of specific genes under different growing conditions, and how these genes influence plant growth during various stages of development. Resources in this dataset:Resource Title: CyVerse Genomes To Fields Inbred Ear Imaging 2017 dataset download. File Name: Web Page, url: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/Edgar_Spalding_G2F_Inbred_Ear_Imaging_June_2017 Dataset (csv, tar.gz) and metadata (BibTex/Endnote) downloads. See _readme.txt for file contents.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Glioblastoma Multiforme (TCGA-GBM) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Glioma Phenotype Research Group.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Rare diseases affect more than 30 million individuals, with the majority facing limited treatment options, elevating the urgency to innovative therapeutic solutions. Addressing these medical challenges necessitates an exploration of novel treatment modalities. Among these, drug repurposing emerges as a promising avenue, offering both potential and risk mitigation. To achieve this goal, we primarily focused on developing predictive models that harness cutting-edge computational techniques to uncover latent relationships between gene targets and chemical compounds towards drug repurposing. Building upon our previous investigation, where we successfully identified gene targets for compounds from the Tox21 in vitro assays, our endeavor expanded to a systematic prediction of potential targets for drug repurposing employing machine learning models built on diverse algorithms such as Support Vector Classifier, K-Nearest Neighbors, Random Forest, and Extreme Gradient Boosting. These models were trained on comprehensive biological activity profile data to predict the relationship between 143 gene targets and over 6000 compounds. Our models demonstrated high accuracy (>0.75), with predictions further validated by using public experimental datasets. Furthermore, several findings were evaluated via case studies. By elucidating these connections, we aim to streamline the drug repurposing process, ultimately catalyzing the discovery of more effective therapeutic interventions for rare diseases.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Ovarian Cancer (TCGA-OV) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the TCGA Ovarian Phenotype Research Group.
The timing and tempo of the processes involved in community assembly are of substantial concern to community ecologists and conservation managers. The fossil record is a valuable source of data for studying past changes in community composition, but it is not always detailed enough to allow the process of community assembly to be resolved at regional or site scales while tracing the trajectories of known species with associated known traits. We present a three‐step framework for studying present‐day species accumulation through time: DNA sampling from multiple individuals from multiple species within a community; estimates of coalescence times for each species using molecular dating methods; and plotting the accumulation of present‐day species through time using the inferred population ages. Our approach is illustrated using whole chloroplast genomes from plants from three rainforest communities in eastern Australia. Expected times to coalescence for multiple species in each community were inferred from pooled high‐throughput sequence libraries. Local assemblage accumulation curves for each community were constructed. We also explored the variation in assemblage accumulation curves of species with different functional traits. Models of equilibrium species richness informed our null hypothesis and largely explained the shape of the assemblage accumulation curves and indicated that the complexities of the accumulation process should be explored with additional parameters, for example allowing species classes with different extinction rates. The assemblage accumulation curves for the study sites showed evidence of recent population expansions within each of the communities. This signal of recent accumulation is consistent with the increase in suitable rainforest habitat that followed the Last Glacial Maximum. Our method of constructing assemblage accumulation curves provides a simple approach for visualizing species‐accumulation data. It can be used to test hypotheses such as the relative survival potential of species‐specific ecological attributes. Although our example used single‐nucleotide polymorphisms derived from whole‐chloroplast sequencing, this framework can be applied to mitochondrial genomes and to communities of other organisms. AAC_branchrate estimate_chloroplast168A fasta files containing concatenated chloroplast genesestimatingtimetocoalescenceAn .xlsx file containing the data and calculations used for building the AAC curves presented in the mansucriptsimulationsSeveral files containing code and data for simulations for assemblage accumulation curves. readme file included in the zip folder
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Cancer Genome Atlas Rectum Adenocarcinoma (TCGA-READ) data collection is part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes by providing clinical images matched to subjects from The Cancer Genome Atlas (TCGA). Clinical, genetic, and pathological data resides in the Genomic Data Commons (GDC) Data Portal while the radiological data is stored on The Cancer Imaging Archive (TCIA).
Matched TCGA patient identifiers allow researchers to explore the TCGA/TCIA databases for correlations between tissue genotype, radiological phenotype and patient outcomes. Tissues for TCGA were collected from many sites all over the world in order to reach their accrual targets, usually around 500 specimens per cancer type. For this reason the image data sets are also extremely heterogeneous in terms of scanner modalities, manufacturers and acquisition protocols. In most cases the images were acquired as part of routine care and not as part of a controlled research study or clinical trial.
Imaging Source Site (ISS) Groups are being populated and governed by participants from institutions that have provided imaging data to the archive for a given cancer type. Modeled after TCGA analysis groups, ISS groups are given the opportunity to publish a marker paper for a given cancer type per the guidelines in the table above. This opportunity will generate increased participation in building these multi-institutional data sets as they become an open community resource. Learn more about the CIP TCGA Radiology Initiative.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BuildingsBench datasets consist of:
Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB).
BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below:
A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.