Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ChemTastesDB is a database that includes curated information of 4075 molecular tastants. ChemTastesDB is distributed to the scientific community to expand the information of molecular tastants, which could assist the analysis of the relationships between molecular structure and taste, as well as in silico (QSAR/QSPR) studies for taste prediction. Examples of QSPR approaches for the prediction of molecular taste are given in the following publication: Rojas, C., Abril-González, M., Ballabio, D. & García, F. (2025). ChemTastesPredictor: An ensemble of machine learning classifiers to predict the taste of molecular tastants (Under review).
The 4075 molecular tastants are categorized into one of the five basic tastes (sweet, bitter, umami sour and salty), as well as to other classes related to non-basic tastes (tasteless, non-sweet, non-bitter, multitaste and miscellaneous). The molecules are categorized into following ten classes: sweet (1313), bitter (1615), umami (220), sour (49), salty (16), multitaste (179), tasteless (232), non-sweet (304), non-bitter (28), and miscellaneous (119).
ChemTastesDB provides the following information for each molecule: name, PubChem CID, CAS registry number, canonical SMILES string, class taste and the reference to the scientific sources from where data were retrieved. In addition, the molecular structure in the HyperChem (.hin) format of each compound is provided.
This is version 2.0 of the ChemTastesDB. In this new version, 1131 newly curated compounds were added. These new molecules were retrieved from 52 new bibliographic references.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We extend questionable research practices (QRPs) research by conducting a robust, large-scale analysis of p-hacking in organizational research. We leverage a manually curated database of more than 1,000,000 correlation coefficients and sample sizes, with which we calculate exact p-values. We test for the prevalence and magnitude of p-hacking across the complete database as well as various subsets of the database according to common bivariate relation types in the organizational literature (e.g., attitudes-behaviors). Results from two analytical approaches (i.e., z-curve, critical bin comparisons) were consistent in both direction and significance in nine of 18 datasets. Critical bin comparisons indicated p-hacking in 12 of 18 subsets, three of which reached statistical significance. Z-curve analyses indicated p-hacking in 11 of 18 subsets, two of which reached statistical significance. Generally, results indicated that p-hacking is detectable but small in magnitude. We also tested for three predictors of p-hacking: Publication year, journal prestige, and authorship team size. Across two analytic approaches, we observed a relatively consistent positive relation between p-hacking and journal prestige, and no relationship between p-hacking and authorship team size. Results were mixed regarding the temporal trends (i.e., evidence for p-hacking over time). In sum, the present study of p-hacking in organizational research indicates that the prevalence of p-hacking is smaller and less concerning than earlier research has suggested.
This dataset contains a curated set of 19,164 airfoil shapes from various applications and the data-driven design space of separable shape tensors (PGA space), which can be used as a parameter space for machine-learning applications focused on airfoil shapes. We constructed the airfoil dataset in two main stages. First, we identified 13 baseline airfoils from the NREL 5MW and IEA 15MW reference wind turbines. We reparameterized these shapes using least-squares fits of 8-order CST parametrizations, which involve 18 coefficients. By uniformly perturbing all 18 CST coefficients by +/-20% around each baseline airfoil, we generated 1,000 unique airfoils. Each airfoil was sampled with 1,001 shape landmarks whose x-coordinates followed a cosine distribution along the chord. This process resulted in a total of 13,000 airfoil shapes, each with 1,001 landmarks. In the second phase, we gathered additional airfoils from the extensive BigFoil database, which consolidates data from sources such as the University of Illinois Urbana-Champaign (UIUC) airfoil database, the JavaFoil database, the NACA-TR-824 database, and others. We undertook a thorough pre-processing step to filter out shapes with sparse, noisy, or incomplete data. We also removed airfoils with sharp leading edge and those exceeding our threshold for trailing edge thickness. Additionally, we thinned out the collection of NACA airfoils-- parametric sweeps of NACA airfoils with increasing thickness and camber present in BigFoil database-- by selecting every fourth step in the parameter sweeps. Finally, we regularized the airfoils by reparametrizing them with an 8-order CST parametrization (with 1,001 shape landmarks with x coordinated following cosine distribution along the chord) and removing airfoils with high reconstruction errors. This data pre-processing resulted in a set of 6,164 airfoils. In total, our curated airfoil dataset comprises 19,164 airfoils, each with 1,001 landmarks, and is stored in the curated_airfoils.npz file. Using this curated airfoil dataset, we utilized the separable shape tensors framework to develop a data-driven parameterization of airfoils based on principal geodesic analysis (PGA) of separable shape tensors. This PGA space is provided in PGAspace.npz file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EukRibo is a manually curated database of reference small-subunit ribosomal RNA gene (18S rDNA) sequences of eukaryotes, specifically aimed at taxonomic annotation of high-throughput metabarcoding datasets. Unlike other reference databases of ribosomal genes, it is not meant to exhaustively capture all publicly available 18S rDNA sequences from the INSDC repositories, but to represent a subset of highly trustable sequences covering the whole known diversity of eukaryotes, with a focus on protists, manually verified taxonomic identifications, and relatively low genetic redundancy.
EukRibo is part of a suite of public resources generated by the UniEuk project (www.unieuk.org), which are all designed to follow a common taxonomic framework for maximal interoperability. The high level of taxonomic accuracy of EukRibo, together with a newly designed, phylogenetically-informed annotation approach, allow high confidence in the taxonomic annotation of environmental metabarcodes, as well as identification of new eukaryotic diversity at various taxonomic levels using a connected components approach.
Accompanying preprint available at https://doi.org/10.1101/2022.11.03.515105.
EukRibo ReadMe file, versions 1 and 2
Each EukRibo release consists of 4 files: - a tsv table containing the taxonomic and other information about the 18S rDNA sequences included in the release - a fasta file containing the full sequences as retrieved from the INSDC repositories (NCBI, EMBL-EBI/ENA, DDBJ) - a fasta file containing the variable region V4 extracted from all these sequences (based on the fragment amplified with the Tara-Oceans V4 primers) - a fasta file containing the variable region V9 extracted from the subset of sequences where it is present (based on the fragment amplified with the Tara-Oceans V9 primers)
The primary goal of EukRibo was to be used to annotate the EukBank meta-dataset of available V4 metabarcoding datasets, and therefore all sequences included in EukRibo contain the variable region V4. Only a subset of these sequences (about 75%) also contain the variable region V9; this is because many 18S rDNA sequences in the INSDC repositories stop before the V9 fragment.
Sequences with slightly incomplete V4 or V9 fragments were kept if phylogenetically useful - i.e. if they are the only available representatives of a certain taxonomic lineage. V4 We allowed up to 50 missing positions in the relatively conserved area at the 5' end of the V4 fragment (for an average fragment length of about 380 bp); no sequence incomplete at the 3' end of the V4 fragment is included. V9 We allowed up to 30 missing positions in the relatively conserved area at the 3' end of the V9 fragment (for an average length of about 135 bp); no sequence incomplete at the 5' end of the V9 fragment is included. We allowed a higher proportion of missing positions for the V9 region because being more conservative would imply losing too many sequences, including entire taxonomic lineages.
Version 1 of EukRibo This is the starting version of EukRibo that was used for the taxonomic annotation of the EukBank dataset, with taxonomy strings that were fixed as of October 2020. - Contains 46,345 sequences with a sufficiently complete V4 region; 46,299 with the actual complete V4 region and 46 (about 0.1%) with missing positions at the 5' end. - Of these, 34,438 also include a sufficiently complete V9 region; 23,226 with the actual complete V9 region and 11,206 (about 33%) with missing positions at the 3' end.
Version 2 of EukRibo This is a version of EukRibo that was made taxonomically compatible with version 3 of the EukProt database (https://doi.org/10.1101/2020.06.30.180687), with taxonomic revisions as of July 2022 as well as additional information on the included selection of sequences that was not provided in the tsv file of version 1. - Contains the exact same selection of sequences as in version 1, with the addition of genus Meteora, the last remaining known supergroup-level eukaryotic lineage for which an 18S rDNA was not previously available. (The Meteora sequence contains the full V4 fragment but does not include a sufficiently complete V9 fragment.) - Only 34,432 sequences with a sufficiently complete V9 region are now retained because of 6 previously unrecognised chimeric sequences where the V9 fragment does not originate from the same organism as the V4 fragment.
Files in EukRibo version 1: 46345_EukRibo.tsv.gz 46345_EukRibo_full_seqs.fas.gz 46345_EukRibo_V4.fas.gz 34438_EukRibo_V9.fas.gz
The tsv file contains 6 columns: gb_accession - INSDC accession number of the sequence supergroup, taxogroup1, taxogroup2 - binning of the taxa into strictly monophyletic clades of evolutionary and/or ecological significance UniEuk_taxonomy_string - full UniEuk-compatible taxonomic annotation of the sequence - an unlimited number of levels is allowed (going down to strain for isolated organisms or to clone for environmental sequences) - informal names are used for phylogenetically supported clades without formal name V9 - presence ('Y') or absence ('N') of a sufficiently complete V9 fragment in the sequence
Files in EukRibo version 2: 46346_EukRibo-02.tsv.gz 46346_EukRibo-02_full_seqs.fas.gz 46346_EukRibo-02_V4.fas.gz 34432_EukRibo-02_V9.fas.gz
The tsv file now contains 12 columns: gb_accession, supergroup, taxogroup1, taxogroup2, UniEuk_taxonomy_string - same columns as in version 1 alternative_strain_names (new) - provides alternative strain/isolate names when known to help cross-linking genetic data coming from the same organism V4 (new) - indicates whether the V4 fragment is complete ('yes - complete') or missing positions at the 5' end ('yes - partial') V9 (emended content) - now contains more precise information than in version 1 about whether it is complete ('yes - complete'), missing positions at the 3' end ('yes - partial'), or was excluded, and the 6 possible reasons why ('no - missing', 'no - too incomplete', 'no - chimera', 'no - bad quality', 'no - deletion in V9', 'no - Ns in V9') EukProt_ID_same_strain (new) - accession of EukProt datasets from the same isolate EukProt_ID_different_strain (new) - accession of EukProt datasets from a different isolate of the same species columns_modified_since_previous_version (new) - lists all of the 6 pre-existing columns that have a modified content compared to version 1 remarks (new) - additional information such as presence of an intron in the V9 fragment, taxonomic identity of the two parts of chimeric sequences, or the presence of Ns or a deletion in the V4 or the V9 fragment (but insufficient to warrant exclusion)
Curated catalog of worldwide biological databases to provide landscape of biological databases throughout the world and enable easy retrieval and access to specific collection of databases of interest. Catalog of worldwide biological databases as well as their curated meta information and derived statistics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here, we make available the supplemental material regarding data collection from the publicaiton "Research Data Curation in Visualization : Position Paper". The dataset represents an aggregated collection of the data policies of selected publication venues in the areas of visualization, computer graphics, software, HCI, and Virtual Reality with inclusions from multimedia, collaboration, and network visualization, for the years 2021-2022. Based on a derived index, long-term preservation and data sharing are evaluated for each venue. The index ranges from No policy to Required sharing and preservation. Additionally the verbatim statements (or the lack thereof) used to reach the concluded score are also provided. Abstract: Research data curation is the act of carefully preparing research data and artifacts for sharing and long-term preservation. Research data management is centrally implemented and formally defined in a data management plan to enable data curation. In tandem, data curation and management facilitate research repeatability. In contrast to other research fields, data curation and management in visualization are not yet part of the researcher’s compendium. In this position paper, we discuss the unique challenges visualization faces and propose how data curation can be practically realized. We share eight lessons learned in managing data in two large research consortia, outline the larger curation workflow, and define the typical roles. We complement our lessons with minimum criteria for selecting a suitable data repository and five challenging scenarios that occur in practice. We conclude with a vision of how the visualization research community can pave the way for new curation standards.
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Data Prep Market size was valued at USD 4.02 Billion in 2024 and is projected to reach USD 16.12 Billion by 2031, growing at a CAGR of 19% from 2024 to 2031.
Global Data Prep Market Drivers
Increasing Demand for Data Analytics: Businesses across all industries are increasingly relying on data-driven decision-making, necessitating the need for clean, reliable, and useful information. This rising reliance on data increases the demand for better data preparation technologies, which are required to transform raw data into meaningful insights.
Growing Volume and Complexity of Data: The increase in data generation continues unabated, with information streaming in from a variety of sources. This data frequently lacks consistency or organization, therefore effective data preparation is critical for accurate analysis. To assure quality and coherence while dealing with such a large and complicated data landscape, powerful technologies are required.
Increased Use of Self-Service Data Preparation Tools: User-friendly, self-service data preparation solutions are gaining popularity because they enable non-technical users to access, clean, and prepare data. independently. This democratizes data access, decreases reliance on IT departments, and speeds up the data analysis process, making data-driven insights more available to all business units.
Integration of AI and ML: Advanced data preparation technologies are progressively using AI and machine learning capabilities to improve their effectiveness. These technologies automate repetitive activities, detect data quality issues, and recommend data transformations, increasing productivity and accuracy. The use of AI and ML streamlines the data preparation process, making it faster and more reliable.
Regulatory Compliance Requirements: Many businesses are subject to tight regulations governing data security and privacy. Data preparation technologies play an important role in ensuring that data meets these compliance requirements. By giving functions that help manage and protect sensitive information these technologies help firms negotiate complex regulatory climates.
Cloud-based Data Management: The transition to cloud-based data storage and analytics platforms needs data preparation solutions that can work smoothly with cloud-based data sources. These solutions must be able to integrate with a variety of cloud settings to assist effective data administration and preparation while also supporting modern data infrastructure.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ChemTastesDB is a database that includes curated information of 4075 molecular tastants. ChemTastesDB is distributed to the scientific community to expand the information of molecular tastants, which could assist the analysis of the relationships between molecular structure and taste, as well as in silico (QSAR/QSPR) studies for taste prediction. Examples of QSPR approaches for the prediction of molecular taste are given in the following publication: Rojas, C., Abril-González, M., Ballabio, D. & García, F. (2025). ChemTastesPredictor: An ensemble of machine learning classifiers to predict the taste of molecular tastants. Chemometrics and Intelligent Laboratory Systems. 261, 105380. https://doi.org/10.1016/j.chemolab.2025.105380.
The 4075 molecular tastants are categorized into one of the five basic tastes (sweet, bitter, umami sour and salty), as well as to other classes related to non-basic tastes (tasteless, non-sweet, non-bitter, multitaste and miscellaneous). The molecules are categorized into following ten classes: sweet (1313), bitter (1615), umami (220), sour (49), salty (16), multitaste (179), tasteless (232), non-sweet (304), non-bitter (28), and miscellaneous (119).
ChemTastesDB provides the following information for each molecule: name, PubChem CID, CAS registry number, canonical SMILES string, class taste and the reference to the scientific sources from where data were retrieved. In addition, the molecular structure in the HyperChem (.hin) format of each compound is provided.
This is version 2.1 of the ChemTastesDB. In this new version, 1131 newly curated compounds were added. These new molecules were retrieved from 52 new bibliographic references.
Manually curated database of all conditions with known genetic causes, focusing on medically significant genetic data with available interventions. Includes gene symbol, conditions, allelic conditions, inheritance, age in which interventions are indicated, clinical categorization, and general description of interventions/rationale. Contents are intended to describe types of interventions that might be considered. Includes only single gene alterations and does not include genetic associations or susceptibility factors related to more complex diseases.
Milky seas are a rare form of nocturnal oceanic bioluminescence distinguished by a steady, non-flashing, white/gray/green glow. Scientific inquiry into milky seas has, for centuries, been held back by the remote ephemeral nature of this phenomenon. Combining centuries of eyewitness accounts with modern satellite observations, we present a curated list of milky sea observations since 1600. This database greatly expands the ability to study when and where milky seas occur, as well as the commonly observed features of a milky sea., , , # A curated database of milky sea observations from 1600 to present
https://doi.org/10.5061/dryad.0gb5mkmbc
The data in this archive was collected for the paper "From Sailors to Satellites: A Curated Database of Milky Seas Since 1600". By combining centuries of eyewitness accounts with satellite observations for the first time, we hope to expand the ability to study and understand milky seas.
Description:Â A human readable PDF of every eyewitness account and satellite observation within the database. This file contains the date, location, description, and who reported the account for every milky sea observation. This is Supplemental 1 for the paper.
Description: A machine-readable tab-separated values file containing detailed information on every milky sea observation within the database. ...,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multilevel logit regression results for predictors of bin membership (non-substantive database).
LOCATE is a curated database that houses data describing the membrane organization and subcellular localization of proteins from the RIKEN FANTOM4 mouse and human protein sequence set. The membrane organization is predicted by the high-throughput, computational pipeline MemO. The subcellular locations were determined by a high-throughput, immunofluorescence-based assay and by manually reviewing peer-reviewed publications.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
An expert-curated database of pharmacological targets with quantitative information on the prescription medicines and experimental dugs that act on them.
The aim of the Data Rescue & Curation Best Practices Guide is to provide an accessible and hands-on approach to handling data rescue and digital curation of at-risk data for use in secondary research. We provide a set of examples and workflows for addressing common challenges with social science survey data that can be applied to other social and behavioural research data. The goal of this guide and set of workflows presented is to improve librarians’ and data curators’ skills in providing access to high-quality, well-documented, and reusable research data. The aspects of data curation that are addressed throughout this guide are adopted from long-standing data library and archiving practices, including: documenting data using standard metadata, file and data organization; using open and software-agnostic formats; and curating research data for reuse.
The ECMDB is an expertly curated database containing extensive metabolomic data and metabolic pathway diagrams about Escherichia coli (strain K12, MG1655). This database includes significant quantities of “original” data compiled by members of the Wishart laboratory as well as additional material derived from hundreds of textbooks, scientific journals, metabolic reconstructions and other electronic databases. Each metabolite is linked to more than 100 data fields describing the compound, its ontology, physical properties, reactions, pathways, references, external links and associated proteins or enzymes.
Organisms living in honey bees and honey bee colonies form large associative holobiont communities that are integral to bee biology. High-throughput sequencing approaches to characterize these holobiont communities from honey bees in various states of health and disease are now commonplace, producing large amounts of nucleotide sequence data that must be accurately and consistently analyzed in order to produce reliable and comparable reports. In addition, new species designations and revisions are actively being made from honey bee holobiont communities, complicating nomenclature in larger databases where taxonomic descriptions associated with archived sequences can quickly become outdated and misleading. To improve the accuracy and consistency of honey bee holobiont research, we have developed HoloBee: a curated database of publicly accessioned nucleotide sequences from the honey bee holobiont community. Except in rare and noted exceptions made by curators, sequences used in HoloBee were obtained from, or in association with, Apis mellifera (Western honey bee) as well as other honey bee species where available (e.g. Apis cerana, Apis dorsata, Apis laboriosa, Apis koschevnikovi, Apis florea, Apis andreniformis and Apis nigrocincta). Sources include: within or on the surface of honey bees (adult, pupae, larvae, egg), corbicular pollen, bee bread, royal jelly, honey, comb, hive surfaces (e.g. bottom board debris, frames, landing platforms), and isolates of microbes, parasites and pathogens from honey bees. HoloBee contains two non-overlapping sets of sequence data, HoloBee-Barcode and HoloBee-Mop, each of which have distinct intended uses. HoloBee-Barcode is a non-redundant database of taxonomically informative barcoding loci for all viruses, bacteria, fungi, protozoans and metazoans associated with honey bees (Apis spp.). It was created from an exhaustive master sequence archive of all valid holobiont sequences. Redundancy was removed from this master archive using a clustering algorithm that grouped sequences with ≥ 99% identity and retained the longest sequence from each cluster as the representative accession for that sequence type (“centroid”). These centroid sequences were concatenated into a fasta formatted file to create the HoloBee-Barcode database. Associated taxonomy for each centroid, including Superkingdom through Species and Strain/Isolate, was individually reviewed and corrected when necessary by a curator. Cross reference tables (separated according to 5 major taxonomic groups) provide a user-friendly outline of information for each centroid accession within HoloBee-Barcode including taxonomy, gene/product name, sequence length, the unaltered NCBI definition line, the number and identity of redundant sequences clustered within each centroid, and any additional information provided by the curator. HoloBee-Barcode centroid counts are: Viruses = 86; Bacteria = 496; Fungi = 41; Protozoa = 4; Metazoa = 60. HoloBee-Barcode is intended to improve and standardize quantitative and qualitative metagenomic descriptions of holobiont communities associated with honey bees by providing a curated set of barcode sequences. The goal of genetic barcoding is to associate a nucleotide sequence sample to a taxonomically valid species. Genomic regions targeted for such barcoding purposes varied by taxonomic group. The small subunit (SSU) ribosomal RNA, or 16S rRNA, is the most commonly used barcode for bacteria and is used in HB-Barcode. These 16S rRNA sequences will support the analysis of data generated with the widely used approach of amplicon-based 16S rRNA deep sequencing to study microbiota communities. Although barcode markers for fungi are less definitive than bacteria, HB-Barcode defaults to the ribosomal RNA internal transcribed spacer region (ITS), which typically includes ITS-1, 5.8S, and ITS-2. For some clades that cannot be resolved by this region, other barcode markers were selected. The majority of barcodes for metazoan taxa are the mitochondrial locus cytochrome c oxidase subunit I (COI). Complete mitochondrial DNA (mtDNA) sequence for Apis cerana (Asian honey bee) and Galleria mellonella (Greater wax moth) are included as barcodes for these species. We note that A. cerana mtDNA is included because it is considered a potentially invasive honey bee species and monitoring for its occurrence is in practice regionally, including in Australia, New Zealand and the USA. Protozoan barcodes include cytochrome b oxidase (Cytb), SSU, or ITS while entire genomes are used for viral barcoding. HoloBee-Mop is a database comprised mostly of chromosomal, mitochondrial and plasmid genome assemblies in order to aggregate as much honey bee holobiont genomic sequence information as possible. For a few organisms without genome assembly data, transcriptome data are included (e.g. Aethina tumida, small hive beetle). Unlike HoloBee-Barcode, redundancy removal was not performed on the HoloBee-Mop database and thus this resource provides an archive of nucleotide sequence assemblies from honey bee holobionts. However, since full viral genomes are used in HoloBee-Barcode, only redundant viral sequences occur in HoloBee-Mop. All accessions within each of these assemblies were concatenated into a single fasta formatted file to create the HoloBee-Mop database. The intended purpose of HoloBee-Mop is to improve honey bee genome and transcriptome assemblies by “mopping-up” as much viral, bacterial, fungal, protozoan and non-honey bee metazoan sequence data as possible. Therefore, sequence data remaining after processing reads through both HoloBee-Barcode and HoloBee-Mop that do not map to the honey bee genome may contain unique data from taxonomic variants or novel species. Details for each sequence assembly within HoloBee-Mop are tabulated in cross reference tables according to each major taxonomic group. HoloBee-Mop assembly counts are: Viruses = 2; Bacteria = 55; Fungi = 5; Protozoa = 1; Metazoa = 6. Follow the HoloBee database on Twitter at: https://twitter.com/HoloBee_db For questions about the HoloBee database, contact: HoloBee database team: holobee.db@gmail.com Jay Evans: Jay.Evans@ars.usda.gov Anna Childers: Anna.Childers@ars.usda.gov Resources in this dataset:Resource Title: HoloBee_v2016.1 sequence database. File Name: HB_v2016.1.zipResource Description: This compressed file contains two fasta sequence files: HB_Bar_v2016.1.fasta (HoloBee-Barcode database) HB_Mop_v2016.1.fasta (HoloBee-Mop database) md5 values: HB_v2016.1.zip: 6e372e443744282128eb51488176503f HB_Bar_v2016.1.fasta: 109e1f686a690c70ef78fc4b5066a01f HB_Mop_v2016.1.fasta: ced8c3f5987dce69e800c8c491471eba Resource Title: data dictionary for HoloBee_v2016.1. File Name: Data_Dictionary_HoloBee_v2016.1.xlsxResource Title: HoloBee_v2016.1 cross reference tables. File Name: HB_v2016.1_crossref.zipResource Description: This compressed file contains ten spreadsheet files (.xlsx) tabulating detailed information for all centroids (HoloBee-Barcode database) and sequence assemblies (HoloBee-Mop database) used in HoloBee v2016.1: HB_Bar_v2016.1_bacteria_crossref_2016-05-18.xlsx HB_Bar_v2016.1_fungi_crossref_2016-05-20.xlsx HB_Bar_v2016.1_metazoa_crossref_2016-05-16.xlsx HB_Bar_v2016.1_protozoa_crossref_2016-05-20.xlsx HB_Bar_v2016.1_viruses_crossref_2016-05-17.xlsx HB_Mop_v2016.1_bacteria_crossref_2016-05-12.xlsx HB_Mop_v2016.1_fungi_crossref_2016-05-12.xlsx HB_Mop_v2016.1_metazoa_crossref_2016-04-15.xlsx HB_Mop_v2016.1_protozoa_crossref_2016-04-11.xlsx HB_Mop_v2016.1_viruses_crossref_2016-05-12.xlsx md5 value: HB_v2016.1_crossref.zip: a8a57d92830eb77904743afc95980465 Resource Title: data dictionary for HoloBee_v2016.1. File Name: Data_Dictionary_HoloBee_v2016.1.csv
This is the updated version of the dataset from 10.5281/zenodo.6320761 Information The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144648 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design. The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation. This dataset belongs to the publication: https://doi.org/10.3390/molecules27082513 Structure and content of the dataset Dataset structure ChEMBL ID PubChem ID IUPHAR ID Target Activity type Assay type Unit Mean C (0) ... Mean PC (0) ... Mean B (0) ... Mean I (0) ... Mean PD (0) ... Activity check annotation Ligand names Canonical SMILES C ... Structure check (Tanimoto) Source The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file. Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format. Column content: ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases Target: biological target of the molecule expressed as the HGNC gene symbol Activity type: for example, pIC50 Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified Unit: unit of bioactivity measurement Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence no comment: bioactivity values are within one log unit; check activity data: bioactivity values are not within one log unit; only one data point: only one value was available, no comparison and no range calculated; no activity value: no precise numeric activity value was available; no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration Ligand names: all unique names contained in the five source databases are listed Canonical SMILES columns: Molecular structure of the compound from each database Structure check (Tanimoto): To denote matching or differing compound structures in different source databases match: molecule structures are the same between different sources; no match: the structures differ. We calculated the Jaccard-Tanimoto similarity coefficient from Morgan Fingerprints to reveal true differences between sources and reported the minimum value; 1 structure: no structure comparison is possible, because there was only one structure available; no structure: no structure comparison is possible, because there was no structure available. Source: From which databases the data come from
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the fasta file including phage genomes used to generate the Curated Phage Database (CPD) utilized in our in review manuscript "The circulating phageome reflects bacterial infections". The corresponding phage characteristic data will be present in the manuscript as a supplemental file, and can be used to connect a Genbank ID to identified bacteriophage host and phage taxonomic information if known.
Please note that this database is built from phage sequences in the NCBI nucleotide repository. Due to field bias towards sequencing human disease-related bacteria and their phage, this database is reflective of this bias and is most representative of bacteriophage associated with human pathogens and as such underrepresents environmental phages in comparison - a limitation to keep in mind when utilizing to interpret potential phage sequences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ActDES constitutes a novel resource for the community of Actinobacterial researchers that will be useful primarily for two types of analyses: (i) comparative genomic studies - facilitated by reliable orthologs identification across a set of defined, phylogenetically representative genomes, and (ii) phylogenomic studies which will be improved by identification of gene subsets at specified taxonomic level. These studies can then act as a springboard for the study of the evolution of virulence genes, studying the evolution of metabolism and metabolic engineering target identification.
A manually curated database of small molecule metabolites found in or produced by Saccharomyces cerevisiae (also known as Baker's yeast and Brewer's yeast). This database covers metabolites described in textbooks, scientific journals, metabolic reconstructions and other electronic databases. YMDB contains metabolites arising from normal S. cerevisiae metabolism under defined laboratory conditions as well as metabolites generated by S. cerevisiae when used in baking and in the production of wines, beers and spirits. YMDB currently contains 2027 small molecules with 857 associated enzymes and 138 associated transporters. Each small molecule has 48 data fields describing the metabolite, its chemical properties and links to spectral and chemical databases. Each enzyme/transporter is linked to its associated metabolites and has 30 data fields describing both the gene and corresponding protein. Users may search through the YMDB using a variety of database-specific tools. The simple text query supports general text queries of the textual component of the database. By selecting either metabolites or proteins in the search for field it is possible to restrict the search and the returned results to only those data associated with metabolites or with proteins. Clicking on the Browse button generates a tabular synopsis of YMDB's content. This browser view allows users to casually scroll through the database or re-sort its contents. Clicking on a given MetaboCard button brings up the full data content for the corresponding metabolite. A complete explanation of all the YMDB fields and sources is available. Under the Search link users will find a number of search options listed in a pull-down menu. The Chem Query option allows users to draw (using MarvinSketch applet or a ChemSketch applet) or to type (SMILES string) a chemical compound and to search the YMDB for chemicals similar or identical to the query compound. The Advanced Search option supports a more sophisticated text search of the text portion of YMDB. The Sequence Search button allows users to conduct BLASTP (protein) sequence searches of all sequences contained in YMDB. Both single and multiple sequence (i.e. whole proteome) BLAST queries are supported. YMDB also supports a Data Extractor option that allows specific data fields or combinations of data fields to be searched and/or extracted. Spectral searches of YMDB's reference compound NMR and MS spectral data are also supported through its MS, MS/MS, GC/MS and NMR Spectra Search links. Users may download YMDB's complete textual data, chemical structures and sequence data by clicking on the Download button.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ChemTastesDB is a database that includes curated information of 4075 molecular tastants. ChemTastesDB is distributed to the scientific community to expand the information of molecular tastants, which could assist the analysis of the relationships between molecular structure and taste, as well as in silico (QSAR/QSPR) studies for taste prediction. Examples of QSPR approaches for the prediction of molecular taste are given in the following publication: Rojas, C., Abril-González, M., Ballabio, D. & García, F. (2025). ChemTastesPredictor: An ensemble of machine learning classifiers to predict the taste of molecular tastants (Under review).
The 4075 molecular tastants are categorized into one of the five basic tastes (sweet, bitter, umami sour and salty), as well as to other classes related to non-basic tastes (tasteless, non-sweet, non-bitter, multitaste and miscellaneous). The molecules are categorized into following ten classes: sweet (1313), bitter (1615), umami (220), sour (49), salty (16), multitaste (179), tasteless (232), non-sweet (304), non-bitter (28), and miscellaneous (119).
ChemTastesDB provides the following information for each molecule: name, PubChem CID, CAS registry number, canonical SMILES string, class taste and the reference to the scientific sources from where data were retrieved. In addition, the molecular structure in the HyperChem (.hin) format of each compound is provided.
This is version 2.0 of the ChemTastesDB. In this new version, 1131 newly curated compounds were added. These new molecules were retrieved from 52 new bibliographic references.