28 datasets found
  1. r

    Database Commons

    • rrid.site
    • dknet.org
    • +1more
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Database Commons [Dataset]. http://identifiers.org/RRID:SCR_023661/resolver?q=*&i=rrid
    Explore at:
    Dataset updated
    Jun 24, 2025
    Description

    Curated catalog of worldwide biological databases to provide landscape of biological databases throughout the world and enable easy retrieval and access to specific collection of databases of interest. Catalog of worldwide biological databases as well as their curated meta information and derived statistics.

  2. f

    Data from: Analysis of Commercial and Public Bioactivity Databases

    • acs.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pekka Tiikkainen; Lutz Franke (2023). Analysis of Commercial and Public Bioactivity Databases [Dataset]. http://doi.org/10.1021/ci2003126.s003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Pekka Tiikkainen; Lutz Franke
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Activity data for small molecules are invaluable in chemoinformatics. Various bioactivity databases exist containing detailed information of target proteins and quantitative binding data for small molecules extracted from journals and patents. In the current work, we have merged several public and commercial bioactivity databases into one bioactivity metabase. The molecular presentation, target information, and activity data of the vendor databases were standardized. The main motivation of the work was to create a single relational database which allows fast and simple data retrieval by in-house scientists. Second, we wanted to know the amount of overlap between databases by commercial and public vendors to see whether the former contain data complementing the latter. Third, we quantified the degree of inconsistency between data sources by comparing data points derived from the same scientific article cited by more than one vendor. We found that each data source contains unique data which is due to different scientific articles cited by the vendors. When comparing data derived from the same article we found that inconsistencies between the vendors are common. In conclusion, using databases of different vendors is still useful since the data overlap is not complete. It should be noted that this can be partially explained by the inconsistencies and errors in the source data.

  3. i

    IDPredictor: predict database links in biomedical database. Supplementary...

    • doi.ipk-gatersleben.de
    Updated Jan 1, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthias Lange; Hendrik Mehlhorn; Uwe Scholz; Falk Schreiber; Matthias Lange; Hendrik Mehlhorn; Uwe Scholz; Falk Schreiber; Matthias Lange; Hendrik Mehlhorn; Uwe Scholz; Falk Schreiber (2012). IDPredictor: predict database links in biomedical database. Supplementary material A.3 for the paper [Dataset]. https://doi.ipk-gatersleben.de/DOI/ce9f7e62-56e5-4554-bb11-d7ab29e6fa1d/dd34a994-daf0-4b7f-9809-d875c1e771d2/2
    Explore at:
    Dataset updated
    Jan 1, 2012
    Dataset provided by
    e!DAL - Plant Genomics and Phenomics Research Data Repository (PGP), IPK Gatersleben, Seeland OT Gatersleben, Corrensstraße 3, D-06466, Germany
    Authors
    Matthias Lange; Hendrik Mehlhorn; Uwe Scholz; Falk Schreiber; Matthias Lange; Hendrik Mehlhorn; Uwe Scholz; Falk Schreiber; Matthias Lange; Hendrik Mehlhorn; Uwe Scholz; Falk Schreiber
    Description

    Supplementary material A.3 for the paper 'IDPredictor: predict database links in biomedical database'. Abstract: Knowledge found in biomedical databases, in particular in Web information systems, is a major bioinformatics resource. In general, this biological knowledge is worldwide represented in a network of databases. These data are spread among thousands of databases, which overlap in content, but differ substantially with respect to content detail, interface, formats and data structure. To support a functional annotation of lab data, such as protein sequences, metabolites or DNA sequences as well as a semi-automated data exploration in information retrieval environments an integrated view to databases is essential. Search engines have the potential of assisting in data retrieval from these structured sources, but fall short of providing a comprehensive knowledge excerpt out of the interlinked databases. A prerequisit for supporting the concept of an integrated data view is the to acquiring insights into cross-references among database entities. But only a fraction of all possible cross-references are explicitely tagged in the particular biomedical informations systems. In this work, we investigate to what extend an automated construction of an integrated data network is possible. We propose a method that predict and extracts cross-references from multiple life science databases and thier possible referenced data targets. We study the retrieval quality of our method and the relationship between manually crafted relevance ranking and relevance ranking based on cross-references, and report on first, promising results.

  4. n

    DNA DataBank of Japan (DDBJ)

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Mar 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). DNA DataBank of Japan (DDBJ) [Dataset]. http://identifiers.org/RRID:SCR_002359
    Explore at:
    Dataset updated
    Mar 24, 2025
    Description

    Maintains and provides archival, retrieval and analytical resources for biological information. Central DDBJ resource consists of public, open-access nucleotide sequence databases including raw sequence reads, assembly information and functional annotation. Database content is exchanged with EBI and NCBI within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). In 2011, DDBJ launched two new resources: DDBJ Omics Archive and BioProject. DOR is archival database of functional genomics data generated by microarray and highly parallel new generation sequencers. Data are exchanged between the ArrayExpress at EBI and DOR in the common MAGE-TAB format. BioProject provides organizational framework to access metadata about research projects and data from projects that are deposited into different databases.

  5. Toward Reproducible Computational Research: An Empirical Analysis of Data...

    • plos.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Stodden; Peixuan Guo; Zhaokun Ma (2023). Toward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals [Dataset]. http://doi.org/10.1371/journal.pone.0067111
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Victoria Stodden; Peixuan Guo; Zhaokun Ma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Journal policy on research data and code availability is an important part of the ongoing shift toward publishing reproducible computational science. This article extends the literature by studying journal data sharing policies by year (for both 2011 and 2012) for a referent set of 170 journals. We make a further contribution by evaluating code sharing policies, supplemental materials policies, and open access status for these 170 journals for each of 2011 and 2012. We build a predictive model of open data and code policy adoption as a function of impact factor and publisher and find higher impact journals more likely to have open data and code policies and scientific societies more likely to have open data and code policies than commercial publishers. We also find open data policies tend to lead open code policies, and we find no relationship between open data and code policies and either supplemental material policies or open access journal status. Of the journals in this study, 38% had a data policy, 22% had a code policy, and 66% had a supplemental materials policy as of June 2012. This reflects a striking one year increase of 16% in the number of data policies, a 30% increase in code policies, and a 7% increase in the number of supplemental materials policies. We introduce a new dataset to the community that categorizes data and code sharing, supplemental materials, and open access policies in 2011 and 2012 for these 170 journals.

  6. European Molecular Biology Laboratory Australian Mirror

    • gbif.org
    • researchdata.edu.au
    • +1more
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GBIF (2025). European Molecular Biology Laboratory Australian Mirror [Dataset]. http://doi.org/10.15468/ypsvix
    Explore at:
    Dataset updated
    Aug 21, 2025
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    European Molecular Biology Laboratoryhttp://www.embl.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The EBI is a centre for research and services in bioinformatics. The Institute manages databases of biological data including nucleic acid, protein sequences and macromolecular structures.

    As we move towards understanding biology at the systems level, access to large data sets of many different types has become crucial. Technologies such as genome-sequencing, microarrays, proteomics and structural genomics have provided 'parts lists' for many living organisms, and researchers are now focusing on how the individual components fit together to build systems. The hope is that scientists will be able to translate their new insights into improving the quality of life for everyone. However, the high-throughput revolution also threatens to drown us in data. There is an ongoing, and growing, need to collect, store and curate all this information in ways that allow its efficient retrieval and exploitation. The European Bioinformatics Institute is one of the few places in the world that has the resources and expertise to fulfil this important task.

  7. f

    Code Availability in the Journal of the American Statistical Association.

    • figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Stodden; Peixuan Guo; Zhaokun Ma (2023). Code Availability in the Journal of the American Statistical Association. [Dataset]. http://doi.org/10.1371/journal.pone.0067111.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Victoria Stodden; Peixuan Guo; Zhaokun Ma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code Availability in the Journal of the American Statistical Association.

  8. r

    RESNET

    • rrid.site
    • scicrunch.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). RESNET [Dataset]. http://identifiers.org/RRID:SCR_002121/resolver?q=&i=rrid
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    Databases that represent sets of pre-compiled information on biological relationships and associations, interactions and facts which have been extracted from the biomedical literature using Ariadne's MedScan technology. ResNet databases store information harvested from the entire PubMed in a formal structure that allows searching, retrieval and updating by Pathway Studio user. ResNet is seamlessly installed when Pathway Studio is installed. There are several available ResNet databases: *ResNet Mammalian Database includes data for Human, Rat, and Mouse *ResNet Plant Database has data on Arabidopsis, Rice and several other plants. Features of ResNet: *All extracted relations have linked access to the original article or abstract *Synonyms and homologs are included to maintain gene identity and to obviate redundancy in search results *Users can update ResNet as often as required using the MedScan technology built into all Ariadne products *Updates are made available by Ariadne every quarter To purchase Pathway Studio software with ResNet database, for information, or to schedule a web demonstration, call our sales department at (240) 453-6272, or (866) 340-5040 (toll free).

  9. m

    Data from: MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022...

    • data.mendeley.com
    Updated Jul 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur (2022). MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak [Dataset]. http://doi.org/10.17632/xmcg82mx9k.3
    Explore at:
    Dataset updated
    Jul 25, 2022
    Authors
    Nirmalya Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2

    Abstract The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization just declared monkeypox a global health emergency. As a result, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

    Data Description The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 23rd July 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 6 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. • Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the Tweet IDs: May 7, 2022 to May 21, 2022) • Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the Tweet IDs: May 21, 2022 to May 27, 2022) • Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the Tweet IDs: May 27, 2022 to June 5, 2022) • Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the Tweet IDs: June 5, 2022 to June 11, 2022) • Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 47718, Date Range of the Tweet IDs: June 12, 2022 to June 30, 2022) • Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the Tweet IDs: July 1, 2022 to July 23, 2022)

    The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used.

  10. Classification of Journal Policies.

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Stodden; Peixuan Guo; Zhaokun Ma (2023). Classification of Journal Policies. [Dataset]. http://doi.org/10.1371/journal.pone.0067111.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Victoria Stodden; Peixuan Guo; Zhaokun Ma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification of Journal Policies.

  11. ISI Classifications Represented in the Journal Titles.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Stodden; Peixuan Guo; Zhaokun Ma (2023). ISI Classifications Represented in the Journal Titles. [Dataset]. http://doi.org/10.1371/journal.pone.0067111.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Victoria Stodden; Peixuan Guo; Zhaokun Ma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ISI Classifications Represented in the Journal Titles.

  12. f

    FishNET: An automated relational database for zebrafish colony management

    • plos.figshare.com
    tiff
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abiud Cantu Gutierrez; Manuel Cantu Gutierrez; Alexander M. Rhyner; Oscar E. Ruiz; George T. Eisenhoffer; Joshua D. Wythe (2023). FishNET: An automated relational database for zebrafish colony management [Dataset]. http://doi.org/10.1371/journal.pbio.3000343
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS Biology
    Authors
    Abiud Cantu Gutierrez; Manuel Cantu Gutierrez; Alexander M. Rhyner; Oscar E. Ruiz; George T. Eisenhoffer; Joshua D. Wythe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The zebrafish Danio rerio is a powerful model system to study the genetics of development and disease. However, maintenance of zebrafish husbandry records is both time intensive and laborious, and a standardized way to manage and track the large amount of unique lines in a given laboratory or centralized facility has not been embraced by the field. Here, we present FishNET, an intuitive, open-source, relational database for managing data and information related to zebrafish husbandry and maintenance. By creating a “virtual facility,” FishNET enables users to remotely inspect the rooms, racks, tanks, and lines within a given facility. Importantly, FishNET scales from one laboratory to an entire facility with several laboratories to multiple facilities, generating a cohesive laboratory and community-based platform. Automated data entry eliminates confusion regarding line nomenclature and streamlines maintenance of individual lines, while flexible query forms allow researchers to retrieve database records based on user-defined criteria. FishNET also links associated embryonic and adult biological samples with data, such as genotyping results or confocal images, to enable robust and efficient colony management and storage of laboratory information. A shared calendar function with email notifications and automated reminders for line turnover, automated tank counts, and census reports promote communication with both end users and administrators. The expected benefits of FishNET are improved vivaria efficiency, increased quality control for experimental numbers, and flexible data reporting and retrieval. FishNET’s easy, intuitive record management and open-source, end-user–modifiable architecture provides an efficient solution to real-time zebrafish colony management for users throughout a facility and institution and, in some cases, across entire research hubs.

  13. S

    Neocryptolepine Derivatives Anti-tumor Database System (NDADS)

    • scidb.cn
    Updated Feb 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen Peng; Zhang Hua (2025). Neocryptolepine Derivatives Anti-tumor Database System (NDADS) [Dataset]. http://doi.org/10.57760/sciencedb.21408
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 27, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Chen Peng; Zhang Hua
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Neocryptolepine is a natural alkaloid isolated from the African climbing plant Paeonia lactiflora, belonging to the indole quinoline alkaloid class. This compound has become a natural precursor widely studied by medicinal chemists due to its diverse biological activities, especially its potential applications in anti-tumor, anti-inflammatory, anti malaria and other fields. As a natural product with multiple biological activities,Neocryptolepine has great potential in cancer treatment research. Through in-depth research and development of the Neocryptolepine, it may provide new treatment options for cancer patients in the future.Cancer, as a global health challenge, has long plagued the medical community and patients. It is a disease caused by the unlimited proliferation, invasion, and metastasis of abnormal cells, which can affect any part of the human body. With the change of lifestyle, the aggravation of environmental pollution and the trend of aging population, the incidence rate of cancer has increased year by year and has become the second leading cause of death in the world. Despite its enormous potential in cancer treatment, the diversity, mechanisms, and unknown targets of action make it extremely challenging to obtain Neocryptolepine anti-cancer pathways from it. In addition, it is difficult to search for systematic information on anti-cancer Neocryptolepine from a large amount of information such as the internet. Neocryptolepine derivatives, as a natural compound, have shown great potential and diversity in cancer treatment. Despite facing challenges in screening and utilization, they remain important resources for drug development.In order to construct the NDADS database, authoritative literature search websites such as Pubmed and Google Scholar were used to systematically collect key information on the generic name, anti-tumor activity, cancer type, mechanism of action, and targets of Neocryptolepine and its derivatives using keywords such as Neocryptolepine, Cancer, and Target. On this basis, all data were integrated and included in the data of 85 Neocryptolepine derivatives in the laboratory, ultimately forming a database containing information on 203 anti-tumor compounds derived from Neocryptolepine derivatives. In order to integrate and evaluate numerous research resources and results, the Neocryptolepine derivatives anti-tumor database can provide rich retrieval and analysis tools, such as cross database retrieval, citation retrieval, journal retrieval, etc., enabling users to easily search for anti-tumor related information of Neocryptolepine derivatives. Supplement the current inclusion status, covering the names, structures, molecular weights, activities, functions, cancer types, cancer cells, targets/signaling pathways, references, and corresponding website sources of various compounds. This interface supports the query function for the content of the Neocryptolepine derivatives mentioned above. Therefore, the anti-tumor database of the Neocryptolepine derivatives will help to study the potential of Neocryptolepine derivatives in the treatment of cancer from multiple aspects such as activity, structure, method of action, and target, assisting in cancer treatment and improving cancer survival rate.

  14. Net Changes in Journal Policy Classifications from 2011 to 2012.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Stodden; Peixuan Guo; Zhaokun Ma (2023). Net Changes in Journal Policy Classifications from 2011 to 2012. [Dataset]. http://doi.org/10.1371/journal.pone.0067111.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Victoria Stodden; Peixuan Guo; Zhaokun Ma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Net Changes in Journal Policy Classifications from 2011 to 2012.

  15. FANTASIA V3 - LookUp Table - UniProt July 2025 - Experimental Evidence code

    • zenodo.org
    csv, tar
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco M. Perez-Canales; Francisco M. Perez-Canales; Ana Rojas; Ana Rojas (2025). FANTASIA V3 - LookUp Table - UniProt July 2025 - Experimental Evidence code [Dataset]. http://doi.org/10.5281/zenodo.16582433
    Explore at:
    csv, tarAvailable download formats
    Dataset updated
    Jul 29, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francisco M. Perez-Canales; Francisco M. Perez-Canales; Ana Rojas; Ana Rojas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📘 FANTASIA V3 - LookUp Table (UniProt July 2025)

    Experimental Evidence Code Only

    Overview

    This is a PostgreSQL database backup using the pgvector extension to store high-dimensional protein embeddings. It contains precomputed embeddings and functional annotations from the UniProt July 2025 release, including only entries supported by experimental evidence.

    This lookup table was generated using version v2.0.0 of the Protein Information System (PIS), an integrated biological information system designed for the automated extraction, processing, and management of protein-related data. PIS consolidates information from UniProt, PDB, and GOA, allowing efficient retrieval and organization of sequences, structures, and annotations.

    The resulting database is designed for compatibility with FANTASIA V3, an advanced pipeline for large-scale functional annotation of proteins using state-of-the-art Protein Language Models (PLMs). While the lookup table is stored in a vector database for persistence, FANTASIA loads the relevant data into memory at runtime to enable high-speed annotation.

    FANTASIA uses precomputed deep learning embeddings to perform nearest-neighbor searches in embedding space and transfer Gene Ontology (GO) terms from experimentally annotated proteins to query sequences.

    Dataset Details

    • Total proteins: 127,546

    • Total sequences: 124,397

    • Total embeddings: 621,849

    • Total GO annotations: 627,932

    • Included evidence codes (Gene Ontology, experimental only):

      • EXP – Inferred from Experiment

      • IDA – Inferred from Direct Assay

      • IPI – Inferred from Physical Interaction

      • IMP – Inferred from Mutant Phenotype

      • IGI – Inferred from Genetic Interaction

      • IEP – Inferred from Expression Pattern

      • TAS – Traceable Author Statement

      • IC – Inferred by Curator

    Included Embedding Models

    • ESM-2 (650M parameters)
      A transformer-based protein language model trained on UniRef50 using masked language modeling. It captures structural and functional features directly from raw sequences without requiring MSAs. ESM-2 is widely used for contact map prediction, unsupervised learning, and representation extraction.

    • ProtT5-XL-UniRef50 (~1.2B parameters)
      A large-scale encoder-decoder model using the T5 architecture, trained on UniRef50 via masked span prediction. It generates high-dimensional sequence representations that perform well across structure and function prediction tasks.

    • ProstT5 (~1.2B parameters)
      A multi-modal extension of ProtT5, trained to predict both sequence and coarse-grained 3Di structural states. Useful for downstream applications like contact prediction, functional annotation, and classification.

    • Ankh3-Large (620M parameters)
      An encoder-only T5-style model trained with masked span prediction. Optimized for fast inference, it encodes both semantic and structural protein information and can replace ProtT5 in many ML pipelines.

    • ESM3c (Cambrian 600M)
      Part of the new ESM C model family, trained on UniRef, MGnify, and JGI datasets. With rotary embeddings and 36 layers, it offers enhanced performance for masked language modeling, producing high-quality structural and functional embeddings without alignments.

    Missing Proteins

    A small subset of proteins could not be processed on the Finisterrae III (CESGA) supercomputer due to memory limitations with 40 GB A100 GPUs.

    The file missing_proteins.csv lists all affected UniProt identifiers. These entries are excluded from the final lookup table.

  16. d

    Southern Ocean Diet and Energetics Database

    • search.dataone.org
    • npdc.nl
    • +1more
    Updated Sep 14, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netherlands Polar Data Center (NPDC) (2021). Southern Ocean Diet and Energetics Database [Dataset]. https://search.dataone.org/view/sha256%3A431892ab62316b222f48c6417a60ad96678b27b58b470c78afdbec98731e61e9
    Explore at:
    Dataset updated
    Sep 14, 2021
    Dataset provided by
    Netherlands Polar Data Center (NPDC)
    Time period covered
    Jan 1, 1 - Jan 1, 2090
    Area covered
    Description

    Information related to diet and energy flow is fundamental to a diverse range of Antarctic and Southern Ocean biological and ecosystem studies. The SCAR Expert Groups on Antarctic Biodiversity Informatics (EG-ABI) and Birds and Marine Mammals (EG-BAMM) are collating a centralised database of such information to assist the scientific community in this work. It includes data related to diet and energy flow from conventional (e.g. gut content) and modern (e.g. molecular) studies, stable isotopes, fatty acids, and energetic content.

  17. Distribution of Impact Factors for Journal Titles.

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Stodden; Peixuan Guo; Zhaokun Ma (2023). Distribution of Impact Factors for Journal Titles. [Dataset]. http://doi.org/10.1371/journal.pone.0067111.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Victoria Stodden; Peixuan Guo; Zhaokun Ma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Distribution of Impact Factors for Journal Titles.

  18. Publishing Houses for Journal Titles.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Stodden; Peixuan Guo; Zhaokun Ma (2023). Publishing Houses for Journal Titles. [Dataset]. http://doi.org/10.1371/journal.pone.0067111.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Victoria Stodden; Peixuan Guo; Zhaokun Ma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Publishing Houses for Journal Titles.

  19. Open Access and Open Data/Code Policies 2012.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Stodden; Peixuan Guo; Zhaokun Ma (2023). Open Access and Open Data/Code Policies 2012. [Dataset]. http://doi.org/10.1371/journal.pone.0067111.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Victoria Stodden; Peixuan Guo; Zhaokun Ma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Open Access and Open Data/Code Policies 2012.

  20. f

    Journal Review and Hosting Policies, 2012.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Stodden; Peixuan Guo; Zhaokun Ma (2023). Journal Review and Hosting Policies, 2012. [Dataset]. http://doi.org/10.1371/journal.pone.0067111.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Victoria Stodden; Peixuan Guo; Zhaokun Ma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Journal Review and Hosting Policies, 2012.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). Database Commons [Dataset]. http://identifiers.org/RRID:SCR_023661/resolver?q=*&i=rrid

Database Commons

RRID:SCR_023661, Database Commons (RRID:SCR_023661)

Explore at:
467 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 24, 2025
Description

Curated catalog of worldwide biological databases to provide landscape of biological databases throughout the world and enable easy retrieval and access to specific collection of databases of interest. Catalog of worldwide biological databases as well as their curated meta information and derived statistics.

Search
Clear search
Close search
Google apps
Main menu