100+ datasets found

e
Data from: PROSITE
prosite.expasy.org
identifiers.org
+7more
Updated Oct 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). PROSITE [Dataset]. https://prosite.expasy.org/
Explore at:
Dataset updated
Oct 15, 2025
Description
PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].
s
CharProtDB: Characterized Protein Database
scicrunch.org
rrid.site
+2more
Updated Dec 4, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2011). CharProtDB: Characterized Protein Database [Dataset]. http://identifiers.org/RRID:SCR_005872
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005872
Dataset updated
Dec 4, 2011
Description
The Characterized Protein Database, CharProtDB, is designed and being developed as a resource of expertly curated, experimentally characterized proteins described in published literature. For each protein record in CharProtDB, storage of several data types is supported. It includes functional annotation (several instances of protein names and gene symbols) taxonomic classification, literature links, specific Gene Ontology (GO) terms and GO evidence codes, EC (Enzyme Commisssion) and TC (Transport Classification) numbers and protein sequence. Additionally, each protein record is associated with cross links to all public accessions in major protein databases as ��synonymous accessions��. Each of the above data types can be linked to as many literature references as possible. Every CharProtDB entry requires minimum data types to be furnished. They are protein name, GO terms and supporting reference(s) associated to GO evidence codes. Annotating using the GO system is of importance for several reasons; the GO system captures defined concepts (the GO terms) with unique ids, which can be attached to specific genes and the three controlled vocabularies of the GO allow for the capture of much more annotation information than is traditionally captured in protein common names, including, for example, not just the function of the protein, but its location as well. GO evidence codes implemented in CharProtDB directly correlate with the GO consortium definitions of experimental codes. CharProtDB tools link characterization data from multiple input streams through synonymous accessions or direct sequence identity. CharProtDB can represent multiple characterizations of the same protein, with proper attribution and links to database sources. Users can use a variety of search terms including protein name, gene symbol, EC number, organism name, accessions or any text to search the database. Following the search, a display page lists all the proteins that match the search term. Click on the protein name to view more detailed annotated information for each protein. Additionally, each protein record can be annotated.
n
TM Function Database
neuinfo.org
rrid.site
+2more
Updated Nov 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). TM Function Database [Dataset]. http://identifiers.org/RRID:SCR_007058
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007058
Dataset updated
Nov 16, 2024
Description
THIS RESOURCE IS NO LONGER IN SERVICE. Documented on October 29,2025. Database of functional residues in alpha-helical and beta-barrel membrane proteins. Each protein is identified with its name and source alongwith the Uniprot code. The protein data bank (PDB) codes are also given for available proteins. Different methods and experimental parameters, for example, affinity, dissociation constant, IC50, activity etc. are given in the database. Further, the database provides the numerical experimental value for each residue (or mutant) in a protein. The experimental data are collected from the literature both by searching the journals as well as with the keyword search at PUBMED. In addition, complete reference is given with journal citation and PMID number. TNFunction is cross-linked with the sequence database, Uniprot, structural database, PDB, and literature database, PubMed. The WWW interface enables users to search data based on various terms with different display options for outputs.
r
NCBI Protein Database
rrid.site
neuinfo.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2001). NCBI Protein Database [Dataset]. http://identifiers.org/RRID:SCR_003257
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_003257
Dataset updated
Jan 29, 2022
Description
Databases of protein sequences and 3D structures of proteins. Collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB.
Bioinformatics Protein Dataset - Simulated
kaggle.com
zip
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
Explore at:
zip(12928905 bytes)Available download formats
Dataset updated
Dec 27, 2024
Authors
Rafael Gallo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

ID_Protein: Unique identifier for each protein.

Sequence: String of amino acids.

Molecular_Weight: Molecular weight calculated from the sequence.

Isoelectric_Point: Estimated isoelectric point based on the sequence composition.

Hydrophobicity: Average hydrophobicity calculated from the sequence.

Total_Charge: Sum of the charges of the amino acids in the sequence.

Polar_Proportion: Percentage of polar amino acids in the sequence.

Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.

Sequence_Length: Total number of amino acids in the sequence.

Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.

Property Calculation: Physicochemical properties were calculated using the Biopython library.

Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.

The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
r
Data from: PROSITE
rrid.site
dknet.org
+1more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). PROSITE [Dataset]. http://identifiers.org/RRID:SCR_003457
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_003457
Dataset updated
Jan 29, 2022
Description
Database of protein families and domains that is based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor. It is complemented by ProRule, a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids. ScanProsite finds matches of your protein sequences to PROSITE signatures. PROSITE currently contains patterns and profiles specific for more than a thousand protein families or domains. Each of these signatures comes with documentation providing background information on the structure and function of these proteins. The database is available via FTP.
An information table for proteins.
plos.figshare.com
figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hafeez Ur Rehman; Nouman Azam; JingTao Yao; Alfredo Benso (2023). An information table for proteins. [Dataset]. http://doi.org/10.1371/journal.pone.0171702.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0171702.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Hafeez Ur Rehman; Nouman Azam; JingTao Yao; Alfredo Benso
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An information table for proteins.
n
PSCDB - Protein Structural Change DataBase
neuinfo.org
scicrunch.org
+2more
Updated Nov 12, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2011). PSCDB - Protein Structural Change DataBase [Dataset]. http://identifiers.org/RRID:SCR_006116
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_006116
Dataset updated
Nov 12, 2011
Description
Database for protein structural change upon ligand binding that are classified into 7 classes in terms of the ligand binding sites and the location where the dominant motion occurs. # Coupled Domain motions are the domain motions induced upon ligand binding. # Independent Domain motions are the observable domain motions regardless of ligand binding. # Coupled Local motions are the local motions induced upon ligand binding. # Independent Local motions are the observable local motions regardless of ligand binding. # Burying ligand motions are imaginable motions required to hold ligand protein-inside. # No significant motions mean just nothing happen. # Other motions are motions unclassified into domain and local motions. Proteins are flexible molecules that undergo structural changes to function. The Protein Data Bank contains multiple entries for identical proteins determined under different conditions, e.g. with and without a ligand molecule, which provides important information for understanding the structural changes related to protein functions. We gathered 839 protein structural pairs of ligand-free and ligand-bound states from monomeric or homo-dimeric proteins, and constructed the Protein Structural Change DataBase (PSCDB). In the database, we focused on whether the motions were coupled with ligand binding. As a result, the protein structural changes were classified into seven classes, i.e. coupled domain motion (59 structural changes), independent domain motion (70), coupled local motion (125), independent local motion (135), burying ligand motion (104), no significant motion (311) and other type motion (35). PSCDB provides lists of each class. On each entry page, users can view detailed information about the motion, accompanied by a morphing animation of the structural changes.
n
H-Invitational Database: Protein-Protein Interaction Viewer
neuinfo.org
rrid.site
+2more
Updated Jan 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). H-Invitational Database: Protein-Protein Interaction Viewer [Dataset]. http://identifiers.org/RRID:SCR_008054/resolver?q=&i=rrid
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008054 https://identifiers.org/RRID:SCR_008054/resolver?q=&i=rrid
Dataset updated
Jan 29, 2022
Description
The PPI view displays H-InvDB human protein-protein interaction (PPI) information. It is constructed by assigning interaction data to H-InvDB proteins which were originally predicted from transcriptional products generated by the H-Invitational project. The PPI view is now providing 32,198 human PPIs comprised of 9,268 H-InvDB proteins. H-Invitational Database (H-InvDB) is an integrated database of human genes and transcripts. By extensive analyses of all human transcripts, we provide curated annotations of human genes and transcripts that include gene structures, alternative splicing isoforms, non-coding functional RNAs, protein functions, functional domains, sub-cellular localizations, metabolic pathways, protein 3D structure, genetic polymorphisms (SNPs, indels and microsatellite repeats) , relation with diseases, gene expression profiling, molecular evolutionary features, protein-protein interactions (PPIs) and gene families/groups. Sponsors: This research is financially supported by the Ministry of Economy, Trade and Industry of Japan (METI), the Ministry of Education, Culture, Sports, Science and Technology of Japan (MEXT) and the Japan Biological Informatics Consortium (JBIC). Also, this work is partly supported by the Research Grant for the RIKEN Genome Exploration Research Project from MEXT to Y.H. and the Grant for the RIKEN Frontier Research System, Functional RNA research program.
Z
Data from: ProtNote: a multimodal method for protein-function annotation
data.niaid.nih.gov
zenodo.org
Updated Oct 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Char, Samir; Corley, Nathaniel; Alamdari, Sarah; Yang, Kevin K.; Amini, Ava P. (2024). ProtNote: a multimodal method for protein-function annotation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13897919
Explore at:
Dataset updated
Oct 13, 2024
Dataset provided by
University of Washington
Microsoft Research
Microsoft (United States)
Authors
Char, Samir; Corley, Nathaniel; Alamdari, Sarah; Yang, Kevin K.; Amini, Ava P.
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Understanding protein sequence-function relationships is essential for advancing protein biology and engineering. However, fewer than 1% of known protein sequences have human-verified functions, and scientists continually update the set of possible functions. While deep learning methods have demonstrated promise for protein function prediction, current models are limited to predicting only those functions on which they were trained. Here, we introduce ProtNote, a multimodal deep learning model that leverages free-form text to enable both supervised and zero-shot protein function prediction. ProtNote not only maintains near state-of-the-art performance for annotations in its train set, but also generalizes to unseen and novel functions in zero-shot test settings. We envision that ProtNote will enhance protein function discovery by enabling scientists to use free text inputs, without restriction to predefined labels – a necessary capability for navigating the dynamic landscape of protein biology.
CAFA 5 Protein Database Files (PDB)
kaggle.com
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A Merii (2023). CAFA 5 Protein Database Files (PDB) [Dataset]. https://www.kaggle.com/datasets/amerii/cafa-5-pdbs
Explore at:
zip(12654687498 bytes)Available download formats
Dataset updated
Jul 25, 2023
Authors
A Merii
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains 3D protein structure files in PDB format, gathered via the AlphaFoldDB API, for the Critical Assessment of protein Function Annotation (CAFA) 5 challenge protein entries.

The AlphaFoldDB is a comprehensive database that stores protein structures predicted by AlphaFold2 - an AI model developed by DeepMind that predicts the 3D structure of a protein based on its sequence. AlphaFold's predictions have been recognized for their remarkable accuracy, often comparable to those obtained from experimental methods.

The CAFA challenge is a community-wide effort to assess computational methods that predict protein function. The protein entries in this dataset are specifically related to the 5th iteration of the challenge - CAFA 5.

The dataset provides the following information for each protein:

The naming conventions for the files are: `

A functional update of the

data.virginia.gov
catalog.data.gov

html

Updated Sep 6, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institutes of Health (2025). A functional update of the [Dataset]. https://data.virginia.gov/dataset/a-functional-update-of-the

Explore at:

htmlAvailable download formats

Dataset updated

Sep 6, 2025

Dataset provided by

National Institutes of Health

Description

Background Since the genome of Escherichia coli K-12 was initially annotated in 1997, additional functional information based on biological characterization and functions of sequence-similar proteins has become available. On the basis of this new information, an updated version of the annotated chromosome has been generated.

   Results
   The E. coli K-12 chromosome is currently represented by 4,401 genes encoding 116 RNAs and 4,285 proteins. The boundaries of the genes identified in the GenBank Accession U00096 were used. Some protein-coding sequences are compound and encode multimodular proteins. The coding sequences (CDSs) are represented by modules (protein elements of at least 100 amino acids with biological activity and independent evolutionary history). There are 4,616 identified modules in the 4,285 proteins. Of these, 48.9% have been characterized, 29.5% have an imputed function, 2.1% have a phenotype and 19.5% have no function assignment. Only 7% of the modules appear unique to E. coli, and this number is expected to be reduced as more genome data becomes available. The imputed functions were assigned on the basis of manual evaluation of functions predicted by BLAST and DARWIN analyses and by the MAGPIE genome annotation system.


   Conclusions
   Much knowledge has been gained about functions encoded by the E. coli K-12 genome since the 1997 annotation was published. The data presented here should be useful for analysis of E. coli gene products as well as gene products encoded by other genomes.

Large protein databases reveal structural complementarity and functional...
figshare.com
bin
Updated Jul 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paweł Szczerbiak; Tomasz Kosciolek; Lukasz Szydlowski; Witold Wydmański; P. Douglas Renfrew; Julia Koehler Leman (2025). Large protein databases reveal structural complementarity and functional locality [Dataset]. http://doi.org/10.6084/m9.figshare.27203073.v3
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27203073.v3
Dataset updated
Jul 12, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Paweł Szczerbiak; Tomasz Kosciolek; Lukasz Szydlowski; Witold Wydmański; P. Douglas Renfrew; Julia Koehler Leman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent breakthroughs in protein structure prediction have led to an unprecedented surge in high-quality 3D models, highlighting the need for efficient computational solutions to manage and analyze this wealth of structural data. In our work, we comprehensively examine the structural clusters obtained from the AlphaFold Protein Structure Database (AFDB), a high-quality subset of ESMAtlas, and the Microbiome Immunity Project (MIP). We create a single cohesive low-dimensional representation of the resulting protein space. Our results show that, while each database occupies distinct regions within the protein structure space, they collectively exhibit significant overlap in their functional profiles. High-level biological functions tend to cluster in particular regions, revealing a shared functional landscape despite the diverse sources of data. By creating a single, cohesive low-dimensional representation of protein structure space integrating data from diverse sources, localizing functional annotations within this space, and providing an open-access web-server for exploration, this work offers insights for future research concerning protein sequence-structure-function relationships, enabling various biological questions to be asked about taxonomic assignments, environmental factors, or functional specificity. This approach is generalizable to other or future datasets, enabling further discovery beyond findings presented here.
d
Data from: Protein Clusters
catalog.data.gov
datadiscovery.nlm.nih.gov
+1more
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). Protein Clusters [Dataset]. https://catalog.data.gov/dataset/protein-clusters
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
A collection of Reference Sequence (RefSeq) proteins, from the complete genomes of prokaryotes, plasmids, and organelles, that have been grouped and annotated based on sequence similarity and protein function.
c
Protein Structural Domain Classification
cathdb.info
ec.i4cologne.com
+3more
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005
Explore at:
Unique identifier
https://identifiers.org/MIR:00100005
Dataset updated
Sep 30, 2024
Description
CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.

Data from: Towards understanding the first genome sequence of a crenarchaeon...

odgavaprod.ogopendata.com
catalog.data.gov

html

Updated Sep 6, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institutes of Health (2025). Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs) [Dataset]. https://odgavaprod.ogopendata.com/dataset/towards-understanding-the-first-genome-sequence-of-a-crenarchaeon-by-genome-annotation-using-cl

Explore at:

htmlAvailable download formats

Dataset updated

Sep 6, 2025

Dataset provided by

National Institutes of Health

Description

Background: Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi.

   Results:
   A. pernix and P. abyssi proteins were assigned to COGs using the COGNITOR program; the results were verified on a case-by-case basis and augmented by additional database searches using the PSI-BLAST and TBLASTN programs. Functions were predicted for over 300 proteins from A. pernix, which could not be assigned a function using conventional methods with a conservative sequence similarity threshold, an approximately 50% increase compared to the original annotation. A. pernix shares most of the conserved core of proteins that were previously identified in the Euryarchaeota. Cluster analysis or distance matrix tree construction based on the co-occurrence of genomes in COGs showed that A. pernix forms a distinct group within the archaea, although grouping with the two species of Pyrococci, indicative of similar repertoires of conserved genes, was observed. No indication of a specific relationship between Crenarchaeota and eukaryotes was obtained in these analyses. Several proteins that are conserved in Euryarchaeota and most bacteria are unexpectedly missing in A. pernix, including the entire set of de novo purine biosynthesis enzymes, the GTPase FtsZ (a key component of the bacterial and euryarchaeal cell-division machinery), and the tRNA-specific pseudouridine synthase, previously considered universal. A. pernix is represented in 48 COGs that do not contain any euryarchaeal members. Many of these proteins are TCA cycle and electron transport chain enzymes, reflecting the aerobic lifestyle of A. pernix.


   Conclusions:
   Special-purpose databases organized on the basis of phylogenetic analysis and carefully curated with respect to known and predicted protein functions provide for a significant improvement in genome annotation. A differential genome display approach helps in a systematic investigation of common and distinct features of gene repertoires and in some cases reveals unexpected connections that may be indicative of functional similarities between phylogenetically distant organisms and of lateral gene exchange.

Z
Data for: 'FAS: assessing the similarity between proteins using...
data.niaid.nih.gov
Updated May 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Dosch; Holger Bergmann; Vinh Tran; Ingo Ebersberger (2023). Data for: 'FAS: assessing the similarity between proteins using multi-layered feature architectures' [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7896005
Explore at:
Dataset updated
May 4, 2023
Dataset provided by
Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, Germany
Authors
Julian Dosch; Holger Bergmann; Vinh Tran; Ingo Ebersberger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Raw data and result data for the analyses made for the manuscript:

'FAS: assessing the similarity between proteins using multi-layered feature architectures'

https://doi.org/10.1093/bioinformatics/btad226

This dataset contains raw data obtained from QFO Orthobench and Gene Ontology database. Analyses were made to showcase the different uses of the FAS algorithm.
n
MfunGD - MIPS Mouse Functional Genome Database
neuinfo.org
dknet.org
+2more
Updated Jan 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). MfunGD - MIPS Mouse Functional Genome Database [Dataset]. http://identifiers.org/RRID:SCR_007783
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007783
Dataset updated
Jan 29, 2022
Description
THIS RESOURCE IS NO LONGER IN SERVICE. Documented on August 16, 2019.Database for annotated mouse proteins and their occurrence in protein networks. It contains cDNA and protein sequences, annotation, gene models and mapping, FunCat, UCSC Genome Viewer, SIMAP, pseudogenes (Genome Viewer Track), InterPro, and splice variants. Protein function annotation is performed using the Functional Catalogue (FunCat) annotation scheme, which is a hierarchically structured classification system. To provide up-to-date similarity search results and InterPro domain analyses, the protein entries are interconnected with the SIMAP database. The gene models are based on the RefSeq mouse cDNAs. The work of our group is focussed on the annotation of biological systems. Therefore, results from the Mammalian Protein-Protein Interaction Database and the Comprehensive Resource of Mammalian Protein Complexes are linked to the MfunGD dataset. Links to external resources are also provided. MfunGD is implemented in GenRE, a J2EE based component oriented multi-tier architecture.
Data from: Sequence-structure-function relationships in the microbial...
data.niaid.nih.gov
Updated Jun 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koehler Leman, Julia; Szczerbiak, Pawel; Renfrew, P. Douglas; Gligorijevic, Vladimir; Berenberg, Daniel; Vatanen, Tommi; Taylor, Bryn C.; Chandler, Chris; Janssen, Stefan; Pataki, Andras; Carriero, Nick; Fisk, Ian; Xavier, Ramnik J.; Knight, Rob; Bonneau, Richard; Kosciolek, Tomasz (2022). Sequence-structure-function relationships in the microbial protein universe [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6477241
Explore at:
Dataset updated
Jun 4, 2022
Dataset provided by
Simons Foundationhttps://www.simonsfoundation.org/
Broad Institute, Cambridge, MA, USA
Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA 92093, USA
Authors
Koehler Leman, Julia; Szczerbiak, Pawel; Renfrew, P. Douglas; Gligorijevic, Vladimir; Berenberg, Daniel; Vatanen, Tommi; Taylor, Bryn C.; Chandler, Chris; Janssen, Stefan; Pataki, Andras; Carriero, Nick; Fisk, Ian; Xavier, Ramnik J.; Knight, Rob; Bonneau, Richard; Kosciolek, Tomasz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Microbiome Immunity Project (MIP) dataset contains models predicted with both Rosetta and DMPFold (folder dataset/). It also contains DeepFRI function predictions for all models.

The metadata folder contains additional data which may be useful for searching the MIP database (FASTA files, BLAST databases and useful scripts for structure/function search) as well as retrieving the sequence/structural annotations.

The intermediate_data folder contains preprocessed output for reproducing many of the figures in our manuscript in conjunction with scripts and Juypter notebooks found in our git repository: https://github.com/microbiome-immunity-project/protein_universe .

More information about the dataset and associated metadata is provided in the README.md file).

We are also providing workflows to search the MIP database against a protein sequence or structure or function of interest (see SEARCHING.md for more details).
D
Data underlying the paper: Plasmonic Enhancement of Protein Function
data.4tu.nl
zip
Updated Aug 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marco Locarno; Qiangrui Dong; Xin Meng; Cristiano Glessi; Nynke Hettema; Nidas Brandsma; Sebbe Blokhuizen; Alejandro Castañeda Garcia; Srividya Ganapathy; Marco Post; Thieme Schmidt; Lars van Roemburg; Bing Xu; Chun-Ting Cho; Liedewij Laan; Miao-Ping Chien; Daan Brinks (2024). Data underlying the paper: Plasmonic Enhancement of Protein Function [Dataset]. http://doi.org/10.4121/909b7170-816f-4d81-a8b0-12b113f29207.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/909b7170-816f-4d81-a8b0-12b113f29207.v1
Dataset updated
Aug 16, 2024
Dataset provided by
4TU.ResearchData
Authors
Marco Locarno; Qiangrui Dong; Xin Meng; Cristiano Glessi; Nynke Hettema; Nidas Brandsma; Sebbe Blokhuizen; Alejandro Castañeda Garcia; Srividya Ganapathy; Marco Post; Thieme Schmidt; Lars van Roemburg; Bing Xu; Chun-Ting Cho; Liedewij Laan; Miao-Ping Chien; Daan Brinks
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset funded by
European Research Council
Dutch Research Council
Description
These data are part of the paper Plasmonic Enhancement of Protein Function; it contains physics data pertaining to coupling plasmonic nanoparticles to proteins to enhance their fluorescence and modify their function. The data are predominantly image data (.TIF format) obtained in fluorescence imaging experiments.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2025). PROSITE [Dataset]. https://prosite.expasy.org/

Data from: PROSITE

Explore at:

Dataset updated

Oct 15, 2025

Description

PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].

Clear search

Close search

Google apps

Main menu

Data from: PROSITE

CharProtDB: Characterized Protein Database

TM Function Database

NCBI Protein Database

Bioinformatics Protein Dataset - Simulated

Subtitle

Description

Introduction

Columns Included

Inspiration and Sources

Proposed Uses

How This Dataset Was Created

Limitations

Data Split

Acknowledgment

Data from: PROSITE

An information table for proteins.

PSCDB - Protein Structural Change DataBase

H-Invitational Database: Protein-Protein Interaction Viewer

Data from: ProtNote: a multimodal method for protein-function annotation

CAFA 5 Protein Database Files (PDB)

The dataset provides the following information for each protein:

A functional update of the

Large protein databases reveal structural complementarity and functional...

Data from: Protein Clusters

Protein Structural Domain Classification

Data from: Towards understanding the first genome sequence of a crenarchaeon...

Data for: 'FAS: assessing the similarity between proteins using...

MfunGD - MIPS Mouse Functional Genome Database

Data from: Sequence-structure-function relationships in the microbial...

Data underlying the paper: Plasmonic Enhancement of Protein Function

Data from: PROSITE