100+ datasets found

e
PRINTS
ebi.ac.uk
Updated Jun 14, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2012). PRINTS [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Jun 14, 2012
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain. PRINTS is based at the University of Manchester, UK.
e
SUPERFAMILY
ebi.ac.uk
Updated Nov 8, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2010). SUPERFAMILY [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Nov 8, 2010
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY is based at the University of Bristol, UK.
e
Data from: PROSITE
prosite.expasy.org
toothandnail-mailorder.com
+7more
Updated Oct 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). PROSITE [Dataset]. https://prosite.expasy.org/
Explore at:
Dataset updated
Oct 15, 2025
Description
PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].
e
PROSITE profiles
ebi.ac.uk
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Feb 5, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
e
PIRSF
ebi.ac.uk
Updated Apr 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). PIRSF [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Apr 7, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. PIRSF is based at the Protein Information Resource, Georgetown University Medical Centre, Washington DC, US.
c
Protein Structural Domain Classification
cathdb.info
ec.i4cologne.com
+3more
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Protein Structural Domain Classification [Dataset]. http://identifiers.org/MIR:00100005
Explore at:
Unique identifier
https://identifiers.org/MIR:00100005
Dataset updated
Sep 30, 2024
Description
CATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.
e
NCBIFAM
ebi.ac.uk
Updated Aug 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). NCBIFAM [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Aug 6, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NCBIfam is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. NCBIfam is maintained at the National Center for Biotechnology Information (Bethesda, MD). NCBIfam includes models from TIGRFAMs, another database of protein families developed at The Institute for Genomic Research, then at the J. Craig Venter Institute (Rockville, MD, US).
Protein Domain Analysis of Genomic Sequence Data Reveals Regulation of LRR...
figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tiange Lang; Kangquan Yin; Jinyu Liu; Kunfang Cao; Charles H. Cannon; Fang K. Du (2023). Protein Domain Analysis of Genomic Sequence Data Reveals Regulation of LRR Related Domains in Plant Transpiration in Ficus [Dataset]. http://doi.org/10.1371/journal.pone.0108719
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0108719
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Tiange Lang; Kangquan Yin; Jinyu Liu; Kunfang Cao; Charles H. Cannon; Fang K. Du
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Predicting protein domains is essential for understanding a protein’s function at the molecular level. However, up till now, there has been no direct and straightforward method for predicting protein domains in species without a reference genome sequence. In this study, we developed a functionality with a set of programs that can predict protein domains directly from genomic sequence data without a reference genome. Using whole genome sequence data, the programming functionality mainly comprised DNA assembly in combination with next-generation sequencing (NGS) assembly methods and traditional methods, peptide prediction and protein domain prediction. The proposed new functionality avoids problems associated with de novo assembly due to micro reads and small single repeats. Furthermore, we applied our functionality for the prediction of leucine rich repeat (LRR) domains in four species of Ficus with no reference genome, based on NGS genomic data. We found that the LRRNT_2 and LRR_8 domains are related to plant transpiration efficiency, as indicated by the stomata index, in the four species of Ficus. The programming functionality established in this study provides new insights for protein domain prediction, which is particularly timely in the current age of NGS data expansion.
f
Data from: Prediction of Protein Domain with mRMR Feature Selection and...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 15, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li, Bi-Qing; Chen, Lei; Feng, Kai-Yan; Chou, Kuo-Chen; Hu, Le-Le; Cai, Yu-Dong (2012). Prediction of Protein Domain with mRMR Feature Selection and Analysis [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001164184
Explore at:
Dataset updated
Jun 15, 2012
Authors
Li, Bi-Qing; Chen, Lei; Feng, Kai-Yan; Chou, Kuo-Chen; Hu, Le-Le; Cai, Yu-Dong
Description
The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28–40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.
e
PANTHER
ebi.ac.uk
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). PANTHER [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Jun 20, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PANTHER is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise. These subfamilies model the divergence of specific functions within protein families, allowing more accurate association with function, as well as inference of amino acids important for functional specificity. Hidden Markov models (HMMs) are built for each family and subfamily for classifying additional protein sequences. PANTHER is based at the University of Southern California, CA, US.
The Encyclopedia of Domains (TED) structural domains assignments for...
zenodo.org
application/gzip, bz2 +1
Updated Oct 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13369203
Explore at:
application/gzip, bz2, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13369203
Dataset updated
Oct 31, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset description:

The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.

Please use the gunzip command to extract files with a '.gz' extension.

CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.

This dataset contains:

ted_214m_per_chain_segmentation.tsv
The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
1. AFDB_model_ID: chain identifier from AFDB in the format AF-

ted_365m_domain_boundaries_consensus_level.tsv.gz
The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
1. TED_ID: TED domain identifier in the format AF-

ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).

ted_324m_seq_clustering.cathlabels.tsv
The file contains the results of the domain sequences clustering with MMseqs2.
Columns:
1. Cluster_representative
2. Cluster_member
3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass

novel_folds_set.domain_summary.tsv is sorted by novelty.
1. ted_id - TED domain identifier in the format AF-

Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
The files contain a header with the following fields. Each column is tab-separated (.tsv).
1. TED_redundant_id - TED chain identifier in the format AF-

and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
The file contains a header with the following fields. Each column is tab-separated (.tsv).
1. TED_redundant_id - TED chain identifier in the format AF-

novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.

All per-tool domain boundaries predictions are in the same format with the following columns.
1. TED_chainID - TED chain identifier in the format AF-

Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

Merizo predicts one continuous domain and a discontinuous domain,
Domain1 (discontinuous): 10-52_289-394
segment1: 10-52
segment2: 289-394
Domain 2 (continuous):
segment 1: 53-288

ted-tools-main.zip - copy of the https://github.com/psipred/ted-tools repository, containing tools and software used to generate TED.

cath-alphaflow-main.zip - copy of CATH-AlphaFlow, used to generate globularity scores for TED domains.

ted-web-master.zip - copy of TED-web, containing code to generate the web interface of TED (https://ted.cathdb.info)

gofocus_data.tar.bz2 - GOFocus model weights
Beyond the E-Value: Stratified Statistics for Protein Domain Prediction
plos.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alejandro Ochoa; John D. Storey; Manuel Llinás; Mona Singh (2023). Beyond the E-Value: Stratified Statistics for Protein Domain Prediction [Dataset]. http://doi.org/10.1371/journal.pcbi.1004509
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1004509
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Alejandro Ochoa; John D. Storey; Manuel Llinás; Mona Singh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.
e
SMART
ebi.ac.uk
Updated Feb 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). SMART [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Feb 14, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.
f
Performance evaluation per protein domain.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chu, Simon K. S.; Narang, Kush; Siegel, Justin B. (2024). Performance evaluation per protein domain. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001352376
Explore at:
Dataset updated
Jul 22, 2024
Authors
Chu, Simon K. S.; Narang, Kush; Siegel, Justin B.
Description
Protein stability plays a crucial role in a variety of applications, such as food processing, therapeutics, and the identification of pathogenic mutations. Engineering campaigns commonly seek to improve protein stability, and there is a strong interest in streamlining these processes to enable rapid optimization of highly stabilized proteins with fewer iterations. In this work, we explore utilizing a mega-scale dataset to develop a protein language model optimized for stability prediction. ESMtherm is trained on the folding stability of 528k natural and de novo sequences derived from 461 protein domains and can accommodate deletions, insertions, and multiple-point mutations. We show that a protein language model can be fine-tuned to predict folding stability. ESMtherm performs reasonably on small protein domains and generalizes to sequences distal from the training set. Lastly, we discuss our model’s limitations compared to other state-of-the-art methods in generalizing to larger protein scaffolds. Our results highlight the need for large-scale stability measurements on a diverse dataset that mirrors the distribution of sequence lengths commonly observed in nature.
d
A dataset for predicting protein-protein interactions in humans
datadryad.org
search.dataone.org
zip
Updated Sep 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Zhang; Ian R. Humphrey; Jimin Pei; Jinuk Kim; Chulwon Choi; Rongqing Yuan; Jesse Durham; Siqi Liu; Hee-Jung Choi; Minkyung Baek; David Baker; Qian Cong (2025). A dataset for predicting protein-protein interactions in humans [Dataset]. http://doi.org/10.5061/dryad.15dv41p84
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.15dv41p84
Dataset updated
Sep 16, 2025
Dataset provided by
Dryad
Authors
Jing Zhang; Ian R. Humphrey; Jimin Pei; Jinuk Kim; Chulwon Choi; Rongqing Yuan; Jesse Durham; Siqi Liu; Hee-Jung Choi; Minkyung Baek; David Baker; Qian Cong
Time period covered
Aug 15, 2025
Description
A dataset for predicting protein-protein interactions in humans

Dataset DOI: 10.5061/dryad.15dv41p84

Description of the data and file structure

protein_omicMSAs.tar.gz (17 GB)

These MSAs are in an A3M-like format. Compared to the standard A3M format, we inserted an additional sequence at the beginning, named “mask,” to indicate the alignment quality at each position. In this “mask,” an asterisk (*) indicates high-quality positions, and a dash (-) indicates low-quality positions (these are poorly conserved and thus cannot be reliably assembled from genomic data). We recommend using only the high-quality positions (marked with *), as we did in our work. Insertions relative to the human (query) sequence are represented by lowercase letters. Each sequence corresponds to one draft genome or genomic dataset, and the NCBI accession number of the genome or dataset is used to name the sequence in the header. We also include the taxonomic information of...
e
SFLD
ebi.ac.uk
Updated Sep 7, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). SFLD [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Sep 7, 2018
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
d
Dissection of the role of a SH3 domain in the evolution of binding...
search.dataone.org
datadryad.org
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pascale Lemieux; David Bradley; Alexandre K. DubÃ©; Christian Landry (2025). Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins [Dataset]. http://doi.org/10.5061/dryad.sj3tx968m
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.sj3tx968m
Dataset updated
Jul 17, 2025
Dataset provided by
Dryad Digital Repository
Authors
Pascale Lemieux; David Bradley; Alexandre K. DubÃ©; Christian Landry
Time period covered
Jan 1, 2023
Description
Protein-protein interactions drive many cellular processes. Some protein interactions are directed by Src homology 3 (SH3) domains that bind proline-rich motifs on other proteins. The evolution of the binding specificity of SH3 domains is not completely understood, particularly following gene duplication. Paralogous genes accumulate mutations that can modify protein functions and, for SH3 domains, their binding preferences. Here, we examined how the binding of the SH3 domains of two paralogous yeast type I myosins, Myo3 and Myo5, evolved following duplication. We found that the paralogs have subtly different SH3-dependent interaction profiles. However, by swapping SH3 domains between the paralogs and characterizing the SH3 domains freed from their protein context, we find that few of the differences in interactions, if any, depend on the SH3 domains themselves. We used ancestral sequence reconstruction to resurrect the pre-duplication SH3 domains and examined, moving back in time, how t..., The data published in this dataset was collected by multiple methods. Among the methods used are DHFR Protein-fragment Complementation Assay, cytometry, ancestral sequence reconstruction with IQ-TREE and FastML, protein structure prediction with AlphaFold2 and AlphaFold Multimer, molecular docking with Haddock2.4, orthology analysis and coevolution predictions with EVCouplings. See the README.md file and the method section of the paper Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins for more details. File S1 :Â Tables S1 - S12 File S2 : Detailled protocols FiguresS : Figures S1 - S10 DataS1 : DHFR PCA results DataS2 : Phylogeny and sequence alignmentÂ DataS3 : AlphaFold results DataS4 : Molecular docking input and output files DataS5: Orthology input and motif conservation results DataS6: EVCouplings outputÂ Please refer to Lemieux et al. 2023 for details on the data collection and transformation., All files can be opened with either R, a text editor, Excel or ChimeraX., This README file was generated on 2023-09-19 by Pascale Lemieux.

GENERAL INFORMATION

Title of Dataset: Data from : Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins

Author Information A. Principal Investigator Contact Information Name: Christian Landry Institution: UniversitÃ© Laval, QuÃ©bec CA Email: christian.landry@bio.ulaval.ca

B. Associate or Co-investigator Contact Information Name: Pascale Lemieux Institution: UniversitÃ© Laval, QuÃ©bec, CA Email: pascale.lemieux.4@ulaval.ca

Date of data collection (single date, range, approximate date): 2020-2023

Information about funding sources that supported the collection of the data: Canadian Institutes of Health Research (CIHR) Foundation grant 387697 and a HFSP grant (RGP0034/2018) to CRL

SHARING/ACCESS INFORMATION

Licenses/restrictions placed on the data: CC0 1.0 Universal (CC0 1.0) Public Domain

Links to publications that cite or use ...
Pfam seed random split
kaggle.com
zip
Updated Apr 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google AI (2019). Pfam seed random split [Dataset]. https://www.kaggle.com/googleai/pfam-seed-random-split
Explore at:
zip(517047246 bytes)Available download formats
Dataset updated
Apr 19, 2019
Dataset authored and provided by
Google AIhttp://ai.google/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Problem description

This directory contains data to train a model to predict the function of protein domains, based on the PFam dataset.

Domains are functional sub-parts of proteins; much like images in ImageNet are pre segmented to contain exactly one object class, this data is presegmented to contain exactly and only one domain.

The purpose of the dataset is to repose the PFam seed dataset as a multiclass classification machine learning task.

The task is: given the amino acid sequence of the protein domain, predict which class it belongs to. There are about 1 million training examples, and 18,000 output classes.

Data structure

This data is more completely described by the publication "Can Deep Learning Classify the Protein Universe", Bileschi et al.

Data split and layout

The approach used to partition the data into training/dev/testing folds is a random split.

Training data should be used to train your models.

Dev (development) data should be used in a close validation loop (maybe for hyperparameter tuning or model validation).

Test data should be reserved for much less frequent evaluations - this helps avoid overfitting on your test data, as it should only be used infrequently.

File content

Each fold (train, dev, test) has a number of files in it. Each of those files contains csv on each line, which has the following fields:

sequence: HWLQMRDSMNTYNNMVNRCFATCIRSFQEKKVNAEEMDCTKRCVTKFVGYSQRVALRFAE family_accession: PF02953.15 sequence_name: C5K6N5_PERM5/28-87 aligned_sequence: ....HWLQMRDSMNTYNNMVNRCFATCI...........RS.F....QEKKVNAEE.....MDCT....KRCVTKFVGYSQRVALRFAE family_id: zf-Tim10_DDP

Description of fields: - sequence: These are usually the input features to your model. Amino acid sequence for this domain. There are 20 very common amino acids (frequency > 1,000,000), and 4 amino acids that are quite uncommon: X, U, B, O, Z. - family_accession: These are usually the labels for your model. Accession number in form PFxxxxx.y (Pfam), where xxxxx is the family accession, and y is the version number. Some values of y are greater than ten, and so 'y' has two digits. - family_id: One word name for family. - sequence_name: Sequence name, in the form "$uniprot_accession_id/$start_index-$end_index". - aligned_sequence: Contains a single sequence from the multiple sequence alignment (with the rest of the members of the family in seed, with gaps retained.

Generally, the family_accession field is the label, and the sequence (or aligned sequence) is the training feature.

This sequence corresponds to a domain, not a full protein.

The contents of these fields is the same as to the data provided in Stockholm format by PFam at ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.seed.gz

[1] Eddy, Sean R. "Accelerated profile HMM searches." PLoS computational biology 7.10 (2011): e1002195.

License

Creative Commons Legal Code

CC0 1.0 Universal

CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED HEREUNDER.

Statement of Purpose

The laws of most jurisdictions throughout the world automatically confer exclusive Copyright and Related Rights (defined below) upon the creator and subsequent owner(s) (each and all, an "owner") of an original work of authorship and/or a database (each, a "Work").

Certain owners wish to permanently relinquish those rights to a Work for the purpose of contributing to a commons of creative, cultural and scientific works ("Commons") that the public can reliably and without fear of later claims of infringement build upon, modify, incorporate in other works, reuse and redistribute as freely as possible in any form whatsoever and for any purposes, including without limitation commercial purposes. These owners may contribute to the Commons to promote the ideal of a free culture and the further production of creative, cultural and scientific works, or to gain reputation or greater distribution for their Work in part through the use and efforts of others.

For these and/or other purposes and motivations, and without any expectation of additional consideration or compensation, the person associating CC0 with a Work (the "Affirmer"), to the extent that he or she is an owner of Copyright and Related Rights in the Work, voluntarily elects to apply CC0 to the Work and publicly distribute the Work under its terms, with knowledge of his or her Copyright and Related Rights in t...
Table_1_Method for Identifying Essential Proteins by Key Features of...
frontiersin.figshare.com
xlsx
Updated Jun 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xin He; Linai Kuang; Zhiping Chen; Yihong Tan; Lei Wang (2023). Table_1_Method for Identifying Essential Proteins by Key Features of Proteins in a Novel Protein-Domain Network.XLSX [Dataset]. http://doi.org/10.3389/fgene.2021.708162.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.708162.s001
Dataset updated
Jun 5, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Xin He; Linai Kuang; Zhiping Chen; Yihong Tan; Lei Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, due to low accuracy and high costs of traditional biological experiments, more and more computational models have been proposed successively to infer potential essential proteins. In this paper, a novel prediction method called KFPM is proposed, in which, a novel protein-domain heterogeneous network is established first by combining known protein-protein interactions with known associations between proteins and domains. Next, based on key topological characteristics extracted from the newly constructed protein-domain network and functional characteristics extracted from multiple biological information of proteins, a new computational method is designed to effectively integrate multiple biological features to infer potential essential proteins based on an improved PageRank algorithm. Finally, in order to evaluate the performance of KFPM, we compared it with 13 state-of-the-art prediction methods, experimental results show that, among the top 1, 5, and 10% of candidate proteins predicted by KFPM, the prediction accuracy can achieve 96.08, 83.14, and 70.59%, respectively, which significantly outperform all these 13 competitive methods. It means that KFPM may be a meaningful tool for prediction of potential essential proteins in the future.
f
Table_3_Method for Identifying Essential Proteins by Key Features of...
frontiersin.figshare.com
xlsx
Updated Jun 29, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xin He; Linai Kuang; Zhiping Chen; Yihong Tan; Lei Wang (2021). Table_3_Method for Identifying Essential Proteins by Key Features of Proteins in a Novel Protein-Domain Network.XLSX [Dataset]. http://doi.org/10.3389/fgene.2021.708162.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.708162.s003
Dataset updated
Jun 29, 2021
Dataset provided by
Frontiers
Authors
Xin He; Linai Kuang; Zhiping Chen; Yihong Tan; Lei Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, due to low accuracy and high costs of traditional biological experiments, more and more computational models have been proposed successively to infer potential essential proteins. In this paper, a novel prediction method called KFPM is proposed, in which, a novel protein-domain heterogeneous network is established first by combining known protein-protein interactions with known associations between proteins and domains. Next, based on key topological characteristics extracted from the newly constructed protein-domain network and functional characteristics extracted from multiple biological information of proteins, a new computational method is designed to effectively integrate multiple biological features to infer potential essential proteins based on an improved PageRank algorithm. Finally, in order to evaluate the performance of KFPM, we compared it with 13 state-of-the-art prediction methods, experimental results show that, among the top 1, 5, and 10% of candidate proteins predicted by KFPM, the prediction accuracy can achieve 96.08, 83.14, and 70.59%, respectively, which significantly outperform all these 13 competitive methods. It means that KFPM may be a meaningful tool for prediction of potential essential proteins in the future.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2012). PRINTS [Dataset]. https://www.ebi.ac.uk/interpro/

PRINTS

Explore at:

Dataset updated

Jun 14, 2012

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain. PRINTS is based at the University of Manchester, UK.

Clear search

Close search

Google apps

Main menu

PRINTS

SUPERFAMILY

Data from: PROSITE

PROSITE profiles

PIRSF

Protein Structural Domain Classification

NCBIFAM

Protein Domain Analysis of Genomic Sequence Data Reveals Regulation of LRR...

Data from: Prediction of Protein Domain with mRMR Feature Selection and...

PANTHER

The Encyclopedia of Domains (TED) structural domains assignments for...

Dataset description:

This dataset contains:

Beyond the E-Value: Stratified Statistics for Protein Domain Prediction

SMART

Performance evaluation per protein domain.

A dataset for predicting protein-protein interactions in humans

A dataset for predicting protein-protein interactions in humans

Description of the data and file structure

protein_omicMSAs.tar.gz (17 GB)

SFLD

Dissection of the role of a SH3 domain in the evolution of binding...

GENERAL INFORMATION

Pfam seed random split

Problem description

Data structure

Data split and layout

File content

License

Table_1_Method for Identifying Essential Proteins by Key Features of...

Table_3_Method for Identifying Essential Proteins by Key Features of...

PRINTS