100+ datasets found
  1. e

    SMART

    • ebi.ac.uk
    Updated Feb 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). SMART [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 14, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.

  2. e

    Data from: PROSITE

    • prosite.expasy.org
    • the-mouth.com
    • +6more
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE [Dataset]. https://prosite.expasy.org/
    Explore at:
    Dataset updated
    Jun 18, 2025
    Description

    PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].

  3. e

    HAMAP

    • ebi.ac.uk
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). HAMAP [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 5, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP profiles are manually created by expert curators. They identify proteins that are part of well-conserved protein families or subfamilies. HAMAP is based at the SIB Swiss Institute of Bioinformatics, Geneva, Switzerland.

  4. e

    SFLD

    • ebi.ac.uk
    Updated Sep 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). SFLD [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Sep 7, 2018
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.

  5. f

    Prediction of Protein Domain with mRMR Feature Selection and Analysis

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bi-Qing Li; Le-Le Hu; Lei Chen; Kai-Yan Feng; Yu-Dong Cai; Kuo-Chen Chou (2023). Prediction of Protein Domain with mRMR Feature Selection and Analysis [Dataset]. http://doi.org/10.1371/journal.pone.0039308
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Bi-Qing Li; Le-Le Hu; Lei Chen; Kai-Yan Feng; Yu-Dong Cai; Kuo-Chen Chou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28–40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

  6. f

    Beyond the E-Value: Stratified Statistics for Protein Domain Prediction

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alejandro Ochoa; John D. Storey; Manuel Llinás; Mona Singh (2023). Beyond the E-Value: Stratified Statistics for Protein Domain Prediction [Dataset]. http://doi.org/10.1371/journal.pcbi.1004509
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Alejandro Ochoa; John D. Storey; Manuel Llinás; Mona Singh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.

  7. The Encyclopedia of Domains (TED) structural domains assignments for...

    • zenodo.org
    application/gzip, bz2 +1
    Updated Oct 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones (2024). The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4 [Dataset]. http://doi.org/10.5281/zenodo.13369203
    Explore at:
    application/gzip, bz2, zipAvailable download formats
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andy Lau; Andy Lau; Nicola Bordin; Nicola Bordin; Shaun Kandathil; Shaun Kandathil; Ian Sillitoe; Ian Sillitoe; Vaishali Waman; Vaishali Waman; Jude Wells; Jude Wells; Christine Orengo; Christine Orengo; David T Jones; David T Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset description:

    The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

    In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

    For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

    Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

    For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

    We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.

    Please use the gunzip command to extract files with a '.gz' extension.

    CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
    Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.


    This dataset contains:

    • ted_214m_per_chain_segmentation.tsv
      The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
      1. AFDB_model_ID: chain identifier from AFDB in the format AF-
    • ted_365m_domain_boundaries_consensus_level.tsv.gz
      The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
      1. TED_ID: TED domain identifier in the format AF-
    • ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).
    • ted_324m_seq_clustering.cathlabels.tsv
      The file contains the results of the domain sequences clustering with MMseqs2.
      Columns:
      1. Cluster_representative
      2. Cluster_member
      3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
      4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass
    • novel_folds_set.domain_summary.tsv is sorted by novelty.
      1. ted_id - TED domain identifier in the format AF-
    • Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
      The files contain a header with the following fields. Each column is tab-separated (.tsv).
      1. TED_redundant_id - TED chain identifier in the format AF-
    • and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv
      The file contains a header with the following fields. Each column is tab-separated (.tsv).
      1. TED_redundant_id - TED chain identifier in the format AF-
    • novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.
    • All per-tool domain boundaries predictions are in the same format with the following columns.
      1. TED_chainID - TED chain identifier in the format AF-
    • Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

      i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
      AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

      Merizo predicts one continuous domain and a discontinuous domain,
      Domain1 (discontinuous): 10-52_289-394
      segment1: 10-52
      segment2: 289-394
      Domain 2 (continuous):
      segment 1: 53-288
    • ted-tools-main.zip - copy of the https://github.com/psipred/ted-tools repository, containing tools and software used to generate TED.
    • cath-alphaflow-main.zip - copy of CATH-AlphaFlow, used to generate globularity scores for TED domains.
    • ted-web-master.zip - copy of TED-web, containing code to generate the web interface of TED (https://ted.cathdb.info)
    • gofocus_data.tar.bz2 - GOFocus model weights
  8. f

    Performance evaluation per protein domain.

    • datasetcatalog.nlm.nih.gov
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chu, Simon K. S.; Narang, Kush; Siegel, Justin B. (2024). Performance evaluation per protein domain. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001352376
    Explore at:
    Dataset updated
    Jul 22, 2024
    Authors
    Chu, Simon K. S.; Narang, Kush; Siegel, Justin B.
    Description

    Protein stability plays a crucial role in a variety of applications, such as food processing, therapeutics, and the identification of pathogenic mutations. Engineering campaigns commonly seek to improve protein stability, and there is a strong interest in streamlining these processes to enable rapid optimization of highly stabilized proteins with fewer iterations. In this work, we explore utilizing a mega-scale dataset to develop a protein language model optimized for stability prediction. ESMtherm is trained on the folding stability of 528k natural and de novo sequences derived from 461 protein domains and can accommodate deletions, insertions, and multiple-point mutations. We show that a protein language model can be fine-tuned to predict folding stability. ESMtherm performs reasonably on small protein domains and generalizes to sequences distal from the training set. Lastly, we discuss our model’s limitations compared to other state-of-the-art methods in generalizing to larger protein scaffolds. Our results highlight the need for large-scale stability measurements on a diverse dataset that mirrors the distribution of sequence lengths commonly observed in nature.

  9. n

    DOMINE: Database of Protein Interactions

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Sep 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). DOMINE: Database of Protein Interactions [Dataset]. http://identifiers.org/RRID:SCR_002399
    Explore at:
    Dataset updated
    Sep 30, 2024
    Description

    Database of known and predicted protein domain (domain-domain) interactions containing interactions inferred from PDB entries, and those that are predicted by 8 different computational approaches using Pfam domain definitions. DOMINE contains a total of 26,219 domain-domain interactions (among 5,410 domains) out of which 6,634 are inferred from PDB entries, and 21,620 are predicted by at least one computational approach. Of the 21,620 computational predictions, 2,989 interactions are high-confidence predictions (HCPs), 2,537 interactions are medium-confidence predictions (MCPs), and the remaining 16,094 are low-confidence predictions (LCPs). (May 2014)

  10. f

    Improvement in Protein Domain Identification Is Reached by Breaking...

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juliana Bernardes; Gerson Zaverucha; Catherine Vaquero; Alessandra Carbone (2023). Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence [Dataset]. http://doi.org/10.1371/journal.pcbi.1005038
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Juliana Bernardes; Gerson Zaverucha; Catherine Vaquero; Alessandra Carbone
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE.

  11. d

    Dissection of the role of a SH3 domain in the evolution of binding...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pascale Lemieux; David Bradley; Alexandre K. Dubé; Christian Landry (2025). Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins [Dataset]. http://doi.org/10.5061/dryad.sj3tx968m
    Explore at:
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Pascale Lemieux; David Bradley; Alexandre K. Dubé; Christian Landry
    Time period covered
    Jan 1, 2023
    Description

    Protein-protein interactions drive many cellular processes. Some protein interactions are directed by Src homology 3 (SH3) domains that bind proline-rich motifs on other proteins. The evolution of the binding specificity of SH3 domains is not completely understood, particularly following gene duplication. Paralogous genes accumulate mutations that can modify protein functions and, for SH3 domains, their binding preferences. Here, we examined how the binding of the SH3 domains of two paralogous yeast type I myosins, Myo3 and Myo5, evolved following duplication. We found that the paralogs have subtly different SH3-dependent interaction profiles. However, by swapping SH3 domains between the paralogs and characterizing the SH3 domains freed from their protein context, we find that few of the differences in interactions, if any, depend on the SH3 domains themselves. We used ancestral sequence reconstruction to resurrect the pre-duplication SH3 domains and examined, moving back in time, how t..., The data published in this dataset was collected by multiple methods. Among the methods used are DHFR Protein-fragment Complementation Assay, cytometry, ancestral sequence reconstruction with IQ-TREE and FastML, protein structure prediction with AlphaFold2 and AlphaFold Multimer, molecular docking with Haddock2.4, orthology analysis and coevolution predictions with EVCouplings. See the README.md file and the method section of the paper Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins for more details. File S1 : Tables S1 - S12 File S2 : Detailled protocols FiguresS : Figures S1 - S10 DataS1 : DHFR PCA results DataS2 : Phylogeny and sequence alignment DataS3 : AlphaFold results DataS4 : Molecular docking input and output files DataS5: Orthology input and motif conservation results DataS6: EVCouplings output Please refer to Lemieux et al. 2023 for details on the data collection and transformation., All files can be opened with either R, a text editor, Excel or ChimeraX., This README file was generated on 2023-09-19 by Pascale Lemieux.

    GENERAL INFORMATION

    1. Title of Dataset: Data from : Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins
    2. Author Information A. Principal Investigator Contact Information Name: Christian Landry Institution: Université Laval, Québec CA Email: christian.landry@bio.ulaval.ca

      B. Associate or Co-investigator Contact Information Name: Pascale Lemieux Institution: Université Laval, Québec, CA Email: pascale.lemieux.4@ulaval.ca

    3. Date of data collection (single date, range, approximate date): 2020-2023

    4. Information about funding sources that supported the collection of the data: Canadian Institutes of Health Research (CIHR) Foundation grant 387697 and a HFSP grant (RGP0034/2018) to CRL

    SHARING/ACCESS INFORMATION

    1. Licenses/restrictions placed on the data: CC0 1.0 Universal (CC0 1.0) Public Domain
    2. Links to publications that cite or use ...
  12. Pfam seed random split

    • kaggle.com
    Updated Apr 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Research (2019). Pfam seed random split [Dataset]. https://www.kaggle.com/googleai/pfam-seed-random-split/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Google Research
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Problem description

    This directory contains data to train a model to predict the function of protein domains, based on the PFam dataset.

    Domains are functional sub-parts of proteins; much like images in ImageNet are pre segmented to contain exactly one object class, this data is presegmented to contain exactly and only one domain.

    The purpose of the dataset is to repose the PFam seed dataset as a multiclass classification machine learning task.

    The task is: given the amino acid sequence of the protein domain, predict which class it belongs to. There are about 1 million training examples, and 18,000 output classes.

    Data structure

    This data is more completely described by the publication "Can Deep Learning Classify the Protein Universe", Bileschi et al.

    Data split and layout

    The approach used to partition the data into training/dev/testing folds is a random split.

    • Training data should be used to train your models.
    • Dev (development) data should be used in a close validation loop (maybe for hyperparameter tuning or model validation).
    • Test data should be reserved for much less frequent evaluations - this helps avoid overfitting on your test data, as it should only be used infrequently.

    File content

    Each fold (train, dev, test) has a number of files in it. Each of those files contains csv on each line, which has the following fields:

    sequence: HWLQMRDSMNTYNNMVNRCFATCIRSFQEKKVNAEEMDCTKRCVTKFVGYSQRVALRFAE 
    family_accession: PF02953.15
    sequence_name: C5K6N5_PERM5/28-87
    aligned_sequence: ....HWLQMRDSMNTYNNMVNRCFATCI...........RS.F....QEKKVNAEE.....MDCT....KRCVTKFVGYSQRVALRFAE 
    family_id: zf-Tim10_DDP
    

    Description of fields: - sequence: These are usually the input features to your model. Amino acid sequence for this domain. There are 20 very common amino acids (frequency > 1,000,000), and 4 amino acids that are quite uncommon: X, U, B, O, Z. - family_accession: These are usually the labels for your model. Accession number in form PFxxxxx.y (Pfam), where xxxxx is the family accession, and y is the version number. Some values of y are greater than ten, and so 'y' has two digits. - family_id: One word name for family. - sequence_name: Sequence name, in the form "$uniprot_accession_id/$start_index-$end_index". - aligned_sequence: Contains a single sequence from the multiple sequence alignment (with the rest of the members of the family in seed, with gaps retained.

    Generally, the family_accession field is the label, and the sequence (or aligned sequence) is the training feature.

    This sequence corresponds to a domain, not a full protein.

    The contents of these fields is the same as to the data provided in Stockholm format by PFam at ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.seed.gz

    [1] Eddy, Sean R. "Accelerated profile HMM searches." PLoS computational biology 7.10 (2011): e1002195.

    License

    Creative Commons Legal Code

    CC0 1.0 Universal

    CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
    LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
    ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
    INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
    REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
    PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
    THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
    HEREUNDER.
    

    Statement of Purpose

    The laws of most jurisdictions throughout the world automatically confer exclusive Copyright and Related Rights (defined below) upon the creator and subsequent owner(s) (each and all, an "owner") of an original work of authorship and/or a database (each, a "Work").

    Certain owners wish to permanently relinquish those rights to a Work for the purpose of contributing to a commons of creative, cultural and scientific works ("Commons") that the public can reliably and without fear of later claims of infringement build upon, modify, incorporate in other works, reuse and redistribute as freely as possible in any form whatsoever and for any purposes, including without limitation commercial purposes. These owners may contribute to the Commons to promote the ideal of a free culture and the further production of creative, cultural and scientific works, or to gain reputation or greater distribution for their Work in part through the use and efforts of others.

    For these and/or other purposes and motivations, and without any expectation of additional consideration or compensation, the person associating CC0 with a Work (the "Affirmer"), to the extent that he or she is an owner of Copyright and Related Rights in the Work, voluntarily elects to apply CC0 to the Work and publicly distribute the Work under its terms, with knowledge of his or her Copyright and Related Rights in t...

  13. f

    Exploiting Amino Acid Composition for Predicting Protein-Protein...

    • plos.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sushmita Roy; Diego Martinez; Harriett Platero; Terran Lane; Margaret Werner-Washburne (2023). Exploiting Amino Acid Composition for Predicting Protein-Protein Interactions [Dataset]. http://doi.org/10.1371/journal.pone.0007813
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Sushmita Roy; Diego Martinez; Harriett Platero; Terran Lane; Margaret Werner-Washburne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundComputational prediction of protein interactions typically use protein domains as classifier features because they capture conserved information of interaction surfaces. However, approaches relying on domains as features cannot be applied to proteins without any domain information. In this paper, we explore the contribution of pure amino acid composition (AAC) for protein interaction prediction. This simple feature, which is based on normalized counts of single or pairs of amino acids, is applicable to proteins from any sequenced organism and can be used to compensate for the lack of domain information.ResultsAAC performed at par with protein interaction prediction based on domains on three yeast protein interaction datasets. Similar behavior was obtained using different classifiers, indicating that our results are a function of features and not of classifiers. In addition to yeast datasets, AAC performed comparably on worm and fly datasets. Prediction of interactions for the entire yeast proteome identified a large number of novel interactions, the majority of which co-localized or participated in the same processes. Our high confidence interaction network included both well-studied and uncharacterized proteins. Proteins with known function were involved in actin assembly and cell budding. Uncharacterized proteins interacted with proteins involved in reproduction and cell budding, thus providing putative biological roles for the uncharacterized proteins.ConclusionAAC is a simple, yet powerful feature for predicting protein interactions, and can be used alone or in conjunction with protein domains to predict new and validate existing interactions. More importantly, AAC alone performs at par with existing, but more complex, features indicating the presence of sequence-level information that is predictive of interaction, but which is not necessarily restricted to domains.

  14. Data from: Coevolution of Drosophila-type timeless with partner clock...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Dolezel; Bullo Enrico; Chen Ping; Smýkal Vlastimil; Fiala Ivan (2025). Coevolution of Drosophila-type timeless with partner clock proteins [Dataset]. http://doi.org/10.5061/dryad.44j0zpcq0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 10, 2025
    Dataset provided by
    Czech Academy of Sciences, Biology Centre
    Authors
    David Dolezel; Bullo Enrico; Chen Ping; Smýkal Vlastimil; Fiala Ivan
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Drosophila-type timeless (dTIM) is an established key clock protein in fruit flies, regulating rhythmicity and light-mediated entrainment. However, as indicated by functional experiments, its contribution to the clock differs in various insects. Therefore, we conducted a comprehensive phylogenetic analysis of dTIM across animals, dated its origin, gene duplications, and losses. We identified variable and conserved protein domains and pinpointed animal lineages that underwent the biggest changes in the dTIM sequence. While dTIM modifications are only mildly affected by changes in the PER protein, even the complete loss of PER in echinoderms had no impact on dTIM. However, changes in dTIM always co-occur with the loss of CRYPTOCHROMES or JETLAG. This is exemplified by the remarkably accelerated evolution of dTIM in phylloxera and aphids. Finally, alternative d-tim splicing, characteristic of D. melanogaster temperature-dependent function, is conserved at least to some extent in Diptera, albeit with unique alterations. Altogether, this study pinpoints major changes that shaped dTIM origin and evolution. Methods Gene Identification and Data Sets A systematic search for clock components was conducted, building on previous studies exploring the evolution, duplication, and loss of circadian clock components (Thakkar et al., 2022; Kotwica-Rolinska et al., 2022a). To identify circadian clock proteins and genes encoding dTIM, mTIM, TOF1, PERIOD, dCRY, mCRY, 6-4 Photolyase, JETLAG, and BRWD3 in Bilateria/Metazoa, GenBank (NCBI) protein and genomic databases, as well as transcriptome shotgun assemblies (TSA), were utilized. BLASTP and tBLASTn algorithms were applied with taxon-restricted searches targeting specific lineages at the levels of orders, suborders, infraorders, and, in some cases, families. In certain instances, annotated genomes or whole-genome shotgun contigs (wgs) were also examined. Protein sequences of circadian clock genes from Drosophila melanogaster, Danaus plexippus, and Pyrrhocoris apterus served as initial queries. Reciprocal and lineage-focused searches incorporated queries representing identified proteins from related taxa. To detect duplicate hits (common in TSAs) or closely related proteins (e.g., bHLH-PAS proteins instead of PERIOD), the E-INS-i algorithm in MAFFT was used for alignment, followed by FastTree analysis (Price et al., 2009), both conducted in Geneious Prime 21.0.3 (Biomatters, New Zealand). Phylogenetic analyses
    To identify specific types of TIM, CRY, FBXL, or PAS proteins (e.g., distinguishing PER from bHLH PAS proteins), sequences were aligned using the MAFFT algorithm in Geneious Prime 21.0.3 (Biomatters, New Zealand). Representative datasets containing target proteins and related types were included. Ambiguously aligned regions were trimmed, and phylogenetic analyses were conducted using RAxML with a maximum likelihood GAMMA-based model in Geneious Prime 21.0.3. The Metazoan phylogenies presented in Figures 2, 4, 5, 6, 7, 8 and S9 were retrieved using TIMETREE 5 (Kumar et al., 2022) and cross-referenced with recent molecular phylogenomic studies. These included works focused on insects (Misof et al., 2014), chelicerates, and crustaceans (von Reumont et al., 2012; Thomas et al., 2020; Bernot et al., 2023). For insects, phylogenomic studies specific to Polyneoptera (Wipfler et al., 2019), the hemipteroid assembly (Johnson et al., 2018), and Coleoptera (McKenna et al., 2019) were used to refine the corresponding sections of the phylogeny. Gene Loss While it is impossible to definitively prove the absence of a gene, in some cases, gene loss is the most plausible explanation. Recent advances in phylogenomics, the availability of extensive TSA data, and an increasing number of sequenced genomes have enabled a systematic exploration of circadian clock genes across major Bilateria groups (Protostomia and Deuterostomia). Our analysis focuses on lineage-specific gene losses that are strongly supported by data from multiple species, whole-genome assemblies, and deep transcriptome sequencing. Evidence for gene loss is summarized for specific genes and animal groups in Supplementary Table 1. Finally, the CTT-like motif, located on the ARM2 domain, and CTT motif located on dCRY, were annotated on D. melanogaster dTIM and dCRY sequences following the boundaries defined by Lin et al. (2023). After MAFFT alignment with the other dTIM sequences in the dataset, it was considered conserved when the degree of similarity exceeded 50%. Prediction of Protein Domains dTIM functional and binding domains were originally annotated based on the sequence of Drosophila melanogaster dTIM, following the boundaries defined by Lin et al., (2023) or identified in cell-based experiments (Saez and Young, 1996). Accordingly, the dCRY-interaction domains, ARMADILLO repeats (ARM1 and ARM2), and PER-binding sites 1 (PER-bind #1) and 2 (PER-bind #2) were annotated in the D. melanogaster dTIM protein isoform P (NP_001334730). This annotated sequence was then aligned to each sequence in the dataset using MAFFT (Katoh and Stanley, 2013, v7.490, Algorithm: E-INS-I, scoring matrix: BLOSUM80) within the Geneious Prime 2024 software. The obtained similarities were plotted as intensities corresponding to numerical values in Figures 2 and 3. The PER-bind #1 domain was annotated when the total length of the region was at least 35 amino acids and the degree of similarity exceeded 40%. The CRY-interaction domain was highlighted when similarities exceeded 20%. In the gene models (Figures 6 and 7), only two shades were used: regions corresponding to ARM1, ARM2, and PER-binding sites were highlighted when similarity exceeded 40%. For Limulus, a value of 38% was depicted as "low similarity" using a paler shade of red. The CRY-interaction domain was highlighted when similarity exceeded 20%. For Bemisia, "low similarity" was represented by a paler shade of blue, highlighting values of 19%. In the supplementary figures depicting protein models (Figures S2, S4, S6), specific numerical values were presented next to domain annotations when a single shade was used for each domain. Nuclear localization signal (NLS) domains were predicted for each protein sequence using PSORT II and NLStradamus prediction software. Acidic domains were annotated based on the following criteria: a sequence located in the corresponding protein region that is at least half the size of the reference sequence, contains at least 20% acidic residues (D, E), and includes less than 10% basic residues (K, H, R). To further investigate the acidic domains, their sequence motifs were scanned. Conserved acidic motifs, presented in Figure S5, were identified using Gapped Local Alignment of Motifs (GLAM2) within the MEME Suite. This analysis focused on two distinct regions of 35 insect dTIM proteins: the sequences between ARM1 and ARM2 and the region covering the C-terminal tails. The length of the entire protein, ARM1-ARM2 region, and C-tail length The protein sequence was considered (likely) complete when the N-terminal part included a complete ARM1 domain starting with a methionine and the ARM2 domain was present. If a TSA (transcriptome shotgun assembly) sequence was used, the stop codon indicated the predicted C-terminal end. In the case of multiple paralogs in Daphnia, only protein sequences longer than 600 amino acids (aa) and containing both ARM domains were further analyzed. To determine the length of the variable region (in amino acids) in the central part of the protein, all protein sequences in the dataset were aligned using MAFFT (Katoh and Stanley, 2013, v7.490, Algorithm: E-INS-I, scoring matrix: BLOSUM80). Conserved motifs corresponding to the Drosophila melanogaster ARM1 and ARM2 domains were identified and the number of aa separating ARM1 and ARM2 calculated. Additionally, conserved motifs corresponding to Drosophila YKDQ (located in ARM1) and LLLR (located in PER-bind #2 and ARM2) were identified in each dTIM protein as a parallel measurement. To measure the length of the C-terminal tail, a conserved motif corresponding to Drosophila DLIE (located at the C-terminal end of ARM2) was identified in each dTIM. The number of amino acids between the DLIE-like motif and the C-terminus was then calculated and plotted (Figure 2E). For motif positions, see Figure S2. These values were plotted as dots representing each sequence distance in PRISM 7 for all proteins in the dataset, with exact values provided in Table S2. If a species lacked one or more conserved motifs (due to partial sequences or deletions), the length of the corresponding region was not calculated and was annotated as “n.d.” (non-determined) in Supplementary Table S2. Substitutions per Amino Acid per Million Years To calculate substitution rates per amino acid position for dTIM and PER proteins, we used the following approach. Protein sequences (dTIM or PER) were aligned using the E-INS-i algorithm in MAFFT (Katoh and Standley 2013). The complete alignments were used to infer phylogenetic trees with RAxML (Stamatakis 2014) under the PROTGAMMAJTT model, ensuring the topology matched the evolutionary relationships of the organisms. Constraint trees were created in TreeGraph 2 (Stöver and Müller 2010). For Crustacea, where phylogenies are still debated, we enforced monophyly with insects but did not specify internal branching. Similarly, Polyneoptera were constrained as monophyletic without defining their internal topologies. The resulting unrooted trees were swapped to position Protostomia and Deuterostomia as sister groups. Branch lengths were extracted and summed from the Protostomia/Deuterostomia split to terminal species. These values were then divided by 700 million years (the estimated divergence time of Protostomia and Deuterostomia) to compute substitution rates per amino acid position per million years. The calculated rates are presented in Supplementary Tables 3 and 4. These

  15. e

    GOLD domain superfamily

    • ebi.ac.uk
    Updated Oct 13, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). GOLD domain superfamily [Dataset]. https://www.ebi.ac.uk/interpro/entry/interpro/IPR036598
    Explore at:
    Dataset updated
    Oct 13, 2017
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The GOLD (for Golgi dynamics) domain is a protein module found in several eukaryotic Golgi and lipid-traffic proteins. It is typically between 90 and 150 amino acids long. Most of the size difference observed in the GOLD-domain superfamily is traceable to a single large low-complexity insert that is seen in some versions of the domain. With the exception of the p24 proteins, which have a simple architecture with the GOLD domain as their only globular domain, all other GOLD-domain proteins contain additional conserved globular domains. In these proteins, the GOLD domain co-occurs with lipid-, sterol-or fatty acid-binding domains such as PH, CRAL-TRIO, FYVE oxysterol binding-and acyl CoA-binding domains, suggesting that these proteins may interact with membranes. The GOLD domain can also be found associated with a RUN domain, which may have a role in the interaction of various proteins with cytoskeletal filaments. The GOLD domain is predicted to mediate diverse protein-protein interactions . A secondary structure prediction for the GOLD domain reveals that it is likely to adopt a compact all-β-fold structure with six to seven strands. Most of the sequence conservation is centred on the hydrophobic cores that support these predicted strands. The predicted secondary-structure elements and the size of the conserved core of the domain suggests that it may form a β-sandwich fold with the strands arranged in two β-sheets stacked on each other .Some proteins known to contain a GOLD domain are listed below:Eukaryotic proteins of the p24 family.Animal Sec14-like proteins. They are involved in secretion.Human Golgi resident protein GCP60. It interacts with the Golgi integral membrane protein Giantin.Yeast oxysterol-binding protein homologue 3 (OSH3).Plant Patellin 1-6. It may be involved in membrane-trafficking events .

  16. f

    Table_2_Method for Identifying Essential Proteins by Key Features of...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xin He; Linai Kuang; Zhiping Chen; Yihong Tan; Lei Wang (2023). Table_2_Method for Identifying Essential Proteins by Key Features of Proteins in a Novel Protein-Domain Network.XLSX [Dataset]. http://doi.org/10.3389/fgene.2021.708162.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    Frontiers
    Authors
    Xin He; Linai Kuang; Zhiping Chen; Yihong Tan; Lei Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, due to low accuracy and high costs of traditional biological experiments, more and more computational models have been proposed successively to infer potential essential proteins. In this paper, a novel prediction method called KFPM is proposed, in which, a novel protein-domain heterogeneous network is established first by combining known protein-protein interactions with known associations between proteins and domains. Next, based on key topological characteristics extracted from the newly constructed protein-domain network and functional characteristics extracted from multiple biological information of proteins, a new computational method is designed to effectively integrate multiple biological features to infer potential essential proteins based on an improved PageRank algorithm. Finally, in order to evaluate the performance of KFPM, we compared it with 13 state-of-the-art prediction methods, experimental results show that, among the top 1, 5, and 10% of candidate proteins predicted by KFPM, the prediction accuracy can achieve 96.08, 83.14, and 70.59%, respectively, which significantly outperform all these 13 competitive methods. It means that KFPM may be a meaningful tool for prediction of potential essential proteins in the future.

  17. d

    A dataset for predicting protein-protein interactions in humans

    • search.dataone.org
    • datadryad.org
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jing Zhang; Ian R. Humphrey; Jimin Pei; Jinuk Kim; Chulwon Choi; Rongqing Yuan; Jesse Durham; Siqi Liu; Hee-Jung Choi; Minkyung Baek; David Baker; Qian Cong (2025). A dataset for predicting protein-protein interactions in humans [Dataset]. http://doi.org/10.5061/dryad.15dv41p84
    Explore at:
    Dataset updated
    Sep 17, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Jing Zhang; Ian R. Humphrey; Jimin Pei; Jinuk Kim; Chulwon Choi; Rongqing Yuan; Jesse Durham; Siqi Liu; Hee-Jung Choi; Minkyung Baek; David Baker; Qian Cong
    Description

    Protein-protein interactions (PPIs) are fundamental to biological function. While recent advances in coevolutionary analysis and deep learning (DL)-based structure prediction have enabled large-scale PPI identification in bacterial and yeast proteomes, their application to the more complex human proteome has remained limited. To address this challenge, we 1) enhanced coevolutionary signals by generating 7-fold deeper multiple sequence alignments (MSAs) from 30 petabytes of unassembled genomic data, and 2) developed a new DL model trained on augmented datasets of domain-domain interactions derived from 200 million predicted protein structures. These improvements led to a 4-fold increase in the performance of our de novo PPI prediction pipeline for human proteins. We systematically screened around 190 million human protein pairs and predicted 17,849 high-confidence PPIs at an estimated precision of 90%, including 3,631 interactions not previously detected by experimental methods. The resu..., , # A dataset for predicting protein-protein interactions in humans

    Dataset DOI: 10.5061/dryad.15dv41p84

    Description of the data and file structure

    protein_omicMSAs.tar.gz (17 GB)

    These MSAs are in an A3M-like format. Compared to the standard A3M format, we inserted an additional sequence at the beginning, named “mask,†to indicate the alignment quality at each position. In this “mask,†an asterisk (*) indicates high-quality positions, and a dash (-) indicates low-quality positions (these are poorly conserved and thus cannot be reliably assembled from genomic data). We recommend using only the high-quality positions (marked with *), as we did in our work. Insertions relative to the human (query) sequence are represented by lowercase letters. Each sequence corresponds to one draft genome or genomic dataset, and the NCBI accession number of the genome or dataset is used to name the sequence in the header. We also include the taxonomic information of...,

  18. f

    Results from the assembly software.

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tiange Lang; Kangquan Yin; Jinyu Liu; Kunfang Cao; Charles H. Cannon; Fang K. Du (2023). Results from the assembly software. [Dataset]. http://doi.org/10.1371/journal.pone.0108719.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Tiange Lang; Kangquan Yin; Jinyu Liu; Kunfang Cao; Charles H. Cannon; Fang K. Du
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FA, FT, FL and FF stands for Ficus altissima, Ficus tinctoria, Ficus langkokensis and Ficus fistulosa, respectively.#fastq reads: number of fastq reads from Illumina Hiseq2000.#contig_250: number of predicted contigs longer than 250 base pairs.max_len (bp): number of base pairs (bp) of the contigs predicted with maximum length.#pep: number of peptides predicted.max_len (aa): number of amino acids (aa) of the peptides predicted with maximum length.#LRRNT_2: number of LRRNT_2 domains predicted.#LRR_8: number of LRR_8 domains predicted.Results from the assembly software.

  19. r

    DBD: Transcription factor prediction database

    • rrid.site
    • dknet.org
    • +1more
    Updated Aug 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). DBD: Transcription factor prediction database [Dataset]. http://identifiers.org/RRID:SCR_002300
    Explore at:
    Dataset updated
    Aug 23, 2025
    Description

    Database of predicted transcription factors in completely sequenced genomes. The predicted transcription factors all contain assignments to sequence specific DNA-binding domain families. The predictions are based on domain assignments from the SUPERFAMILY and Pfam hidden Markov model libraries. Benchmarks of the transcription factor predictions show they are accurate and have wide coverage on a genomic scale. The DBD consists of predicted transcription factor repertoires for 930 completely sequenced genomes.

  20. d

    Data from: Functional associations of proteins in entire genomes by means of...

    • catalog.data.gov
    • odgavaprod.ogopendata.com
    Updated Sep 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions [Dataset]. https://catalog.data.gov/dataset/functional-associations-of-proteins-in-entire-genomes-by-means-of-exhaustive-detection-of-
    Explore at:
    Dataset updated
    Sep 7, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background It has recently been shown that the detection of gene fusion events across genomes can be used for predicting functional associations of proteins, including physical interaction or complex formation. To obtain such predictions we have made an exhaustive search for gene fusion events within 24 available completely sequenced genomes. Results Each genome was used as a query against the remaining 23 complete genomes to detect gene fusion events. Using an improved, fully automatic protocol, a total of 7,224 single-domain proteins that are components of gene fusions in other genomes were detected, many of which were identified for the first time. The total number of predicted pairwise functional associations is 39,730 for all genomes. Component pairs were identified by virtue of their similarity to 2,365 multidomain composite proteins. We also show for the first time that gene fusion is a complex evolutionary process with a number of contributory factors, including paralogy, genome size and phylogenetic distance. On average, 9% of genes in a given genome appear to code for single-domain, component proteins predicted to be functionally associated. These proteins are detected by an additional 4% of genes that code for fused, composite proteins. Conclusions These results provide an exhaustive set of functionally associated genes and also delineate the power of fusion analysis for the prediction of protein interactions.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2020). SMART [Dataset]. https://www.ebi.ac.uk/interpro/

SMART

Explore at:
Dataset updated
Feb 14, 2020
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.

Search
Clear search
Close search
Google apps
Main menu