100+ datasets found
  1. e

    CATH-Gene3D

    • ebi.ac.uk
    Updated Oct 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Oct 21, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.

  2. f

    EukProt v3 Pfam domain annotations (Pfam version 34.0)

    • figshare.com
    application/x-gzip
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Galvez Morante; Daniel Richter (2023). EukProt v3 Pfam domain annotations (Pfam version 34.0) [Dataset]. http://doi.org/10.6084/m9.figshare.24680811.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    figshare
    Authors
    Alex Galvez Morante; Daniel Richter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pfam domains predicted on all EukProt v3 datasets (one output file per dataset). Domains were predicted with InterProScan version 5.56 and InterPro version 89.0 (which includes the Pfam database version 34.0; note that this is no longer the most recent version of Pfam), with default parameter values.

  3. b

    Data from: Pfam

    • bioregistry.io
    Updated Apr 22, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Pfam [Dataset]. http://identifiers.org/re3data:r3d100012850
    Explore at:
    Dataset updated
    Apr 22, 2021
    License

    https://bioregistry.io/spdx:CC0-1.0https://bioregistry.io/spdx:CC0-1.0

    Description

    The Pfam database contains information about protein domains and families. For each entry a protein sequence alignment and a Hidden Markov Model is stored.

  4. e

    SUPERFAMILY

    • ebi.ac.uk
    Updated Nov 8, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2010). SUPERFAMILY [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Nov 8, 2010
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY is based at the University of Bristol, UK.

  5. f

    Tangled history of a multigene family: The evolution of...

    • plos.figshare.com
    • omicsdi.org
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanae Nishii; Frank Wright; Yun-Yu Chen; Michael Möller (2023). Tangled history of a multigene family: The evolution of ISOPENTENYLTRANSFERASE genes [Dataset]. http://doi.org/10.1371/journal.pone.0201198
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Kanae Nishii; Frank Wright; Yun-Yu Chen; Michael Möller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ISOPENTENYLTRANSFERASE (IPT) genes play important roles in the initial steps of cytokinin synthesis, exist in plant and pathogenic bacteria, and form a multigene family in plants. Protein domain searches revealed that bacteria and plant IPT proteins were to assigned to different protein domains families in the Pfam database, namely Pfam IPT (IPTPfam) and Pfam IPPT (IPPTPfam) families, both are closely related in the P-loop NTPase clan. To understand the origin and evolution of the genes, a species matrix was assembled across the tree of life and intensively in plant lineages. The IPTPfam domain was only found in few bacteria lineages, whereas IPPTPfam is common except in Archaea and Mycoplasma bacteria. The bacterial IPPTPfam domain miaA genes were shown as ancestral of eukaryotic IPPTPfam domain genes. Plant IPTs diversified into class I, class II tRNA-IPTs, and Adenosine-phosphate IPTs; the class I tRNA-IPTs appeared to represent direct successors of miaA genes were found in all plant genomes, whereas class II tRNA-IPTs originated from eukaryotic genes, and were found in prasinophyte algae and in euphyllophytes. Adenosine-phosphate IPTs were only found in angiosperms. Gene duplications resulted in gene redundancies with ubiquitous expression or diversification in expression. In conclusion, it is shown that IPT genes have a complex history prior to the protein family split, and might have experienced losses or HGTs, and gene duplications that are to be likely correlated with the rise in morphological complexity involved in fine tuning cytokinin production.

  6. n

    Data from: Pfam

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Sep 18, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2007). Pfam [Dataset]. http://identifiers.org/RRID:SCR_004726
    Explore at:
    Dataset updated
    Sep 18, 2007
    Description

    A database of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Users can analyze protein sequences for Pfam matches, view Pfam family annotation and alignments, see groups of related families, look at the domain organization of a protein sequence, find the domains on a PDB structure, and query Pfam by keywords. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families that may automatically generate a supplement using the ADDA database. These automatically generated entries are called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. Pfam also generates higher-level groupings of related families, known as clans (collections of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM).

  7. e

    SMART

    • ebi.ac.uk
    Updated Feb 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). SMART [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 14, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.

  8. active_sites v1.0: enzyme domain sequences with annotated active sites from...

    • zenodo.org
    application/gzip
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Zielezinski; Andrzej Zielezinski; Adam Gudyś; Adam Gudyś; Sebastian Deorowicz; Sebastian Deorowicz (2025). active_sites v1.0: enzyme domain sequences with annotated active sites from Pfam v37.1 for benchmarking MSA tools [Dataset]. http://doi.org/10.5281/zenodo.16023627
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jul 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Zielezinski; Andrzej Zielezinski; Adam Gudyś; Adam Gudyś; Sebastian Deorowicz; Sebastian Deorowicz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 772 enzyme domain families annotated with 1,376 active sites in Pfam v37.1.

    Protein sequences containing Pfam domains were retrieved from UniProt, and active site residues were predicted using Pfam’s pfam_scan.pl tool v1.626 together with the active site database active_site.dat v37.1. For each protein, only the sequence corresponding to the annotated domain was extracted from the full-length protein sequence Domain sequences were excluded if: the annotated domain was shorter than 25% of the length of the corresponding Pfam HMM model, or more than 10% of residues were non-standard amino acids.

    Directory structure

    The dataset contains two directories:

    • families/ – domain protein sequences grouped by enzyme family [FASTA format]
    • active_sites/ – active site residue annotations for each family [TSV format]

    A metadata file (metadata.tsv) is also included, providing detailed information for each enzyme family.

    Metadata

    A metadata file (metadata.tsv) provides:

    • family_id – Pfam family identifier (e.g., PF02615)
    • family_name – Pfam family name
    • seqs_count – total number of domain sequences in the family
    • seqs_with_active_sites – number of sequences containing at least one annotated active site
    • seqs_with_active_sites_percent – percentage of sequences with active sites
    • active_site_ids – comma-separated list of active site identifiers for the family
    • min_seq_length – minimum domain sequence length
    • mean_seq_length – average domain sequence length
    • max_seq_length – maximum domain sequence length

    Active sites

    Each family has a corresponding TSV file in active_sites/ listing sequence-specific active site annotations:

    • protein_id – protein sequence identifier
    • site_id – active site identifier (e.g., 114_H, 42_H)
    • protein_position – residue position within the sequence
    • protein_residue – amino acid at the position
  9. e

    PROSITE profiles

    • ebi.ac.uk
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 5, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.

  10. f

    Table_2_Mapping OMIM Disease–Related Variations on Protein Domains Reveals...

    • figshare.com
    docx
    Updated Jun 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Castrense Savojardo; Giulia Babbi; Pier Luigi Martelli; Rita Casadio (2023). Table_2_Mapping OMIM Disease–Related Variations on Protein Domains Reveals an Association Among Variation Type, Pfam Models, and Disease Classes.docx [Dataset]. http://doi.org/10.3389/fmolb.2021.617016.s003
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    Frontiers
    Authors
    Castrense Savojardo; Giulia Babbi; Pier Luigi Martelli; Rita Casadio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Human genome resequencing projects provide an unprecedented amount of data about single-nucleotide variations occurring in protein-coding regions and often leading to observable changes in the covalent structure of gene products. For many of these variations, links to Online Mendelian Inheritance in Man (OMIM) genetic diseases are available and are reported in many databases that are collecting human variation data such as Humsavar. However, the current knowledge on the molecular mechanisms that are leading to diseases is, in many cases, still limited. For understanding the complex mechanisms behind disease insurgence, the identification of putative models, when considering the protein structure and chemico-physical features of the variations, can be useful in many contexts, including early diagnosis and prognosis. In this study, we investigate the occurrence and distribution of human disease–related variations in the context of Pfam domains. The aim of this study is the identification and characterization of Pfam domains that are statistically more likely to be associated with disease-related variations. The study takes into consideration 2,513 human protein sequences with 22,763 disease-related variations. We describe patterns of disease-related variation types in biunivocal relation with Pfam domains, which are likely to be possible markers for linking Pfam domains to OMIM diseases. Furthermore, we take advantage of the specific association between disease-related variation types and Pfam domains for clustering diseases according to the Human Disease Ontology, and we establish a relation among variation types, Pfam domains, and disease classes. We find that Pfam models are specific markers of patterns of variation types and that they can serve to bridge genes, diseases, and disease classes. Data are available as Supplementary Material for 1,670 Pfam models, including 22,763 disease-related variations associated to 3,257 OMIM diseases.

  11. Annotation

    • figshare.com
    txt
    Updated Mar 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asif Ali (2020). Annotation [Dataset]. http://doi.org/10.6084/m9.figshare.11926770.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 3, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Asif Ali
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    Annotations of transcript isoforms. Pfam protein domains along with E-values and GO terms via InterProScan.

  12. e

    SFLD

    • ebi.ac.uk
    Updated Sep 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). SFLD [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Sep 7, 2018
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.

  13. f

    Statistics of conserved ordered and disordered PFAM domains.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iga Korneta; Janusz M. Bujnicki (2023). Statistics of conserved ordered and disordered PFAM domains. [Dataset]. http://doi.org/10.1371/journal.pcbi.1002641.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Iga Korneta; Janusz M. Bujnicki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    () Including the LSM domain present in Sm and Lsm proteins.(*) In >100 copies.

  14. e

    CDD

    • ebi.ac.uk
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). CDD [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Apr 18, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CDD is a protein annotation resource that consists of a collection of annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domain models, which use 3D-structure information to explicitly define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases.

  15. f

    Top Pfam domains identified in Agrilus planipennis sequences.

    • plos.figshare.com
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omprakash Mittapalli; Xiaodong Bai; Praveen Mamidala; Swapna Priya Rajarapu; Pierluigi Bonello; Daniel A. Herms (2023). Top Pfam domains identified in Agrilus planipennis sequences. [Dataset]. http://doi.org/10.1371/journal.pone.0013708.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Omprakash Mittapalli; Xiaodong Bai; Praveen Mamidala; Swapna Priya Rajarapu; Pierluigi Bonello; Daniel A. Herms
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Top Pfam domains identified in Agrilus planipennis sequences.

  16. e

    PIRSF

    • ebi.ac.uk
    Updated Apr 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). PIRSF [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Apr 7, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. PIRSF is based at the Protein Information Resource, Georgetown University Medical Centre, Washington DC, US.

  17. e

    PRINTS

    • ebi.ac.uk
    Updated Jun 14, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). PRINTS [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Jun 14, 2012
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain. PRINTS is based at the University of Manchester, UK.

  18. f

    Most frequent PFAM domains found throughout the secretome of Mycosphaerella...

    • figshare.com
    xls
    Updated Jun 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandre Morais do Amaral; John Antoniw; Jason J. Rudd; Kim E. Hammond-Kosack (2023). Most frequent PFAM domains found throughout the secretome of Mycosphaerella graminicola (Mg), and corresponding frequency in Fusarium graminearum (Fg). [Dataset]. http://doi.org/10.1371/journal.pone.0049904.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Alexandre Morais do Amaral; John Antoniw; Jason J. Rudd; Kim E. Hammond-Kosack
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Most frequent PFAM domains found throughout the secretome of Mycosphaerella graminicola (Mg), and corresponding frequency in Fusarium graminearum (Fg).

  19. f

    Performance of NADDA based on Pfam; When similar domains are present in the...

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Armen Abnousi; Shira L. Broschat; Ananth Kalyanaraman (2023). Performance of NADDA based on Pfam; When similar domains are present in the training set (repetitive) and when some domains are withheld from the training set (non-repetitive). [Dataset]. http://doi.org/10.1371/journal.pone.0161338.t013
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Armen Abnousi; Shira L. Broschat; Ananth Kalyanaraman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance of NADDA based on Pfam; When similar domains are present in the training set (repetitive) and when some domains are withheld from the training set (non-repetitive).

  20. f

    Top Pfam domains identified in Chilo suppressalis midgut sequences.

    • figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weihua Ma; Zan Zhang; Chuanhua Peng; Xiaoping Wang; Fei Li; Yongjun Lin (2023). Top Pfam domains identified in Chilo suppressalis midgut sequences. [Dataset]. http://doi.org/10.1371/journal.pone.0038151.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Weihua Ma; Zan Zhang; Chuanhua Peng; Xiaoping Wang; Fei Li; Yongjun Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Top Pfam domains identified in Chilo suppressalis midgut sequences.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/

CATH-Gene3D

Explore at:
Dataset updated
Oct 21, 2020
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.

Search
Clear search
Close search
Google apps
Main menu