Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pfam domains predicted on all EukProt v3 datasets (one output file per dataset). Domains were predicted with InterProScan version 5.56 and InterPro version 89.0 (which includes the Pfam database version 34.0; note that this is no longer the most recent version of Pfam), with default parameter values.
https://bioregistry.io/spdx:CC0-1.0https://bioregistry.io/spdx:CC0-1.0
The Pfam database contains information about protein domains and families. For each entry a protein sequence alignment and a Hidden Markov Model is stored.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY is based at the University of Bristol, UK.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ISOPENTENYLTRANSFERASE (IPT) genes play important roles in the initial steps of cytokinin synthesis, exist in plant and pathogenic bacteria, and form a multigene family in plants. Protein domain searches revealed that bacteria and plant IPT proteins were to assigned to different protein domains families in the Pfam database, namely Pfam IPT (IPTPfam) and Pfam IPPT (IPPTPfam) families, both are closely related in the P-loop NTPase clan. To understand the origin and evolution of the genes, a species matrix was assembled across the tree of life and intensively in plant lineages. The IPTPfam domain was only found in few bacteria lineages, whereas IPPTPfam is common except in Archaea and Mycoplasma bacteria. The bacterial IPPTPfam domain miaA genes were shown as ancestral of eukaryotic IPPTPfam domain genes. Plant IPTs diversified into class I, class II tRNA-IPTs, and Adenosine-phosphate IPTs; the class I tRNA-IPTs appeared to represent direct successors of miaA genes were found in all plant genomes, whereas class II tRNA-IPTs originated from eukaryotic genes, and were found in prasinophyte algae and in euphyllophytes. Adenosine-phosphate IPTs were only found in angiosperms. Gene duplications resulted in gene redundancies with ubiquitous expression or diversification in expression. In conclusion, it is shown that IPT genes have a complex history prior to the protein family split, and might have experienced losses or HGTs, and gene duplications that are to be likely correlated with the rise in morphological complexity involved in fine tuning cytokinin production.
A database of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Users can analyze protein sequences for Pfam matches, view Pfam family annotation and alignments, see groups of related families, look at the domain organization of a protein sequence, find the domains on a PDB structure, and query Pfam by keywords. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families that may automatically generate a supplement using the ADDA database. These automatically generated entries are called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. Pfam also generates higher-level groupings of related families, known as clans (collections of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 772 enzyme domain families annotated with 1,376 active sites in Pfam v37.1.
Protein sequences containing Pfam domains were retrieved from UniProt, and active site residues were predicted using Pfam’s pfam_scan.pl tool v1.626 together with the active site database active_site.dat v37.1. For each protein, only the sequence corresponding to the annotated domain was extracted from the full-length protein sequence Domain sequences were excluded if: the annotated domain was shorter than 25% of the length of the corresponding Pfam HMM model, or more than 10% of residues were non-standard amino acids.
The dataset contains two directories:
families/
– domain protein sequences grouped by enzyme family [FASTA format]active_sites/
– active site residue annotations for each family [TSV format]A metadata file (metadata.tsv
) is also included, providing detailed information for each enzyme family.
A metadata file (metadata.tsv
) provides:
Each family has a corresponding TSV file in active_sites/
listing sequence-specific active site annotations:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Human genome resequencing projects provide an unprecedented amount of data about single-nucleotide variations occurring in protein-coding regions and often leading to observable changes in the covalent structure of gene products. For many of these variations, links to Online Mendelian Inheritance in Man (OMIM) genetic diseases are available and are reported in many databases that are collecting human variation data such as Humsavar. However, the current knowledge on the molecular mechanisms that are leading to diseases is, in many cases, still limited. For understanding the complex mechanisms behind disease insurgence, the identification of putative models, when considering the protein structure and chemico-physical features of the variations, can be useful in many contexts, including early diagnosis and prognosis. In this study, we investigate the occurrence and distribution of human disease–related variations in the context of Pfam domains. The aim of this study is the identification and characterization of Pfam domains that are statistically more likely to be associated with disease-related variations. The study takes into consideration 2,513 human protein sequences with 22,763 disease-related variations. We describe patterns of disease-related variation types in biunivocal relation with Pfam domains, which are likely to be possible markers for linking Pfam domains to OMIM diseases. Furthermore, we take advantage of the specific association between disease-related variation types and Pfam domains for clustering diseases according to the Human Disease Ontology, and we establish a relation among variation types, Pfam domains, and disease classes. We find that Pfam models are specific markers of patterns of variation types and that they can serve to bridge genes, diseases, and disease classes. Data are available as Supplementary Material for 1,670 Pfam models, including 22,763 disease-related variations associated to 3,257 OMIM diseases.
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Annotations of transcript isoforms. Pfam protein domains along with E-values and GO terms via InterProScan.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
() Including the LSM domain present in Sm and Lsm proteins.(*) In >100 copies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CDD is a protein annotation resource that consists of a collection of annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domain models, which use 3D-structure information to explicitly define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Top Pfam domains identified in Agrilus planipennis sequences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. PIRSF is based at the Protein Information Resource, Georgetown University Medical Centre, Washington DC, US.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain. PRINTS is based at the University of Manchester, UK.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Most frequent PFAM domains found throughout the secretome of Mycosphaerella graminicola (Mg), and corresponding frequency in Fusarium graminearum (Fg).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of NADDA based on Pfam; When similar domains are present in the training set (repetitive) and when some domains are withheld from the training set (non-repetitive).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Top Pfam domains identified in Chilo suppressalis midgut sequences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.