Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.
PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP profiles are manually created by expert curators. They identify proteins that are part of well-conserved protein families or subfamilies. HAMAP is based at the SIB Swiss Institute of Bioinformatics, Geneva, Switzerland.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28–40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.
In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.
For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.
Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:
For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).
We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.
Please use the gunzip command to extract files with a '.gz' extension.
CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain.
Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches.
Protein stability plays a crucial role in a variety of applications, such as food processing, therapeutics, and the identification of pathogenic mutations. Engineering campaigns commonly seek to improve protein stability, and there is a strong interest in streamlining these processes to enable rapid optimization of highly stabilized proteins with fewer iterations. In this work, we explore utilizing a mega-scale dataset to develop a protein language model optimized for stability prediction. ESMtherm is trained on the folding stability of 528k natural and de novo sequences derived from 461 protein domains and can accommodate deletions, insertions, and multiple-point mutations. We show that a protein language model can be fine-tuned to predict folding stability. ESMtherm performs reasonably on small protein domains and generalizes to sequences distal from the training set. Lastly, we discuss our model’s limitations compared to other state-of-the-art methods in generalizing to larger protein scaffolds. Our results highlight the need for large-scale stability measurements on a diverse dataset that mirrors the distribution of sequence lengths commonly observed in nature.
Database of known and predicted protein domain (domain-domain) interactions containing interactions inferred from PDB entries, and those that are predicted by 8 different computational approaches using Pfam domain definitions. DOMINE contains a total of 26,219 domain-domain interactions (among 5,410 domains) out of which 6,634 are inferred from PDB entries, and 21,620 are predicted by at least one computational approach. Of the 21,620 computational predictions, 2,989 interactions are high-confidence predictions (HCPs), 2,537 interactions are medium-confidence predictions (MCPs), and the remaining 16,094 are low-confidence predictions (LCPs). (May 2014)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE.
Protein-protein interactions drive many cellular processes. Some protein interactions are directed by Src homology 3 (SH3) domains that bind proline-rich motifs on other proteins. The evolution of the binding specificity of SH3 domains is not completely understood, particularly following gene duplication. Paralogous genes accumulate mutations that can modify protein functions and, for SH3 domains, their binding preferences. Here, we examined how the binding of the SH3 domains of two paralogous yeast type I myosins, Myo3 and Myo5, evolved following duplication. We found that the paralogs have subtly different SH3-dependent interaction profiles. However, by swapping SH3 domains between the paralogs and characterizing the SH3 domains freed from their protein context, we find that few of the differences in interactions, if any, depend on the SH3 domains themselves. We used ancestral sequence reconstruction to resurrect the pre-duplication SH3 domains and examined, moving back in time, how t..., The data published in this dataset was collected by multiple methods. Among the methods used are DHFR Protein-fragment Complementation Assay, cytometry, ancestral sequence reconstruction with IQ-TREE and FastML, protein structure prediction with AlphaFold2 and AlphaFold Multimer, molecular docking with Haddock2.4, orthology analysis and coevolution predictions with EVCouplings. See the README.md file and the method section of the paper Dissection of the role of a SH3 domain in the evolution of binding preference of paralogous proteins for more details. File S1 : Tables S1 - S12 File S2 : Detailled protocols FiguresS : Figures S1 - S10 DataS1 : DHFR PCA results DataS2 : Phylogeny and sequence alignment DataS3 : AlphaFold results DataS4 : Molecular docking input and output files DataS5: Orthology input and motif conservation results DataS6: EVCouplings output Please refer to Lemieux et al. 2023 for details on the data collection and transformation., All files can be opened with either R, a text editor, Excel or ChimeraX., This README file was generated on 2023-09-19 by Pascale Lemieux.
Author Information A. Principal Investigator Contact Information Name: Christian Landry Institution: Université Laval, Québec CA Email: christian.landry@bio.ulaval.ca
B. Associate or Co-investigator Contact Information Name: Pascale Lemieux Institution: Université Laval, Québec, CA Email: pascale.lemieux.4@ulaval.ca
Date of data collection (single date, range, approximate date): 2020-2023
Information about funding sources that supported the collection of the data: Canadian Institutes of Health Research (CIHR) Foundation grant 387697 and a HFSP grant (RGP0034/2018) to CRL
SHARING/ACCESS INFORMATION
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This directory contains data to train a model to predict the function of protein domains, based on the PFam dataset.
Domains are functional sub-parts of proteins; much like images in ImageNet are pre segmented to contain exactly one object class, this data is presegmented to contain exactly and only one domain.
The purpose of the dataset is to repose the PFam seed dataset as a multiclass classification machine learning task.
The task is: given the amino acid sequence of the protein domain, predict which class it belongs to. There are about 1 million training examples, and 18,000 output classes.
This data is more completely described by the publication "Can Deep Learning Classify the Protein Universe", Bileschi et al.
The approach used to partition the data into training/dev/testing folds is a random split.
Each fold (train, dev, test) has a number of files in it. Each of those files contains csv on each line, which has the following fields:
sequence: HWLQMRDSMNTYNNMVNRCFATCIRSFQEKKVNAEEMDCTKRCVTKFVGYSQRVALRFAE
family_accession: PF02953.15
sequence_name: C5K6N5_PERM5/28-87
aligned_sequence: ....HWLQMRDSMNTYNNMVNRCFATCI...........RS.F....QEKKVNAEE.....MDCT....KRCVTKFVGYSQRVALRFAE
family_id: zf-Tim10_DDP
Description of fields: - sequence: These are usually the input features to your model. Amino acid sequence for this domain. There are 20 very common amino acids (frequency > 1,000,000), and 4 amino acids that are quite uncommon: X, U, B, O, Z. - family_accession: These are usually the labels for your model. Accession number in form PFxxxxx.y (Pfam), where xxxxx is the family accession, and y is the version number. Some values of y are greater than ten, and so 'y' has two digits. - family_id: One word name for family. - sequence_name: Sequence name, in the form "$uniprot_accession_id/$start_index-$end_index". - aligned_sequence: Contains a single sequence from the multiple sequence alignment (with the rest of the members of the family in seed, with gaps retained.
Generally, the family_accession
field is the label, and the sequence
(or aligned sequence) is the training feature.
This sequence corresponds to a domain, not a full protein.
The contents of these fields is the same as to the data provided in Stockholm format by PFam at ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.seed.gz
[1] Eddy, Sean R. "Accelerated profile HMM searches." PLoS computational biology 7.10 (2011): e1002195.
Creative Commons Legal Code
CC0 1.0 Universal
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
HEREUNDER.
Statement of Purpose
The laws of most jurisdictions throughout the world automatically confer exclusive Copyright and Related Rights (defined below) upon the creator and subsequent owner(s) (each and all, an "owner") of an original work of authorship and/or a database (each, a "Work").
Certain owners wish to permanently relinquish those rights to a Work for the purpose of contributing to a commons of creative, cultural and scientific works ("Commons") that the public can reliably and without fear of later claims of infringement build upon, modify, incorporate in other works, reuse and redistribute as freely as possible in any form whatsoever and for any purposes, including without limitation commercial purposes. These owners may contribute to the Commons to promote the ideal of a free culture and the further production of creative, cultural and scientific works, or to gain reputation or greater distribution for their Work in part through the use and efforts of others.
For these and/or other purposes and motivations, and without any expectation of additional consideration or compensation, the person associating CC0 with a Work (the "Affirmer"), to the extent that he or she is an owner of Copyright and Related Rights in the Work, voluntarily elects to apply CC0 to the Work and publicly distribute the Work under its terms, with knowledge of his or her Copyright and Related Rights in t...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundComputational prediction of protein interactions typically use protein domains as classifier features because they capture conserved information of interaction surfaces. However, approaches relying on domains as features cannot be applied to proteins without any domain information. In this paper, we explore the contribution of pure amino acid composition (AAC) for protein interaction prediction. This simple feature, which is based on normalized counts of single or pairs of amino acids, is applicable to proteins from any sequenced organism and can be used to compensate for the lack of domain information.ResultsAAC performed at par with protein interaction prediction based on domains on three yeast protein interaction datasets. Similar behavior was obtained using different classifiers, indicating that our results are a function of features and not of classifiers. In addition to yeast datasets, AAC performed comparably on worm and fly datasets. Prediction of interactions for the entire yeast proteome identified a large number of novel interactions, the majority of which co-localized or participated in the same processes. Our high confidence interaction network included both well-studied and uncharacterized proteins. Proteins with known function were involved in actin assembly and cell budding. Uncharacterized proteins interacted with proteins involved in reproduction and cell budding, thus providing putative biological roles for the uncharacterized proteins.ConclusionAAC is a simple, yet powerful feature for predicting protein interactions, and can be used alone or in conjunction with protein domains to predict new and validate existing interactions. More importantly, AAC alone performs at par with existing, but more complex, features indicating the presence of sequence-level information that is predictive of interaction, but which is not necessarily restricted to domains.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Drosophila-type timeless (dTIM) is an established key clock protein in fruit flies, regulating rhythmicity and light-mediated entrainment. However, as indicated by functional experiments, its contribution to the clock differs in various insects. Therefore, we conducted a comprehensive phylogenetic analysis of dTIM across animals, dated its origin, gene duplications, and losses. We identified variable and conserved protein domains and pinpointed animal lineages that underwent the biggest changes in the dTIM sequence. While dTIM modifications are only mildly affected by changes in the PER protein, even the complete loss of PER in echinoderms had no impact on dTIM. However, changes in dTIM always co-occur with the loss of CRYPTOCHROMES or JETLAG. This is exemplified by the remarkably accelerated evolution of dTIM in phylloxera and aphids. Finally, alternative d-tim splicing, characteristic of D. melanogaster temperature-dependent function, is conserved at least to some extent in Diptera, albeit with unique alterations. Altogether, this study pinpoints major changes that shaped dTIM origin and evolution.
Methods
Gene Identification and Data Sets
A systematic search for clock components was conducted, building on previous studies exploring the evolution, duplication, and loss of circadian clock components (Thakkar et al., 2022; Kotwica-Rolinska et al., 2022a). To identify circadian clock proteins and genes encoding dTIM, mTIM, TOF1, PERIOD, dCRY, mCRY, 6-4 Photolyase, JETLAG, and BRWD3 in Bilateria/Metazoa, GenBank (NCBI) protein and genomic databases, as well as transcriptome shotgun assemblies (TSA), were utilized. BLASTP and tBLASTn algorithms were applied with taxon-restricted searches targeting specific lineages at the levels of orders, suborders, infraorders, and, in some cases, families. In certain instances, annotated genomes or whole-genome shotgun contigs (wgs) were also examined. Protein sequences of circadian clock genes from Drosophila melanogaster, Danaus plexippus, and Pyrrhocoris apterus served as initial queries. Reciprocal and lineage-focused searches incorporated queries representing identified proteins from related taxa. To detect duplicate hits (common in TSAs) or closely related proteins (e.g., bHLH-PAS proteins instead of PERIOD), the E-INS-i algorithm in MAFFT was used for alignment, followed by FastTree analysis (Price et al., 2009), both conducted in Geneious Prime 21.0.3 (Biomatters, New Zealand).
Phylogenetic analyses
To identify specific types of TIM, CRY, FBXL, or PAS proteins (e.g., distinguishing PER from bHLH PAS proteins), sequences were aligned using the MAFFT algorithm in Geneious Prime 21.0.3 (Biomatters, New Zealand). Representative datasets containing target proteins and related types were included. Ambiguously aligned regions were trimmed, and phylogenetic analyses were conducted using RAxML with a maximum likelihood GAMMA-based model in Geneious Prime 21.0.3.
The Metazoan phylogenies presented in Figures 2, 4, 5, 6, 7, 8 and S9 were retrieved using TIMETREE 5 (Kumar et al., 2022) and cross-referenced with recent molecular phylogenomic studies. These included works focused on insects (Misof et al., 2014), chelicerates, and crustaceans (von Reumont et al., 2012; Thomas et al., 2020; Bernot et al., 2023). For insects, phylogenomic studies specific to Polyneoptera (Wipfler et al., 2019), the hemipteroid assembly (Johnson et al., 2018), and Coleoptera (McKenna et al., 2019) were used to refine the corresponding sections of the phylogeny.
Gene Loss
While it is impossible to definitively prove the absence of a gene, in some cases, gene loss is the most plausible explanation. Recent advances in phylogenomics, the availability of extensive TSA data, and an increasing number of sequenced genomes have enabled a systematic exploration of circadian clock genes across major Bilateria groups (Protostomia and Deuterostomia). Our analysis focuses on lineage-specific gene losses that are strongly supported by data from multiple species, whole-genome assemblies, and deep transcriptome sequencing. Evidence for gene loss is summarized for specific genes and animal groups in Supplementary Table 1.
Finally, the CTT-like motif, located on the ARM2 domain, and CTT motif located on dCRY, were annotated on D. melanogaster dTIM and dCRY sequences following the boundaries defined by Lin et al. (2023). After MAFFT alignment with the other dTIM sequences in the dataset, it was considered conserved when the degree of similarity exceeded 50%.
Prediction of Protein Domains
dTIM functional and binding domains were originally annotated based on the sequence of Drosophila melanogaster dTIM, following the boundaries defined by Lin et al., (2023) or identified in cell-based experiments (Saez and Young, 1996). Accordingly, the dCRY-interaction domains, ARMADILLO repeats (ARM1 and ARM2), and PER-binding sites 1 (PER-bind #1) and 2 (PER-bind #2) were annotated in the D. melanogaster dTIM protein isoform P (NP_001334730).
This annotated sequence was then aligned to each sequence in the dataset using MAFFT (Katoh and Stanley, 2013, v7.490, Algorithm: E-INS-I, scoring matrix: BLOSUM80) within the Geneious Prime 2024 software. The obtained similarities were plotted as intensities corresponding to numerical values in Figures 2 and 3. The PER-bind #1 domain was annotated when the total length of the region was at least 35 amino acids and the degree of similarity exceeded 40%. The CRY-interaction domain was highlighted when similarities exceeded 20%.
In the gene models (Figures 6 and 7), only two shades were used: regions corresponding to ARM1, ARM2, and PER-binding sites were highlighted when similarity exceeded 40%. For Limulus, a value of 38% was depicted as "low similarity" using a paler shade of red. The CRY-interaction domain was highlighted when similarity exceeded 20%. For Bemisia, "low similarity" was represented by a paler shade of blue, highlighting values of 19%. In the supplementary figures depicting protein models (Figures S2, S4, S6), specific numerical values were presented next to domain annotations when a single shade was used for each domain.
Nuclear localization signal (NLS) domains were predicted for each protein sequence using PSORT II and NLStradamus prediction software. Acidic domains were annotated based on the following criteria: a sequence located in the corresponding protein region that is at least half the size of the reference sequence, contains at least 20% acidic residues (D, E), and includes less than 10% basic residues (K, H, R). To further investigate the acidic domains, their sequence motifs were scanned. Conserved acidic motifs, presented in Figure S5, were identified using Gapped Local Alignment of Motifs (GLAM2) within the MEME Suite. This analysis focused on two distinct regions of 35 insect dTIM proteins: the sequences between ARM1 and ARM2 and the region covering the C-terminal tails.
The length of the entire protein, ARM1-ARM2 region, and C-tail length
The protein sequence was considered (likely) complete when the N-terminal part included a complete ARM1 domain starting with a methionine and the ARM2 domain was present. If a TSA (transcriptome shotgun assembly) sequence was used, the stop codon indicated the predicted C-terminal end. In the case of multiple paralogs in Daphnia, only protein sequences longer than 600 amino acids (aa) and containing both ARM domains were further analyzed.
To determine the length of the variable region (in amino acids) in the central part of the protein, all protein sequences in the dataset were aligned using MAFFT (Katoh and Stanley, 2013, v7.490, Algorithm: E-INS-I, scoring matrix: BLOSUM80). Conserved motifs corresponding to the Drosophila melanogaster ARM1 and ARM2 domains were identified and the number of aa separating ARM1 and ARM2 calculated. Additionally, conserved motifs corresponding to Drosophila YKDQ (located in ARM1) and LLLR (located in PER-bind #2 and ARM2) were identified in each dTIM protein as a parallel measurement.
To measure the length of the C-terminal tail, a conserved motif corresponding to Drosophila DLIE (located at the C-terminal end of ARM2) was identified in each dTIM. The number of amino acids between the DLIE-like motif and the C-terminus was then calculated and plotted (Figure 2E). For motif positions, see Figure S2. These values were plotted as dots representing each sequence distance in PRISM 7 for all proteins in the dataset, with exact values provided in Table S2. If a species lacked one or more conserved motifs (due to partial sequences or deletions), the length of the corresponding region was not calculated and was annotated as “n.d.” (non-determined) in Supplementary Table S2.
Substitutions per Amino Acid per Million Years
To calculate substitution rates per amino acid position for dTIM and PER proteins, we used the following approach. Protein sequences (dTIM or PER) were aligned using the E-INS-i algorithm in MAFFT (Katoh and Standley 2013). The complete alignments were used to infer phylogenetic trees with RAxML (Stamatakis 2014) under the PROTGAMMAJTT model, ensuring the topology matched the evolutionary relationships of the organisms. Constraint trees were created in TreeGraph 2 (Stöver and Müller 2010). For Crustacea, where phylogenies are still debated, we enforced monophyly with insects but did not specify internal branching. Similarly, Polyneoptera were constrained as monophyletic without defining their internal topologies.
The resulting unrooted trees were swapped to position Protostomia and Deuterostomia as sister groups. Branch lengths were extracted and summed from the Protostomia/Deuterostomia split to terminal species. These values were then divided by 700 million years (the estimated divergence time of Protostomia and Deuterostomia) to compute substitution rates per amino acid position per million years. The calculated rates are presented in Supplementary Tables 3 and 4. These
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The GOLD (for Golgi dynamics) domain is a protein module found in several eukaryotic Golgi and lipid-traffic proteins. It is typically between 90 and 150 amino acids long. Most of the size difference observed in the GOLD-domain superfamily is traceable to a single large low-complexity insert that is seen in some versions of the domain. With the exception of the p24 proteins, which have a simple architecture with the GOLD domain as their only globular domain, all other GOLD-domain proteins contain additional conserved globular domains. In these proteins, the GOLD domain co-occurs with lipid-, sterol-or fatty acid-binding domains such as PH, CRAL-TRIO, FYVE oxysterol binding-and acyl CoA-binding domains, suggesting that these proteins may interact with membranes. The GOLD domain can also be found associated with a RUN domain, which may have a role in the interaction of various proteins with cytoskeletal filaments. The GOLD domain is predicted to mediate diverse protein-protein interactions . A secondary structure prediction for the GOLD domain reveals that it is likely to adopt a compact all-β-fold structure with six to seven strands. Most of the sequence conservation is centred on the hydrophobic cores that support these predicted strands. The predicted secondary-structure elements and the size of the conserved core of the domain suggests that it may form a β-sandwich fold with the strands arranged in two β-sheets stacked on each other .Some proteins known to contain a GOLD domain are listed below:Eukaryotic proteins of the p24 family.Animal Sec14-like proteins. They are involved in secretion.Human Golgi resident protein GCP60. It interacts with the Golgi integral membrane protein Giantin.Yeast oxysterol-binding protein homologue 3 (OSH3).Plant Patellin 1-6. It may be involved in membrane-trafficking events .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent years, due to low accuracy and high costs of traditional biological experiments, more and more computational models have been proposed successively to infer potential essential proteins. In this paper, a novel prediction method called KFPM is proposed, in which, a novel protein-domain heterogeneous network is established first by combining known protein-protein interactions with known associations between proteins and domains. Next, based on key topological characteristics extracted from the newly constructed protein-domain network and functional characteristics extracted from multiple biological information of proteins, a new computational method is designed to effectively integrate multiple biological features to infer potential essential proteins based on an improved PageRank algorithm. Finally, in order to evaluate the performance of KFPM, we compared it with 13 state-of-the-art prediction methods, experimental results show that, among the top 1, 5, and 10% of candidate proteins predicted by KFPM, the prediction accuracy can achieve 96.08, 83.14, and 70.59%, respectively, which significantly outperform all these 13 competitive methods. It means that KFPM may be a meaningful tool for prediction of potential essential proteins in the future.
Protein-protein interactions (PPIs) are fundamental to biological function. While recent advances in coevolutionary analysis and deep learning (DL)-based structure prediction have enabled large-scale PPI identification in bacterial and yeast proteomes, their application to the more complex human proteome has remained limited. To address this challenge, we 1) enhanced coevolutionary signals by generating 7-fold deeper multiple sequence alignments (MSAs) from 30 petabytes of unassembled genomic data, and 2) developed a new DL model trained on augmented datasets of domain-domain interactions derived from 200 million predicted protein structures. These improvements led to a 4-fold increase in the performance of our de novo PPI prediction pipeline for human proteins. We systematically screened around 190 million human protein pairs and predicted 17,849 high-confidence PPIs at an estimated precision of 90%, including 3,631 interactions not previously detected by experimental methods. The resu..., , # A dataset for predicting protein-protein interactions in humans
Dataset DOI: 10.5061/dryad.15dv41p84
These MSAs are in an A3M-like format. Compared to the standard A3M format, we inserted an additional sequence at the beginning, named “mask,†to indicate the alignment quality at each position. In this “mask,†an asterisk (*) indicates high-quality positions, and a dash (-) indicates low-quality positions (these are poorly conserved and thus cannot be reliably assembled from genomic data). We recommend using only the high-quality positions (marked with *), as we did in our work. Insertions relative to the human (query) sequence are represented by lowercase letters. Each sequence corresponds to one draft genome or genomic dataset, and the NCBI accession number of the genome or dataset is used to name the sequence in the header. We also include the taxonomic information of...,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FA, FT, FL and FF stands for Ficus altissima, Ficus tinctoria, Ficus langkokensis and Ficus fistulosa, respectively.#fastq reads: number of fastq reads from Illumina Hiseq2000.#contig_250: number of predicted contigs longer than 250 base pairs.max_len (bp): number of base pairs (bp) of the contigs predicted with maximum length.#pep: number of peptides predicted.max_len (aa): number of amino acids (aa) of the peptides predicted with maximum length.#LRRNT_2: number of LRRNT_2 domains predicted.#LRR_8: number of LRR_8 domains predicted.Results from the assembly software.
Database of predicted transcription factors in completely sequenced genomes. The predicted transcription factors all contain assignments to sequence specific DNA-binding domain families. The predictions are based on domain assignments from the SUPERFAMILY and Pfam hidden Markov model libraries. Benchmarks of the transcription factor predictions show they are accurate and have wide coverage on a genomic scale. The DBD consists of predicted transcription factor repertoires for 930 completely sequenced genomes.
Background It has recently been shown that the detection of gene fusion events across genomes can be used for predicting functional associations of proteins, including physical interaction or complex formation. To obtain such predictions we have made an exhaustive search for gene fusion events within 24 available completely sequenced genomes. Results Each genome was used as a query against the remaining 23 complete genomes to detect gene fusion events. Using an improved, fully automatic protocol, a total of 7,224 single-domain proteins that are components of gene fusions in other genomes were detected, many of which were identified for the first time. The total number of predicted pairwise functional associations is 39,730 for all genomes. Component pairs were identified by virtue of their similarity to 2,365 multidomain composite proteins. We also show for the first time that gene fusion is a complex evolutionary process with a number of contributory factors, including paralogy, genome size and phylogenetic distance. On average, 9% of genes in a given genome appear to code for single-domain, component proteins predicted to be functionally associated. These proteins are detected by an additional 4% of genes that code for fused, composite proteins. Conclusions These results provide an exhaustive set of functionally associated genes and also delineate the power of fusion analysis for the prediction of protein interactions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. SMART is based at EMBL, Heidelberg, Germany.