Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains the protein-protein interaction analysis dataset that was used in the unpublished manuscript and was further analyzed with the STRING online software.Significantly upregulated mRNAs (2,777 genes; p < 0.05) identified by bulk RNA-seq were analyzed using the STRING module in Cytoscape v.2.2.0 (Institute for System Biology; WA; USA). A cluster network was constructed using the MCL algorithm with a granularity parameter of 4, followed by filtering nodes with mcl.cluster > 10. The resulting 1,848 nodes were processed through STRING v12.0 (Swiss Institute of Bioinformatics; Lausanne; Switzerland) to generate a protein–protein interaction (PPI) network, incorporating evidence from text mining, genomic neighborhood, experimental data, curated databases, co-expression, gene fusion, and co-occurrence, with a minimum confidence score threshold of 0.40. Network modules were defined using the DBSCAN clustering algorithm with an ε parameter of 2. Cluster 1, representing the largest gene set (101 genes), was further analyzed by sorting the top 20 nodes with the highest node degree, resulting in a network comprising 101 nodes and 756 edges. Global network metrics indicated an average node degree of 15, a local clustering coefficient of 0.600, and a PPI enrichment p-value of < 1 × 10⁻¹⁶. The average values of coexpression, experimentally determined interactions, automated text mining, and combined scores were calculated.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An extensive dataset of binary physical protein-protein interaction extracted from STRING 12.0 (>12,000 organisms) with artificially generated negatives. The dataset includes 72M positive pairs with STRING confidence scores> 0.9 and 720M negative pairs. The corresponding protein sequences are located in the .fasta files. The generation of the negatives was derived from https://doi.org/10.1016/j.isci.2024.110371
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data collection contains the data sets related to human (9606) that were previously deposed as separate datasets in STRING ver.10.5 before changing the download files structure with release of ver.11.0.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data for RAPPPID, a method for the Regularised Automative Prediction of Protein-Protein Interactions using Deep Learning.
These datasets are in a format that RAPPPID is ready to read.
Comparatives Dataset
These datasets were derived from the STRING v11 H. sapiens dataset, according to the C1, C2, and C3 procedures outlined by Park and Marcotte, 2012. Negative samples are sampled randomly from the space of proteins not known to interact. See Szymborski & Emad for details.
Repeatability Datasets
The following datasets are all derived from STRING in the manner as the comparatives dataset, but three different random seeds are used for drawing proteins.
References
Park,Y. and Marcotte,E.M. (2012) Flaws in evaluation schemes for pair-input computational predictions. Nat Methods, 9, 1134–1136.
Szklarczyk, D., Gable, A. L., Lyon, D., Junge, A., Wyder, S., Huerta-Cepas, J., Simonovic, M., Doncheva, N. T., Morris, J. H., Bork, P., Jensen, L. J., and Mering, C. (2019). String v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research, 47(D1), D607–D613.
Szymborski,J. and Emad,A. (2021) RAPPPID: Towards Generalisable Protein Interaction Prediction with AWD-LSTM Twin Networks. bioRxiv https://doi.org/10.1101/2021.08.13.456309
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All human protein interactions were obtained from STRING (https://string-db.org/, version 11.0). Interactions were then filtered to those involving only BM zone proteins. Related to Fig. S6B.
The prediction of protein complexes from protein-protein interactions (PPIs) is a well-studied problem in bioinformatics. However, the currently available PPI data is not enough to describe all known protein complexes. In this paper, we express the problem of determining the minimum number of (additional) required protein-protein interactions as a graph theoretic problem under the constraint that each complex constitutes a connected component in a PPI network. For this problem, we develop two computational methods: one is based on integer linear programming (ILPMinPPI) and the other one is based on an existing greedy-type approximation algorithm (GreedyMinPPI) originally developed in the context of communication and social networks. Since the former method is only applicable to datasets of small size, we apply the latter method to a combination of the CYC2008 protein complex dataset and each of eight PPI datasets (STRING, MINT, BioGRID, IntAct, DIP, BIND, WI-PHI, iRefIndex). The results show that the minimum number of additional required PPIs ranges from 51 (STRING) to 964 (BIND), and that even the four best PPI databases, STRING (51), BioGRID (67), WI-PHI (93) and iRefIndex (85), do not include enough PPIs to form all CYC2008 protein complexes. We also demonstrate that the proposed problem framework and our solutions can enhance the prediction accuracy of existing PPI prediction methods. ILPMinPPI can be freely downloaded from http://sunflower.kuicr.kyoto-u.ac.jp/~nakajima/.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Validation of the total new predicted links and the new predicted links associated with the 10 proteins by STRING database for the 14317_PPI data.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset for protein-protein interaction prediction across bacteria (Protein sequences)
A dataset of 10,533 bacterial genomes across 6,956 species with protein-protein interaction (PPI) scores for each genome. The genome protein sequences and PPI scores have been extracted from STRING DB. Each row contains a set of protein sequences from a genome, ordered by their location on the chromosome and plasmids and a set of associated PPI scores. The PPI scores have been extracted using the… See the full description on the dataset page: https://huggingface.co/datasets/macwiatrak/bacbench-ppi-stringdb-protein-sequences.
STRING PPI Human 1M
This dataset contains 1 million human protein–protein interactions (PPIs) derived from STRING v11.5. Columns:
seq_a, seq_b: Amino acid sequences of the interacting proteins (≤2048 AA). seq_name_a, seq_name_b: Protein names from STRING. score: Combined score from STRING (0–1000, normalized to 0–1). This score integrates various evidence channels (experimental data, text mining, co-expression, etc.) into a single confidence metric. label: Binary interaction label.… See the full description on the dataset page: https://huggingface.co/datasets/vladak/string_ppi_human_1M.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Homophily/heterophily evaluation, expressed in terms of z-score values, is related to the human Protein-Protein Interaction Network (PPI), obtained from the STRING v11.5 database (https://string-db.org) setting standard threshold on edge score (T=700). Each protein occurring in the PPI was assigned to a class corresponding to the chromosome the related gene belongs to.
A total of 23 classes (chr1, chr2, ..., chr22, chrX) were considered (excluding the class corresponding to chromosome Y because of the small number of genes occurring in the network).
The homophily/heterophily nature of the network, with respect to chromosome classes, was evaluated through HONTO tool (https://github.com/cumbof/honto).
In other words, the tendency of proteins to preferentially interact with proteins whose genes are physically located on the same chromosome (homophily) or on different chromosomes (heterophily) was investigated and evaluated in terms of z-scores.
Values related to intra (along the diagonal) and inter chromosomal interactions (other than the diagonal) are also reported as a heatmap.
As one can observe, values occurring in the diagonal are clearly higher than values out of the diagonal, leading to assess a homophilic nature of the network, confirming the link between shared chromosome and interaction in the PPI.
Selection of 30 central genes from PPI network, including 17 upregulated and 13 downregulated genes, by using the STRING and Cytoscape software.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of proteins and known PPIs per species in BIOGRID. (version 3.5.171).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data files necessary to reproduce the analysis of associations between cancer driver mutation status and its protein physical interactors. Original research article available at: https://www.biorxiv.org/content/10.1101/2025.01.18.633728v1
Code for the analysis is available at: https://github.com/GamaPintoLab/driver_neighbours. Clone the code repository and extract the data folder to the directory's root.
Raw data files were retrieved from the TCGA Pan Cancer cohort through the UCSC Xena online tool (https://pancanatlas.xenahubs.net):
Protein-protein interaction data:
Cancer driver genes from the Network of Cancer Genes and Heathy Drivers (NCG7.0): http://network-cancer-genes.org/
Consensus RNA data from The Human Protein Atlas (version 24.0): https://www.proteinatlas.org/humanproteome/tissue/data#consensus_tissues_rna
The opentargets folder contains files from the Open Targets Platform (Version 24.09): (https://platform.opentargets.org/)
Additionally, we also include a processed folder, which includes all generated files required to reproduce the figures in the mentioned article, without the necessity of running all the provided code.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The underlying mechanism of obesity and hyperuricemia. (A) Protein-protein interaction (PPI) network obtained from STRING database and constructed by Cytoscape. Each node size and color depth are proportional to their node degree. Edge width is proportional to the edge betweenness. (B) Grouping of KEGG enrichment analysis of the 235 intersected targets of obesity leading to hyperuricemia. Functionally related groups partially overlap. KEGG pathway is represented as a node. The nodes of a group are labeled in the same color. Two groups share the nodes with two colors
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary statistics for protein-protein interaction networks identified with STRING amongst genes corresponding to significant SNPs or k-mers (inside or adjacent to genes). PPI enrichment p-value corresponds to the likelihood nodes and edges would be selected from the S. aureus database by chance.
Despite being one of the most important human fungal pathogens, Candida albicans has not been studied extensively at the level of protein-protein interactions (PPIs) and data on PPIs are not readily available in online databases. In January 2018, the database called “Biological General Repository for Interaction Datasets (BioGRID)” that contains the most PPIs for C. albicans, only documented 188 physical or direct PPIs (release 3.4.156) while several more can be found in the literature. Other databases such as the String database, the Molecular INTeraction Database (MINT), and the Database for Interacting Proteins (DIP) database contain even fewer interactions or do not even include C. albicans as a searchable term. Because of the non-canonical codon usage of C. albicans where CUG is translated as serine rather than leucine, it is often problematic to use the yeast two-hybrid system in Saccharomyces cerevisiae to study C. albicans PPIs. However, studying PPIs is crucial to gain a thorough understanding of the function of proteins, biological processes and pathways. PPIs can also be potential drug targets. To aid in creating PPI networks and updating the BioGRID, we performed an exhaustive literature search in order to provide, in an accessible format, a more extensive list of known PPIs in C. albicans.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Data from the Gene Expression Omnibus (GEO) database (GSE119600) for PSC were downloaded and analyzed using R software to identify differentially expressed genes (DEGs). Online analysis tools were employed for Gene Ontology (GO) functional analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. The STRING database (https://www.string-db.org) was used for protein-protein interaction (PPI) analysis to identify key genes in PSC, and ssGSEA was used to analyze immune cell infiltration.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ameloblastoma is a highly aggressive odontogenic tumor, and its pathogenesis is associated with multiple participating genes. Objective: Our aim was to identify and validate new critical genes of conventional ameloblastoma using microarray and bioinformatics analysis. Methods: Gene expression microarray and bioinformatic analysis were performed to use CHIP H10KA and DAVID software for enrichment. Protein-protein interactions (PPI) were visualized using STRING-Cytoscape with MCODE plugin, followed by Kaplan-Meier and GEPIA analysis that were employed for the candidate's postulation. RT-qPCR and IHC assays were performed to validate the bioinformatic approach. Results: 376 upregulated genes were identified. PPI analysis revealed 14 genes that were validated by Kaplan-Meier and GEPIA resulting in PDGFA and IL2RA as candidate genes. The RT-qPCR analysis confirmed their intense expression. Immunohistochemistry analysis showed that PDGFA expression is parenchyma located. Conclusion: With bioinformatics methods, we can identify upregulated genes in conventional ameloblastoma, and with RT-qPCR and immunoexpression analysis validate that PDGFA could be a more specific and localized therapeutic target.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains pre-processed data required to reproduce the results of the paper "Identification of transcription factor co-binding patterns with non-negative matrix factorization". The repository with the code can be found here: https://bitbucket.org/CBGR/cobind_manuscript/src/master/. Data include transcription factor binding sites (TFBSs) for 7 species from UniBind 2021 database, joined motif collection from CIS-BP and JASPAR 2022 databases and corresponding physical protein-protein interaction (PPI) data from STRING database.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 3: Phenotype-specific coherence estimates. The PPI database (STRING, STRING filtered, Biogrid) are specified in the corresponding column names. For each phenotype category, the results are sorted alphabetically. In addition to the normalized coherence and the permutation p-values indicating significant difference of phenotype-specific coherence from random, modularity, untransformed slopes, and the corresponding lower and upper confidence interval bounds are shown. “NA” indicates the corresponding value cannot be estimated due to lack of sufficient number of genes annotated with PPIs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains the protein-protein interaction analysis dataset that was used in the unpublished manuscript and was further analyzed with the STRING online software.Significantly upregulated mRNAs (2,777 genes; p < 0.05) identified by bulk RNA-seq were analyzed using the STRING module in Cytoscape v.2.2.0 (Institute for System Biology; WA; USA). A cluster network was constructed using the MCL algorithm with a granularity parameter of 4, followed by filtering nodes with mcl.cluster > 10. The resulting 1,848 nodes were processed through STRING v12.0 (Swiss Institute of Bioinformatics; Lausanne; Switzerland) to generate a protein–protein interaction (PPI) network, incorporating evidence from text mining, genomic neighborhood, experimental data, curated databases, co-expression, gene fusion, and co-occurrence, with a minimum confidence score threshold of 0.40. Network modules were defined using the DBSCAN clustering algorithm with an ε parameter of 2. Cluster 1, representing the largest gene set (101 genes), was further analyzed by sorting the top 20 nodes with the highest node degree, resulting in a network comprising 101 nodes and 756 edges. Global network metrics indicated an average node degree of 15, a local clustering coefficient of 0.600, and a PPI enrichment p-value of < 1 × 10⁻¹⁶. The average values of coexpression, experimentally determined interactions, automated text mining, and combined scores were calculated.