Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains labeled, weighted networks of chemical-gene, gene-gene, gene-disease, and chemical-disease relationships based on single sentences in PubMed abstracts. All raw dependency paths are provided in addition to the labeled relationships.
PART I: Connects dependency paths to labels, or "themes". Each record contains a dependency path followed by its score for each theme, and indicators of whether or not the path is part of the flagship path set for each theme (meaning that it was manually reviewed and determined to reflect that theme). The themes themselves are listed below and are in our paper (reference below).
PART II: Connects sentences to dependency paths. It consists of sentences and associated metadata, entity pairs found in the sentences, and dependency paths connecting those entity pairs. Each record contains the following information:
The "with-themes.txt" files only contain dependency paths with corresponding theme assignments from Part I. The plain ".txt" files contain all dependency paths.
This release contains the annotated network for the September 15, 2019 version of PubTator. The version discussed in our paper, below, is an older one - from April 30, 2016. If you're interested in that network, it can be found in Version 1 of this repository. We will be releasing updated networks periodically, as the PubTator community continues to release new versions of named entity annotations for Medline each month or so.
------------------------------------------------------------------------------------
REFERENCES
Percha B, Altman RBA (2017) A global network of biomedical relationships derived from text. Bioinformatics, 34(15): 2614-2624.
Percha B, Altman RBA (2015) Learning the structure of biomedical relationships from unstructured text. PLoS Computational Biology, 11(7): e1004216.
This project depends on named entity annotations from the PubTator project:
https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/
Reference:
Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522.
Dependency parsing was provided by the Stanford CoreNLP toolkit (version 3.9.1):
https://stanfordnlp.github.io/CoreNLP/index.html
Reference:
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.
------------------------------------------------------------------------------------
THEMES
chemical-gene
(A+) agonism, activation
(A-) antagonism, blocking
(B) binding, ligand (esp. receptors)
(E+) increases expression/production
(E-) decreases expression/production
(E) affects expression/production (neutral)
(N) inhibits
gene-chemical
(O) transport, channels
(K) metabolism, pharmacokinetics
(Z) enzyme activity
chemical-disease
(T) treatment/therapy (including investigatory)
(C) inhibits cell growth (esp. cancers)
(Sa) side effect/adverse event
(Pr) prevents, suppresses
(Pa) alleviates, reduces
(J) role in disease pathogenesis
disease-chemical
(Mp) biomarkers (of disease progression)
gene-disease
(U) causal mutations
(Ud) mutations affecting disease course
(D) drug targets
(J) role in pathogenesis
(Te) possible therapeutic effect
(Y) polymorphisms alter risk
(G) promotes progression
disease-gene
(Md) biomarkers (diagnostic)
(X) overexpression in disease
(L) improper regulation linked to disease
gene-gene
(B) binding, ligand (esp. receptors)
(W) enhances response
(V+) activates, stimulates
(E+) increases expression/production
(E) affects expression/production (neutral)
(I) signaling pathway
(H) same protein or complex
(Rg) regulation
(Q) production by cell population
------------------------------------------------------------------------------------
FORMATTING NOTE
A few users have mentioned that the dependency paths in the "part-i" files are all lowercase text, whereas those in the "part-ii" files maintain the case of the original sentence. This complicates mapping between the two sets of files.
We kept the part-ii files in the same case as the original sentence to facilitate downstream debugging - it's easier to tell which words in a particular sentence are contributing to the dependency path if their original case is maintained. When working with the part-ii "with-themes" files, if you simply convert the dependency path to lowercase, it is guaranteed to match to one of the paths in the corresponding part-i file and you'll be able to get the theme scores.
Apologies for the additional complexity, and please reach out to us if you have any questions (see correspondence information in the Bioinformatics manuscript, above).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network metrics for Protein-Protein Interaction (PPI). The table shows important network measures for certain proteins associated with vascular dementia. Degree shows how many links each protein has, Betweenness Centrality shows how it acts as a network hub, Clustering Coefficient shows how connected its neighbours are, and Edge Confidence Score shows how reliable the interaction is based on estimates from the STRING database.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Rare diseases affect 1-in-10 people in the United States and despite increased genetic testing, up to half never receive a diagnosis. Even when using advanced genome sequencing platforms to discover variants, if there is no connection between the variants found in the patient’s genome and their phenotypes in the literature, then the patient will remain undiagnosed. When a direct variant-phenotype connection is not known, putting a patient’s information in the larger context of phenotype relationships and protein-protein interactions may provide an opportunity to find an indirect explanation. Databases such as STRING contain millions of protein-protein interactions, and the Human Phenotype Ontology (HPO) contains the relations of thousands of phenotypes. By integrating these networks and clustering the entities within, we can potentially discover latent gene-to-phenotype connections. The historical records for STRING and HPO provide a unique opportunity to create a network time series for evaluating the cluster significance. Most excitingly, working with Children’s Hospital Colorado, we have provided promising hypotheses about latent gene-to-phenotype connections for 38 patients. We also provide potential answers for 14 patients listed on MyGene2. Clusters our tool finds significant harbor 2.35 to 8.72 times as many gene-to-phenotype edges inferred from known drug interactions than clusters found to be insignificant. Our tool, BOCC, is available as a web app and command line tool.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ‘Total Nodes’ column contains the total number of nodes available in the network while the ‘Gene (Protein) Nodes’ column shows the number of nodes with at least one gene in KEGG (or one protein in STIRING). The fourth and fifth columns contain the total number of edges, and the number of connected components having at least one gene (or protein), respectively. ‘Avg. Node Degree’ represents the number of edges a node has on average. ‘Max Node Degree’ denotes the maximum number of edges a node has in the network. ‘Clustering Coefficient’ is the ratio of the triangles to the connected triples in a graph.
Facebook
TwitterThere are no widely-accepted prognostic markers currently available to predict outcomes in patients with triple-negative breast cancer (TNBC), and no targeted therapies with confirmed benefit. We have used MALDI mass spectrometry imaging (MSI) of tryptic peptides to compare regions of cancer and benign tissue in 10 formalin-fixed, paraffin-embedded sections of TNBC tumors. Proteins were identified by reference to a peptide library constructed by LC-MALDI-MS/MS analyses of the same tissues. The prognostic significance of proteins that distinguished between cancer and benign regions was estimated by Kaplan-Meier analysis of their gene expression from public databases. Among peptides that distinguished between cancer and benign tissue in at least 3 tissues with a ROC area under the curve >0.7, 14 represented proteins identified from the reference library, including proteins not previously associated with breast cancer. Initial network analysis using the STRING database showed no obvious functional relationships except among collagen subunits COL1A1, COL1A2, and COL63A, but manual curation, including the addition of EGFR to the analysis, revealed a unique network connecting 10 of the 14 proteins. Kaplan-Meier survival analysis to examine the relationship between tumor expression of genes encoding the 14 proteins, and recurrence-free survival (RFS) in patients with basal-like TNBC showed that, compared to low expression, high expression of nine of the genes was associated with significantly worse RFS, most with hazard ratios >2. In contrast, in estrogen receptor-positive tumors, high expression of these genes showed only low, or no, association with worse RFS. These proteins are proposed as putative markers of RFS in TNBC, and some may also be considered as possible targets for future therapies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Whole genome protein-protein association networks are not random and their topological properties stem from genome evolution mechanisms. In fact, more connected, but less clustered proteins are related to genes that, in general, present more paralogs as compared to other genes, indicating frequent previous gene duplication episodes. On the other hand, genes related to conserved biological functions present few or no paralogs and yield proteins that are highly connected and clustered. These general network characteristics must have an evolutionary explanation. Considering data from STRING database, we present here experimental evidence that, more than not being scale free, protein degree distributions of organisms present an increased probability for high degree nodes. Furthermore, based on this experimental evidence, we propose a simulation model for genome evolution, where genes in a network are either acquired de novo using a preferential attachment rule, or duplicated with a probability that linearly grows with gene degree and decreases with its clustering coefficient. For the first time a model yields results that simultaneously describe different topological distributions. Also, this model correctly predicts that, to produce protein-protein association networks with number of links and number of nodes in the observed range for Eukaryotes, it is necessary 90% of gene duplication and 10% of de novo gene acquisition. This scenario implies a universal mechanism for genome evolution.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is nowhere near an exhaustive list of papers or tools on the topic. It is not intended to be a systematic review but highlights the breadth and general shift in the methods up to the present.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Breakdowns are given for the four sets of clusters identified by the corresponding predictive models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionCirrhosis is one of the most important risk factors for development of hepatocellular carcinoma (HCC). Recent studies have shown that removal or well control of the underlying cause could reduce but not eliminate the risk of HCC. Therefore, it is important to elucidate the molecular mechanisms that drive the progression of cirrhosis to HCC.Materials and MethodsMicroarray datasets incorporating cirrhosis and HCC subjects were identified from the Gene Expression Omnibus (GEO) database. Differentially expressed genes (DEGs) were determined by GEO2R software. Functional enrichment analysis was performed by the clusterProfiler package in R. Liver carcinogenesis-related networks and modules were established using STRING database and MCODE plug-in, respectively, which were visualized with Cytoscape software. The ability of modular gene signatures to discriminate cirrhosis from HCC was assessed by hierarchical clustering, principal component analysis (PCA), and receiver operating characteristic (ROC) curve. Association of top modular genes and HCC grades or prognosis was analyzed with the UALCAN web-tool. Protein expression and distribution of top modular genes were analyzed using the Human Protein Atlas database.ResultsFour microarray datasets were retrieved from GEO database. Compared with cirrhotic livers, 125 upregulated and 252 downregulated genes in HCC tissues were found. These DEGs constituted a liver carcinogenesis-related network with 272 nodes and 2954 edges, with 65 nodes being highly connected and formed a liver carcinogenesis-related module. The modular genes were significantly involved in several KEGG pathways, such as “cell cycle,” “DNA replication,” “p53 signaling pathway,” “mismatch repair,” “base excision repair,” etc. These identified modular gene signatures could robustly discriminate cirrhosis from HCC in the validation dataset. In contrast, the expression pattern of the modular genes was consistent between cirrhotic and normal livers. The top modular genes TOP2A, CDC20, PRC1, CCNB2, and NUSAP1 were associated with HCC onset, progression, and prognosis, and exhibited higher expression in HCC compared with normal livers in the HPA database.ConclusionOur study revealed a highly connected module associated with liver carcinogenesis on a cirrhotic background, which may provide deeper understanding of the genetic alterations involved in the transition from cirrhosis to HCC, and offer valuable variables for screening and surveillance of HCC in high-risk patients with cirrhosis.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains labeled, weighted networks of chemical-gene, gene-gene, gene-disease, and chemical-disease relationships based on single sentences in PubMed abstracts. All raw dependency paths are provided in addition to the labeled relationships.
PART I: Connects dependency paths to labels, or "themes". Each record contains a dependency path followed by its score for each theme, and indicators of whether or not the path is part of the flagship path set for each theme (meaning that it was manually reviewed and determined to reflect that theme). The themes themselves are listed below and are in our paper (reference below).
PART II: Connects sentences to dependency paths. It consists of sentences and associated metadata, entity pairs found in the sentences, and dependency paths connecting those entity pairs. Each record contains the following information:
The "with-themes.txt" files only contain dependency paths with corresponding theme assignments from Part I. The plain ".txt" files contain all dependency paths.
This release contains the annotated network for the September 15, 2019 version of PubTator. The version discussed in our paper, below, is an older one - from April 30, 2016. If you're interested in that network, it can be found in Version 1 of this repository. We will be releasing updated networks periodically, as the PubTator community continues to release new versions of named entity annotations for Medline each month or so.
------------------------------------------------------------------------------------
REFERENCES
Percha B, Altman RBA (2017) A global network of biomedical relationships derived from text. Bioinformatics, 34(15): 2614-2624.
Percha B, Altman RBA (2015) Learning the structure of biomedical relationships from unstructured text. PLoS Computational Biology, 11(7): e1004216.
This project depends on named entity annotations from the PubTator project:
https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/
Reference:
Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522.
Dependency parsing was provided by the Stanford CoreNLP toolkit (version 3.9.1):
https://stanfordnlp.github.io/CoreNLP/index.html
Reference:
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.
------------------------------------------------------------------------------------
THEMES
chemical-gene
(A+) agonism, activation
(A-) antagonism, blocking
(B) binding, ligand (esp. receptors)
(E+) increases expression/production
(E-) decreases expression/production
(E) affects expression/production (neutral)
(N) inhibits
gene-chemical
(O) transport, channels
(K) metabolism, pharmacokinetics
(Z) enzyme activity
chemical-disease
(T) treatment/therapy (including investigatory)
(C) inhibits cell growth (esp. cancers)
(Sa) side effect/adverse event
(Pr) prevents, suppresses
(Pa) alleviates, reduces
(J) role in disease pathogenesis
disease-chemical
(Mp) biomarkers (of disease progression)
gene-disease
(U) causal mutations
(Ud) mutations affecting disease course
(D) drug targets
(J) role in pathogenesis
(Te) possible therapeutic effect
(Y) polymorphisms alter risk
(G) promotes progression
disease-gene
(Md) biomarkers (diagnostic)
(X) overexpression in disease
(L) improper regulation linked to disease
gene-gene
(B) binding, ligand (esp. receptors)
(W) enhances response
(V+) activates, stimulates
(E+) increases expression/production
(E) affects expression/production (neutral)
(I) signaling pathway
(H) same protein or complex
(Rg) regulation
(Q) production by cell population
------------------------------------------------------------------------------------
FORMATTING NOTE
A few users have mentioned that the dependency paths in the "part-i" files are all lowercase text, whereas those in the "part-ii" files maintain the case of the original sentence. This complicates mapping between the two sets of files.
We kept the part-ii files in the same case as the original sentence to facilitate downstream debugging - it's easier to tell which words in a particular sentence are contributing to the dependency path if their original case is maintained. When working with the part-ii "with-themes" files, if you simply convert the dependency path to lowercase, it is guaranteed to match to one of the paths in the corresponding part-i file and you'll be able to get the theme scores.
Apologies for the additional complexity, and please reach out to us if you have any questions (see correspondence information in the Bioinformatics manuscript, above).