Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction to Computational Proteomics
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of computational and proteomics datasets for the secretomes of T. gondii and P. falciparum.
Protein glycosylation is a complex post-translational modification with crucial cellular functions in all domains of life. Currently, large-scale glycoproteomics approaches rely on glycan database dependent algorithms and are thus unsuitable for discovery-driven analyses of glycoproteomes. Therefore, we devised SugarPy, a glycan database independent Python module, and validated it on the glycoproteome of human breast milk. We further demonstrated its applicability by analyzing glycoproteomes with uncommon glycans stemming from the green algae Chalmydomonas reinhardtii and the archaeon Haloferax volcanii. Finally, SugarPy facilitated the novel characterization of glycoproteins from Cyanidioschyzon merolae.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data for an examplary metaproteomics data analysis with the MetaProteomeAnalyzer (MPA) and Prophane software tools. Data is from the PRIDE dataset PXD010550.
Files include:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An example input file for Geena 2. This example file can be used for testing purposes. It includes 12 MS spectra generated by MALDI/TOF from four biological samples in the context of a real experiment. Three spectra were generated for each sample. The format of the file is described in details, and with examples, in the manuscript and in the information file on Input/Output data formats in the web site. (TXT 26Â kb)
This dataset consists of 44 raw MS files, comprising 27 DIA (SWATH) and 15 DDA runs on a TripleTOF 5600 and of two raw mass spectrometry files acquired on a Q Exactive. The composition of the dataset is described in the manuscript by Tsou et al., titled: "DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics", Nature Methods, in press Raw files are deposited here in ProteomeXchange and are associated with the DIA-Umpire processed data. All DIA-Umpire processed results for each sample together with DDA results are deposited in separated folders. Also see the "DataSampleID.xlsx" associated with this Readme file. Internal reference from the Gingras lab ProHits implementation: Project 94, Export version VS2 (Tsou_DIA-Umpire)
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Proteomics data analysis strongly benefits from not studying single proteins in isolation but taking their multivariate interdependence into account. We introduce PerseusNet, the new Perseus network module for the biological analysis of proteomics data. Proteomics is commonly used to generate networks, e.g., with affinity purification experiments, but networks are also used to explore proteomics data. PerseusNet supports the biomedical researcher for both modes of data analysis with a multitude of activities. For affinity purification, a volcano-plot-based statistical analysis method for network generation is featured which is scalable to large numbers of baits. For posttranslational modifications of proteins, such as phosphorylation, a collection of dedicated network analysis tools helps in elucidating cellular signaling events. Co-expression network analysis of proteomics data adopts established tools from transcriptome co-expression analysis. PerseusNet is extensible through a plugin architecture in a multi-lingual way, integrating analyses in C#, Python, and R, and is freely available at http://www.perseus-framework.org.
In high-throughput LC-MS/MS-based proteomics, information about the presence and stoichiometry of post-translational modifications is normally not readily available. To overcome this problem we developed multiFLEX-LF, a computational tool that builds upon FLEXIQuant and FLEXIQuant-LF, which detect modified peptides and quantify their modification extent by monitoring the differences between observed and expected intensities of the unmodified peptides. To this end, multiFLEX-LF relies on robust linear regression to calculate the modification extent of a given peptide relative to a within-study reference. multiFLEX-LF can analyze entire label-free discovery proteomics datasets. Furthermore, to analyze modification dynamics and co-regulated modifications, the peptides of all proteins are hierarchically clustered based on their computed relative modification scores. To demonstrate the versatility of multiFLEX-LF we applied it on a cell-cycle time series dataset acquired using data-independent acquisition. The clustering of the peptides highlighted several groups of peptides with different modification dynamics across the four analyzed time points providing evidence of the kinases involved in the cell-cycle. Overall, multiFLEX-LF enables fast identification of potentially differentially modified peptides and quantification of their differential modification extent in large datasets. Additionally, multiFLEX-LF can drive large-scale investigation of modification dynamics of peptides in time series and case-control studies. multiFLEX-LF is available at https://gitlab.com/SteenOmicsLab/multiflex-lf.
Cross-linking combined with mass spectrometry (XL-MS) provides a wealth of information about the 3D structure of proteins and their interactions. We introduce MaxLynx, a novel computational proteomics workflow for XL-MS integrated into the MaxQuant environment and here we have tested the performance of MaxLynx on the data sets that were generated by using a Bruker timsTOF pro instrument.
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The computational biology market is experiencing robust growth, driven by the increasing adoption of advanced technologies like artificial intelligence (AI) and machine learning (ML) in drug discovery and development. The market's Compound Annual Growth Rate (CAGR) of 13.33% from 2019 to 2024 indicates a significant upward trajectory, projected to continue into the forecast period (2025-2033). Key drivers include the rising prevalence of chronic diseases necessitating faster and more efficient drug development processes, the decreasing cost of high-throughput sequencing and data storage, and the increasing availability of large biological datasets fueling advanced computational analyses. The market segmentation reveals strong demand across various applications, including cellular and biological simulations (particularly in genomics and proteomics), drug discovery and disease modeling (with target identification and validation being prominent areas), and preclinical drug development (focused on pharmacokinetics and pharmacodynamics). Clinical trial applications are also significant, spanning Phases I, II, and III. Software tools like databases, analysis software, and specialized infrastructure are critical components, further segmented by service type (in-house vs. contract) and end-user (academic institutions and commercial entities). North America currently holds a significant market share, but Asia-Pacific is projected to witness substantial growth owing to increasing investments in research and development and the rising adoption of computational biology techniques in emerging economies. The competitive landscape is dynamic, with several major players such as Dassault Systèmes SE, Certara, and Schrödinger contributing to innovation. However, the market also includes numerous smaller, specialized companies focusing on niche applications or specific technologies. This competitive landscape encourages continuous innovation, driving the development of more sophisticated software, improved algorithms, and enhanced analytical capabilities. While data limitations exist regarding precise market size figures, extrapolating from the provided CAGR and industry reports suggests a substantial market value currently, exceeding several billion dollars and poised for continued expansion. The focus on precision medicine and personalized therapies further strengthens the long-term growth potential of the computational biology market. Challenges include the complexity of biological systems, the need for robust data validation, and the ethical considerations associated with the use of AI and big data in healthcare. Recent developments include: February 2023: The Centre for Development of Advanced Computing (C-DAC) launched two software tools critical for research in life sciences. Integrated Computing Environment, one of the products, is an indigenous cloud-based genomics computational facility for bioinformatics that integrates ICE-cube, a hardware infrastructure, and ICE flakes. This software will help securely store and analyze petascale to exascale genomics data., January 2023: Insilico Medicine, a clinical-stage, end-to-end artificial intelligence (AI)-driven drug discovery company, launched the 6th generation Intelligent Robotics Lab to accelerate its AI-driven drug discovery. The fully automated AI-powered robotics laboratory performs target discovery, compound screening, precision medicine development, and translational research.. Key drivers for this market are: Increase in Bioinformatics Research, Increasing Number of Clinical Studies in Pharmacogenomics and Pharmacokinetics; Growth of Drug Designing and Disease Modeling. Potential restraints include: Increase in Bioinformatics Research, Increasing Number of Clinical Studies in Pharmacogenomics and Pharmacokinetics; Growth of Drug Designing and Disease Modeling. Notable trends are: Industry and Commercials Sub-segment is Expected to hold its Highest Market Share in the End User Segment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used decoupleR to evaluate the performance of individual methods by recovering perturbed transcription factors (TFs) from a curation of single-gene perturbation experiments (Holland et al., 2020). As a resource we used DoRothEA, a gene regulatory network linking TFs to target genes by their mode of regulation (Garcia-Alonso et al., 2019). Perturbation experiments where the targeted regulator was not in DoRothEA were removed. After filtering, this dataset is composed of gene expression data from 92 knockdown and overexpression experiments of 40 unique TFs in human cells. Additionally, we tested the performance of decoupleR on phospho-proteomic data. For this, we filtered in a similar fashion a curated set of knockdown and overexpression single-kinase perturbation experiments, obtaining 63 experiments including 14 unique kinases, and applied a weighted resource from the same publication that links kinases to their target phosphosites (Hernandez-Armenta et al., 2017). For the transcriptomic dataset, differential expression analysis was performed with limma (Ritchie et al., 2015) and the resulting t-values were used as input. For the phospho-proteomics, the quantile-normalized log2-fold changes from different studies were used to make them comparable.
Background: In recent years, an innovative strategy using laser microdissection and mass spectrometry markedly expanded the landscape of antigens associated with membranous nephropathy (MN). Specific associations with phenotypes, diseases and sometimes reversible triggers led to a novel antigen-based classification of MN, paving the way for precision medicine and stressing the need for more routine use of proteomics in MN. Methods: To explore the proteomic landscape of human glomeruli and identify podocyte antigens and disease mechanisms in MN, we expanded the original technique to an integrative approach combining laser capture microdissection, next-generation mass spectrometry and computational analysis. Next to conventional data-dependent acquisition (DDA), we used and assessed the diagnostic yield of the more comprehensive data-independent acquisition (DIA) mass spectrometry, which enables the detection and quantification of every peptide in a sample irrespective of its level of abundance or m/z value. Our proteomic pipeline was applied to residual material from kidney biopsies in 64 individuals, including 31 healthy controls; 5 disease controls; 5 PLA2R-associated MN; and 23 PLA2R-negative MN. Results: Unbiased analyses confirmed the significant enrichment in PLA2R, IgG4 and complement proteins in glomeruli from patients with PLA2R-MN compared with healthy and disease controls, while molecular characterization of complement fragments provided evidence for complement activation in PLA2R-MN. Compared to DDA, DIA mass spectrometry increased the number of glomerular proteins (~3800 vs. ~1200) identified in healthy glomeruli; allowed the detection all known antigens except NELL1 in normal glomeruli; and increased the detection rate of podocyte antigens from 50% to >80% in PLA2R-negative MN. Conclusions: This proof-of-concept study suggests that an integrative approach combining laser microdissection, DIA mass spectrometry and computational biology is a powerful tool, with translational potential, to identify podocyte antigens and unravel disease mechanisms in MN.
Label-free quantitative mass spectrometry (MS) based on the Normalized Spectral Abundance Factor (NSAF) has emerged as a simple and reasonably robust method to determine the relative abundance of individual proteins within complex mixtures. Here, we describe Morpheus Spectral Counter (MSpC) as the first computational tool that directly calculates NSAF values from output obtained from Morpheus, a fast, open-source, peptide-MS/MS matching engine compatible with high-resolution mass instruments. NSAF has distinct advantages over other MS-based quantification methods, including a higher dynamic range as compared to isobaric tags, no requirement to align and re-extract MS1 peaks, and increased speed. MSpC features an easy to use graphic user interface that additionally calculates both distributed and unique NSAF values to permit analyses of both protein families and isoforms/proteoforms. MSpC determinations of protein concentration were linear over several orders of magnitude based on the analysis of several high-mass accuracy datasets either obtained from the Proteomics Identifications Repository or generated de novo with total cell extracts spiked with purified Arabidopsis 20S proteasomes. The MSpC software was developed in C# and is open sourced under a permissive license with the code made available at http://dcgemperline.github.io/Morpheus_SpC/.
These tissue-level multi-omic graphical analysis reports are provided as additional data for the manuscript “Temporal dynamics of the multi-omic response to endurance exercise training across tissues” (MoTrPAC Study Group, bioRxiv, 2022). Find the preprint here. Extensive background is included in each report. Briefly, we used a graphical clustering approach to define and visualize the temporal dynamics of molecular analytes regulated by endurance exercise training at multiple training time points in male and female rats across many data types ("omes") and tissues. The objective of these multi-omic reports is to share representations of >34,000 training-regulated molecular features in interactive HTML reports that allow researchers to extract meaningful biology from a complex dataset. Each report presents a summary of the significantly training-regulated features across omes in a specific tissue and the corresponding graphical analysis results, as well as features and pathway enrichment results corresponding to the largest graphical clusters (nodes, edges, and paths) for that tissue. A graphical cluster is a group of training-regulated features that share temporal behavior at some point during the training time course. These multi-omic reports are generated using data and functions available through the MotrpacRatTraining6mo R package. Install this R package to explore the data yourself! Get started with this tutorial. {"references": ["Ignatiadis N, Klaus B, Zaugg JB, Huber W. Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat Methods. 2016 Jul;13(7):577-80. doi: 10.1038/nmeth.3885. Epub 2016 May 30. PMID: 27240256; PMCID: PMC4930141.", "Heller R, Yaacoby S, Yekutieli D. repfdr: a tool for replicability analysis for genome-wide association studies. Bioinformatics. 2014 Oct 15;30(20):2971-2. doi: 10.1093/bioinformatics/btu434. Epub 2014 Jul 9. PMID: 25012182.", "Almende B.V. and Contributors, Thieurmel B (2022). visNetwork: Network Visualization using 'vis.js' Library. R package version 2.1.2, https://CRAN.R-project.org/package=visNetwork.", "Gay N, Amar D, Jean Beltran P, MoTrPAC Study Group (2022). MotrpacRatTraining6mo: Analysis of the MoTrPAC endurance exercise training data in 6-month-old rats. R package version 1.5.2, https://motrpac.github.io/MotrpacRatTraining6mo/."]}
Post-translational modifications (PTMs) are under significant focus in molecular biomedicine due to their importance in signal transduction in most cellular and organismal processes. Identification of PTMs, determination of PTM location sites, discrimination between functional and inert PTMs, and quantification of their occupancies are demanding tasks, especially in the light of PTM crosstalk in each biosystem. On top of that, the study of each PTM often necessitates a particular experimental design in majority of cases. Computational approaches can identify the relevant PTMs in a biosystem and help to design follow-up experiments involving specific PTM enrichment. Here, we present a PTM-centric proteome informatic pipeline for prediction of most probable and relevant PTMs in mass spectrometry-based proteomics data and refining raw data search parameters based on the acquired knowledge. Using expression profiling, we identified cellular proteins that are differentially regulated in response to multikinase inhibitors dasatinib and staurosporine at four different concentrations. Computational enrichment analysis was employed to determine the potential PTMs of protein targets for both drugs. Finally, we conducted an additional round of database search with these predicted chemical modifications. Our pipeline helped analyze the enriched PTMs and even detected proteins that were not picked up in the initial search. Our findings support the idea of PTM-oriented searching of MS data in proteomics based on computational enrichment analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Library of Integrated Network-Based Cellular Signatures (LINCS) project aims to create a network-based understanding of biology by cataloging changes in gene expression and signal transduction that occur when cells are exposed to a variety of perturbations. It is helpful for understanding cell pathways and facilitating drug discovery. Here, we developed a novel approach to infer cell-specific pathways and identify a compound's effects using gene expression and phosphoproteomics data under treatments with different compounds. Gene expression data were employed to infer potential targets of compounds and create a generic pathway map. Binary linear programming (BLP) was then developed to optimize the generic pathway topology based on the mid-stage signaling response of phosphorylation. To demonstrate effectiveness of this approach, we built a generic pathway map for the MCF7 breast cancer cell line and inferred the cell-specific pathways by BLP. The first group of 11 compounds was utilized to optimize the generic pathways, and then 4 compounds were used to identify effects based on the inferred cell-specific pathways. Cross-validation indicated that the cell-specific pathways reliably predicted a compound's effects. Finally, we applied BLP to re-optimize the cell-specific pathways to predict the effects of 4 compounds (trichostatin A, MS-275, staurosporine, and digoxigenin) according to compound-induced topological alterations. Trichostatin A and MS-275 (both HDAC inhibitors) inhibited the downstream pathway of HDAC1 and caused cell growth arrest via activation of p53 and p21; the effects of digoxigenin were totally opposite. Staurosporine blocked the cell cycle via p53 and p21, but also promoted cell growth via activated HDAC1 and its downstream pathway. Our approach was also applied to the PC3 prostate cancer cell line, and the cross-validation analysis showed very good accuracy in predicting effects of 4 compounds. In summary, our computational model can be used to elucidate potential mechanisms of a compound's efficacy.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of the dataset and subsequent analyses.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The majority of large-scale proteomics quantification methods yield long lists of quantified proteins that are often difficult to interpret and poorly reproduced. Computational approaches are required to analyze such intricate quantitative proteomics data sets. We propose a statistical approach to computationally identify protein sets (e.g., Gene Ontology (GO) terms) that are significantly enriched with abundant proteins with reproducible quantification measurements across a set of replicates. To this end, we developed PSEA-Quant, a protein set enrichment analysis algorithm for label-free and label-based protein quantification data sets. It offers an alternative approach to classic GO analyses, models protein annotation biases, and allows the analysis of samples originating from a single condition, unlike analogous approaches such as GSEA and PSEA. We demonstrate that PSEA-Quant produces results complementary to GO analyses. We also show that PSEA-Quant provides valuable information about the biological processes involved in cystic fibrosis using label-free protein quantification of a cell line expressing a CFTR mutant. Finally, PSEA-Quant highlights the differences in the mechanisms taking place in the human, rat, and mouse brain frontal cortices based on tandem mass tag quantification. Our approach, which is available online, will thus improve the analysis of proteomics quantification data sets by providing meaningful biological insights.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Computational analysis of shotgun proteomics data can now be performed in a completely automated and statistically rigorous way, as exemplified by the freely available MaxQuant environment. The sophisticated algorithms involved and the sheer amount of data translate into very high computational demands. Here we describe parallelization and memory optimization of the MaxQuant software with the aim of executing it on a large computer cluster. We analyze and mitigate bottlenecks in overall performance and find that the most time-consuming algorithms are those detecting peptide features in the MS1 data as well as the fragment spectrum search. These tasks scale with the number of raw files and can readily be distributed over many CPUs as long as memory access is properly managed. Here we compared the performance of a parallelized version of MaxQuant running on a standard desktop, an I/O performance optimized desktop computer (“game computer”), and a cluster environment. The modified gaming computer and the cluster vastly outperformed a standard desktop computer when analyzing more than 1000 raw files. We apply our high performance platform to investigate incremental coverage of the human proteome by high resolution MS data originating from in-depth cell line and cancer tissue proteome measurements.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Isobaric labeling-based proteomics is widely applied in deep proteome quantification. Among the platforms for isobaric labeled proteomic data analysis, the commercial software Proteome Discoverer (PD) is widely used, incorporating the search engine CHIMERYS, while FragPipe (FP) is relatively new, free for noncommercial purposes, and integrates the engine MSFragger. Here, we compared PD and FP over three public proteomic data sets labeled using 6plex, 10plex, and 16plex tandem mass tags. Our results showed the protein abundances generated by the two software are highly correlated. PD quantified more proteins (10.02%, 15.44%, 8.19%) than FP with comparable NA ratios (0.00% vs. 0.00%, 0.85% vs. 0.38%, and 11.74% vs. 10.52%) in the three data sets. Using the 16plex data set, PD and FP outputs showed high consistency in quantifying technical replicates, batch effects, and functional enrichment in differentially expressed proteins. However, FP saved 93.93%, 96.65%, and 96.41% of processing time compared to PD for analyzing the three data sets, respectively. In conclusion, while PD is a well-maintained commercial software integrating various additional functions and can quantify more proteins, FP is freely available and achieves similar output with a shorter computational time. Our results will guide users in choosing the most suitable quantification software for their needs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction to Computational Proteomics