Facebook
TwitterThe Peptide Sequence Database contains putative peptide sequences from human, mouse, rat, and zebrafish. Compressed to eliminate redundancy, these are about 40 fold smaller than a brute force enumeration. Current and old releases are available for download. Each species'' peptide sequence database comprises peptide sequence data from releveant species specific UniGene and IPI clusters, plus all sequences from their consituent EST, mRNA and protein sequence databases, namely RefSeq proteins and mRNAs, UniProt''s SwissProt and TrEMBL, GenBank mRNA, ESTs, and high-throughput cDNAs, HInv-DB, VEGA, EMBL, IPI protein sequences, plus the enumeration of all combinations of UniProt sequence variants, Met loss PTM, and signal peptide cleavages. The README file contains some information about the non amino-acid symbols O (digest site corresponding to a protein N- or C-terminus) and J (no digest sequence join) used in these peptide sequence databases and information about how to configure various search engines to use them. Some search engines handle (very) long sequences badly and in some cases must be patched to use these peptide sequence databases. All search engines supported by the PepArML meta-search engine can (or can be patched to) successfully search these peptide sequence databases.
Facebook
TwitterA database of peptides based on sequence text mining and public peptide data sources. Only peptides that are 20 amino acids or shorter are stored. Only peptides with available sequences are stored. After submitting a query you can further refine the results using the new heat map retrieval tool to quickly find the entries that are most relevant to you. Text classification helps you find candidate peptides that are related to cancer, cardiovascular diseases, diabetes, apoptosis, angiogenesis and molecular imaging or peptides for which binding data exist.
Facebook
TwitterPepBank is a database of peptides based on sequence text mining and public peptide data sources. Only peptides that are 20 amino acids or shorter are stored. Only peptides with available sequences are stored.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In recent years, a variety of approaches have been developed using decoy databases to empirically assess the error associated with peptide identifications from large-scale proteomics experiments. We have developed an approach for calculating the expected uncertainty associated with false-positive rate determination using concatenated reverse and forward protein sequence databases. After explaining the theoretical basis of our model, we compare predicted error with the results of experiments characterizing a series of mixtures containing known proteins. In general, results from characterization of known proteins show good agreement with our predictions. Finally, we consider how these approaches may be applied to more complicated data sets, as when peptides are separated by charge state prior to false-positive determination. Keywords: Peptide Identification • False-Positive Rate • False Discovery Rate • Proteomics • Data Analysis • Mass Spectrometry • Reversed Database • Decoy Database
Facebook
TwitterSYFPEITHI is a database comprising more than 7000 peptide sequences known to bind class I and class II MHC molecules. The entries are compiled from published reports only. It contains a collection of MHC class I and class II ligands and peptide motifs of humans and other species, such as apes, cattle, chicken, and mouse, for example, and is continuously updated. Searches for MHC alleles, MHC motifs, natural ligands, T-cell epitopes, source proteins/organisms and references are possible. Hyperlinks to the EMBL and PubMed databases are included. In addition, ligand predictions are available for a number of MHC allelic products. The database is based on previous publications on T-cell epitopes and MHC ligands. It contains information on: -Peptide sequences -anchor positions -MHC specificity -source proteins, source organisms -publication references Since the number of motifs continuously increases, it was necessary to set up a database which facilitates the search for peptides and allows the prediction of T-cell epitopes. The prediction is based on published motifs (pool sequencing, natural ligands) and takes into consideration the amino acids in the anchor and auxiliary anchor positions, as well as other frequent amino acids. The score is calculated according to the following rules: The amino acids of a certain peptide are given a specific value depending on whether they are anchor, auxiliary anchor or preferred residue. Ideal anchors will be given 10 points, unusual anchors 6-8 points, auxiliary anchors 4-6 and preferred residues 1-4 points. Amino acids that are regarded as having a negative effect on the binding ability are given values between -1 and -3. Sponsors: SYFPEITHI is supported by DFG-Sonderforschungsbereich 685 and theEuropean Union: EU BIOMED CT95-1627, BIOTECH CT95-0263, and EU QLQ-CT-1999-00713.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
For the analysis of homogeneous post-translational modifications such as protein phosphorylation and acetylation, setting a variable modification on the specific residue(s) is applied to identify the modified peptides for database searching. However, this approach is often not applicable to identify intact mucin-type O-glycopeptides due to the high microheterogeneity of the glycosylation. Because there is virtually no carbohydrate-related tag on the peptide fragments after the O-glycopeptides are dissociated in HCD, we find it is unnecessary to set the variable mass tags on the Ser/Thr residues to identify the peptide sequences. In this study, we present a novel approach, termed as O-Search, for the interpretation of O-glycopeptide HCD spectra. Instead of setting the variable mass tags on the Ser/Thr residues, we set variable mass tags on the peptide level. The precursor mass of the MS/MS spectrum was deducted by every possible summed mass of O-glycan combinations on at most three S/T residues. All the spectra with these new precursor masses were searched against the protein sequence database without setting variable glycan modifications. It was found that this method had much decreased search space and had excellent sensitivity in the identification of O-glycopeptides. Compared with the conventional searching approach, O-Search yielded 96%, 86%, and 79% improvement in glycopeptide spectra matching, glycopeptide identification, and peptide sequence identification, respectively. It was demonstrated that O-Search enabled the consideration of more glycan structures and was fitted to analyze microheterogeneity of O-glycosylation.
Facebook
TwitterProtein sequence databases are indispensable tools for life science research including mass spectrometry (MS)-based proteomics. In current database construction processes, sequence similarity clustering is used to reduce redundancies in the source data. Albeit powerful, it ignores the peptide-centric nature of proteomic data and the fact that MS is able to distinguish similar sequences. Therefore, we introduce an approach that structures the protein sequence space at the peptide level using theoretical and empirical information from large-scale proteomic data to generate a mass spectrometry-centric protein sequence database (MScDB). The core modules of MScDB are an in-silico proteolytic digest and a peptide-centric clustering algorithm that groups protein sequences that are indistinguishable by mass spectrometry. Analysis of various MScDB uses cases against five complex human proteomes, resulting in 69 peptide identifications not present in UniProtKB as well as 79 putative single amino acid polymorphisms. MScDB retains ∼99% of the identifications in comparison to common databases despite a 3–48% increase in the theoretical peptide search space (but comparable protein sequence space). In addition, MScDB enables cross-species applications such as human/mouse graft models, and our results suggest that the uncertainty in protein assignments to one species can be smaller than 20%.
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented on June 04, 2014. Curated database on selected from randomized pools proteins and peptides designed for accumulation of experimental data on protein functionality obtained by in vitro directed evolution methods (phage display, ribosome display, SIP etc.) ASPD is integrated by means of hyperlinks with different databases (SWISS-PROT, PDB, PROSITE, etc). The database also contains modules for pairwise correlation analysis and BLAST search.
Facebook
TwitterAlternative splicing is a mechanism in eukaryotes by which different forms of mRNAs are generated from the same gene. Identification of alternative splice variants requires the identification of peptides specific for alternative splice forms. For this purpose, we generated a human database that contains only unique tryptic peptides specific for alternative splice forms from Swiss-Prot entries. Using this database allows an easy access to splice variant-specific peptide sequences that match to MS data. Furthermore, we combined this database without alternative splice variant-1-specific peptides with human Swiss-Prot. This combined database can be used as a general database for searching of LC–MS data. LC–MS data derived from in-solution digests of two different cell lines (LNCaP, HeLa) and phosphoproteomics studies were analyzed using these two databases. Several nonalternative splice variant-1-specific peptides were found in both cell lines, and some of them seemed to be cell-line-specific. Control and apoptotic phosphoproteomes from Jurkat T cells revealed several nonalternative splice variant-1-specific peptides, and some of them showed clear quantitative differences between the two states.
Facebook
TwitterBackgroundMicrobiome research is providing important new insights into the metabolic interactions of complex microbial ecosystems involved in fields as diverse as the pathogenesis of human diseases, agriculture and climate change. Poor correlations typically observed between RNA and protein expression datasets make it hard to accurately infer microbial protein synthesis from metagenomic data. Additionally, mass spectrometry-based metaproteomic analyses typically rely on focused search sequence databases based on prior knowledge for protein identification that may not represent all the proteins present in a set of samples. Metagenomic 16S rRNA sequencing only targets the bacterial component, while whole genome sequencing is at best an indirect measure of expressed proteomes. Here we describe a novel approach, MetaNovo, that combines existing open-source software tools to perform scalable de novo sequence tag matching with a novel algorithm for probabilistic optimization of the entire UniProt knowledgebase to create tailored sequence databases for target-decoy searches directly at the proteome level, enabling metaproteomic analyses without prior expectation of sample composition or metagenomic data generation and compatible with standard downstream analysis pipelines.ResultsWe compared MetaNovo to published results from the MetaPro-IQ pipeline on 8 human mucosal-luminal interface samples, with comparable numbers of peptide and protein identifications, many shared peptide sequences and a similar bacterial taxonomic distribution compared to that found using a matched metagenome sequence database—but simultaneously identified many more non-bacterial peptides than the previous approaches. MetaNovo was also benchmarked on samples of known microbial composition against matched metagenomic and whole genomic sequence database workflows, yielding many more MS/MS identifications for the expected taxa, with improved taxonomic representation, while also highlighting previously described genome sequencing quality concerns for one of the organisms, and identifying an experimental sample contaminant without prior expectation.ConclusionsBy estimating taxonomic and peptide level information directly on microbiome samples from tandem mass spectrometry data, MetaNovo enables the simultaneous identification of peptides from all domains of life in metaproteome samples, bypassing the need for curated sequence databases to search. We show that the MetaNovo approach to mass spectrometry metaproteomics is more accurate than current gold standard approaches of tailored or matched genomic sequence database searches, can identify sample contaminants without prior expectation and yields insights into previously unidentified metaproteomic signals, building on the potential for complex mass spectrometry metaproteomic data to speak for itself.
Facebook
TwitterMulti-organism, publicly accessible compendium of peptides identified in a large set of tandem mass spectrometry proteomics experiments. Mass spectrometer output files are collected for human, mouse, yeast, and several other organisms, and searched using the latest search engines and protein sequences. All results of sequence and spectral library searching are subsequently processed through the Trans Proteomic Pipeline to derive a probability of correct identification for all results in a uniform manner to insure a high quality database, along with false discovery rates at the whole atlas level. The raw data, search results, and full builds can be downloaded for other uses. All results of sequence searching are processed through PeptideProphet to derive a probability of correct identification for all results in a uniform manner ensuring a high quality database. All peptides are mapped to Ensembl and can be viewed as custom tracks on the Ensembl genome browser. The long term goal of the project is full annotation of eukaryotic genomes through a thorough validation of expressed proteins. The PeptideAtlas provides a method and a framework to accommodate proteome information coming from high-throughput proteomics technologies. The online database administers experimental data in the public domain. You are encouraged to contribute to the database.
Facebook
TwitterThe MHC-Peptide Interaction Database version T (MPID-T) is a new generation database for sequence-structure-function information on T cell receptor/peptide/MHC interactions. It contains all structures of TcR/pMHC and pMHC complexes, with emphasis on the structural characterization of these complexes. MPID-T will facilitate the development of algorithms to predict whether a peptide sequence will bind to a specific MHC allele. The database has been populated with the data from the Protein Data Bank(PDB). The data from the PDB is manually verified and classified, after which each structure is analysed for atomic interactions relevant to MHC-Peptide complex.
Facebook
TwitterBottom-up mass spectrometry-based proteomics is challenged by the task of identifying the peptide that generates a tandem mass spectrum. Traditional methods that rely on known peptide sequence databases are limited and may not be applicable in certain contexts. De novo peptide sequencing, which assigns peptide sequences to the spectra without prior information, is valuable for various biological applications; yet, due to a lack of accuracy, it remains challenging to apply this approach in many situations. Here, we introduce InstaNovo, a transformer neural network with the ability to translate fragment ion peaks into the sequence of amino acids that make up the studied peptide(s). The model was trained on 28 million labelled spectra matched to ~742k human peptides from the ProteomeTools project. We demonstrate that InstaNovo outperforms current state-of-the-art methods on benchmark datasets and showcase its utility in several applications. Building upon human intuition, we also introduce InstaNovo+, a multinomial diffusion model that further improves performance by iterative refinement of predicted sequences. Using these models, we could de novo sequence antibody-based therapeutics with unprecedented coverage, discover novel peptides, and detect unreported organisms in different datasets, thereby expanding the scope and detection rate of proteomics searches. Finally, we could experimentally validate tryptic and non-tryptic peptides with targeted proteomics, demonstrating the fidelity of our predictions.
Facebook
Twitterhttps://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Leguminous crops are vital to sustainable agriculture due to their ability to fix atmospheric nitrogen, improving soil fertility and reducing the need for synthetic fertilizers. Additionally, they are an excellent source of protein for both human consumption and animal feed. AntiMicrobial Peptides (AMPs), found in various leguminous seeds, exhibit broad-spectrum antimicrobial activity through diverse mechanisms, including interaction with microbial cell membranes and interference with cellular processes, making them valuable for enhancing crop resilience and food safety. In the field of plant sciences, computational biology methods have been instrumental in the discovery and optimization of AMPs. These methods enable rapid exploration of sequence space and the prediction of AMPs using deep learning technologies. Optimizing AMP annotations through computational design offers a strategic approach to enhance efficacy and minimize potential side effects, providing a viable alternative to conventional antimicrobial agents. However, the presence of overlapping sequences across multiple databases poses a challenge for creating a reliable dataset for AMP prediction. To address this, we conducted a comprehensive analysis of sequence redundancy across various AMP databases. These databases encompass a wide range of AMPs from different sources and with specific functions, including both naturally occurring and artificially synthesized AMPs. Our analysis revealed significant overlap, underscoring the need for a non-redundant AMP sequence database. We present the development of a new database that consolidates unique AMP sequences derived from leguminous seeds, aiming to create a more refined dataset for the binary classification and prediction of plant-derived AMPs. This database will support the advancement of sustainable agricultural practices by enhancing the use of plant-based AMPs in agroecology, contributing to improved crop protection and food security.
Facebook
TwitterWe present new algorithms and a software implementation for assigning confidence to peptide sequence assignments obtained through classic accurate mass and retention time (AMT) matching techniques, as well as methods for integrating these assignments with standard proteomics workflows. The algorithms are intended to increase the number of peptides and proteins identified (and, when applicable, quantitated by isotopic labeling) among related proteomics experiments that use high-resolution mass spectrometry instrumentation. The motivations for our extensions include the need to exploit high-resolution data to support highly complex proteomics experiments, especially those involving extensive off-line fractionation, to which recent label-free workflows might not easily generalize.
Facebook
TwitterDNA sequence and relationships for mature peptide (protein)
Facebook
TwitterAAindex is a database of numerical indices representing various physicochemical and biochemical properties of amino acids and pairs of amino acids. AAindex consists of three sections now: AAindex1 for the amino acid index of 20 numerical values, AAindex2 for the amino acid mutation matrix and AAindex3 for the statistical protein contact potentials. All data are derived from published literature. An amino acid index is a set of 20 numerical values representing any of the different physicochemical and biological properties of amino acids. The AAindex1 section of the Amino Acid Index Database is a collection of published indices together with the result of cluster analysis using the correlation coefficient as the distance between two indices. This section currently contains 544 indices. Another important feature of amino acids that can be represented numerically is the similarity between amino acids. Thus, a similarity matrix, also called a mutation matrix, is a set of 210 numerical values, 20 diagonal and 20x19/2 off-diagonal elements, used for sequence alignments and similarity searches. The AAindex2 section of the Amino Acid Index Database is a collection of published amino acid mutation matrices together with the result of cluster analysis. This section currently contains 94 matrices. In the release 9.0, we added a collection of published protein pairwise contact potentials to AAindex as AAindex3. This section currently contains 47 contact potential matrices. Sponsors: This work was supported by grants and resources from the Ministry of Education, Culture, Sports, Science and Technology, and the Japan Science and Technology Agency, and the Bioinformatics Center, Institute for Chemical Research, Kyoto University and the Super Computer System, Human Genome Center, Institute of Medical Science, University of Tokyo.
Facebook
TwitterComplex MS-based proteomics datasets are usually analyzed by protein database-searches. While this approach performs considerably well for sequenced organisms, direct inference of peptide sequences from tandem mass spectra, i.e. de novo peptide sequencing, oftentimes is the only way to obtain information when protein databases are absent. However, available algorithms suffer from drawbacks such as lack of validation and often high rates of false positive hits (FP). Here we present a simple method of combining results from commonly available de novo peptide sequencing algorithms, which in conjunction with minor tweaks in data acquisition ensues lower empirical FDR compared to the analysis using single algorithms. Results were validated using state-of-the art database search algorithms as well specifically synthesized reference peptides. Thus, we could increase the number of PSMs meeting a stringent FDR of 5% more than threefold compared to the single best de novo sequencing algorithm alone, accounting for an average of 11,120 PSMs (combined) instead of 3,476 PSMs (alone) in triplicate 2 h LC-MS runs of tryptic HeLa digestion.
Facebook
TwitterCottonGen offers BLAST with genome, transcriptome, peptide and marker sequence databases from Gossypium species. This can be done using nucleotide sequences or peptide sequences. BLAST functionality is similar to that on NCBI. BLAST Programs: blastn: Search a nucleotide database using a nucleotide query. blastx: Search protein database using a translated nucleotide query. tblastn: Search translated nucleotide database using a protein query. blastp: Search protein database using a protein query. Resources in this dataset:Resource Title: Website Pointer for CottonGen BLAST Search. File Name: Web Page, url: https://www.cottongen.org/blast CottonGen offers BLAST with genome, transcriptome, peptide and marker sequence databases from Gossypium species. This can be done using nucleotide sequences or peptide sequences. BLAST functionality is similar to that on NCBI. Enter or upload FASTA sequence(s) to query and select BLAST database. BLAST Programs: blastn: Search a nucleotide database using a nucleotide query. blastx: Search protein database using a translated nucleotide query. tblastn: Search translated nucleotide database using a protein query. blastp: Search protein database using a protein query.
Facebook
TwitterDatabase containing several body fluid proteomes, including plasma, urine, and cerebrospinal fluid. Cell lines have been mapped to a depth of several thousand proteins and the red blood cell proteome has also been analyzed in depth. The liver proteome is represented with 3200 proteins. By employing high resolution MS and stringent validation criteria, false positive identification rates in MAPU are lower than 1:1000. Thus MAPU datasets can serve as reference proteomes in biomarker discovery. MAPU contains the peptides identifying each protein, measured masses, scores and intensities using a clickable interface of cell or body parts. Proteome data can be queried across proteomes by protein name, accession number, sequence similarity, peptide sequence and annotation information. More than 4500 mouse and 2500 human proteins have already been identified in at least one proteome. Basic annotation information and links to other public databases are provided in MAPU and we plan to add further analysis tools.
Facebook
TwitterThe Peptide Sequence Database contains putative peptide sequences from human, mouse, rat, and zebrafish. Compressed to eliminate redundancy, these are about 40 fold smaller than a brute force enumeration. Current and old releases are available for download. Each species'' peptide sequence database comprises peptide sequence data from releveant species specific UniGene and IPI clusters, plus all sequences from their consituent EST, mRNA and protein sequence databases, namely RefSeq proteins and mRNAs, UniProt''s SwissProt and TrEMBL, GenBank mRNA, ESTs, and high-throughput cDNAs, HInv-DB, VEGA, EMBL, IPI protein sequences, plus the enumeration of all combinations of UniProt sequence variants, Met loss PTM, and signal peptide cleavages. The README file contains some information about the non amino-acid symbols O (digest site corresponding to a protein N- or C-terminus) and J (no digest sequence join) used in these peptide sequence databases and information about how to configure various search engines to use them. Some search engines handle (very) long sequences badly and in some cases must be patched to use these peptide sequence databases. All search engines supported by the PepArML meta-search engine can (or can be patched to) successfully search these peptide sequence databases.