Facebook
TwitterThis dataset is a selection of The Therapeutic Target Database (release 4.3.02, 18th Oct 2013) protein IDs for successful targets. The web page states 388 but these reduced to 345 human Swiss-Prot accessions.
Facebook
TwitterThe UniprotKB/SwissProt database contains protein sequence information.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
All unigenes of Portunus sanguinolentus hit to the Swiss-Prot database.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BLASTP vs SwissProt
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
False PositivesOther signaling molecules: FGF-3,5,7,10,17,18; GDNF; CD8,28; PDGF-2; TGF; VEGF (vascular endothelial growth factor); HBNF-1; MIP; NGF (nerve growth factor); Cytokine A21, IFN-α (interferon alpha); IGF binding protein 1B,2,3; IL7 (interleukin 7).Other: MAGF (microfibril associated protein), MINK (K-channel), K-channel related peptide, L-type Ca2+ channel, gamma subunit, myelin Po protein, Dif-2, Eosinophil, Syntaxin 1B (vesicle docking), Syntaxin 2, TMP21 (vesicle trafficking protein), Coagulation factor III, PGD2 synthase, syndecans, FKBP12 (FK506 binding protein), Folate receptor, ERp29, COMT, Connexin 32, Cytostatin.
Facebook
TwitterThis dataset is a supplementary data from "Analysis of in vitro bioactivity data extracted from drug discovery literature and patents: Ranking 1654 human protein targets by assayed compounds and molecular scaffolds" (2011). In this case the Entrez Gene IDs were mapped to 1651 human Swiss-Prot accessions but this includes both approved and research targets.
Facebook
TwitterResults of blastx searches against the Swiss-Prot database
Facebook
TwitterUploaded UniProt reviewed proteins database with all columns for easier using in kaggle notebooks. All columns have description, but if you will have any questions, you can check UniProt Help where every column have a full explanation.
For UniProt Species Proteomes check this dataset.
License: Creative Commons Attribution 4.0 International (CC BY 4.0) License
Facebook
TwitterThis dataset is a supplementary data from "Novelty in the target landscape of the pharmaceutical industry" (2013). The listing of proven drug targets is converted to 248 human Swiss-Prot accessions.
Facebook
TwitterShotgun and positional proteomics study of a mouse embryonic stem cell line. We devised a proteogenomic approach constructing a custom protein sequence search space, built from both SwissProt and RIBO-seq derived translation products, applicable for LC-MSMS spectrum identification. To record the impact of using the constructed deep proteome database we performed two alternative MS-based proteomic strategies: (I) a regular shotgun proteomic and (II) an N-terminal COFRADIC approach. The obtained fragmentation spectra were searched against the custom database (combination of UniProtKB-SwissProt and RIBO-seq derived translation sequences) using three different search engines: OMSSA (version 2.1.9), X!Tandem (TORNADO, version 2010.01.01.04) and Mascot (version 2.3). The first two were run from the SearchGUI graphical user interface (version 1.10.4). A combination of X!Tandem and Mascot was used for the N-terminal COFRADIC analysis, a combination of all three search engines for the shotgun proteome analysis. Note that OMMSA cannot cope with the protease setting semi-ArgC/P needed to analyze N-terminal COFRADIC data.For the shotgun proteome data, trypsin was set as cleavage enzyme allowing for one missed cleavage, and singly to triply charged precursors or singly to quadruple charged precursors were taken into account respectively for the Mascot or X!Tandem/OMSSA search engines, and the precursor and fragment mass tolerance were set to respectively 10 ppm and 0.5 Da. Methionine oxidation to methionine-sulfoxide, pyroglutamate formation of N-terminal glutamine and acetylation (protein N-terminus) were set as variable modifications. For the N-terminal COFRADIC analysis the protease setting semi-ArgC/P (Arg-C specificity with arginine-proline cleavage allowed) was used. No missed cleavages were allowed and the precursor and fragment mass tolerance were also set to respectively 10 ppm and 0.5 Da. Carbamidomethylation of cysteine and methionine oxidation to methionine-sulfoxide and 13C3D2-acetylation of lysines were set as fixed modifications. Peptide N-terminal acetylation or 13C3D2-acetylation and pyroglutamate formation of N-terminal glutamine were set as variable modifications and instrument setting was put on ESI-TRAP. Protein and peptide identification in addition to data interpretation was done using the PeptideShaker algorithm (http://code.google.com/p/peptide-shaker, version 0.18.3), setting the false discovery rate to 1% at all levels (protein, peptide, and peptide to spectrum matching). Aforementioned tools and algorithms (SearchGui, X!Tandem, OMSSA, and PeptideShaker) are freely available as open source.
Facebook
TwitterCurated component of UniProtKB (produced by the UniProt consortium). It contains hundreds of thousands of protein descriptions, including function, domain structure, subcellular location, post-translational modifications and functionally characterized variants.
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented on June 08, 2011. This database contains gene expression data for various physiological and pathological processes in mouse brain. All the data have been obtained by adaptor-tagged competitive PCR, an advanced version of quantitative PCR. Brain Gene Expression Database (BGED) contains gene expression data for various physiological and pathological processes in mouse brain. All the data have been obtained by adaptor-tagged competitive PCR, an advanced version of quantitative PCR. Manual Download 1. Data retrieval Gene expression data can be retrieved either by ID numbers or by keywords representing functional annotations from this page. The ID numbers include GenBank, RefSeq, SwissProt, Gene Ontology, and BED (our own ID). The keyword search is based either on definition in GenBank, SwissProt and RefSeq, functional annotation of SwissProt database, or Gene Ontology terms. 2. Gene expression pattern display * Display of multiple gene expression patterns. Expression patterns of multiple genes selected by the keyword search can be displayed from the result page of the keyword search. * Gene expression pattern similarity search This function is available on the information page of each gene accessed through BED ID (in-house ID).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Protein sequence and structure data
This data set contains data from Uniprot (in the files called protein_sequence, protein_synonyms, protein_names, organism_synonyms) and PDB (in the files called PDB and PDB_chain) as used by the Aquaria web resource at the time of download (2022-02-08).
The PSSH2 data set
PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.
Calculating PSSH2
The Swissprot and PDB data was downloaded in November 2021. Generating PSSH2: We used UniRef30_2021_03 (originally called UniRef30_2021_06) from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30%. The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe until December 2021.
PDB based sequence-to-structure alignments
In addition to the PSSH2 data, new PDB structures were retrieved based on the primary accession of the proteins, by querying for all chains in all PDB entries with exact matches using the sequence cross references records given in PDB. Sequence-to-structure alignments were then created, again based on information provided in each PDB entry. These are contained in the PDBchain data.
This data covers sequences and PDB structures in the timeframe until February 2022.
Evaluating PSSH2
The resulting alignment data was analysed using CATH domain assignments downloaded from /cath/releases/all-releases/v4_2_0/cath-classification-data/ to define correct hits and false hits:
The set of query sequences is defined by the CATH non-redundant S40_overlap_60 dataset (ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/non-redundant-data-sets/)
The set of all expected hits are all pdb structures containing a domain with the same CATH code if contained in the set of processed sequences (-> all) or only if also contained in the set of non redundant sequences (-> nr40).
The set of true positives is defined by sharing the same CATH code up to the level of homology ("CATH") or up to the level of topology ("CAT").
The data was evaluated with respect to false discovery rate (FDR) and recall (true positive rate TPR) by cumulatively considering all hits with an E-value below the threshold ("C") or in bins with an E-value between the threshold and one tenth of the threshold ("B"). This evaluation was carried out for the data obtained in November 2021 (202111) as well as previous data from October 2020 (202010), February 2020 (202002) and September 2017 (201709). The results are collected in PSSH CATH validation.csv.
Known errors
Due to processing error, the profile of pdb structure 5fia A / B (sequence md5 052667679fc644184f40063c7602c9e1) is incomplete in the pdb_full hhblits database which led to further errors in generating sequence based alignments for sequences for 1vtm P (sequence md5 c844aff103449363cb8489c78c58ebf1) and 434t A / B (sequence md5 d67aa1c3a36492c719cb48b5e7ecc624).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The PSSH2 data set
PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.
This dataset contains a subset of the usual PSSH2 database, including only the proteins relevant to visualise Sars-CoV-2 structures.
It contains Swissprot and PDB data used for generating PSSH2 along with the PSSH2 data itself. This consists of the sequence-to-structure alignments used in Aquaria (aquaria.ws) and also for the Covid19 resource of Aquaria (http://aquaria.ws/covid).
Calculating PSSH2
The main bunch of Swissprot and PDB data was downloaded in October 2020, but incremental updates, especially as related to Covid19 were added until April 2021.
Generating PSSH2: We used Uniclust30 from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30% (http://gwdu111.gwdg.de/~compbiol/uniclust/2020_03/UniRef30_2020_03_hhsuite.tar.gz). The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe until April 2021.
Facebook
TwitterIt has been established with the intention of assembling in a central, publicly accessible site information about alternatively spliced genes, their products and expression patterns. Version 2.1 of ASDB consists of two divisions, ASDB(proteins) , which contains amino acid sequences, and ASDB(nucleotides) with genomic sequences. SWISS-PROT uses two formats for description of alternative splicing Thus the protein sequences were selected from SWISS-PROT using full text search for both the words alternative splicing (usually in the CC lines) and varsplic (in the FT lines). In order to group proteins that could arise by alternative splicing of the same gene, we developed the clustering procedure. Two proteins were linked if they had a common fragment of at least 20 amino acids, and clusters were initially defined as maximum connected groups of linked proteins. It turned out that some clusters were chimeric, in the sense that they contained members of multi-gene families, but not alternatively spliced variants of one gene. Therefore the multiple alignments were subject to additional analysis aimed at detection of chimeric clusters. Each cluster is represented by multiple alignment of its members constructed using CLUSTALW. The distribution of cluster size, representation of species and other relevant statistics of ASDB(proteins) can be accessed through the links below. This processing covers the cases when alternatively spliced variants are described in separate SWISS-PROT entries. The other kinds of ASDB records, originating from the SWISS-PROT entries with the varsplic field in the feature table, usually describe the proteins that are not part of any cluster. In these cases, the information on the variable fragments of the several proteins which result from the alternative splicing of a single gene is contained in the entry itself. ASDB(proteins) entries are marked with different symbols to allow for easy differentiation among the three types: those proteins which are part of the ASDB clusters and the corresponding multialignments, those which have the information on different variants in the associated SWISS-PROT entries, and those for which the information on the variants is not available at the present time. ASDB contains internal links between entries and/or clusters, as well as external links to Medline, GenBank and SWISS-PROT entries. The ASDB(nucleotides) division was generated by collecting all GenBank entries containing the words alternative splicing and further selection of those entries that contain complete gene sequences (all CDS fields are complete, i.e. they do not have continuation signs). Sponsors: This work was supported by the Director, Office of Energy Research, Office of Biological and Environmental Research, of the US Department of Energy under Contract No. DE-ACO3-76SF00098. Additional support came from grants from the Russian Fund of Basic Research (99-04-48347), the Russian State Scientific Program Human Genome (65/99), and the Merck Genome Research Institute (244).
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The large-scale identification and quantitation of proteins via nanoliquid chromatography (LC)-tandem mass spectrometry (MS/MS) offers a unique opportunity to gain unprecedented insight into the microbial composition and biomolecular activity of true environmental samples. However, in order to realize this potential for marine biofilms, new methods of protein extraction must be developed as many compounds naturally present in biofilms are known to interfere with common proteomic manipulations and LC-MS/MS techniques. In this study, we used amino acid analyses (AAA) and LC-MS/MS to compare the efficacy of three sample preparation methods [6 M guanidine hydrochloride (GuHCl) protein extraction + in-solution digestion + 2D LC; sodium dodecyl sulfate (SDS) protein extraction + 1D gel LC; phenol protein extraction + 1D gel LC] for the metaproteomic analyses of an environmental marine biofilm. The AAA demonstrated that proteins constitute 1.24% of the biofilm wet weight and that the compared methods varied in their protein extraction efficiencies (0.85–15.15%). Subsequent LC-MS/MS analyses revealed that the GuHCl method resulted in the greatest number of proteins identified by one or more peptides whereas the phenol method provided the greatest sequence coverage of identified proteins. As expected, metagenomic sequencing of the same biofilm sample enabled the creation of a searchable database that increased the number of protein identifications by 48.7% (≥1 peptide) or 54.7% (≥2 peptides) when compared to SwissProt database identifications. Taken together, our results provide methods and evidence-based recommendations to consider for qualitative or quantitative biofilm metaproteome experimental design.
Facebook
TwitterCentral repository for collection of functional information on proteins, with accurate and consistent annotation. In addition to capturing core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and cross-references, and experimental and computational data. The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot (reviewed) is a high quality manually annotated and non-redundant protein sequence database which brings together experimental results, computed features, and scientific conclusions. UniProtKB/TrEMBL (unreviewed) contains protein sequences associated with computationally generated annotation and large-scale functional characterization that await full manual annotation. Users may browse by taxonomy, keyword, gene ontology, enzyme class or pathway.
Facebook
TwitterDatabase of mitochondrial and human nuclear encoded proteins involved in mitochondrial biogenesis and function. This database consolidates information from SwissProt, LocusLink, Protein Data Bank (PDB), GenBank, Genome Database (GDB), Online Mendelian Inheritance in Man (OMIM), Human Mitochondrial Genome Database (mtDB), MITOMAP, Neuromuscular Disease Center and Human 2-D PAGE Databases. The mitochondrion plays a central role in cellular metabolism, and evidence of mitochondrial involvement in a number of different human diseases is increasing. This database is intended as a tool not only to aid in studying the mitochondrion but in studying the associated diseases. Mitochondrial DNA Sequence: A graphical tool was developed to visualize the human mitochondrial DNA sequences that highlight coding regions for RNAs and proteins. Disease susceptible mutations are also noted in the sequence. Mitochondrial DNA Polymorphism: Human mitochondrial sequences of different ethnic groups were obtained from the Human Mitochondrial Genome Database. A DNA sequence analysis tool was developed to compare polymorphisms of different human mitochondrial DNA sequences. This tool allows the user to select mitochondrial sequences from any two human populations and compare them for sequences variations. Mitochondrial proteins related diseases: Malfunction of mitochondrial proteins affect many cells from brain, heart, liver, skeletal muscles, kidney, and the endocrine and the respiratory systems which lead to many diseases. Relevant information for mitochondrial related diseases from OMIM, the Neuromuscular Disease Center and MITOMAP are gathered, and mitochondrion-associated diseases are grouped, categorized, and linked to OMIM. 3-D Structures of Mitochondrial proteins: The available 3D structures for mitochondrial proteins are presented through a custom-made interface. A concise HTML page is generated for reporting the structural details and the associated information obtained from relevant web sites (PDBREPORT, Interatomic Contacts of Structural Units (CSU), PROCHECK, Ligand Protein Contacts (LPC), PROMOTIF and CastP). References are linked to the PubMed site. The 3-D structures are presented through the use of a Kinemage.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description follows is from the official UniProt embeddings page, which also hosts this dataset originally.
Protein embeddings are a way to encode functional and structural properties of a protein, mostly from its sequence only, in a machine-friendly format (vector representation). Generating such embeddings is computationally expensive, but once computed they can be leveraged for different tasks, such as sequence similarity search, sequence clustering, and sequence classification.
UniProt provided raw embeddings (mean pooled, per-protein using the ProtT5 model) for UniProtKB/Swiss-Prot.
Note: Protein sequences longer than 12k residues are excluded due to limitation of GPU memory (this concerns only a handful of proteins).
Sample code The embeddings.h5 files store the embeddings as key-value pairs. The key is the protein accession number and the value is the embeddings vector. The following code snippet shows how to read and iterate over an embeddings file in python.
import numpy as np
import h5py
with h5py.File("path/to/embeddings.h5", "r") as file:
print(f"number of entries: {len(file.items())}")
for sequence_id, embedding in file.items():
print(
f" id: {sequence_id}, "
f" embeddings shape: {embedding.shape}, "
f" embeddings mean: {np.array(embedding).mean()}"
)
Sample output (SARS-CoV-2 embeddings from release 2022_04) per-protein file:
number of entries: 17 id: A0A663DJA2, embeddings shape: (1024,), embeddings mean: 0.0006136894226074219 id: P0DTC1, embeddings shape: (1024,), embeddings mean: 0.0011968612670898438 id: P0DTC2, embeddings shape: (1024,), embeddings mean: 0.001041412353515625
SOURCE: https://www.uniprot.org/help/embeddings https://www.uniprot.org/help/downloads#embeddings Reviewed (Swiss-Prot) - per-protein: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/embeddings/uniprot_sprot/per-protein.h5
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of human protein variations collected from the UniProt/Swiss-Prot database.
Facebook
TwitterThis dataset is a selection of The Therapeutic Target Database (release 4.3.02, 18th Oct 2013) protein IDs for successful targets. The web page states 388 but these reduced to 345 human Swiss-Prot accessions.