Facebook
TwitterThe UniprotKB/SwissProt database contains protein sequence information.
Facebook
TwitterThis dataset is a selection of The Therapeutic Target Database (release 4.3.02, 18th Oct 2013) protein IDs for successful targets. The web page states 388 but these reduced to 345 human Swiss-Prot accessions.
Facebook
TwitterUploaded UniProt reviewed proteins database with all columns for easier using in kaggle notebooks. All columns have description, but if you will have any questions, you can check UniProt Help where every column have a full explanation.
For UniProt Species Proteomes check this dataset.
License: Creative Commons Attribution 4.0 International (CC BY 4.0) License
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
All unigenes of Portunus sanguinolentus hit to the Swiss-Prot database.
Facebook
TwitterThis dataset is a supplementary data from "Novelty in the target landscape of the pharmaceutical industry" (2013). The listing of proven drug targets is converted to 248 human Swiss-Prot accessions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
Facebook
TwitterCurated component of UniProtKB (produced by the UniProt consortium). It contains hundreds of thousands of protein descriptions, including function, domain structure, subcellular location, post-translational modifications and functionally characterized variants.
Facebook
TwitterThe Peptide Sequence Database contains putative peptide sequences from human, mouse, rat, and zebrafish. Compressed to eliminate redundancy, these are about 40 fold smaller than a brute force enumeration. Current and old releases are available for download. Each species'' peptide sequence database comprises peptide sequence data from releveant species specific UniGene and IPI clusters, plus all sequences from their consituent EST, mRNA and protein sequence databases, namely RefSeq proteins and mRNAs, UniProt''s SwissProt and TrEMBL, GenBank mRNA, ESTs, and high-throughput cDNAs, HInv-DB, VEGA, EMBL, IPI protein sequences, plus the enumeration of all combinations of UniProt sequence variants, Met loss PTM, and signal peptide cleavages. The README file contains some information about the non amino-acid symbols O (digest site corresponding to a protein N- or C-terminus) and J (no digest sequence join) used in these peptide sequence databases and information about how to configure various search engines to use them. Some search engines handle (very) long sequences badly and in some cases must be patched to use these peptide sequence databases. All search engines supported by the PepArML meta-search engine can (or can be patched to) successfully search these peptide sequence databases.
Facebook
TwitterDatabases of protein sequences and 3D structures of proteins. Collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB.
Facebook
TwitterIt has been established with the intention of assembling in a central, publicly accessible site information about alternatively spliced genes, their products and expression patterns. Version 2.1 of ASDB consists of two divisions, ASDB(proteins) , which contains amino acid sequences, and ASDB(nucleotides) with genomic sequences. SWISS-PROT uses two formats for description of alternative splicing Thus the protein sequences were selected from SWISS-PROT using full text search for both the words alternative splicing (usually in the CC lines) and varsplic (in the FT lines). In order to group proteins that could arise by alternative splicing of the same gene, we developed the clustering procedure. Two proteins were linked if they had a common fragment of at least 20 amino acids, and clusters were initially defined as maximum connected groups of linked proteins. It turned out that some clusters were chimeric, in the sense that they contained members of multi-gene families, but not alternatively spliced variants of one gene. Therefore the multiple alignments were subject to additional analysis aimed at detection of chimeric clusters. Each cluster is represented by multiple alignment of its members constructed using CLUSTALW. The distribution of cluster size, representation of species and other relevant statistics of ASDB(proteins) can be accessed through the links below. This processing covers the cases when alternatively spliced variants are described in separate SWISS-PROT entries. The other kinds of ASDB records, originating from the SWISS-PROT entries with the varsplic field in the feature table, usually describe the proteins that are not part of any cluster. In these cases, the information on the variable fragments of the several proteins which result from the alternative splicing of a single gene is contained in the entry itself. ASDB(proteins) entries are marked with different symbols to allow for easy differentiation among the three types: those proteins which are part of the ASDB clusters and the corresponding multialignments, those which have the information on different variants in the associated SWISS-PROT entries, and those for which the information on the variants is not available at the present time. ASDB contains internal links between entries and/or clusters, as well as external links to Medline, GenBank and SWISS-PROT entries. The ASDB(nucleotides) division was generated by collecting all GenBank entries containing the words alternative splicing and further selection of those entries that contain complete gene sequences (all CDS fields are complete, i.e. they do not have continuation signs). Sponsors: This work was supported by the Director, Office of Energy Research, Office of Biological and Environmental Research, of the US Department of Energy under Contract No. DE-ACO3-76SF00098. Additional support came from grants from the Russian Fund of Basic Research (99-04-48347), the Russian State Scientific Program Human Genome (65/99), and the Merck Genome Research Institute (244).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 6:Sequences of the 1,823 oak BESs with a match in the Swissprot database (release 2010-04). (FASTA 1 MB)
Facebook
TwitterShotgun and positional proteomics study of a mouse embryonic stem cell line. We devised a proteogenomic approach constructing a custom protein sequence search space, built from both SwissProt and RIBO-seq derived translation products, applicable for LC-MSMS spectrum identification. To record the impact of using the constructed deep proteome database we performed two alternative MS-based proteomic strategies: (I) a regular shotgun proteomic and (II) an N-terminal COFRADIC approach. The obtained fragmentation spectra were searched against the custom database (combination of UniProtKB-SwissProt and RIBO-seq derived translation sequences) using three different search engines: OMSSA (version 2.1.9), X!Tandem (TORNADO, version 2010.01.01.04) and Mascot (version 2.3). The first two were run from the SearchGUI graphical user interface (version 1.10.4). A combination of X!Tandem and Mascot was used for the N-terminal COFRADIC analysis, a combination of all three search engines for the shotgun proteome analysis. Note that OMMSA cannot cope with the protease setting semi-ArgC/P needed to analyze N-terminal COFRADIC data.For the shotgun proteome data, trypsin was set as cleavage enzyme allowing for one missed cleavage, and singly to triply charged precursors or singly to quadruple charged precursors were taken into account respectively for the Mascot or X!Tandem/OMSSA search engines, and the precursor and fragment mass tolerance were set to respectively 10 ppm and 0.5 Da. Methionine oxidation to methionine-sulfoxide, pyroglutamate formation of N-terminal glutamine and acetylation (protein N-terminus) were set as variable modifications. For the N-terminal COFRADIC analysis the protease setting semi-ArgC/P (Arg-C specificity with arginine-proline cleavage allowed) was used. No missed cleavages were allowed and the precursor and fragment mass tolerance were also set to respectively 10 ppm and 0.5 Da. Carbamidomethylation of cysteine and methionine oxidation to methionine-sulfoxide and 13C3D2-acetylation of lysines were set as fixed modifications. Peptide N-terminal acetylation or 13C3D2-acetylation and pyroglutamate formation of N-terminal glutamine were set as variable modifications and instrument setting was put on ESI-TRAP. Protein and peptide identification in addition to data interpretation was done using the PeptideShaker algorithm (http://code.google.com/p/peptide-shaker, version 0.18.3), setting the false discovery rate to 1% at all levels (protein, peptide, and peptide to spectrum matching). Aforementioned tools and algorithms (SearchGui, X!Tandem, OMSSA, and PeptideShaker) are freely available as open source.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Protein sequence and structure data
This data set contains data from Uniprot (in the files called protein_sequence, protein_synonyms, protein_names, organism_synonyms) and PDB (in the files called PDB and PDB_chain) as used by the Aquaria web resource at the time of download (2022-02-08).
The PSSH2 data set
PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.
Calculating PSSH2
The Swissprot and PDB data was downloaded in November 2021. Generating PSSH2: We used UniRef30_2021_03 (originally called UniRef30_2021_06) from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30%. The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe until December 2021.
PDB based sequence-to-structure alignments
In addition to the PSSH2 data, new PDB structures were retrieved based on the primary accession of the proteins, by querying for all chains in all PDB entries with exact matches using the sequence cross references records given in PDB. Sequence-to-structure alignments were then created, again based on information provided in each PDB entry. These are contained in the PDBchain data.
This data covers sequences and PDB structures in the timeframe until February 2022.
Evaluating PSSH2
The resulting alignment data was analysed using CATH domain assignments downloaded from /cath/releases/all-releases/v4_2_0/cath-classification-data/ to define correct hits and false hits:
The set of query sequences is defined by the CATH non-redundant S40_overlap_60 dataset (ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/non-redundant-data-sets/)
The set of all expected hits are all pdb structures containing a domain with the same CATH code if contained in the set of processed sequences (-> all) or only if also contained in the set of non redundant sequences (-> nr40).
The set of true positives is defined by sharing the same CATH code up to the level of homology ("CATH") or up to the level of topology ("CAT").
The data was evaluated with respect to false discovery rate (FDR) and recall (true positive rate TPR) by cumulatively considering all hits with an E-value below the threshold ("C") or in bins with an E-value between the threshold and one tenth of the threshold ("B"). This evaluation was carried out for the data obtained in November 2021 (202111) as well as previous data from October 2020 (202010), February 2020 (202002) and September 2017 (201709). The results are collected in PSSH CATH validation.csv.
Known errors
Due to processing error, the profile of pdb structure 5fia A / B (sequence md5 052667679fc644184f40063c7602c9e1) is incomplete in the pdb_full hhblits database which led to further errors in generating sequence based alignments for sequences for 1vtm P (sequence md5 c844aff103449363cb8489c78c58ebf1) and 434t A / B (sequence md5 d67aa1c3a36492c719cb48b5e7ecc624).
Facebook
TwitterThis dataset is a supplementary data from "Analysis of in vitro bioactivity data extracted from drug discovery literature and patents: Ranking 1654 human protein targets by assayed compounds and molecular scaffolds" (2011). In this case the Entrez Gene IDs were mapped to 1651 human Swiss-Prot accessions but this includes both approved and research targets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We assembled a dual-layered biological network to study the roles of resistance gene analogs (RGAs) in the resistance of sugarcane to infection by the biotrophic fungus causing smut disease. Based on sugarcane-Arabidopsis orthology, the modeling used metabolic and protein-protein interaction (PPI) data from Arabidopsis thaliana (from Kyoto Encyclopedia of Genes and Genomes (KEGG) and BioGRID databases) and plant resistance curated knowledge for Viridiplantae obtained through text mining of the UniProt/SwissProt database. With the network, we integrated functional annotations and transcriptome data from two sugarcane genotypes that differ significantly in resistance to smut and applied a series of analyses to compare the transcriptomes and understand both signal perception and transduction in plant resistance. We show that the smut-resistant sugarcane has a larger arsenal of RGAs encompassing transcriptionally modulated subnetworks with other resistance elements, reaching hub proteins of primary metabolism. This approach may benefit molecular breeders in search of markers associated with quantitative resistance to diseases in non-model systems.
Facebook
TwitterDataset Description
Dataset Summary
This dataset is a mirror of the Uniprot/SwissProt database. It contains the names and sequences of >500K proteins. This dataset was parsed from the FASTA file at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz. Supported Tasks and Leaderboards: None Languages: English
Dataset Structure
Data Instances
Data Fields: id, description, sequence Data… See the full description on the dataset page: https://huggingface.co/datasets/damlab/uniprot.
Facebook
TwitterThe UniprotKB/SwissProt database contains protein sequence information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for UniProtKB/Swiss-Prot
Dataset Summary
[More Information Needed]
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure
Data Instances
[More Information Needed]
Data Fields
[More Information Needed]
Data Splits
[More Information Needed]
Dataset Creation
Curation Rationale
[More Information Needed]
Source… See the full description on the dataset page: https://huggingface.co/datasets/zgcarvalho/swiss-prot-test.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of human protein variations collected from the UniProt/Swiss-Prot database.
Facebook
TwitterThe UniprotKB/SwissProt database contains protein sequence information.