100+ datasets found

V
The Universal Protein Resource (UniProt)
data.virginia.gov
healthdata.gov
+1more
Updated Jul 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (NIH) (2023). The Universal Protein Resource (UniProt) [Dataset]. https://data.virginia.gov/dataset/the-universal-protein-resource-uniprot
Explore at:
Dataset updated
Jul 26, 2023
Dataset provided by
National Institutes of Health (NIH)
Description
The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc).
h
uniprot
huggingface.co
Updated Apr 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Will Dampier (2022). uniprot [Dataset]. https://huggingface.co/datasets/damlab/uniprot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2022
Authors
Will Dampier
Description
Dataset Description

Dataset Summary

This dataset is a mirror of the Uniprot/SwissProt database. It contains the names and sequences of >500K proteins. This dataset was parsed from the FASTA file at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz. Supported Tasks and Leaderboards: None Languages: English

Dataset Structure Data Instances

Data Fields: id, description, sequence Data… See the full description on the dataset page: https://huggingface.co/datasets/damlab/uniprot.
s
UniProt
scicrunch.org
dknet.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). UniProt [Dataset]. http://identifiers.org/RRID:SCR_002380
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_002380
Dataset updated
Jan 29, 2022
Description
Collection of data of protein sequence and functional information. Resource for protein sequence and annotation data. Consortium for preservation of the UniProt databases: UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (UniRef), and UniProt Archive (UniParc), UniProt Proteomes. Collaboration between European Bioinformatics Institute (EMBL-EBI), SIB Swiss Institute of Bioinformatics and Protein Information Resource. Swiss-Prot is a curated subset of UniProtKB.
uniprot-database_(type_ko).27.09.2019.tab.rar
figshare.com
application/x-rar
Updated Jun 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Kumazawa Morais (2020). uniprot-database_(type_ko).27.09.2019.tab.rar [Dataset]. http://doi.org/10.6084/m9.figshare.12555422.v1
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12555422.v1
Dataset updated
Jun 24, 2020
Dataset provided by
Figsharehttp://figshare.com/
Authors
Daniel Kumazawa Morais
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The current database was downloaded on 27.09.2019 and has the data fields (columns) as described below:# 1 Entry# 2 Entry name# 3 Status# 4 Protein names# 5 Gene names# 6 Organism# 7 Length# 8 Cross-reference (KO)# 9 Taxonomic lineage (PHYLUM)# 10 Taxonomic lineage (SPECIES) # This field carries current and old* taxonomic classifications.# 11 Taxonomic lineage (GENUS)# 12 Taxonomic lineage (KINGDOM)# 13 Taxonomic lineage (SUPERKINGDOM)# 14 Cross-reference (OrthoDB)# 15 Cross-reference (eggNOG)*Details about the classification used in UNIPROT can be found at the link: https://www.uniprot.org/help/taxonomy
s
Repository URL
cinergi.sdsc.edu
resource url
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Repository URL [Dataset]. http://cinergi.sdsc.edu/geoportal/rest/metadata/item/323ebc5365ec476ebdcb92329cf10b57/html
Explore at:
resource urlAvailable download formats
Description
Link Function: information
Z
Prediction and Visualization of Human Transmembrane Proteins using AlphaFold...
data.niaid.nih.gov
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heinzinger, Michael (2024). Prediction and Visualization of Human Transmembrane Proteins using AlphaFold and Protein Language Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6816082
Explore at:
Dataset updated
Jul 16, 2024
Dataset provided by
Grekova, Anastasia
Marquet, Céline
Houri, Leen
Heinzinger, Michael
Rost, Burkhard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description: TMvis ("TMvis496.tar.gz") is a dataset containing 496 3D-structures of predicted human transmembrane proteins (TMP) and their predicted membrane embedding. The method TMbed [1], based on the protein language model ProtT5 [2] predicted 4.967 TMP for the human proteome (20,375 proteins, UniProt [3] version April 2022; excluding TITIN_HUMAN due to length). For these proteins, we obtained AlphaFold [4] structures from AlphaFoldDB [5] with an average per-residue confidence score (pLDDT) of more than 90%. This resulted in the 496 proteins of TMvis, as can be found in "TMvis496.fasta". The membrane embedding was predicted using the methods ANVIL [6], PPM3 [7], and per-residue TMbed predictions. As the three methods are based on different approaches, we decided to publish results for all. The figure “TMvis_project_overview.png” provides a graphical overview for each step described above.

TMvis Folder Structure: TMvis is separated into “alpha” containing predicted alpha-helical TMPs, and “beta” containing predicted beta-barrel TMPs. Within these folders, each protein is assigned one folder, identifiable by the respective unique UniProt ID. Each protein folder consists of: - “UniprotID.fasta” with UniProt ID, sequence, TMbed per-residue prediction - “AF-UniprotID-F1-model_v2.pdb” with the AlphaFold structure - “AF-UniprotID-F1-model_v2.cif” with the AlphaFold structure - “AF-UniprotID-F1-model_v2_ANVIL.pdb” with predicted ANVIL membrane embedding - “AF-UniprotID-F1-model_v2_ppm.pdb” predicted PPM3 membrane embedding

TMvis
|
├── alpha
│ │
│ ├── A0A087X1C5
│ │ ├── A0A087X1C5.fasta
│ │ ├── AF-A0A087X1C5-F1-model_v2.pdb
│ │ ├── AF-A0A087X1C5-F1-model_v2.cif
│ │ ├── AF-A0A087X1C5-F1-model_v2_ANVIL.pdb
│ │ └── AF-A0A087X1C5-F1-model_v2_ppm.PDB
│ └── ...
└── beta
└── P45880

TMvis visualization: The 3D-visualization of every protein in the dataset TMvis can be easily accessed using the Jupyter Notebook “TMvis.ipynb”. It contains detailed descriptions the different membrane prediction tools ANVIL, PPM3, and TMbed as well as the respective code. Additionally, it allows to visualize the per-residue confidence scores (pLDDT) of AlphaFold.

——————————————————————————————————————————————————————————————————————————

References:

[1] TMbed - TMbed Bernhofer, Michael, and Burkhard Rost. 2022. “TMbed – Transmembrane Proteins Predicted through Language Model Embeddings.” bioRxiv.

[2] ProtT5 - A. Elnaggar et al., "ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2021.3095381.

[3] UniProt - UniProt Consortium (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic acids research, 49(D1), D480–D489.

[4] AlphaFold - AlphaFold Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89.

[5] Alphafold DB - Varadi, Mihaly, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, et al. 2022. “AlphaFold Protein Structure Database: Massively Expanding the Structural Coverage of Protein-Sequence Space with High-Accuracy Models.” Nucleic Acids Research 50 (D1): D439–44.

[6] ANVIL - ANVIL Postic, Guillaume, Yassine Ghouzam, Vincent Guiraud, and Jean-Christophe Gelly. 2016. “Membrane Positioning for High- and Low-Resolution Protein Structures through a Binary Classification Approach.” Protein Engineering, Design & Selection: PEDS 29 (3): 87–91.

[7] PPM3 - PPM3 Lomize, Mikhail A., Irina D. Pogozheva, Hyeon Joo, Henry I. Mosberg, and Andrei L. Lomize. 2012. “OPM Database and PPM Web Server: Resources for Positioning of Proteins in Membranes.” Nucleic Acids Research 40 (Database issue): D370–76.

——————————————————————————————————————————————————————————————————————————

License:

This work is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).
n
UniProt Chordata protein annotation program
neuinfo.org
scicrunch.org
+2more
Updated Jul 12, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2013). UniProt Chordata protein annotation program [Dataset]. http://identifiers.org/RRID:SCR_007071
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007071
Dataset updated
Jul 12, 2013
Description
Data set of manually annotated chordata-specific proteins as well as those that are widely conserved. The program keeps existing human entries up-to-date and broadens the manual annotation to other vertebrate species, especially model organisms, including great apes, cow, mouse, rat, chicken, zebrafish, as well as Xenopus laevis and Xenopus tropicalis. A draft of the complete human proteome is available in UniProtKB/Swiss-Prot and one of the current priorities of the Chordata protein annotation program is to improve the quality of human sequences provided. To this aim, they are updating sequences which show discrepancies with those predicted from the genome sequence. Dubious isoforms, sequences based on experimental artifacts and protein products derived from erroneous gene model predictions are also revisited. This work is in part done in collaboration with the Hinxton Sequence Forum (HSF), which allows active exchange between UniProt, HAVANA, Ensembl and HGNC groups, as well as with RefSeq database. UniProt is a member of the Consensus CDS project and thye are in the process of reviewing their records to support convergence towards a standard set of protein annotation. They also continuously update human entries with functional annotation, including novel structural, post-translational modification, interaction and enzymatic activity data. In order to identify candidates for re-annotation, they use, among others, information extraction tools such as the STRING database. In addition, they regularly add new sequence variants and maintain disease information. Indeed, this annotation program includes the Variation Annotation Program, the goal of which is to annotate all known human genetic diseases and disease-linked protein variants, as well as neutral polymorphisms.
e
NCBIFAM
ebi.ac.uk
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). NCBIFAM [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Dec 16, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NCBIfam is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. NCBIfam is maintained at the National Center for Biotechnology Information (Bethesda, MD). NCBIfam includes models from TIGRFAMs, another database of protein families developed at The Institute for Genomic Research, then at the J. Craig Venter Institute (Rockville, MD, US).
n
UniParc
neuinfo.org
dknet.org
+2more
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). UniParc [Dataset]. http://identifiers.org/RRID:SCR_005818
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005818
Dataset updated
Aug 6, 2024
Description
Database that contains publicly available protein sequences with stable and unique identifiers (UPI) which are never removed, changed or reassigned. UniParc tracks sequence changes in the source databases and archives the history of all changes. Information other than protein sequence must be retrieved from the UniParc source databases using the database cross-references.
e
PROSITE profiles
ebi.ac.uk
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Feb 5, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.
f
Functional annotation from protein sequence database of Uniprot, Myristica...
figshare.com
xml
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deden Derajat Matra (2022). Functional annotation from protein sequence database of Uniprot, Myristica fatua [Dataset]. http://doi.org/10.6084/m9.figshare.21524646.v1
Explore at:
xmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21524646.v1
Dataset updated
Nov 21, 2022
Dataset provided by
figshare
Authors
Deden Derajat Matra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Functional annotation from protein sequence database of Uniprot, Myristica fatua
d
UniRef
dknet.org
scicrunch.org
+1more
Updated Aug 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). UniRef [Dataset]. http://identifiers.org/RRID:SCR_010646
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_010646
Dataset updated
Aug 9, 2024
Description
Databases which provide clustered sets of sequences from UniProt Knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences from view. The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues (from any organism) into a single UniRef entry. The sequence of a representative protein, the accession numbers of all the merged entries, and links to the corresponding UniProtKB and UniParc records are all displayed in the entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues such that each cluster is composed of sequences that have at least 90% (UniRef90) or 50% (UniRef50) sequence identity to the longest sequence (UniRef seed sequence). All the sequences in each cluster are ranked to facilitate the selection of a representative sequence for the cluster.
d
Peptide Sequence Database
dknet.org
scicrunch.org
+1more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Peptide Sequence Database [Dataset]. http://identifiers.org/RRID:SCR_005764
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_005764
Dataset updated
Jan 29, 2022
Description
The Peptide Sequence Database contains putative peptide sequences from human, mouse, rat, and zebrafish. Compressed to eliminate redundancy, these are about 40 fold smaller than a brute force enumeration. Current and old releases are available for download. Each species'' peptide sequence database comprises peptide sequence data from releveant species specific UniGene and IPI clusters, plus all sequences from their consituent EST, mRNA and protein sequence databases, namely RefSeq proteins and mRNAs, UniProt''s SwissProt and TrEMBL, GenBank mRNA, ESTs, and high-throughput cDNAs, HInv-DB, VEGA, EMBL, IPI protein sequences, plus the enumeration of all combinations of UniProt sequence variants, Met loss PTM, and signal peptide cleavages. The README file contains some information about the non amino-acid symbols O (digest site corresponding to a protein N- or C-terminus) and J (no digest sequence join) used in these peptide sequence databases and information about how to configure various search engines to use them. Some search engines handle (very) long sequences badly and in some cases must be patched to use these peptide sequence databases. All search engines supported by the PepArML meta-search engine can (or can be patched to) successfully search these peptide sequence databases.
e
CATH-Gene3D
ebi.ac.uk
Updated Oct 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Oct 21, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.
Z
PSSH2 - database of protein sequence-to-structure homologies (including...
data.niaid.nih.gov
zenodo.org
Updated Feb 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandeep Kaur (2022). PSSH2 - database of protein sequence-to-structure homologies (including Sars-CoV-2 structures) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4279163
Explore at:
Dataset updated
Feb 11, 2022
Dataset provided by
Neblina Sikta
Sandeep Kaur
Andrea Schafferhans
Sean O'Donoghue
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Protein sequence and structure data

This data set contains data from Uniprot (in the files called protein_sequence, protein_synonyms, protein_names, organism_synonyms) and PDB (in the files called PDB and PDB_chain) as used by the Aquaria web resource at the time of download (2022-02-08).

The PSSH2 data set

PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.

Calculating PSSH2

The Swissprot and PDB data was downloaded in November 2021. Generating PSSH2: We used UniRef30_2021_03 (originally called UniRef30_2021_06) from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30%. The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe until December 2021.

PDB based sequence-to-structure alignments

In addition to the PSSH2 data, new PDB structures were retrieved based on the primary accession of the proteins, by querying for all chains in all PDB entries with exact matches using the sequence cross references records given in PDB. Sequence-to-structure alignments were then created, again based on information provided in each PDB entry. These are contained in the PDBchain data.

This data covers sequences and PDB structures in the timeframe until February 2022.

Evaluating PSSH2

The resulting alignment data was analysed using CATH domain assignments downloaded from /cath/releases/all-releases/v4_2_0/cath-classification-data/ to define correct hits and false hits:

The set of query sequences is defined by the CATH non-redundant S40_overlap_60 dataset (ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/non-redundant-data-sets/)

The set of all expected hits are all pdb structures containing a domain with the same CATH code if contained in the set of processed sequences (-> all) or only if also contained in the set of non redundant sequences (-> nr40).

The set of true positives is defined by sharing the same CATH code up to the level of homology ("CATH") or up to the level of topology ("CAT").

The data was evaluated with respect to false discovery rate (FDR) and recall (true positive rate TPR) by cumulatively considering all hits with an E-value below the threshold ("C") or in bins with an E-value between the threshold and one tenth of the threshold ("B"). This evaluation was carried out for the data obtained in November 2021 (202111) as well as previous data from October 2020 (202010), February 2020 (202002) and September 2017 (201709). The results are collected in PSSH CATH validation.csv.

Known errors

Due to processing error, the profile of pdb structure 5fia A / B (sequence md5 052667679fc644184f40063c7602c9e1) is incomplete in the pdb_full hhblits database which led to further errors in generating sequence based alignments for sequences for 1vtm P (sequence md5 c844aff103449363cb8489c78c58ebf1) and 434t A / B (sequence md5 d67aa1c3a36492c719cb48b5e7ecc624).
R
Isoelectric point for all UniProtKB/TrEMBL proteins April 2016
repod.icm.edu.pl
commons.datacite.org
7z, bin
Updated May 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kozlowski, Lukasz (2016). Isoelectric point for all UniProtKB/TrEMBL proteins April 2016 [Dataset]. http://doi.org/10.18150/repod.9948646
Explore at:
7z(11492396457), bin(11492396457)Available download formats
Unique identifier
https://doi.org/10.18150/repod.9948646
Dataset updated
May 18, 2016
Dataset provided by
RepOD
Authors
Kozlowski, Lukasz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Predicted isoelectric point for all UniProtKB/TrEMBL proteins (April 2016) done using 18 different algorithms. Over 63 millions of protein sequences. Compressed using 7zip **Primary reference: Kozlowski, LP (2016) Proteome-pI: proteome isoelectric point database. Nucleic Acids Research doi: 10.1093/nar/gkw978 **www: http://isoelectricpointdb.org
Number of human protein variations collected from the UniProt/Swiss-Prot...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yongwook Choi; Gregory E. Sims; Sean Murphy; Jason R. Miller; Agnes P. Chan (2023). Number of human protein variations collected from the UniProt/Swiss-Prot database. [Dataset]. http://doi.org/10.1371/journal.pone.0046688.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0046688.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yongwook Choi; Gregory E. Sims; Sean Murphy; Jason R. Miller; Agnes P. Chan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of human protein variations collected from the UniProt/Swiss-Prot database.
b
Gene Ontology Annotation Database
bioregistry.io
Updated Apr 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Gene Ontology Annotation Database [Dataset]. https://bioregistry.io/goa
Explore at:
Dataset updated
Apr 24, 2021
Description
The GOA (Gene Ontology Annotation) project provides high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and International Protein Index (IPI). This involves electronic annotation and the integration of high-quality manual GO annotation from all GO Consortium model organism groups and specialist groups.
Z
Protein database for SAGApipeline
data.niaid.nih.gov
Updated Jun 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Rui (2022). Protein database for SAGApipeline [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6591506
Explore at:
Dataset updated
Jun 3, 2022
Dataset authored and provided by
Li Rui
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Proteins from multiple databases were used in algal genome annotation. We first downloaded the protein data of green algae (Chlorophyta) and red algae (Rhodophyta) on UniProt for 10 higher-quality assemblies gene predictions. We refer to these protein sequences for RefSeq genomes gene prediction as seed_algae_mix, whose role is to complete gene prediction of higher quality genomes quickly and accurately. These predicted results obtained in the RefSeq genomes will be used as the query sequences to query the NR (RefSeq non-redundant proteins) database, Then the searched target sequence is extracted as the NR_extract part. Subsequently, we retrieved the protein sequences on Uniprot of all lineages of algae, please note that we did not select only the protein sequences of the 17 lineages to be annotated, but all algae from the 21 lineages searched on NCBI taxonomy, a total of 2,432,633 (1.17GB) algae protein sequences named total_algae_mix were for future predictions. The plants part of OrthoDB V10.1 and BUSCO proteins of 7 lineages were also merged into pr_total_mix.
s
UniProtKB
scicrunch.org
neuinfo.org
Updated Oct 24, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). UniProtKB [Dataset]. http://identifiers.org/RRID:SCR_004426
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_004426
Dataset updated
Oct 24, 2019
Description
Central repository for collection of functional information on proteins, with accurate and consistent annotation. In addition to capturing core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and cross-references, and experimental and computational data. The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot (reviewed) is a high quality manually annotated and non-redundant protein sequence database which brings together experimental results, computed features, and scientific conclusions. UniProtKB/TrEMBL (unreviewed) contains protein sequences associated with computationally generated annotation and large-scale functional characterization that await full manual annotation. Users may browse by taxonomy, keyword, gene ontology, enzyme class or pathway.

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institutes of Health (NIH) (2023). The Universal Protein Resource (UniProt) [Dataset]. https://data.virginia.gov/dataset/the-universal-protein-resource-uniprot

The Universal Protein Resource (UniProt)

Explore at:

Dataset updated

Jul 26, 2023

Dataset provided by

National Institutes of Health (NIH)

Description

The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc).

Clear search

Close search

Google apps

Main menu

The Universal Protein Resource (UniProt)

uniprot

UniProt

uniprot-database_(type_ko).27.09.2019.tab.rar

Repository URL

Prediction and Visualization of Human Transmembrane Proteins using AlphaFold...

UniProt Chordata protein annotation program

NCBIFAM

UniParc

PROSITE profiles

Functional annotation from protein sequence database of Uniprot, Myristica...

UniRef

Peptide Sequence Database

CATH-Gene3D

PSSH2 - database of protein sequence-to-structure homologies (including...

Isoelectric point for all UniProtKB/TrEMBL proteins April 2016

Number of human protein variations collected from the UniProt/Swiss-Prot...

Gene Ontology Annotation Database

Protein database for SAGApipeline

UniProtKB

The Universal Protein Resource (UniProt)See More Versions

The Universal Protein Resource (UniProt)