100+ datasets found
  1. h

    uniprot

    • huggingface.co
    Updated Apr 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Will Dampier (2022). uniprot [Dataset]. https://huggingface.co/datasets/damlab/uniprot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2022
    Authors
    Will Dampier
    Description

    Dataset Description

      Dataset Summary
    

    This dataset is a mirror of the Uniprot/SwissProt database. It contains the names and sequences of >500K proteins. This dataset was parsed from the FASTA file at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz. Supported Tasks and Leaderboards: None Languages: English

      Dataset Structure
    
    
    
    
    
    
    
      Data Instances
    

    Data Fields: id, description, sequence Data… See the full description on the dataset page: https://huggingface.co/datasets/damlab/uniprot.

  2. The Therapeutic Drug Target Database Human SwissProt

    • johnsnowlabs.com
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs, The Therapeutic Drug Target Database Human SwissProt [Dataset]. https://www.johnsnowlabs.com/marketplace/the-therapeutic-drug-target-database-human-swissprot/
    Explore at:
    csvAvailable download formats
    Dataset authored and provided by
    John Snow Labs
    Area covered
    N/A
    Description

    This dataset is a selection of The Therapeutic Target Database (release 4.3.02, 18th Oct 2013) protein IDs for successful targets. The web page states 388 but these reduced to 345 human Swiss-Prot accessions.

  3. Swiss-Prot database

    • springernature.figshare.com
    application/cdfv2
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuqi Wang; Cuihong You; Hongyu Ma; Yin Zhang; Guidong Miao; Qingyang Wu; Fan Lin; Jude Juventus Aweya (2023). Swiss-Prot database [Dataset]. http://doi.org/10.6084/m9.figshare.6124457.v1
    Explore at:
    application/cdfv2Available download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Shuqi Wang; Cuihong You; Hongyu Ma; Yin Zhang; Guidong Miao; Qingyang Wu; Fan Lin; Jude Juventus Aweya
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    All unigenes of Portunus sanguinolentus hit to the Swiss-Prot database.

  4. Approved and Researched Drug Targets Human SwissProt Accessions

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). Approved and Researched Drug Targets Human SwissProt Accessions [Dataset]. https://www.johnsnowlabs.com/marketplace/approved-and-researched-drug-targets-human-swissprot-accessions/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    N/A
    Description

    This dataset is a supplementary data from "Analysis of in vitro bioactivity data extracted from drug discovery literature and patents: Ranking 1654 human protein targets by assayed compounds and molecular scaffolds" (2011). In this case the Entrez Gene IDs were mapped to 1651 human Swiss-Prot accessions but this includes both approved and research targets.

  5. Proven Drug Targets Converted to Human SwissProt Accessions

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). Proven Drug Targets Converted to Human SwissProt Accessions [Dataset]. https://www.johnsnowlabs.com/marketplace/proven-drug-targets-converted-to-human-swissprot-accessions/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    N/A
    Description

    This dataset is a supplementary data from "Novelty in the target landscape of the pharmaceutical industry" (2013). The listing of proven drug targets is converted to 248 human Swiss-Prot accessions.

  6. Data from: Method Development for Metaproteomic Analyses of Marine Biofilms

    • acs.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dagmar Hajkova Leary; W. Judson Hervey; Robert W. Li; Jeffrey R. Deschamps; Anne W. Kusterbeck; Gary J. Vora (2023). Method Development for Metaproteomic Analyses of Marine Biofilms [Dataset]. http://doi.org/10.1021/ac203315n.s003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Dagmar Hajkova Leary; W. Judson Hervey; Robert W. Li; Jeffrey R. Deschamps; Anne W. Kusterbeck; Gary J. Vora
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The large-scale identification and quantitation of proteins via nanoliquid chromatography (LC)-tandem mass spectrometry (MS/MS) offers a unique opportunity to gain unprecedented insight into the microbial composition and biomolecular activity of true environmental samples. However, in order to realize this potential for marine biofilms, new methods of protein extraction must be developed as many compounds naturally present in biofilms are known to interfere with common proteomic manipulations and LC-MS/MS techniques. In this study, we used amino acid analyses (AAA) and LC-MS/MS to compare the efficacy of three sample preparation methods [6 M guanidine hydrochloride (GuHCl) protein extraction + in-solution digestion + 2D LC; sodium dodecyl sulfate (SDS) protein extraction + 1D gel LC; phenol protein extraction + 1D gel LC] for the metaproteomic analyses of an environmental marine biofilm. The AAA demonstrated that proteins constitute 1.24% of the biofilm wet weight and that the compared methods varied in their protein extraction efficiencies (0.85–15.15%). Subsequent LC-MS/MS analyses revealed that the GuHCl method resulted in the greatest number of proteins identified by one or more peptides whereas the phenol method provided the greatest sequence coverage of identified proteins. As expected, metagenomic sequencing of the same biofilm sample enabled the creation of a searchable database that increased the number of protein identifications by 48.7% (≥1 peptide) or 54.7% (≥2 peptides) when compared to SwissProt database identifications. Taken together, our results provide methods and evidence-based recommendations to consider for qualitative or quantitative biofilm metaproteome experimental design.

  7. r

    UniProtKB/Swiss-Prot

    • rrid.site
    • scicrunch.org
    • +2more
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). UniProtKB/Swiss-Prot [Dataset]. http://identifiers.org/RRID:SCR_021164
    Explore at:
    Dataset updated
    Jun 28, 2025
    Description

    Curated component of UniProtKB (produced by the UniProt consortium). It contains hundreds of thousands of protein descriptions, including function, domain structure, subcellular location, post-translational modifications and functionally characterized variants.

  8. h

    SwissProt-EC-leaf

    • huggingface.co
    Updated Jun 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LightOn AI (2022). SwissProt-EC-leaf [Dataset]. https://huggingface.co/datasets/lightonai/SwissProt-EC-leaf
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2022
    Dataset authored and provided by
    LightOn AI
    Description

    Dataset

    Swissprot is a high quality manually annotated protein database. The dataset contains annotations with the functional properties of the proteins. Here we extract proteins with Enzyme Commission labels. The dataset is ported from Protinfer: https://github.com/google-research/proteinfer. The leaf level EC-labels are extracted and indexed, the mapping is provided in idx_mapping.json. Proteins without leaf-level-EC tags are removed.

      Example
    

    The protein Q87BZ2 have… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/SwissProt-EC-leaf.

  9. o

    Data from: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in...

    • omicsdi.org
    Updated Jan 1, 1994
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (1994). The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. [Dataset]. https://www.omicsdi.org/dataset/biostudies/S-EPMC165542
    Explore at:
    Dataset updated
    Jan 1, 1994
    Variables measured
    Unknown
    Description

    The SWISS-PROT protein knowledgebase (http://www.expasy.org/sprot/ and http://www.ebi.ac.uk/swissprot/) connects amino acid sequences with the current knowledge in the Life Sciences. Each protein entry provides an interdisciplinary overview of relevant information by bringing together experimental results, computed features and sometimes even contradictory conclusions. Detailed expertise that goes beyond the scope of SWISS-PROT is made available via direct links to specialised databases. SWISS-PROT provides annotated entries for all species, but concentrates on the annotation of entries from human (the HPI project) and other model organisms to ensure the presence of high quality annotation for representative members of all protein families. Part of the annotation can be transferred to other family members, as is already done for microbes by the High-quality Automated and Manual Annotation of microbial Proteomes (HAMAP) project. Protein families and groups of proteins are regularly reviewed to keep up with current scientific findings. Complementarily, TrEMBL strives to comprise all protein sequences that are not yet represented in SWISS-PROT, by incorporating a perpetually increasing level of mostly automated annotation. Researchers are welcome to contribute their knowledge to the scientific community by submitting relevant findings to SWISS-PROT at swiss-prot@expasy.org.

  10. Gene Ontology according to the Swiss-Prot database for the substrates of the...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sander H. Diks; Kaushal Parikh; Marijke van der Sijde; Jos Joore; Tita Ritsema; Maikel P. Peppelenbosch (2023). Gene Ontology according to the Swiss-Prot database for the substrates of the minimal kinome, shown for humanized substrate set. [Dataset]. http://doi.org/10.1371/journal.pone.0000777.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sander H. Diks; Kaushal Parikh; Marijke van der Sijde; Jos Joore; Tita Ritsema; Maikel P. Peppelenbosch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gene Ontology according to the Swiss-Prot database for the substrates of the minimal kinome, shown for humanized substrate set.

  11. e

    PROSITE profiles

    • ebi.ac.uk
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 5, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.

  12. s

    Peptide Sequence Database

    • scicrunch.org
    • dknet.org
    • +1more
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Peptide Sequence Database [Dataset]. http://identifiers.org/RRID:SCR_005764
    Explore at:
    Dataset updated
    Jun 17, 2025
    Description

    The Peptide Sequence Database contains putative peptide sequences from human, mouse, rat, and zebrafish. Compressed to eliminate redundancy, these are about 40 fold smaller than a brute force enumeration. Current and old releases are available for download. Each species'' peptide sequence database comprises peptide sequence data from releveant species specific UniGene and IPI clusters, plus all sequences from their consituent EST, mRNA and protein sequence databases, namely RefSeq proteins and mRNAs, UniProt''s SwissProt and TrEMBL, GenBank mRNA, ESTs, and high-throughput cDNAs, HInv-DB, VEGA, EMBL, IPI protein sequences, plus the enumeration of all combinations of UniProt sequence variants, Met loss PTM, and signal peptide cleavages. The README file contains some information about the non amino-acid symbols O (digest site corresponding to a protein N- or C-terminus) and J (no digest sequence join) used in these peptide sequence databases and information about how to configure various search engines to use them. Some search engines handle (very) long sequences badly and in some cases must be patched to use these peptide sequence databases. All search engines supported by the PepArML meta-search engine can (or can be patched to) successfully search these peptide sequence databases.

  13. e

    NCBIFAM

    • ebi.ac.uk
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). NCBIFAM [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Dec 16, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NCBIfam is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. NCBIfam is maintained at the National Center for Biotechnology Information (Bethesda, MD). NCBIfam includes models from TIGRFAMs, another database of protein families developed at The Institute for Genomic Research, then at the J. Craig Venter Institute (Rockville, MD, US).

  14. h

    swissprot

    • huggingface.co
    Updated Jun 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bayes Group (2023). swissprot [Dataset]. https://huggingface.co/datasets/bayes-group-diffusion/swissprot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 23, 2023
    Authors
    Bayes Group
    Description

    bayes-group-diffusion/swissprot dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. s

    UniProtKB

    • scicrunch.org
    Updated Oct 24, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). UniProtKB [Dataset]. http://identifiers.org/RRID:SCR_004426
    Explore at:
    Dataset updated
    Oct 24, 2019
    Description

    Central repository for collection of functional information on proteins, with accurate and consistent annotation. In addition to capturing core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and cross-references, and experimental and computational data. The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot (reviewed) is a high quality manually annotated and non-redundant protein sequence database which brings together experimental results, computed features, and scientific conclusions. UniProtKB/TrEMBL (unreviewed) contains protein sequences associated with computationally generated annotation and large-scale functional characterization that await full manual annotation. Users may browse by taxonomy, keyword, gene ontology, enzyme class or pathway.

  16. Z

    PSSH2 - database of protein sequence-to-structure homologies (including...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandeep Kaur (2022). PSSH2 - database of protein sequence-to-structure homologies (including Sars-CoV-2 structures) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4279163
    Explore at:
    Dataset updated
    Feb 11, 2022
    Dataset provided by
    Neblina Sikta
    Andrea Schafferhans
    Sean O'Donoghue
    Sandeep Kaur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Protein sequence and structure data

    This data set contains data from Uniprot (in the files called protein_sequence, protein_synonyms, protein_names, organism_synonyms) and PDB (in the files called PDB and PDB_chain) as used by the Aquaria web resource at the time of download (2022-02-08).

    The PSSH2 data set

    PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.

    Calculating PSSH2

    The Swissprot and PDB data was downloaded in November 2021. Generating PSSH2: We used UniRef30_2021_03 (originally called UniRef30_2021_06) from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30%. The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe until December 2021.

    PDB based sequence-to-structure alignments

    In addition to the PSSH2 data, new PDB structures were retrieved based on the primary accession of the proteins, by querying for all chains in all PDB entries with exact matches using the sequence cross references records given in PDB. Sequence-to-structure alignments were then created, again based on information provided in each PDB entry. These are contained in the PDBchain data.

    This data covers sequences and PDB structures in the timeframe until February 2022.

    Evaluating PSSH2

    The resulting alignment data was analysed using CATH domain assignments downloaded from /cath/releases/all-releases/v4_2_0/cath-classification-data/ to define correct hits and false hits:

    The set of query sequences is defined by the CATH non-redundant S40_overlap_60 dataset (ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/non-redundant-data-sets/)

    The set of all expected hits are all pdb structures containing a domain with the same CATH code if contained in the set of processed sequences (-> all) or only if also contained in the set of non redundant sequences (-> nr40).

    The set of true positives is defined by sharing the same CATH code up to the level of homology ("CATH") or up to the level of topology ("CAT").

    The data was evaluated with respect to false discovery rate (FDR) and recall (true positive rate TPR) by cumulatively considering all hits with an E-value below the threshold ("C") or in bins with an E-value between the threshold and one tenth of the threshold ("B"). This evaluation was carried out for the data obtained in November 2021 (202111) as well as previous data from October 2020 (202010), February 2020 (202002) and September 2017 (201709). The results are collected in PSSH CATH validation.csv.

    Known errors

    Due to processing error, the profile of pdb structure 5fia A / B (sequence md5 052667679fc644184f40063c7602c9e1) is incomplete in the pdb_full hhblits database which led to further errors in generating sequence based alignments for sequences for 1vtm P (sequence md5 c844aff103449363cb8489c78c58ebf1) and 434t A / B (sequence md5 d67aa1c3a36492c719cb48b5e7ecc624).

  17. n

    Secreted Protein Database

    • neuinfo.org
    • scicrunch.org
    • +1more
    Updated Apr 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Secreted Protein Database [Dataset]. http://identifiers.org/RRID:SCR_013448
    Explore at:
    Dataset updated
    Apr 14, 2024
    Description

    A collection of secreted proteins from Human, Mouse and Rat proteomes, which includes sequences from SwissProt, Trembl, Ensembl and Refseq. The 18,152 entries are classified into fourteen functional categories, including "apolipoprotein", "cytokine", "protease", "toxin", etc. To make the dataset more comprehensive, nine related datasets were also collected, such as SPDI, Riken mouse secretome, SwissProt vertebrate secreted proteins, SubLoc etc.

  18. e

    CATH-Gene3D

    • ebi.ac.uk
    Updated Oct 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). CATH-Gene3D [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Oct 21, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. CATH-Gene3D is based at University College, London, UK.

  19. PSSH2 - database of protein sequence-to-structure homologies

    • zenodo.org
    application/gzip
    Updated Feb 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Schafferhans; Andrea Schafferhans; Sean O'Donoghue; Sean O'Donoghue (2022). PSSH2 - database of protein sequence-to-structure homologies [Dataset]. http://doi.org/10.5281/zenodo.4279164
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrea Schafferhans; Andrea Schafferhans; Sean O'Donoghue; Sean O'Donoghue
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The PSSH2 data set

    PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.

    This dataset contains the Swissprot and PDB data used for generating PSSH2 along with the PSSH2 data itself. This consists of the sequence-to-structure alignments used in Aquaria (aquaria.ws) and also for the Covid19 resource of Aquaria (http://aquaria.ws/covid).

    Calculating PSSH2

    The main bunch of Swissprot and PDB data was downloaded in February 2020, but incremental updates, especially as related to Covid19 were added until July 2020.
    Generating PSSH2: We used Uniclust30 from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30% (http://gwdu111.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz). The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe between February and July 2020.

    Evaluating PSSH2

    The resulting alignment data was analysed using CATH domain assignments downloaded from /cath/releases/all-releases/v4_2_0/cath-classification-data/ to define correct hits and false hits:

    • The set of query sequences is defined by the CATH non-redundant S40_overlap_60 dataset (ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/non-redundant-data-sets/)
    • The set of all expected hits are all pdb structures containing a domain with the same CATH code if contained in the set of processed sequences (-> all) or only if also contained in the set of non redundant sequences (-> nr40).
    • The set of true positives is defined by sharing the same CATH code up to the level of homology ("CATH") or up to the level of topology ("CAT").

    The data was evaluated with respect to false discovery rate (FDR) and recall (true positive rate TPR) by cumulatively considering all hits with an E-value below the threshold ("C") or in bins with an E-value between the threshold and one tenth of the threshold ("B"). This evaluation was carried out for the data obtained in February 2020 (202002) as well as previous data from September 2017 (201709) and has since been repeated for data from October 2020 (202010). The results are collected in PSSH CATH validation.csv.

  20. e

    SFLD

    • ebi.ac.uk
    Updated Sep 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). SFLD [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Sep 7, 2018
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Will Dampier (2022). uniprot [Dataset]. https://huggingface.co/datasets/damlab/uniprot

uniprot

damlab/uniprot

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2022
Authors
Will Dampier
Description

Dataset Description

  Dataset Summary

This dataset is a mirror of the Uniprot/SwissProt database. It contains the names and sequences of >500K proteins. This dataset was parsed from the FASTA file at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz. Supported Tasks and Leaderboards: None Languages: English

  Dataset Structure







  Data Instances

Data Fields: id, description, sequence Data… See the full description on the dataset page: https://huggingface.co/datasets/damlab/uniprot.

Search
Clear search
Close search
Google apps
Main menu