100+ datasets found
  1. d

    Peptide Sequence Database

    • dknet.org
    • scicrunch.org
    • +1more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Peptide Sequence Database [Dataset]. http://identifiers.org/RRID:SCR_005764
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    The Peptide Sequence Database contains putative peptide sequences from human, mouse, rat, and zebrafish. Compressed to eliminate redundancy, these are about 40 fold smaller than a brute force enumeration. Current and old releases are available for download. Each species'' peptide sequence database comprises peptide sequence data from releveant species specific UniGene and IPI clusters, plus all sequences from their consituent EST, mRNA and protein sequence databases, namely RefSeq proteins and mRNAs, UniProt''s SwissProt and TrEMBL, GenBank mRNA, ESTs, and high-throughput cDNAs, HInv-DB, VEGA, EMBL, IPI protein sequences, plus the enumeration of all combinations of UniProt sequence variants, Met loss PTM, and signal peptide cleavages. The README file contains some information about the non amino-acid symbols O (digest site corresponding to a protein N- or C-terminus) and J (no digest sequence join) used in these peptide sequence databases and information about how to configure various search engines to use them. Some search engines handle (very) long sequences badly and in some cases must be patched to use these peptide sequence databases. All search engines supported by the PepArML meta-search engine can (or can be patched to) successfully search these peptide sequence databases.

  2. f

    Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric W. Deutsch; Zhi Sun; David S. Campbell; Pierre-Alain Binz; Terry Farrah; David Shteynberg; Luis Mendoza; Gilbert S. Omenn; Robert L. Moritz (2023). Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics [Dataset]. http://doi.org/10.1021/acs.jproteome.6b00445.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    ACS Publications
    Authors
    Eric W. Deutsch; Zhi Sun; David S. Campbell; Pierre-Alain Binz; Terry Farrah; David Shteynberg; Luis Mendoza; Gilbert S. Omenn; Robert L. Moritz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstancesa problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ∼20,000 primary isoforms plus contaminants to a very large database that includes almost all nonredundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/.

  3. f

    MScDB: A Mass Spectrometry-centric Protein Sequence Database for Proteomics

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harald Marx; Simone Lemeer; Susan Klaeger; Thomas Rattei; Bernhard Kuster (2023). MScDB: A Mass Spectrometry-centric Protein Sequence Database for Proteomics [Dataset]. http://doi.org/10.1021/pr400215r.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    ACS Publications
    Authors
    Harald Marx; Simone Lemeer; Susan Klaeger; Thomas Rattei; Bernhard Kuster
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Protein sequence databases are indispensable tools for life science research including mass spectrometry (MS)-based proteomics. In current database construction processes, sequence similarity clustering is used to reduce redundancies in the source data. Albeit powerful, it ignores the peptide-centric nature of proteomic data and the fact that MS is able to distinguish similar sequences. Therefore, we introduce an approach that structures the protein sequence space at the peptide level using theoretical and empirical information from large-scale proteomic data to generate a mass spectrometry-centric protein sequence database (MScDB). The core modules of MScDB are an in-silico proteolytic digest and a peptide-centric clustering algorithm that groups protein sequences that are indistinguishable by mass spectrometry. Analysis of various MScDB uses cases against five complex human proteomes, resulting in 69 peptide identifications not present in UniProtKB as well as 79 putative single amino acid polymorphisms. MScDB retains ∼99% of the identifications in comparison to common databases despite a 3–48% increase in the theoretical peptide search space (but comparable protein sequence space). In addition, MScDB enables cross-species applications such as human/mouse graft models, and our results suggest that the uncertainty in protein assignments to one species can be smaller than 20%.

  4. r

    NCBI Protein Database

    • rrid.site
    • neuinfo.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2001). NCBI Protein Database [Dataset]. http://identifiers.org/RRID:SCR_003257
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    Databases of protein sequences and 3D structures of proteins. Collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB.

  5. TMBETA-GENOME

    • dbarchive.biosciencedbc.jp
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Advanced Industrial Science and Technology *The original website was terminated., TMBETA-GENOME [Dataset]. http://doi.org/10.18908/lsdba.nbdc00713-000
    Explore at:
    Description

    TMBETA-GENOME is a database for transmembrane β-barrel proteins in complete genomes. For each genome, calculations with machine learning algorithms and statistical methods have been perfumed and the annotation results are accumulated in the database.

  6. The GenBank Non-Redundant Protein Sequence Database (NRDB)

    • data.niaid.nih.gov
    • piroplasmadb.org
    • +1more
    Updated Jan 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NRDB (2021). The GenBank Non-Redundant Protein Sequence Database (NRDB) [Dataset]. https://data.niaid.nih.gov/resources?id=ds_a7163a9f0d
    Explore at:
    Dataset updated
    Jan 21, 2021
    Dataset provided by
    National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov/
    Authors
    NRDB
    Description

    The GenBank non-redundant protein sequence database (NRDB) is a component of the NCBI BLAST databases and contains entries from GenPept, Swissprot, PIR, PDF, PDB and NCBI RefSeq.

  7. n

    UniParc

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). UniParc [Dataset]. http://identifiers.org/RRID:SCR_005818
    Explore at:
    Dataset updated
    Aug 6, 2024
    Description

    Database that contains publicly available protein sequences with stable and unique identifiers (UPI) which are never removed, changed or reassigned. UniParc tracks sequence changes in the source databases and archives the history of all changes. Information other than protein sequence must be retrieved from the UniParc source databases using the database cross-references.

  8. Performance and comparison of the metagenomic predicted protein sequence...

    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brandi L. Cantarel; Alison R. Erickson; Nathan C. VerBerkmoes; Brian K. Erickson; Patricia A. Carey; Chongle Pan; Manesh Shah; Emmanuel F. Mongodin; Janet K. Jansson; Claire M. Fraser-Liggett; Robert L. Hettich (2023). Performance and comparison of the metagenomic predicted protein sequence databases. [Dataset]. http://doi.org/10.1371/journal.pone.0027173.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Brandi L. Cantarel; Alison R. Erickson; Nathan C. VerBerkmoes; Brian K. Erickson; Patricia A. Carey; Chongle Pan; Manesh Shah; Emmanuel F. Mongodin; Janet K. Jansson; Claire M. Fraser-Liggett; Robert L. Hettich
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The database composition and SEQUEST/DTASelect search results (compute time, identified non-redundant spectra and peptides) with a 2-peptide and deltCN of 0.08 filters are shown for samples 6a (Run 2 and 3) and 6b (Run 1 and 2).

  9. Z

    CPBI_seqdb_demo sample QFO sequence library

    • data.niaid.nih.gov
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William R. Pearson (2020). CPBI_seqdb_demo sample QFO sequence library [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_377027
    Explore at:
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    U. of Virginia
    Authors
    William R. Pearson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A medium-sized (approx 1 million entry) protein sequence database constructed from the NCBI 'nr' (Jan, 2017) database selecting Uniprot (SwissProt), RefSeq, and PDB entries for 66 species (taxon_id's) from the Quest for Orthologs organism set. These files are designed to be used in conjunction with scripts and SQL files to construct the seqdb_demo database, as described in a Current Protocols in Bioinformatics Unit 3.9 revised Spring, 2017. The files are:

    qfo_demo.gz - a fasta-format sequence library with the curren NR Defline format (gzip compressed)

    qfo_prot.accession2taxonid.gz, qfo_pdb.accession2taxid.gz- tables that map accessions to taxon_id's and gi-numbers, similar to that available in the NCBI pub/taxonomy/accession2taxid/prot.accession2taxid and pdb.accession2taxid files (gzip compressed).

  10. f

    Protein Identification Using Customized Protein Sequence Databases Derived...

    • acs.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaojing Wang; Robbert J. C. Slebos; Dong Wang; Patrick J. Halvey; David L. Tabb; Daniel C. Liebler; Bing Zhang (2023). Protein Identification Using Customized Protein Sequence Databases Derived from RNA-Seq Data [Dataset]. http://doi.org/10.1021/pr200766z.s002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Xiaojing Wang; Robbert J. C. Slebos; Dong Wang; Patrick J. Halvey; David L. Tabb; Daniel C. Liebler; Bing Zhang
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The standard shotgun proteomics data analysis strategy relies on searching MS/MS spectra against a context-independent protein sequence database derived from the complete genome sequence of an organism. Because transcriptome sequence analysis (RNA-Seq) promises an unbiased and comprehensive picture of the transcriptome, we reason that a sample-specific protein database derived from RNA-Seq data can better approximate the real protein pool in the sample and thus improve protein identification. In this study, we have developed a two-step strategy for building sample-specific protein databases from RNA-Seq data. First, the database size is reduced by eliminating unexpressed or lowly expressed genes according to transcript quantification. Second, high-quality nonsynonymous coding single nucleotide variations (SNVs) are identified based on RNA-Seq data, and corresponding protein variants are added to the database. Using RNA-Seq and shotgun proteomics data from two colorectal cancer cell lines SW480 and RKO, we demonstrated that customized protein sequence databases could significantly increase the sensitivity of peptide identification, reduce ambiguity in protein assembly, and enable the detection of known and novel peptide variants. Thus, sample-specific databases from RNA-Seq data can enable more sensitive and comprehensive protein discovery in shotgun proteomics studies.

  11. n

    Protein-Protein Interaction Database

    • neuinfo.org
    • dknet.org
    • +1more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Protein-Protein Interaction Database [Dataset]. http://identifiers.org/RRID:SCR_007288
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    Mammalian protein-protein interaction database focusing on synaptic proteins. The Protein-Protein Interaction Database was originally a single-person's attempt to integrate a gamut of biological/bibliographical/molecular data and build a framework which might help understanding how cells orchestrate their protein content in order to become what they are: machines with a purpose. This is based on the simple paradigm that functionality like signal cascades are held together in a close space, thereby allowing specific events to occur without the necessity of passive diffusion and random events. The PPID database arose from the need to interpret Proteomic datasets, which were generated analysing the NMDA-receptor complex (see H. Husi, M. A. Ward, J. S. Choudhary, W. P. Blackstock and S. G. Grant (2000). Proteomic analysis of NMDA receptor-adhesion protein signaling complexes. Nat Neurosci 3, 661-669.). To study these clusters of proteins requires unavoidably the handling of large datasets, which PPID is generally aimed and tailored for. This database is unifying molecular entries across three species, namely human, rat and mouse and is is footed on sequence databases such as SwissProt, EMBL, TrEMBL (translated EMBL sequences) and Unigene and the literature database PubMed. A typical entry in PPID holds up to three general entries for the three species, all protein and gene accession numbers associated with them (assembled from Blast2 searches of the databases) and the OMIM entry as maintained by Johns Hopkins University. Furthermore protein sequence information is also included, together with known and novel splice-variants of each molecule as found by ClustalW sequence alignments. Entry points also include protein-binding information together with the literature reference. The whole database is curated manually to insure accuracy and quality. Querying the database will be possible by online browsing and batch-submission for large datasets holding accession number information, as can be generated using software like Mascot for mass-spectrometry. Cluster-analysis of the submitted datasets in the form of a graphical output will be developed as well as an easy-to-use web-interface. An interface is currently being built in collaboration with the Department of Informatics (T. Theodosiou and D. Armstrong) and will be deployed soon The current team of people collating and deploying the database are H. Husi (database mining and information gathering) and T. Theodosiou (web-interface and deployment). Please note that this database is not funded financially, and cannot survive without sponsorship.

  12. Data from: Assessing protein sequence database suitability using de novo...

    • data.niaid.nih.gov
    • ebi.ac.uk
    xml
    Updated Nov 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Johnson; Michael J. MacCoss (2019). Assessing protein sequence database suitability using de novo sequencing [Dataset]. https://data.niaid.nih.gov/resources?id=pxd015083
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Nov 20, 2019
    Dataset provided by
    University of Washington
    Department of Genome Sciences, University of Washington, Seattle, WA, United States
    Authors
    Richard Johnson; Michael J. MacCoss
    Variables measured
    Proteomics
    Description

    The analysis of samples from unsequenced and/or understudied species as well as samples where the proteome is derived from multiple organisms poses two key questions. The first is whether the proteomic data obtained from an unusual sample type even contains peptide tandem mass spectra. The second question is whether an appropriate protein sequence database is available for proteomic searches. We describe the use of automated de novo sequencing for evaluating both the quality of a collection of tandem mass spectra and the suitability of a given protein sequence database for searching that data. Applications of this method include the proteome analysis of closely related species, metaproteomics, and proteomics of extant organisms.

  13. n

    TM Function Database

    • neuinfo.org
    • rrid.site
    • +2more
    Updated Nov 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). TM Function Database [Dataset]. http://identifiers.org/RRID:SCR_007058
    Explore at:
    Dataset updated
    Nov 16, 2024
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE. Documented on October 29,2025. Database of functional residues in alpha-helical and beta-barrel membrane proteins. Each protein is identified with its name and source alongwith the Uniprot code. The protein data bank (PDB) codes are also given for available proteins. Different methods and experimental parameters, for example, affinity, dissociation constant, IC50, activity etc. are given in the database. Further, the database provides the numerical experimental value for each residue (or mutant) in a protein. The experimental data are collected from the literature both by searching the journals as well as with the keyword search at PUBMED. In addition, complete reference is given with journal citation and PMID number. TNFunction is cross-linked with the sequence database, Uniprot, structural database, PDB, and literature database, PubMed. The WWW interface enables users to search data based on various terms with different display options for outputs.

  14. s

    CharProtDB: Characterized Protein Database

    • scicrunch.org
    • rrid.site
    • +2more
    Updated Dec 4, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2011). CharProtDB: Characterized Protein Database [Dataset]. http://identifiers.org/RRID:SCR_005872
    Explore at:
    Dataset updated
    Dec 4, 2011
    Description

    The Characterized Protein Database, CharProtDB, is designed and being developed as a resource of expertly curated, experimentally characterized proteins described in published literature. For each protein record in CharProtDB, storage of several data types is supported. It includes functional annotation (several instances of protein names and gene symbols) taxonomic classification, literature links, specific Gene Ontology (GO) terms and GO evidence codes, EC (Enzyme Commisssion) and TC (Transport Classification) numbers and protein sequence. Additionally, each protein record is associated with cross links to all public accessions in major protein databases as ��synonymous accessions��. Each of the above data types can be linked to as many literature references as possible. Every CharProtDB entry requires minimum data types to be furnished. They are protein name, GO terms and supporting reference(s) associated to GO evidence codes. Annotating using the GO system is of importance for several reasons; the GO system captures defined concepts (the GO terms) with unique ids, which can be attached to specific genes and the three controlled vocabularies of the GO allow for the capture of much more annotation information than is traditionally captured in protein common names, including, for example, not just the function of the protein, but its location as well. GO evidence codes implemented in CharProtDB directly correlate with the GO consortium definitions of experimental codes. CharProtDB tools link characterization data from multiple input streams through synonymous accessions or direct sequence identity. CharProtDB can represent multiple characterizations of the same protein, with proper attribution and links to database sources. Users can use a variety of search terms including protein name, gene symbol, EC number, organism name, accessions or any text to search the database. Following the search, a display page lists all the proteins that match the search term. Click on the protein name to view more detailed annotated information for each protein. Additionally, each protein record can be annotated.

  15. Z

    Evaluation SIHUMI dataset: A sectioning and database enrichment approach for...

    • data.niaid.nih.gov
    Updated Apr 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kumar, Praveen; Johnson, James; Easterly, Caleb; Mehta, Subina; Sajulga, Ray; Nunn, Brook; Jagtap, Pratik; Griffin, Timothy (2020). Evaluation SIHUMI dataset: A sectioning and database enrichment approach for improved peptide spectrum matching in large, genome-guided protein sequence databases [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3754788
    Explore at:
    Dataset updated
    Apr 16, 2020
    Dataset provided by
    University of Minnesota
    University of Washington
    Authors
    Kumar, Praveen; Johnson, James; Easterly, Caleb; Mehta, Subina; Sajulga, Ray; Nunn, Brook; Jagtap, Pratik; Griffin, Timothy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for evaluation of database sectioning method for generating an enriched database for mass-spectrometry-based proteomic approaches using large databases. Our evaluation demonstrates that this method helps to increase the sensitivity of PSMs while maintaining acceptable FDR statistics. This dataset was acquired from the protein samples containing proteins from eight microorganisms (Anaerostipes caccae, Bacteroides thetaiotaomicron, Bifidobacterium longum, Blautia producta, Clostridium butyricum, Clostridium ramosum, Escherichia coli, Lactobacillus plantarum) that were grown in a bioreactor. The MS-data for this dataset was acquired by Dr. Hettich's group at Oak Ridge National Laboratory, and it was made available by Dr. Nico Jehmlich from Helmholtz Center for Environmental Research through the 3rd International Metaproteome Symposium (https://www.ufz.de/index.php?en=44639).

  16. e

    Data from: PROSITE

    • prosite.expasy.org
    • identifiers.org
    • +7more
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE [Dataset]. https://prosite.expasy.org/
    Explore at:
    Dataset updated
    Oct 15, 2025
    Description

    PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].

  17. r

    Genomic Distribution of structural Superfamilies

    • rrid.site
    • bioregistry.io
    Updated Nov 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Genomic Distribution of structural Superfamilies [Dataset]. http://identifiers.org/RRID:SCR_007670/resolver?q=&i=rrid
    Explore at:
    Dataset updated
    Nov 11, 2025
    Description

    Genomic Distribution of structural Superfamilies identifies and classifies evolutionary related proteins at the superfamily level in whole genome databases. GenDiS has been curated in direct correspondence with SCOP and represents 4001 highly resolved domains in 1194 structural superfamilies across protein sequence databases. Sequences showing reliable homology to entries in SCOP and PASS2 databases have been obtained from the non-redundant protein sequence database and aligned. Similar alignments of the superfamily members are provided in the genome level. GenDiS provides a platform for cross genome comparison at the superfamily level. GenDis relates proteins sequence information across all strata of taxonomy. One may navigate through the database to obtain structural homologues across different levels in taxonomic classification. The nomenclature of the various genomes and their hierarchy is in direct correspondence with the taxonomy database maintained at the NCBI. Sequence homologues for the various structural members are obtained from the non-redundant protein sequence database employing sensitive sequence search methods. Multiple approaches such as PSI-BLAST, HMMsearch of the HMMer suite and an interacting motif constrained PHI-BLAST have been employed to identify homologues in the sequence databases.

  18. d

    Data from: CluSTr

    • dknet.org
    • rrid.site
    • +2more
    Updated Jan 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). CluSTr [Dataset]. http://identifiers.org/RRID:SCR_007600
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented May 10, 2017. A pilot effort that has developed a centralized, web-based biospecimen locator that presents biospecimens collected and stored at participating Arizona hospitals and biospecimen banks, which are available for acquisition and use by researchers. Researchers may use this site to browse, search and request biospecimens to use in qualified studies. The development of the ABL was guided by the Arizona Biospecimen Consortium (ABC), a consortium of hospitals and medical centers in the Phoenix area, and is now being piloted by this Consortium under the direction of ABRC. You may browse by type (cells, fluid, molecular, tissue) or disease. Common data elements decided by the ABC Standards Committee, based on data elements on the National Cancer Institute''s (NCI''s) Common Biorepository Model (CBM), are displayed. These describe the minimum set of data elements that the NCI determined were most important for a researcher to see about a biospecimen. The ABL currently does not display information on whether or not clinical data is available to accompany the biospecimens. However, a requester has the ability to solicit clinical data in the request. Once a request is approved, the biospecimen provider will contact the requester to discuss the request (and the requester''s questions) before finalizing the invoice and shipment. The ABL is available to the public to browse. In order to request biospecimens from the ABL, the researcher will be required to submit the requested required information. Upon submission of the information, shipment of the requested biospecimen(s) will be dependent on the scientific and institutional review approval. Account required. Registration is open to everyone., documented June 24, 2013 as per the Miriam database (http://www.ebi.ac.uk/miriam/main/collections/MIR:00000021). The CluSTr database offers an automatic classification of UniProt Knowledgebase and IPI proteins into groups of related proteins. The clustering is based on analysis of all pairwise comparisons between protein sequences. The database provides links to InterPro, which integrates information on protein families, domains and functional sites from PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAMs, Gene3D, SUPERFAMILY, PIR Superfamily and PANTHER. To date (2011), CluSTr contains the following information: * 9,450,285 sequences from UniProt Knowledgebase release 15.6 * 308,281 sequences from IPI * 3,636,831,744 similarities, with pairwise alignments generated on-the-fly * 17,616,060 clusters * Clustering for 972 organisms with completely sequenced genomes. For the full list of the genomes see Integr8 * Putative homologues predictions for the above species. For more information see Homologue Selection at Integr8

  19. MARMICRODB database for taxonomic classification of (marine) metagenomes

    • zenodo.org
    application/gzip, bin +3
    Updated Mar 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shane L Hogle; Shane L Hogle (2020). MARMICRODB database for taxonomic classification of (marine) metagenomes [Dataset]. http://doi.org/10.5281/zenodo.3520509
    Explore at:
    bin, application/gzip, tsv, html, bz2Available download formats
    Dataset updated
    Mar 20, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shane L Hogle; Shane L Hogle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction:
    This sequence database (MARMICRODB) was introduced in the publication JW Becker, SL Hogle, K Rosendo, and SW Chisholm. 2019. Co-culture and biogeography of Prochlorococcus and SAR11. ISME J. doi:10.1038/s41396-019-0365-4. Please see the original publication and its associated supplementary material for the original description of this resource.

    Motivation:
    We needed a reference database to annotate shotgun metagenomes from the Tara Oceans project [1] the GEOTRACES cruises GA02, GA03, GA10, and GP13 and the HOT and BATS time series [2]. Our interests are primarily in quantifying and annotating the free-living, oligotrophic bacterial groups Prochlorococcus, Pelagibacterales/SAR11, SAR116, and SAR86 from these samples using the protein classifier tool Kaiju [3]. Kaiju’s sensitivity and classification accuracy depend on the composition of the reference database, and highest sensitivity is achieved when the reference database contains a comprehensive representation of expected taxa from an environment/sample of interest. However, the speed of the algorithm decreases as database size increases. Therefore, we aimed to create a reference database that maximized the representation of sequences from marine bacteria, archaea, and microbial eukaryotes, while minimizing (but not excluding) the sequences from clinical, industrial, and terrestrial host-associated samples.

    Results/Description:
    MARMICRODB consists of 56 million sequence non-redundant protein sequences from 18769 bacterial/archaeal/eukaryote genome and transcriptome bins and 7492 viral genomes optimized for use with the protein homology classifier Kaiju [3]. To ensure maximum representation of marine bacteria, archaea, and microbial eukaryotes, we included translated genes/transcripts from 5397 representative “specI” species clusters from the proGenomes database [4]; 113 transcriptomes from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) [5]; 10509 metagenome assembled genomes from the Tara Oceans expedition [6,7], the Red Sea [8], the Baltic Sea [9], and other aquatic and terrestrial sources [10]; 994 isolate genomes from the Genomic Encyclopedia of Bacteria and Archaea [11]; 7492 viral genomes from NCBI RefSeq [12]; 786 bacterial and archaeal genomes from MarRef [13]; and 677 marine single cell genomes [14]. In order to annotate metagenomic reads at the clade/ecotype level (subspecies) for the focal taxa Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116, we generated custom MARMICRODB taxonomies based on curated genome phylogenies for each group. The curated phylogenies, Kaiju formatted Burrows-Wheeler index, translated genes, the custom taxonomy hierarchy, an interactive kronaplot of the taxonomic composition, and scripts and instructions for how to use or rebuild the resource is available from 10.5281/zenodo.3520509.

    Methods:
    The curation and quality control of MARMICRODB single cell, metagenome assembled, and isolate genomes was performed as described in [15]. Briefly, we downloaded all MARMICRODB genomes as raw nucleotide assemblies from NCBI. We determined an initial genome taxonomy for these assemblies using checkM with the default lineage workflow [16]. All genome bins met the completion/contamination thresholds outlined in prior studies [7,17]. For single cell and metagenome assembled genomes, especially those from Tara Oceans Mediterranean sea samples [18], we use the GTDB-Tk classification workflow [19] to verify the taxonomic fidelity of each genome bin. We then selected genomes with a checkM taxonomic assignment of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 for further analysis and confirmed taxonomic assignment using blast matches to known Prochlorococcus/Synechococcus ITS sequences and by matching 16S sequences to the SILVA database [20]. To refine our estimates of completeness/contamination of Prochlorococcus genome bins we created a custom set of 730 single copy protein families (available from 10.5281/zenodo.3719132) from closed, isolate Prochlorococcus genomes [21] for quality assessments with checkM. For Synechococcus we used the CheckM taxonomic-specific workflow with the genus Synechococcus. After the custom CheckM quality control, we excluded any genome bins from downstream analysis that had an estimated quality < 30, defined as %completeness – 5x %contamination resulting in 18769 genome/transcriptome bins. We predicted genes in the resulting genome bins using prodigal [22] and excluded protein sequences with lengths less than 20 and greater than 20000 amino acids, removed non-standard amino acid residues, and condensed redundant protein sequences to a single representative sequence to which we assigned a lowest common ancestor (LCA) taxonomy identifier from the NCBI taxonomy database [23]. The resulting protein sequences were compiled and used to build a Kaiju [3] search database.

    The above filtering criteria resulted in 605 Prochlorococcus, 96 Synechococcus, 186 SAR11/Pelagibacterales, 60 SAR86, and 59 SAR116 high-quality genome bins. We constructed a high quality fixed reference phylogenetic tree for each taxonomic group based on genomes manually selected for completeness and the phylogenetic diversity. For example the Prochlorococcus and Synechococcus genomes for the fixed reference phylogeny are estimated > 90% complete, and SAR11 genomes are estimated > 70% complete. We created multiple sequence alignments of phylogenetically conserved genes from these genomes using the GTDB-Tk pipeline [19] with default settings. The pipeline identifies conserved proteins (120 bacterial proteins) and generates concatenated multi-protein alignments [17] from the genome assemblies using hmmalign from the hmmer software suite. We further filtered the resulting alignment columns using the bacterial and archaeal alignment masks from [17] (http://gtdb.ecogenomic.org/downloads). We removed columns represented by fewer than 50% of all taxa and/or columns with no single amino acid residue occuring at a frequency greater than 25%. We trimmed the alignments using trimal [24] with the automated -gappyout option to trim columns based on their gap distribution. We inferred reference phylogenies using multithreaded RAxML [25] with the GAMMA model of rate heterogeneity, empirically determined base frequencies, and the LG substitution model [26](PROTGAMMALGF). Branch support is based on 250 resampled bootstrap trees. This tree was then pruned to only allow a maximum average distance to the closest leaf (ADCL) of 0.003 to reduce the phylogenetic redundancy in the tree [27]. We then “placed” genomes that either did not pass completeness threshold or were considered phylogenetically redundant by ADCL within the fixed reference phylogeny for each group using pplacer [28] representing each placed genome as a pendant edge in the final tree. We then examined the resulting tree and manually selected clade/ecotype cutoffs to be as consistent as possible with clade definitions previously outlined for these groups [29–32]. We then gave clades from each taxonomic group custom taxonomic identifiers and we added these identifiers to the MARMICRODB Kaiju taxonomic hierarchy.

    Software/databases used:
    checkM v1.0.11[16]
    HMMERv3.1b2 (http://hmmer.org/)
    prodigal v2.6.3 [22]
    trimAl v1.4.rev22 [24]
    AliView v1.18.1 [33] [34]
    Phyx v0.1 [35]
    RAxML v8.2.12 [36]
    Pplacer v1.1alpha [28]
    GTDB-Tk v0.1.3 [19]
    Kaiju v1.6.0 [34]
    GTDB RS83 (https://data.ace.uq.edu.au/public/gtdb/data/releases/release83/83.0/)
    NCBI Taxonomy (accessed 2018-07-02) [23]
    TIGRFAM v14.0 [37]
    PFAM v31.0 [38]

    Discussion/Caveats:
    MARMICRODB is optimized for metagenomic samples from the marine environment, in particular planktonic microbes from the pelagic euphotic zone. We expect this database may also be useful for classifying other types of marine metagenomic samples (for example mesopelagic, bathypelagic, or even benthic or marine host-associated), but it has not been tested as such. The original purpose of this database was to quantify clades/ecotypes of Prochlorococcus, Synechococcus, SAR11/Pelagibacterales, SAR86, and SAR116 in metagenomes from Tara Oceans Expedition and the GEOTRACES project. We carefully annotated and quality controlled genomes from these five groups, but the processing of the other marine taxa was largely automated and unsupervised. Taxonomy for other groups was copied over from the Genome Taxonomy Database (GTDB) [19,39] and NCBI Taxonomy [23] so any inconsistencies in those databases will be propagated to MARMICRODB. For most use cases MARMICRODB can probably be used unmodified, but if the user’s goal is to focus on a particular organism/clade that we did not curate in the database then the user may wish to spend some time curating those genomes (ie checking for contamination, dereplicating, building a genome phylogeny for custom taxonomy node assignment). Currently the custom taxonomy is hardcoded in the MARMICRODB.fmi index, but if users wish to modify MARMICRODB by adding or removing genomes, or reconfiguring taxonomic ranks the names.dmp and nodes.dmp files can easily be modified as well as the fasta file of protein sequences. However, the Kaiju index will need to be rebuilt, and user will require a high

  20. n

    Human Gene and Protein Database (HGPD)

    • neuinfo.org
    • scicrunch.org
    • +1more
    Updated Nov 23, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2008). Human Gene and Protein Database (HGPD) [Dataset]. http://identifiers.org/RRID:SCR_002889
    Explore at:
    Dataset updated
    Nov 23, 2008
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE. Documented on January 4,2023.The Human Gene and Protein Database presents SDS-PAGE patterns and other informations of human genes and proteins. The HGPD was constructed from full-length cDNAs. For conversion to Gateway entry clones, we first determined an open reading frame (ORF) region in each cDNA meeting the criteria. Those ORF regions were PCR-amplified utilizing selected resource cDNAs as templates. All the details of the construction and utilization of entry clones will be published elsewhere. Amino acid and nucleotide sequences of an ORF for each cDNA and sequence differences of Gateway entry clones from source cDNAs are presented in the GW: Gateway Summary window. Utilizing those clones with a very efficient cell-free protein synthesis system featuring wheat germ, we have produced a large number of human proteins in vitro. Expressed proteins were detected in almost all cases. Proteins in both total and supernatant fractions are shown in the PE: Protein Expression window. In addition, we have also successfully expressed proteins in HeLa cells and determined subcellular localizations of human proteins. These biological data are presented on the frame of cDNA clusters in the Human Gene and Protein Database. To build the basic frame of HGPD, sequences of FLJ full-length cDNAs and others deposited in public databases (Human ESTs, RefSeq, Ensembl, MGC, etc.) are assembled onto the genome sequences (NCBI Build 35 (UCSC hg17)). The majority of analysis data for cDNA sequences in HGPD are shared with the FLJ Human cDNA Database (http://flj.hinv.jp/) constructed as a human cDNA sequence analysis database focusing on mRNA varieties caused by variations in transcription start site (TSS) and splicing.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). Peptide Sequence Database [Dataset]. http://identifiers.org/RRID:SCR_005764

Peptide Sequence Database

RRID:SCR_005764, nlx_149230, Peptide Sequence Database (RRID:SCR_005764), PepSeqDB

Explore at:
Dataset updated
Jan 29, 2022
Description

The Peptide Sequence Database contains putative peptide sequences from human, mouse, rat, and zebrafish. Compressed to eliminate redundancy, these are about 40 fold smaller than a brute force enumeration. Current and old releases are available for download. Each species'' peptide sequence database comprises peptide sequence data from releveant species specific UniGene and IPI clusters, plus all sequences from their consituent EST, mRNA and protein sequence databases, namely RefSeq proteins and mRNAs, UniProt''s SwissProt and TrEMBL, GenBank mRNA, ESTs, and high-throughput cDNAs, HInv-DB, VEGA, EMBL, IPI protein sequences, plus the enumeration of all combinations of UniProt sequence variants, Met loss PTM, and signal peptide cleavages. The README file contains some information about the non amino-acid symbols O (digest site corresponding to a protein N- or C-terminus) and J (no digest sequence join) used in these peptide sequence databases and information about how to configure various search engines to use them. Some search engines handle (very) long sequences badly and in some cases must be patched to use these peptide sequence databases. All search engines supported by the PepArML meta-search engine can (or can be patched to) successfully search these peptide sequence databases.

Search
Clear search
Close search
Google apps
Main menu