84 datasets found
  1. n

    Conserved Domain Database

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Jan 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Conserved Domain Database [Dataset]. http://identifiers.org/RRID:SCR_002077
    Explore at:
    Dataset updated
    Jan 22, 2025
    Description

    Database of annotations of functional units in proteins including multiple sequence alignment models for ancient domains and full-length proteins. This collection of models includes 3D structures that display the sequence/structure/function relationships in proteins. It also includes alignments of the domains to known three-dimensional protein structures in the MMDB database. The source databases are Pfam, Smart, and COG. Users can identify amino acids in protein sequences with the resources available as well as view single sequences embedded within multiple sequence alignments.

  2. f

    Determining the amino acid composition of every protein in Escherichia coli...

    • figshare.com
    xlsx
    Updated Dec 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenfa Ng (2019). Determining the amino acid composition of every protein in Escherichia coli K-12 [Dataset]. http://doi.org/10.6084/m9.figshare.11341157.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 9, 2019
    Dataset provided by
    figshare
    Authors
    Wenfa Ng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Understanding the amino acid composition of every protein in a microorganism may be the first step in gaining an appreciation of the metabolic needs of a species. Specifically, amino acid composition of proteins informs the cellular need for each amino acid, and this in turn, provides a glimpse into the metabolic need of the cell. This may hold implications for how the different biosynthesis pathway for each amino acid are regulated. On the other hand, information on amino acid composition of an organism also guide the formulation of a suitable growth medium for the species in view of enabling it to attain fast growth. But, all these discussions must be taken in light of the fact that not all genes in a genome are expressed at any point in time. Using an in-house MATLAB software, the amino acid composition of every protein in Escherichia coli K-12 was calculated. Raw proteome information of the species was obtained from UniProt (accession number: UP000000625), and processed to yield a database comprising protein name, protein amino acid sequence, and the percentage of each amino acid in the protein sequence. Amino acid composition is presented as percentage of each amino acid, which is identified by their single letter abbreviation. Overall, the presented database should find use in fundamental studies aiming to calculate the total metabolic need of E. coli K-12 with respect to amino acid utilization, as well as assessing the compendium of relatively heavily utilized amino acid that may need supplementation in the growth medium.

  3. d

    Peptide Sequence Database

    • dknet.org
    • rrid.site
    Updated Jan 19, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2011). Peptide Sequence Database [Dataset]. http://identifiers.org/RRID:SCR_005764
    Explore at:
    Dataset updated
    Jan 19, 2011
    Description

    The Peptide Sequence Database contains putative peptide sequences from human, mouse, rat, and zebrafish. Compressed to eliminate redundancy, these are about 40 fold smaller than a brute force enumeration. Current and old releases are available for download. Each species'' peptide sequence database comprises peptide sequence data from releveant species specific UniGene and IPI clusters, plus all sequences from their consituent EST, mRNA and protein sequence databases, namely RefSeq proteins and mRNAs, UniProt''s SwissProt and TrEMBL, GenBank mRNA, ESTs, and high-throughput cDNAs, HInv-DB, VEGA, EMBL, IPI protein sequences, plus the enumeration of all combinations of UniProt sequence variants, Met loss PTM, and signal peptide cleavages. The README file contains some information about the non amino-acid symbols O (digest site corresponding to a protein N- or C-terminus) and J (no digest sequence join) used in these peptide sequence databases and information about how to configure various search engines to use them. Some search engines handle (very) long sequences badly and in some cases must be patched to use these peptide sequence databases. All search engines supported by the PepArML meta-search engine can (or can be patched to) successfully search these peptide sequence databases.

  4. n

    Data from: Kabat Database of Sequences of Proteins of Immunological Interest...

    • neuinfo.org
    • dknet.org
    • +1more
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Kabat Database of Sequences of Proteins of Immunological Interest [Dataset]. http://identifiers.org/RRID:SCR_006465
    Explore at:
    Dataset updated
    Dec 30, 2024
    Description

    The Kabat Database determines the combining site of antibodies based on the available amino acid sequences. The precise delineation of complementarity determining regions (CDR) of both light and heavy chains provides the first example of how properly aligned sequences can be used to derive structural and functional information of biological macromolecules. The Kabat database now includes nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules, and other proteins of immunological interest. The Kabat Database searching and analysis tools package is an ASP.NET web-based portal containing lookup tools, sequence matching tools, alignment tools, length distribution tools, positional correlation tools and much more. The searching and analysis tools are custom made for the aligned data sets contained in both the SQL Server and ASCII text flat file formats. The searching and analysis tools may be run on a single PC workstation or in a distributed environment. The analysis tools are written in ASP.NET and C# and are available in Visual Studio .NET 2003/2005/2008 formats. The Kabat Database was initially started in 1970 to determine the combining site of antibodies based on the available amino acid sequences at that time. Bence Jones proteins, mostly from human, were aligned, using the now-known Kabat numbering system, and a quantitative measure, variability, was calculated for every position. Three peaks, at positions 24-34, 50-56 and 89-97, were identified and proposed to form the complementarity determining regions (CDR) of light chains. Subsequently, antibody heavy chain amino acid sequences were also aligned using a different numbering system, since the locations of their CDRs (31-35B, 50-65 and 95-102) are different from those of the light chains. CDRL1 starts right after the first invariant Cys 23 of light chains, while CDRH1 is eight amino acid residues away from the first invariant Cys 22 of heavy chains. During the past 30 years, the Kabat database has grown to include nucleotide sequences, sequences of T cell receptors for antigens (TCR), major histocompatibility complex (MHC) class I and II molecules and other proteins of immunological interest. It has been used extensively by immunologists to derive useful structural and functional information from the primary sequences of these proteins.

  5. n

    Human Gene and Protein Database (HGPD)

    • neuinfo.org
    • scicrunch.org
    Updated Nov 23, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2008). Human Gene and Protein Database (HGPD) [Dataset]. http://identifiers.org/RRID:SCR_002889
    Explore at:
    Dataset updated
    Nov 23, 2008
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE. Documented on January 4,2023.The Human Gene and Protein Database presents SDS-PAGE patterns and other informations of human genes and proteins. The HGPD was constructed from full-length cDNAs. For conversion to Gateway entry clones, we first determined an open reading frame (ORF) region in each cDNA meeting the criteria. Those ORF regions were PCR-amplified utilizing selected resource cDNAs as templates. All the details of the construction and utilization of entry clones will be published elsewhere. Amino acid and nucleotide sequences of an ORF for each cDNA and sequence differences of Gateway entry clones from source cDNAs are presented in the GW: Gateway Summary window. Utilizing those clones with a very efficient cell-free protein synthesis system featuring wheat germ, we have produced a large number of human proteins in vitro. Expressed proteins were detected in almost all cases. Proteins in both total and supernatant fractions are shown in the PE: Protein Expression window. In addition, we have also successfully expressed proteins in HeLa cells and determined subcellular localizations of human proteins. These biological data are presented on the frame of cDNA clusters in the Human Gene and Protein Database. To build the basic frame of HGPD, sequences of FLJ full-length cDNAs and others deposited in public databases (Human ESTs, RefSeq, Ensembl, MGC, etc.) are assembled onto the genome sequences (NCBI Build 35 (UCSC hg17)). The majority of analysis data for cDNA sequences in HGPD are shared with the FLJ Human cDNA Database (http://flj.hinv.jp/) constructed as a human cDNA sequence analysis database focusing on mRNA varieties caused by variations in transcription start site (TSS) and splicing.

  6. Ribosomal protein database of Serratia marcescens strain SMO3

    • figshare.com
    xls
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenfa Ng (2021). Ribosomal protein database of Serratia marcescens strain SMO3 [Dataset]. http://doi.org/10.6084/m9.figshare.14213543.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 14, 2021
    Dataset provided by
    figshare
    Authors
    Wenfa Ng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This work presents the ribosomal protein database of Serratia marcescens strain SMO3. Original data for the work came from the annotated proteome data of the bacterium downloaded from UniProt. Using an in-house MATLAB ribosomal protein database analysis software, the original proteome data file was parsed to extract protein name and amino acid sequence of all ribosomal proteins in the species. The database also includes calculated variables such as number of residues, molecular weight, and nucleotide sequence. Overall, the presented database could serve as a ribosomal protein mass fingerprint for use in microbial identification, or it could be used in fundamental studies seeking to uncover new insights into ribosomal protein biology.

  7. Ribosomal protein database of Klebsiella pneumoniae

    • figshare.com
    xlsx
    Updated Feb 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenfa Ng (2021). Ribosomal protein database of Klebsiella pneumoniae [Dataset]. http://doi.org/10.6084/m9.figshare.13724617.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 6, 2021
    Dataset provided by
    figshare
    Authors
    Wenfa Ng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This work presents the ribosomal protein database of Klebsiella pneumoniae. Original data for the work came from the annotated proteome data of the bacterium downloaded from UniProt. Using an in-house MATLAB ribosomal protein database analysis software, the original proteome data file was parsed to extract protein name and amino acid sequence of all ribosomal proteins in the species. The database also includes calculated variables such as number of residues, molecular weight, and nucleotide sequence. Overall, the presented database could serve as a ribosomal protein mass fingerprint for use in microbial identification, or it could be used in fundamental studies seeking to uncover new insights into ribosomal protein biology.

  8. i

    Data from: PMD

    • integbio.jp
    Updated Jun 17, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Genetics (2013). PMD [Dataset]. https://integbio.jp/dbcatalog/en/record/nbdc00170
    Explore at:
    Dataset updated
    Jun 17, 2013
    Dataset provided by
    National Institute of Genetics
    Description

    PMD is a collections of literature data related to natural and artificial mutants for all kinds of proteins except members of the globin and immunoglobulin families. PMD treats an article as one entry, and each entry contains data about one or more protein mutants. This database allows users to search articles by keywords or amino acid sequences using BLAST. Each entry in search results shows details of the article (authors, journal, MEDLINE link, etc) and mutation (the position of amino acid substitution, insertion or deletion, kind of mutations, functional or structural features, etc). The authors of articles in academic journals can submit protein mutant data to PMD.

  9. n

    Artificial Selected Proteins/Peptides Database

    • neuinfo.org
    • scicrunch.org
    • +1more
    Updated Mar 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Artificial Selected Proteins/Peptides Database [Dataset]. http://identifiers.org/RRID:SCR_007557
    Explore at:
    Dataset updated
    Mar 22, 2025
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented on June 04, 2014. Curated database on selected from randomized pools proteins and peptides designed for accumulation of experimental data on protein functionality obtained by in vitro directed evolution methods (phage display, ribosome display, SIP etc.) ASPD is integrated by means of hyperlinks with different databases (SWISS-PROT, PDB, PROSITE, etc). The database also contains modules for pairwise correlation analysis and BLAST search.

  10. f

    Proteome database of Aeromonas hydrophila AH1

    • figshare.com
    xlsx
    Updated Dec 24, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenfa Ng (2019). Proteome database of Aeromonas hydrophila AH1 [Dataset]. http://doi.org/10.6084/m9.figshare.11447622.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 24, 2019
    Dataset provided by
    figshare
    Authors
    Wenfa Ng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Resistant to most common antibiotics and cold temperatures, Aeromonas hydrophila is a microorganism that could emerge as a human pathogen of concern. More importantly, its ability to live in aerobic and anaerobic environments endow it with capability in infecting body wounds that would result in morbidity. Finally, ability to use a variety of organic substrates as carbon source further positions A. hydrophila as a potential pathogen given its ability to subsist on small amounts of nutrients in the interstitial fluid that bathe human cells and tissues. This work sought to illuminate the fundamental biology of A. hydrophila AH1 through parsing the UniProt proteome file of the bacterium using an in-house MATLAB software. Specifically, a proteome database comprising protein name, amino acid sequence, number of residues, molecular weight and nucleotide sequence was built that would help guide experiments to deduce the nutritional and environmental conditions that would be most conducive for the growth of this versatile microorganism. Finally, proteomic information would serve as a useful reference for thinking about possible interventions or therapeutic strategies, especially as more research confirms A. hydrophila as an emerging human pathogen.

  11. d

    Animal Genome Database

    • dknet.org
    • rrid.site
    • +1more
    Updated Mar 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Animal Genome Database [Dataset]. http://identifiers.org/RRID:SCR_008165/resolver
    Explore at:
    Dataset updated
    Mar 8, 2025
    Description

    Database of comparative gene mapping between species to assist the mapping of the genes related to phenotypic traits in livestock. The linkage maps, cytogenetic maps, polymerase chain reaction primers of pig, cattle, mouse and human, and their references have been included in the database, and the correspondence among species have been stipulated in the database. AGP is an animal genome database developed on a Unix workstation and maintained by a relational database management system. It is a joint project of National Institute of Agrobiological Sciences (NIAS) and Institute of the Society for Techno-innovation of Agriculture, Forestry and Fisheries (STAFF-Institute), under cooperation with other related research institutes. AGP also contains the Pig Expression Data Explorer (PEDE), a database of porcine EST collections derived from full-length cDNA libraries and full-length sequences of the cDNA clones picked from the EST collection. The EST sequences have been clustered and assembled, and their similarity to sequences in RefSeq, and UniGene determined. The PEDE database system was constructed to store sequences and similarity data of swine full-length cDNA libraries and to make them available to users. It provides interfaces for keyword and ID searches of BLAST results and enables users to obtain sequence data and names of clones of interest. Putative SNPs in EST assemblies have been classified according to breed specificity and their effect on coding amino acids, and the assemblies are equipped with an SNP search interface. The database contains porcine nucleotide sequences and cDNA clones that are ready for analyses such as expression in mammalian cells, because of their high likelihood of containing full-length CDS. PEDE will be useful for researchers who want to explore genes that may be responsible for traits such as disease susceptibility. The database also offers information regarding major and minor porcine-specific antigens, which might be investigated in regard to the use of pigs as models in various medical research applications.

  12. Proteome database of Pseudomonas protegens Pf-5

    • figshare.com
    xlsx
    Updated Dec 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenfa Ng (2019). Proteome database of Pseudomonas protegens Pf-5 [Dataset]. http://doi.org/10.6084/m9.figshare.11416824.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 20, 2019
    Dataset provided by
    figshare
    Authors
    Wenfa Ng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Harbouring a large genome relative to bacterial standards, the Gram-negative bacterium, Pseudomonas protegens Pf-5 possess a versatile metabolism that enables it to colonize a variety of environments. More importantly, P. protegens Pf-5 has found applications in biotechnology such as its emerging role as a useful bacterium for leaching precious metals from electronic scrap materials. Biotechnology potential aside, P. protegens Pf-5 is fundamentally interesting given its large genome harbouring hitherto unknown metabolic and signalling pathways that may find use in biotechnology. This work sought to provide some fundamental information about P. protegens Pf-5 through constructing a proteome database of the organism at the global level. Parsed by an in-house MATLAB software, the database comprises protein name, amino acid sequence, number of residues, molecular weight and corresponding nucleotide sequence of the protein. The latter three features were calculated using MATLAB built-in functions. Overall, the proteome database of P. protegens Pf-5 should inform researchers of the ensemble of proteins detected in the species, which forms the foundation on which a variety of fundamental and applied microbiology studies could be tackled.

  13. GTDB Rel. 202 FastAAI databaseFastAAI Databases

    • figshare.com
    application/gzip
    Updated May 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kenji Gerhardt (2024). GTDB Rel. 202 FastAAI databaseFastAAI Databases [Dataset]. http://doi.org/10.6084/m9.figshare.25746876.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 3, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Kenji Gerhardt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database containing a dataset for FastAAI version 1.

  14. Protein Post Translational Modifications

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). Protein Post Translational Modifications [Dataset]. https://www.johnsnowlabs.com/marketplace/protein-post-translational-modifications/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    N/A
    Description

    This dataset includes protein post-translational modifications as well as associated annotation data obtained from the Biological General Repository for Interaction databases (BIOGRID) for major model organisms species including the type of modification, protein sequence and specific amino acid involved.

  15. f

    Relative frequency of amino acid substitutions.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    János Molnár; Gergely Szakács; Gábor E. Tusnády (2023). Relative frequency of amino acid substitutions. [Dataset]. http://doi.org/10.1371/journal.pone.0151760.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    János Molnár; Gergely Szakács; Gábor E. Tusnády
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mutated amino acids are shown in rows; mutant amino acids are shown in columns associated with diseases.

  16. v

    Global import data of Amino Acid

    • volza.com
    csv
    Updated Mar 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volza FZ LLC (2025). Global import data of Amino Acid [Dataset]. https://www.volza.com/p/amino-acid/import/import-in-united-states/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 25, 2025
    Dataset authored and provided by
    Volza FZ LLC
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Count of importers, Sum of import value, 2014-01-01/2021-09-30, Count of import shipments
    Description

    10244 Global import shipment records of Amino Acid with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.

  17. n

    Database of Interacting Proteins (DIP)

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Mar 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Database of Interacting Proteins (DIP) [Dataset]. http://identifiers.org/RRID:SCR_003167
    Explore at:
    Dataset updated
    Mar 19, 2025
    Description

    Database to catalog experimentally determined interactions between proteins combining information from a variety of sources to create a single, consistent set of protein-protein interactions that can be downloaded in a variety of formats. The data were curated, both, manually and also automatically using computational approaches that utilize the the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data. Because the reliability of experimental evidence varies widely, methods of quality assessment have been developed and utilized to identify the most reliable subset of the interactions. This CORE set can be used as a reference when evaluating the reliability of high-throughput protein-protein interaction data sets, for development of prediction methods, as well as in the studies of the properties of protein interaction networks. Tools are available to analyze, visualize and integrate user's own experimental data with the information about protein-protein interactions available in the DIP database. The DIP database lists protein pairs that are known to interact with each other. By interact they mean that two amino acid chains were experimentally identified to bind to each other. The database lists such pairs to aid those studying a particular protein-protein interaction but also those investigating entire regulatory and signaling pathways as well as those studying the organization and complexity of the protein interaction network at the cellular level. Registration is required to gain access to most of the DIP features. Registration is free to the members of the academic community. Trial accounts for the commercial users are also available.

  18. n

    CADB - Conformational Angles DataBase of Proteins

    • neuinfo.org
    • dknet.org
    Updated Mar 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). CADB - Conformational Angles DataBase of Proteins [Dataset]. http://identifiers.org/RRID:SCR_007573/resolver?q=&i=rrid
    Explore at:
    Dataset updated
    Mar 22, 2025
    Description

    Conformation Angles DataBase is a comprehensive, authoritative and timely knowledge base developed to facilitate retrieval of information related to the conformational angles (main-chain and side-chain) of the amino acid residues present in the non-redundant (both 25% and 90%) data set. The database includes the options of determining the dependency of the conformation angles of a particular residue upon the flanking residues in main-chain, doublet analysis, triplet analysis and analysis of a particular protein structure. It is worth mentioning that for all the options, a user-friendly and convenient Java Graphical User Interface (GUI) has been provided to display the output on the client machine.

  19. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10315204
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 27, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  20. H

    Ribosomal protein database of Acinetobacter baumannii

    • dataverse.harvard.edu
    • dataone.org
    Updated Dec 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenfa Ng (2020). Ribosomal protein database of Acinetobacter baumannii [Dataset]. http://doi.org/10.7910/DVN/ZFXEAV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Wenfa Ng
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This work presents the ribosomal protein database of Acinetobacter baumannii. Original data for the work came from the annotated proteome data of the bacterium downloaded from UniProt. Using an in-house MATLAB ribosomal protein database analysis software, the original proteome data file was parsed to extract protein name and amino acid sequence of all ribosomal proteins in the species. The database also includes calculated variables such as number of residues, molecular weight, and nucleotide sequence. Overall, the presented database could serve as a ribosomal protein mass fingerprint for use in microbial identification, or it could be used in fundamental studies seeking to uncover new insights into ribosomal protein biology.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). Conserved Domain Database [Dataset]. http://identifiers.org/RRID:SCR_002077

Conserved Domain Database

RRID:SCR_002077, nif-0000-02647, Conserved Domain Database (RRID:SCR_002077), CDD, Conserved Domains Database, Conserved Domains

Explore at:
14 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 22, 2025
Description

Database of annotations of functional units in proteins including multiple sequence alignment models for ancient domains and full-length proteins. This collection of models includes 3D structures that display the sequence/structure/function relationships in proteins. It also includes alignments of the domains to known three-dimensional protein structures in the MMDB database. The source databases are Pfam, Smart, and COG. Users can identify amino acids in protein sequences with the resources available as well as view single sequences embedded within multiple sequence alignments.

Search
Clear search
Close search
Google apps
Main menu