100+ datasets found
  1. e

    CDD

    • ebi.ac.uk
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). CDD [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Apr 18, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CDD is a protein annotation resource that consists of a collection of annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domain models, which use 3D-structure information to explicitly define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases.

  2. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    zip
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated
    Explore at:
    zip(12928905 bytes)Available download formats
    Dataset updated
    Dec 27, 2024
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  3. e

    NCBIFAM

    • ebi.ac.uk
    Updated Aug 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). NCBIFAM [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Aug 6, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NCBIfam is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. NCBIfam is maintained at the National Center for Biotechnology Information (Bethesda, MD). NCBIfam includes models from TIGRFAMs, another database of protein families developed at The Institute for Genomic Research, then at the J. Craig Venter Institute (Rockville, MD, US).

  4. P

    Protein Sequence Analysis Tool Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jan 5, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2026). Protein Sequence Analysis Tool Report [Dataset]. https://www.datainsightsmarket.com/reports/protein-sequence-analysis-tool-1425541
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Jan 5, 2026
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2026 - 2034
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The size of the Protein Sequence Analysis Tool market was valued at USD XXX million in 2024 and is projected to reach USD XXX million by 2033, with an expected CAGR of XX% during the forecast period.

  5. Structural Protein Sequences

    • kaggle.com
    zip
    Updated Feb 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SHAHIR (2018). Structural Protein Sequences [Dataset]. https://www.kaggle.com/datasets/shahir/protein-data-set
    Explore at:
    zip(28782775 bytes)Available download formats
    Dataset updated
    Feb 3, 2018
    Authors
    SHAHIR
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).

    The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.

    The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.

    Content

    There are two data files. Both are arranged on "structureId" of the protein:

    • pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.

    • data_seq.csv contains >400,000 protein structure sequences.

    Acknowledgements

    Original data set down loaded from http://www.rcsb.org/pdb/

    Inspiration

    Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.

  6. e

    PROSITE profiles

    • ebi.ac.uk
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE profiles [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Feb 5, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs. PROSITE is based at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland.

  7. Bacterial_conserved_protein_for_selection_test-protein.fa

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated Aug 15, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuepeng Sun (2018). Bacterial_conserved_protein_for_selection_test-protein.fa [Dataset]. http://doi.org/10.6084/m9.figshare.6972455.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 15, 2018
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Xuepeng Sun
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Protein sequences for the 36 conserved bacterial proteins subject for evolutionary analysis

  8. P

    Protein Sequence Analysis Tool Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Jan 25, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2026). Protein Sequence Analysis Tool Report [Dataset]. https://www.archivemarketresearch.com/reports/protein-sequence-analysis-tool-36100
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Jan 25, 2026
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2026 - 2034
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The size of the Protein Sequence Analysis Tool market was valued at USD XXX million in 2024 and is projected to reach USD XXX million by 2033, with an expected CAGR of XX % during the forecast period.

  9. P

    Protein Sequence Analysis Tool Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jan 5, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2026). Protein Sequence Analysis Tool Report [Dataset]. https://www.datainsightsmarket.com/reports/protein-sequence-analysis-tool-1941839
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Jan 5, 2026
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2026 - 2034
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Protein Sequence Analysis Tool market is booming, projected to reach $7.8B by 2033 (CAGR 12%). This in-depth analysis explores market drivers, trends, restraints, and key players, including Waters Corp and Thermo Fisher. Discover insights into software, services, and regional market shares for biopharma, clinical diagnostics, and research.

  10. e

    PIRSF

    • ebi.ac.uk
    Updated Apr 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). PIRSF [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Apr 7, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. PIRSF is based at the Protein Information Resource, Georgetown University Medical Centre, Washington DC, US.

  11. e

    Data from: PROSITE

    • prosite.expasy.org
    • toothandnail-mailorder.com
    • +7more
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). PROSITE [Dataset]. https://prosite.expasy.org/
    Explore at:
    Dataset updated
    Oct 15, 2025
    Description

    PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them [More... / References / Commercial users ]. PROSITE is complemented by ProRule , a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [More...].

  12. N

    CDTree

    • datadiscovery.nlm.nih.gov
    • datahub.hhs.gov
    • +4more
    csv, xlsx, xml
    Updated Jun 30, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). CDTree [Dataset]. https://datadiscovery.nlm.nih.gov/NLM-Products-and-Services/CDTree/vkhf-hsp7
    Explore at:
    xml, xlsx, csvAvailable download formats
    Dataset updated
    Jun 30, 2021
    Description

    CDTree is a stand-alone application for classifying protein sequences and investigating their evolutionary relationships. CDTree can import, analyze and update existing Conserved Domain (CDD) records and hierarchies, and also allows users to create their own.

  13. N

    CD Search (Conserved Domain Search Service)

    • datadiscovery.nlm.nih.gov
    • odgavaprod.ogopendata.com
    • +4more
    csv, xlsx, xml
    Updated Jun 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). CD Search (Conserved Domain Search Service) [Dataset]. https://datadiscovery.nlm.nih.gov/Biology/CD-Search-Conserved-Domain-Search-Service-/j6ef-yjai
    Explore at:
    csv, xml, xlsxAvailable download formats
    Dataset updated
    Jun 30, 2021
    Description

    Identifies the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (Reverse Position-Specific BLAST) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD).

  14. Mice Protein

    • kaggle.com
    zip
    Updated Dec 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammet Varlı (2020). Mice Protein [Dataset]. https://www.kaggle.com/datasets/muhammetvarl/mice-protein/data
    Explore at:
    zip(434808 bytes)Available download formats
    Dataset updated
    Dec 16, 2020
    Authors
    Muhammet Varlı
    Description

    Source: UCI - 2015 Please cite: Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126.

    Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning.

    The data set consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. There are 38 control mice and 34 trisomic mice (Down syndrome), for a total of 72 mice. In the experiments, 15 measurements were registered of each protein per sample/mouse. Therefore, for control mice, there are 38x15, or 570 measurements, and for trisomic mice, there are 34x15, or 510 measurements. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample/mouse.

    The eight classes of mice are described based on features such as genotype, behavior and treatment. According to genotype, mice can be control or trisomic. According to behavior, some mice have been stimulated to learn (context-shock) and others have not (shock-context) and in order to assess the effect of the drug memantine in recovering the ability to learn in trisomic mice, some mice have been injected with the drug and others have not.

    Classes: * c-CS-s: control mice, stimulated to learn, injected with saline (9 mice) * c-CS-m: control mice, stimulated to learn, injected with memantine (10 mice) * c-SC-s: control mice, not stimulated to learn, injected with saline (9 mice) * c-SC-m: control mice, not stimulated to learn, injected with memantine (10 mice) * t-CS-s: trisomy mice, stimulated to learn, injected with saline (7 mice) * t-CS-m: trisomy mice, stimulated to learn, injected with memantine (9 mice) * t-SC-s: trisomy mice, not stimulated to learn, injected with saline (9 mice) * t-SC-m: trisomy mice, not stimulated to learn, injected with memantine (9 mice)

    The aim is to identify subsets of proteins that are discriminant between the classes.

    Attribute Information:

    1 Mouse ID
    2..78 Values of expression levels of 77 proteins; the names of proteins are followed by indicating that they were measured in the nuclear fraction. For example: DYRK1A_n
    79 Genotype: control (c) or trisomy (t)
    80 Treatment type: memantine (m) or saline (s)
    81 Behavior: context-shock (CS) or shock-context (SC)
    82 Class: c-CS-s, c-CS-m, c-SC-s, c-SC-m, t-CS-s, t-CS-m, t-SC-s, t-SC-m
    
  15. Encoding of amino acids, deletions and missing protein sequence data after...

    • plos.figshare.com
    xls
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luryane F. Souza; Hernane B. de B. Pereira; Tarcisio M. da Rocha Filho; Bruna A. S. Machado; Marcelo A. Moret (2023). Encoding of amino acids, deletions and missing protein sequence data after alignment. [Dataset]. http://doi.org/10.1371/journal.pone.0287880.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Luryane F. Souza; Hernane B. de B. Pereira; Tarcisio M. da Rocha Filho; Bruna A. S. Machado; Marcelo A. Moret
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code based on molecular structure of amino acid side chains by Chaudhuri et al. [18].

  16. f

    MScDB: A Mass Spectrometry-centric Protein Sequence Database for Proteomics

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harald Marx; Simone Lemeer; Susan Klaeger; Thomas Rattei; Bernhard Kuster (2023). MScDB: A Mass Spectrometry-centric Protein Sequence Database for Proteomics [Dataset]. http://doi.org/10.1021/pr400215r.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    ACS Publications
    Authors
    Harald Marx; Simone Lemeer; Susan Klaeger; Thomas Rattei; Bernhard Kuster
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Protein sequence databases are indispensable tools for life science research including mass spectrometry (MS)-based proteomics. In current database construction processes, sequence similarity clustering is used to reduce redundancies in the source data. Albeit powerful, it ignores the peptide-centric nature of proteomic data and the fact that MS is able to distinguish similar sequences. Therefore, we introduce an approach that structures the protein sequence space at the peptide level using theoretical and empirical information from large-scale proteomic data to generate a mass spectrometry-centric protein sequence database (MScDB). The core modules of MScDB are an in-silico proteolytic digest and a peptide-centric clustering algorithm that groups protein sequences that are indistinguishable by mass spectrometry. Analysis of various MScDB uses cases against five complex human proteomes, resulting in 69 peptide identifications not present in UniProtKB as well as 79 putative single amino acid polymorphisms. MScDB retains ∼99% of the identifications in comparison to common databases despite a 3–48% increase in the theoretical peptide search space (but comparable protein sequence space). In addition, MScDB enables cross-species applications such as human/mouse graft models, and our results suggest that the uncertainty in protein assignments to one species can be smaller than 20%.

  17. Z

    SARS-CoV-2 vs. Homo sapiens BLASTP protein sequence analysis results

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arumugham, Vinu (2020). SARS-CoV-2 vs. Homo sapiens BLASTP protein sequence analysis results [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3758729
    Explore at:
    Dataset updated
    Apr 21, 2020
    Authors
    Arumugham, Vinu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SARS-CoV-2 vs. Homo sapiens BLASTP protein sequence analysis results

  18. f

    Data_Sheet_1_Identification and Analysis of Long Repeats of Proteins at the...

    • figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Mary Rajathei; Subbiah Parthasarathy; Samuel Selvaraj (2023). Data_Sheet_1_Identification and Analysis of Long Repeats of Proteins at the Domain Level.xlsx [Dataset]. http://doi.org/10.3389/fbioe.2019.00250.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    David Mary Rajathei; Subbiah Parthasarathy; Samuel Selvaraj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Amino acid repeats play an important role in the structure and function of proteins. Analysis of long repeats in protein sequences enables one to understand their abundance, structure and function in the protein universe. In the present study, amino acid repeats of length >50 (long repeats) were identified in a non-redundant set of UniProt sequences using the RADAR program. The underlying structures and functions of these long repeats were carried out using the Gene3D for structural domains, Pfam for functional domains and enzyme and non-enzyme functional classification for catalytic and binding of the proteins. From a structural perspective, these long repeats seem to predominantly occur in certain architectures such as sandwich, bundle, barrel, and roll and within these architectures abundant in the superfolds. The lengths of the repeats within each fold are not uniform exhibiting different structures for different functions. We also observed that long repeats are in the domain regions of the family and are involved in the function of the proteins. After grouping based on enzyme and non-enzyme classes, we observed the abundant occurrence of long repeats in specific catalytic and binding of the proteins. In this study, we have analyzed the occurrence of long repeats in the protein sequence universe apart from well-characterized short tandem repeats in sequences and their structures and functions of the proteins at the domain level. The present study suggests that long repeats may play an important role in the structure and function of domains of the proteins.

  19. e

    SFLD

    • ebi.ac.uk
    Updated Sep 7, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). SFLD [Dataset]. https://www.ebi.ac.uk/interpro/
    Explore at:
    Dataset updated
    Sep 7, 2018
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.

  20. Neurotransmitter Receptors & Protein Sequences

    • kaggle.com
    zip
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yashasvi Goswami (2025). Neurotransmitter Receptors & Protein Sequences [Dataset]. https://www.kaggle.com/datasets/yashasvigoswami/neurotransmitter-receptors
    Explore at:
    zip(35950 bytes)Available download formats
    Dataset updated
    Jul 7, 2025
    Authors
    Yashasvi Goswami
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset provides a curated collection of proteins with their sequences, functional annotations, and structural information. It serves as a resource for researchers working in bioinformatics, structural biology, and systems biology, offering insights into the molecular machinery of life.

    By linking protein sequence data with functional and structural attributes, it supports diverse applications such as: -Protein classification and annotation tasks -Sequence-to-function machine learning models -Structural modeling and docking studies -Comparative proteomics and evolutionary studies

    The dataset is valuable for both computational researchers and students of life sciences, offering real-world biological data that can be directly integrated into pipelines for protein analysis, modeling, and prediction.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). CDD [Dataset]. https://www.ebi.ac.uk/interpro/

CDD

Explore at:
Dataset updated
Apr 18, 2024
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

CDD is a protein annotation resource that consists of a collection of annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domain models, which use 3D-structure information to explicitly define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases.

Search
Clear search
Close search
Google apps
Main menu