Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.
While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.
This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.
The dataset is divided into two subsets:
- Training: 16,000 samples (proteinas_train.csv).
- Testing: 4,000 samples (proteinas_test.csv).
This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of 11,832 protein structures obtained from the RCSB Protein Data Bank (PDB), covering the period from 2015 to 2023. It includes features relevant to X-ray crystallography-determined protein structures from the organism Homo sapiens (human). In addition, new structural features such as the number of residues, chains, and secondary structure components (helix, sheet, coil) have been added. These features provide additional insights that enhance the dataset’s application for various machine learning (ML) and deep learning (DL) tasks.
The dataset is optimized for tasks such as protein-ligand binding prediction, oligomeric state analysis, enzyme classification, protein secondary structure prediction, structure quality assessment, and more. By incorporating secondary structure data, the dataset also supports tasks related to structural motif detection, protein stability prediction, and protein domain analysis.
For a detailed explanation of how these new features were extracted and integrated, including code snippets, please refer to our PrepareData|RCSB_PDBHumanMacromoleculeDataset & prepareRCSB_PDBDatasetwithStructuralFeatures.
The dataset was curated by filtering proteins based on X-ray diffraction data and a resolution range of 1.0 Å to 3.0 Å, ensuring high-quality structural data for downstream ML applications.
The dataset is licensed under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication, allowing unrestricted use, modification, and distribution. It can be used for both academic and commercial purposes.
A custom query was constructed to retrieve X-ray crystallography structures of human proteins from the RCSB Protein Data Bank (PDB) with resolutions between 1.0 and 3.0 Å. The query also filtered for proteins with asymmetric symmetry and enzyme classification information.
Query:
(experimental_method:"X-RAY DIFFRACTION")
AND (resolution:[1.0 TO 3.0])
AND (entity_poly.rcsb_entity_polymer_type:"Protein (only)")
AND (pdbx_audit_revision_history.revision_date:[2015-01-01 TO 2023-09-17])
AND (rcsb_entity_source_organism.taxonomy_lineage.name:"Homo sapiens")
AND (rcsb_symmetry.symmetry_type:"Asymmetric")
AND (rcsb_enzyme_classification.ec_name:*)
After fetching the relevant protein structures, the following features were selected for extraction:
These newly added features enrich the dataset's potential for structural bioinformatics applications. For a full breakdown of the extraction process, refer to the PrepareData|RCSB_PDBHumanMacromoleculeDataset & prepareRCSB_PDBDatasetwithStructuralFeatures.
The extracted data, along with the new structural features, was consolidated into a tabular format. This dataset was further prepared by merging multiple CSV files into a single dataset using a custom Python notebook.
The dataset includes a broad range of features, now expanded with structural data to enhance ML/DL applications. Here is a detailed explanation of all the features:
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📖 Context & Inspiration
This dataset was created to provide a comprehensive, easily accessible archive of Protein Data Bank (PDB) files—vital for researchers and developers working in structural biology, molecular modeling, and bioinformatics. The files were gathered from a public directory-style server, where PDB structures are organized into subfolders based on naming conventions.
The goal was to simplify access to this valuable data by restructuring it in a more machine-learning-friendly format, eliminating the need for repetitive scraping or manual downloads. Inspiration came from real-world bottlenecks in data preprocessing for protein folding prediction models, drug design simulations, and structure-based machine learning pipelines.
🌐 Source
All files were sourced from a publicly available FTP-style archive mirroring the PDB repository. The original structure was preserved to ensure traceability and compatibility with existing workflows.
⚠️ Note: This dataset is intended for research and educational use. Ensure compliance with any licensing or usage terms defined by the PDB archive.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).
The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.
The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.
There are two data files. Both are arranged on "structureId" of the protein:
pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.
data_seq.csv contains >400,000 protein structure sequences.
Original data set down loaded from http://www.rcsb.org/pdb/
Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset provides a curated collection of proteins with their sequences, functional annotations, and structural information. It serves as a resource for researchers working in bioinformatics, structural biology, and systems biology, offering insights into the molecular machinery of life.
By linking protein sequence data with functional and structural attributes, it supports diverse applications such as: -Protein classification and annotation tasks -Sequence-to-function machine learning models -Structural modeling and docking studies -Comparative proteomics and evolutionary studies
The dataset is valuable for both computational researchers and students of life sciences, offering real-world biological data that can be directly integrated into pipelines for protein analysis, modeling, and prediction.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.
While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.
This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.
The dataset is divided into two subsets:
- Training: 16,000 samples (proteinas_train.csv).
- Testing: 4,000 samples (proteinas_test.csv).
This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.