Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9770082%2Fb234dd748f233e4d3ef1d72d048828b5%2FMastering%20Drug%20Design.jpg?generation=1686502308761641&alt=media" alt="">
Read this article to get unlock the wonderful world Deep Reinforcement Learning for Drug Design
ReLeaSE is a public dataset, consisting of molecular structures and their corresponding binding affinity to proteins. The dataset was created for the purpose of evaluating and comparing machine learning models for the prediction of protein-ligand binding affinity.
The dataset contains a total of 10,000 molecules and their binding affinity to several target proteins, including thrombin, kinase, and protease. The molecular structures are represented using Simplified Molecular Input Line Entry System (SMILES) notation, which is a standardized method for representing molecular structures as a string of characters. The binding affinity is represented as a negative logarithm of the dissociation constant (pKd), which is a measure of the strength of the interaction between the molecule and the target protein.
The ReLeaSE dataset provides a standardized benchmark for evaluating machine learning models for protein-ligand binding affinity prediction. The dataset is publicly available and can be used for research purposes, making it an important resource for the drug discovery community.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the HF Molecule using the STO-3G basis set at various bondlengths.
Facebook
TwitterSII-minxiyu/molecule-data dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the C2 Molecule using the STO-3G basis set at various bondlengths.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Data: Effector Molecule Analysis
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Molecular Entities in Linked Data (MEiLD) dataset comprises data of distinct atoms, molecules, ions, ion pairs, radicals, radical ions, and others that can be identifiable as separately distinguishable chemical entities. The dataset is provided in a JSON-LD format and was generated by the SDFEater, a tool that allows parsing atoms, bonds, and other molecule data. MEiLD contains 349,960 of âsmallâ chemical entities.
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
"DUDE_pocket.tar.gz" contains the pocket structures of the 101 DUDE targets used for model evaluation."lingo3dmol_confs.tar.gz" includes molecules generated by Lingo3DMol for the 101 DUD-E targets. It includes 3D conformations directly generated by the model without any force field-based refinement."pocket2mol_confs.tar.gz" consists of molecules generated by Pocket2Mol for the 101 DUD-E targets. It includes 3D conformations directly generated by the model without any force field-based refinement."targetdiff_confs.tar.gz" contains molecules generated by TargitDiff for the 101 DUD-E targets. It includes 3D conformations directly generated by the model without any force field-based refinement."random_confs.tar.gz" contains random molecules for the 101 DUD-E targets. It includes 3D conformations obtained through docking.The molecule files in "*_confs.tar.gz" files are named using the format {pdb_id}-{mol_id}.Additional information about the generated molecules, including mol_id, SMILES, and metric scores involved in the evaluation, can be found in the "*_moleculars_metric_score_with_dude.csv" files."Training_data.tar.gz" contains all the complex structures used for model fine-tuning."Training_data_homology_with_DUDE.csv" contains information about the PDB IDs in the training set and their maximum sequence identity with the DUD-E targets used in the evaluation."Pretraining_molecules_SMILES" is a dataset that contains a specific subset of data used for pretraining. This dataset consists of 1.4 million publicly available molecules that were utilized during the pretraining phase.
Facebook
Twitterhttps://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
The METLIN Small Molecule Retention Time (SMRT) datasetThe METLIN SMRT is a reverse-phase retention time dataset covering a total of 80,038 small molecules. The SMRT dataset includes, for each molecule, the retention time (in seconds), the PubChem number, the molfiles containing the structure (SDF format), and molecular descriptors and extended connectivity fingerprints (ECFP) calculated with Dragon 7 (Kode Chemoinformatics, Pisa, Italy). The SMRT is a freely available resource. Use and redistribution of the data, in whole or in part, requires explicit acknowledgment of the source material and the original publication:Domingo-Almenara, X. et al. The METLIN small molecule dataset for machine learning-based retention time prediction. Nature Communications (2019) DOI: 10.1038/s41467-019-13680-7
Facebook
TwitterThis dataset was created by Danny Ahn
Facebook
TwitterSingle molecule map stretch per scan in recent flowcells. Bases per pixel (bpp) is plotted for scans 1..n for each flowcell of mouse lemur molecules (purple). The first scan of each flowcell is indicated with a grey dashed line. The pre-adjusted molecule map stretch was determined by aligning molecule maps to the in silico maps. Data made available by P.A. Larsen, J. Rogers, A.D. Yoder and the Duke Lemur Center. (ZIP 55 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains an example of Molecular Dynamics (MD) Simulation files. A Pseudomonas aeruginosa membrane was simulated with Polymyxin B1 and used as an example data set for MeTrEx (Membrane Trajectory Explorer). MeTrEx is a Python-based software to explore and analyse MD Simulation ligand and membrane molecule interaction data.
The manuscript is currently under review, and the data set will be published open-source after acceptance.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the O2 Molecule using the STO-3G basis set at various bondlengths.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Version 3 description below
This dataset is an extended version of the "Wikipedia Molecules Properties Dataset" with added SMILES representations, additional physicochemical properties calculated using the thermo library, and chemical classification according to the ClassyFire system for approximately 4,200 compounds.
CC0: Public Domain - This dataset is in the public domain. You can copy, modify, distribute and perform the data, even for commercial purposes, all without asking permission.
When using this dataset, please cite: - Original dataset: Wikipedia Molecules Properties Dataset - This extended dataset: "Yakovlev, I. (2025). Physical and Chemical Properties of Substances [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18242974" or "Ivan Yakovlev. (2025). Physical and Chemical Properties of Substances [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/11611464"
Ivan Yakovlev
Email: yakovlev.ivan.g@gmail.com
LinkedIn: www.linkedin.com/in/ivanyakovlevg
Facebook
TwitterCumulative length and number of single molecule maps per BNX file for T. castaneum data generated over time. Detailed metrics for molecule maps per BNX file (cumulative length and number of maps). Columns include cumulative length of molecule maps > 150 kb, number of molecule maps > 150 kb and date that BNX file was generated. (CSV 4.11 kb)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Cleaned some public pka datasets containing experimental and calculated data for ML modeling and providing protonated and deprotonated molecular topology graphs as well as the reaction centers
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains two datasets for our recent work "Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data". The first data set is a multi-fidelity band gap data for crystals, and the second data set is the molecular energy data set for molecules.1. Multi-fidelity band gap data for crystalsThe full band gap data used in the paper is located at band_gap_no_structs.gz. Users can use the following code to extract it. import gzipimport jsonwith gzip.open("band_gap_no_structs.gz", "rb") as f: data = json.loads(f.read())data is a dictionary with the following format{"pbe": {mp_id: PBE band gap, mp_id: PBE band gap, ...},"hse": {mp_id: HSE band gap, mp_id: HSE band gap, ...},"gllb-sc": {mp_id: GLLB-SC band gap, mp_id: GLLB-SC band gap, ...},"scan": {mp_id: SCAN band gap, mp_id: SCAN band gap, ...},"ordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...},"disordered_exp": {icsd_id: Exp band gap, icsd_id: Exp band gap, ...}}where mp_id is the Materials Project materials ID for the material, and icsd_id is the ICSD materials ID. For example, the PBE band gap of NaCl (mp-22862, band gap 5.003 eV) can be accessed by data['pbe']['mp-22862']. Note that the Materials Project database is evolving with time and it is possible that certain ID is removed in latest release and there may also be some band gap value change for the same material. To get the structure that corresponds to the specific material id in Materials Project, users can use the pymatgen REST API. 1.1. Register at Materials Project https://www.materialsproject.org and get an API key.1.2. In python, do the following to get the corresponding computational structure. from pymatgen import MPRester mpr = MPRester(#Your API Key) structure = mpr.get_structure_by_material_id(#mp_id)A dump of all the material ids and structures for 2019.04.01 MP version is provided here: https://ndownloader.figshare.com/files/15108200. Users can download the file and extract the material_id and structure from this file for all materials. The structure in this case is a cif file. Users can use again pymatgen to read the cif string and get the structure. from pymatgen.core import Structurestructure = Structure.from_str(#cif_string, fmt='cif')For the ICSD structures, the users are required to have commercial ICSD access. Hence the structures will not be provided here.2. Multi-fidelity molecular energy dataThe molecule_data.zip contains two datasets in json format. 2.1 G4MP2.json contains two fidelity G4MP2 (6095) and B3LYP (130831) calculations results on QM9 molecules {"G4MP2": {"U0": {ID: G4MP2 energy (eV), ...}, { "molecules": {ID: Pymatgen molecule dict, ...}},"B3LYP": {"U0": {ID: B3LYP energy (eV), ...} {"molecules": {ID: Pymatgen molecule dict, ...}}}2.2 qm7b.json contains the molecule energy calculation resultsi for 7211 molecules using HF, MP2 and CCSD(T) methods with 6-31g, sto-3g and cc-pvdz bases. {"molecules": {ID: Pymatgen molecule dict, ...},"targets": {ID: {"HF": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "MP2": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, "CCSD(T)": {"sto3g": Atomization energy (kcal/mol), "631g": Atomization energy (kcal/mol), "cc-pvdz": Atomization energy (kcal/mol)}, ...}}}
Facebook
TwitterSummary from GEO:
"Replicon-seq is a method to study the progression of sister replisomes during DNA replication. This method relies excision of the full-length of replicons by the fusion of MNase to MCM4 and sequencing via Nanopore technology."
Overall design from GEO:
"MCM4 was fused to Miccrococcal nuclease ( MNase) to generate DNA double strand break at the site of replisomes. DNA ends are repaired and MinION compatible DNA adaptors are ligated. Full length molecules are sequenced. Because cells have been released from a G1 arrest in presence of BrdU, we can select for replicon reads (reads that contain BrdU) informatically using DNAscent."
Facebook
TwitterOld version. Use https://www.kaggle.com/cedben/pickled-molecules instead
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many approaches not only fail to consider the intricate binding pocket interactions, leading to molecules with suboptimal properties and stability, but also struggle with designing selective inhibitors. To address this challenge, we have developed an innovative structure-based three-dimensional molecular generation framework named Coarse-grained and Multi-dimensional Data-driven molecular generation (CMD-GEN). This framework bridges three-dimensional ligand-protein complex data with two-dimensional drug-like molecule data by utilizing coarse-grained pharmacophore points sampled from diffusion models, thereby enriching the training data for generative models. Through a hierarchical architecture, it decomposes the generation of three-dimensional molecules within the pocket into sampling of coarse-grained pharmacophore points, generating of chemical structures, and alignment of conformations, avoiding the instability issues associated with inherent in deep generative model-based generation of molecular conformations.
This project provide the source dataset used to train and evaluate the overall model.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset used in this augmentation process(used a subset of the original training data) is sourced from the Leash Bio - Predict New Medicines with BELKA competition(Read More). It comprises examples of small molecules categorized through binary classification, determining whether each molecule is a binder to one of three protein targets. The data collection method involves utilizing DNA-encoded chemical library (DEL) technology.
Chemical representations are expressed in SMILES (Simplified Molecular-Input Line-Entry System), while the labels denote binary binding classifications, corresponding to three distinct protein targets.
I've expanded the original dataset by augmenting it with additional features derived from the existing data. Specifically, I've calculated and included three new features:
id- A unique example_id we use to identify the molecule-binding target pair. buildingblock1_smiles - The structure, in SMILES, of the first building block **buildingblock2_smiles **- The structure, in SMILES, of the second building block buildingblock3_smiles - The structure, in SMILES, of the third building block **molecule_smiles **- The structure of the fully assembled molecule, in SMILES. This includes the three building blocks and the triazine core. Note we use a [Dy] as the stand-in for the DNA linker. protein_name - The protein target name binds - The target column. A binary class label of whether the molecule binds to the protein. Not available for the test set. mol_wt - The molecule's molecular weight derived from SMILES data using RDKit. logP - The logP of the molecule derived from SMILES data using RDKit. **rotamers **- The number of rotamers of the molecule derived from SMILES data using RDKit.
bindsProteins are encoded in the genome, and names of the genes encoding those proteins are typically bestowed by their discoverers and regulated by the Hugo Gene Nomenclature Committee. The protein products of these genes can sometimes have different names, often due to the history of their discovery.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9770082%2Fb234dd748f233e4d3ef1d72d048828b5%2FMastering%20Drug%20Design.jpg?generation=1686502308761641&alt=media" alt="">
Read this article to get unlock the wonderful world Deep Reinforcement Learning for Drug Design
ReLeaSE is a public dataset, consisting of molecular structures and their corresponding binding affinity to proteins. The dataset was created for the purpose of evaluating and comparing machine learning models for the prediction of protein-ligand binding affinity.
The dataset contains a total of 10,000 molecules and their binding affinity to several target proteins, including thrombin, kinase, and protease. The molecular structures are represented using Simplified Molecular Input Line Entry System (SMILES) notation, which is a standardized method for representing molecular structures as a string of characters. The binding affinity is represented as a negative logarithm of the dissociation constant (pKd), which is a measure of the strength of the interaction between the molecule and the target protein.
The ReLeaSE dataset provides a standardized benchmark for evaluating machine learning models for protein-ligand binding affinity prediction. The dataset is publicly available and can be used for research purposes, making it an important resource for the drug discovery community.