MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset used in this augmentation process(used a subset of the original training data) is sourced from the Leash Bio - Predict New Medicines with BELKA competition(Read More). It comprises examples of small molecules categorized through binary classification, determining whether each molecule is a binder to one of three protein targets. The data collection method involves utilizing DNA-encoded chemical library (DEL) technology.
Chemical representations are expressed in SMILES (Simplified Molecular-Input Line-Entry System), while the labels denote binary binding classifications, corresponding to three distinct protein targets.
I've expanded the original dataset by augmenting it with additional features derived from the existing data. Specifically, I've calculated and included three new features:
id- A unique example_id we use to identify the molecule-binding target pair. buildingblock1_smiles - The structure, in SMILES, of the first building block **buildingblock2_smiles **- The structure, in SMILES, of the second building block buildingblock3_smiles - The structure, in SMILES, of the third building block **molecule_smiles **- The structure of the fully assembled molecule, in SMILES. This includes the three building blocks and the triazine core. Note we use a [Dy] as the stand-in for the DNA linker. protein_name - The protein target name binds - The target column. A binary class label of whether the molecule binds to the protein. Not available for the test set. mol_wt - The molecule's molecular weight derived from SMILES data using RDKit. logP - The logP of the molecule derived from SMILES data using RDKit. **rotamers **- The number of rotamers of the molecule derived from SMILES data using RDKit.
binds
Proteins are encoded in the genome, and names of the genes encoding those proteins are typically bestowed by their discoverers and regulated by the Hugo Gene Nomenclature Committee. The protein products of these genes can sometimes have different names, often due to the history of their discovery.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the H2O Molecule using the STO-3G basis set at various bondlengths.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The analysis of small molecule crystal structures is a common way to gather valuable information for drug development. The necessary structural data is usually provided in specific file formats containing only element identities and three-dimensional atomic coordinates as reliable chemical information. Consequently, the automated perception of molecular structures from atomic coordinates has become a standard task in cheminformatics. The molecules generated by such methods must be both chemically valid and reasonable to provide a reliable basis for subsequent calculations. This can be a difficult task since the provided coordinates may deviate from ideal molecular geometries due to experimental uncertainties or low resolution. Additionally, the quality of the input data often differs significantly thus making it difficult to distinguish between actual structural features and mere geometric distortions. We present a method for the generation of molecular structures from atomic coordinates based on the recently published NAOMI model. By making use of this consistent chemical description, our method is able to generate reliable results even with input data of low quality. Molecules from 363 Protein Data Bank (PDB) entries could be perceived with a success rate of 98%, a result which could not be achieved with previously described methods. The robustness of our approach has been assessed by processing all small molecules from the PDB and comparing them to reference structures. The complete data set can be processed in less than 3 min, thus showing that our approach is suitable for large scale applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A test dataset for MUSCLE (MUltiplexed Single-molecule Characterization at the Library scalE) data analysis. See "\Python codes for MUSCLE data analysis\README.txt" for the instructions on running the data analysis codes. Use the files in the "Test MUSCLE dataset" folder as input for the codes. "Test MUSCLE dataset\Output_tile1" contains the code output for the test dataset. The example dataset corresponds to one MiSeq tile in an experiment analyzing dCas9-induced R-loop formation for a library of 256 different target sequences.The latest version of the Python codes for matching single-molecule FRET traces with sequenced clusters is available at https://github.com/deindllab/MUSCLE/.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the H5 Molecule using the STO-3G basis set at various bondlengths.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository comprises a dataset of ~2 million unique compounds saved in an hdf5 small molecule library store, which includes the following fields for each molecule:
The repository also includes a Jupyter notebook that serves as an initial guide for querying the small molecule library store. Export both files to the same folder, allocate approximately 40 GB of available memory disk space, unzip the library store, and then launch the notebook to begin querying.
Upon usage, please cite this publication:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Cleaned some public pka datasets containing experimental and calculated data for ML modeling and providing protonated and deprotonated molecular topology graphs as well as the reaction centers
With the recent advances in machine learning for quantum chemistry, it is now possible to predict the chemical properties of compounds and to generate novel molecules. Existing generative models mostly use a string- or graph-based representation, but the precise three-dimensional coordinates of the atoms are usually not encoded. First attempts in this direction have been proposed, where autoregressive or GAN-based models generate atom coordinates. Those either lack a latent space in the autoregressive setting, such that a smooth exploration of the compound space is not possible, or cannot generalize to varying chemical compositions. We propose a new approach to efficiently generate molecular structures that are not restricted to a fixed size or composition. Our model is based on the variational autoencoder which learns a translation-, rotation-, and permutation-invariant low-dimensional representation of molecules. Our experiments yield a mean reconstruction error below 0.05 Angstrom, outperforming the current state-of-the-art methods by a factor of four, and which is even lower than the spatial quantization error of most chemical descriptors. The compositional and structural validity of newly generated molecules has been confirmed by quantum chemical methods in a set of experiments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
37532 Global import shipment records of Small,molecule with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data archive contains single-molecule source data for "Delayed inhibition mechanism for secondary channel factor regulation of ribosomal RNA transcription" by Sarah K. Stumper, Harini Ravi, Larry J. Friedman, Rachel Anne Mooney, Ivan R. Corrêa, Jr., Anne Gershenson, Robert Landick, and Jeff Gelles.
The data archive (doi: 10.5281/zenodo.2530159) provides files for each figure and figure supplement. The files are ‘intervals’ files readable by the imscroll program (https://github.com/gelles-brandeis/CoSMoS_Analysis).
Three dimensional structures provide a wealth of information on the biological function and the evolutionary history of macromolecules. They can be used to examine sequence-structure-function relationships, interactions, active sites, and more.
We present a database of energy and NMR chemical shifts DFT calculations of 4150 crystal organic solids. The structures contain only H/C/N/O/S atoms and were subject to all-atoms geometry optimisation. Calculations were carried out using Quantum Espresso and GIPAW.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Identifying small molecules that bind strongly to target proteins in rational molecular design is crucial. Machine learning techniques, such as generative adversarial networks (GAN), are now essential tools for generating such molecules. In this study, we present an enhanced method for molecule generation using objective-reinforced GANs. Specifically, we introduce BEGAN (Boltzmann-enhanced GAN), a novel approach that adjusts molecule occurrence frequencies during training based on the Boltzmann distribution exp(−ΔU/τ), where ΔU represents the estimated binding free energy derived from docking algorithms and τ is a temperature-related scaling hyperparameter. This Boltzmann reweighting process shifts the generation process toward molecules with higher binding affinities, allowing the GAN to explore molecular spaces with superior binding properties. The reweighting process can also be refined through multiple iterations without altering the overall distribution shape. To validate our approach, we apply it to the design of sex pheromone analogs targeting Spodoptera frugiperda pheromone receptor SfruOR16, illustrating that the Boltzmann reweighting significantly increases the likelihood of generating promising sex pheromone analogs with improved binding affinities to SfruOR16, further supported by atomistic molecular dynamics simulations. Furthermore, we conduct a comprehensive investigation into parameter dependencies and propose a reasonable range for the hyperparameter τ. Our method offers a promising approach for optimizing molecular generation for enhanced protein binding, potentially increasing the efficiency of molecular discovery pipelines.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
We propose an objective and robust method to extract the electrical conductance of single molecules connected to metal electrodes from a set of measured conductance data. Our method roots in the physics of tunneling and is tested on octanedithiol using mechanically controllable break junctions. The single molecule conductance values can be deduced without the need for data selection.
Interference observed in a double-slit experiment most conclusively demonstrates the wave properties of particles. We construct a quantum mechanical double-slit interferometer by rovibrationally exciting D2 (v=2, j=2) molecules in a biaxial state using Stark-induced adiabatic Raman passage. In D2(v=2, j=2)→D2(v=2, j'=0) rotational relaxation via a cold collision with ground state He, the entangled bond axis orientations in the biaxial state act as two slits generating two indistinguishable quantum mechanical pathways connecting initial and final states of the colliding system. The interference disappears when we decouple the two orientations of the bond axis by separately constructing the uniaxial states of D2, unequivocally establishing the double-slit action of the biaxial state. This double slit opens new possibilities in the coherent control of molecular collisions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the manuscript: Zhang, O., Guo, Z., He, Y. et al. Six-dimensional single-molecule imaging with isotropic resolution using a multi-view reflector microscope. Nat. Photon. 17, 179–186 (2023). https://doi.org/10.1038/s41566-022-01116-6
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains Hamiltonian information, molecular data, VQE data, and tapering data for the H8 Molecule using the STO-3G basis set at various bondlengths.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Network parameters and datasets to reproduce the results of the article "Predicting the binding of small molecules to proteins through invariant representation of the molecular structure" by R. Beccaria, A. Lazzeri and G. Tiana. The scripts and the instructions can be obtained from https://github.com/guidotiana/Milbinding
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data set recorded in 6 scans, three omega and three phi, on the updated Diamond Light Source beamline I19-1, as part of routine commissioning. These data have been processed several times with different packages and are being made available to the community as an example set to allow other data processing software authors to verify the content of the data and headers.
The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 6712 drug entries including 1448 FDA-approved small molecule drugs, 131 FDA-approved biotech (protein/peptide) drugs, 85 nutraceuticals and 5080 experimental drugs. Additionally, 4227 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries. Each DrugCard entry contains more than 150 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data. DrugBank is supported by David Wishart, Departments of Computing Science X Biological Sciences, University of Alberta. DrugBank is also supported by The Metabolomics Innovation Centre, a Genome Canada-funded core facility serving the scientific community and industry with world-class expertise and cutting-edge technologies in metabolomics.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset used in this augmentation process(used a subset of the original training data) is sourced from the Leash Bio - Predict New Medicines with BELKA competition(Read More). It comprises examples of small molecules categorized through binary classification, determining whether each molecule is a binder to one of three protein targets. The data collection method involves utilizing DNA-encoded chemical library (DEL) technology.
Chemical representations are expressed in SMILES (Simplified Molecular-Input Line-Entry System), while the labels denote binary binding classifications, corresponding to three distinct protein targets.
I've expanded the original dataset by augmenting it with additional features derived from the existing data. Specifically, I've calculated and included three new features:
id- A unique example_id we use to identify the molecule-binding target pair. buildingblock1_smiles - The structure, in SMILES, of the first building block **buildingblock2_smiles **- The structure, in SMILES, of the second building block buildingblock3_smiles - The structure, in SMILES, of the third building block **molecule_smiles **- The structure of the fully assembled molecule, in SMILES. This includes the three building blocks and the triazine core. Note we use a [Dy] as the stand-in for the DNA linker. protein_name - The protein target name binds - The target column. A binary class label of whether the molecule binds to the protein. Not available for the test set. mol_wt - The molecule's molecular weight derived from SMILES data using RDKit. logP - The logP of the molecule derived from SMILES data using RDKit. **rotamers **- The number of rotamers of the molecule derived from SMILES data using RDKit.
binds
Proteins are encoded in the genome, and names of the genes encoding those proteins are typically bestowed by their discoverers and regulated by the Hugo Gene Nomenclature Committee. The protein products of these genes can sometimes have different names, often due to the history of their discovery.