15 datasets found

f
Data from: Interpretable Machine Learning Models for Phase Prediction in...
acs.figshare.com
xlsx
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yiwen Lu; Dilek Yalcin; Paul J. Pigram; Lewis D. Blackman; Mario Boley (2023). Interpretable Machine Learning Models for Phase Prediction in Polymerization-Induced Self-Assembly [Dataset]. http://doi.org/10.1021/acs.jcim.3c00460.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.3c00460.s002
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Yiwen Lu; Dilek Yalcin; Paul J. Pigram; Lewis D. Blackman; Mario Boley
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
While polymerization-induced self-assembly (PISA) has become a preferred synthetic route toward amphiphilic block copolymer self-assemblies, predicting their phase behavior from experimental design is extremely challenging, requiring time and work-intensive creation of empirical phase diagrams whenever self-assemblies of novel monomer pairs are sought for specific applications. To alleviate this burden, we develop here the first framework for a data-driven methodology for the probabilistic modeling of PISA morphologies based on a selection and suitable adaption of statistical machine learning methods. As the complexity of PISA precludes generating large volumes of training data with in silico simulations, we focus on interpretable low variance methods that can be interrogated for conformity with chemical intuition and that promise to work well with only 592 training data points which we curated from the PISA literature. We found that among the evaluated linear models, generalized additive models, and rule and tree ensembles, all but the linear models show a decent interpolation performance with around 0.2 estimated error rate and 1 bit expected cross entropy loss (surprisal) when predicting the mixture of morphologies formed from monomer pairs already encountered in the training data. When considering extrapolation to new monomer combinations, the model performance is weaker but the best model (random forest) still achieves highly nontrivial prediction performance (0.27 error rate, 1.6 bit surprisal), which renders it a good candidate to support the creation of empirical phase diagrams for new monomers and conditions. Indeed, we find in three case studies that, when used to actively learn phase diagrams, the model is able to select a smart set of experiments that lead to satisfactory phase diagrams after observing only relatively few data points (5–16) for the targeted conditions. The data set as well as all model training and evaluation codes are publicly available through the GitHub repository of the last author.
NIST Excerpts Benchmark Data
nist.gov
data.nist.gov
+1more
Updated Jan 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2025). NIST Excerpts Benchmark Data [Dataset]. http://doi.org/10.18434/mds2-2895
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-2895, https://identifiers.org/ark:/88434/mds2-2895
Dataset updated
Jan 31, 2025
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
The NIST Excerpts Benchmark Data are a set of target data for deidentification algorithms. The data are configured to work with "SDNist: Synthetic Data Report Tool", a package for evaluating synthetic data generators: https://github.com/usnistgov/SDNist. An installation of SDNist will download the data resources automatically. Jan 2025 -- Benhcmark Excerpts: - NIST American Community Survey (ACS) Data Excerpts, 24 demographic features over 40k records, - NIST Survey of Business Owners (SBO) Data Excerpts, 130 demographic and financial features over 161k records The data are curated subsets of U.S. Census Bureau products.
f
Data from: SpaceHASTEN: A Structure-Based Virtual Screening Tool for...
acs.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tuomo Kalliokoski; Ainoleena Turku; Heikki Käsnänen (2024). SpaceHASTEN: A Structure-Based Virtual Screening Tool for Nonenumerated Virtual Chemical Libraries [Dataset]. http://doi.org/10.1021/acs.jcim.4c01790.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.4c01790.s001
Dataset updated
Dec 23, 2024
Dataset provided by
ACS Publications
Authors
Tuomo Kalliokoski; Ainoleena Turku; Heikki Käsnänen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Given the size of the relevant chemical space for drug discovery, working with fully enumerated compound libraries (especially in three-dimensional (3D)) is unfeasible. Nonenumerated virtual chemical spaces are a practical solution to this issue, where compounds are described as building blocks which are then connected by rules. One concrete example of such is the BioSolveIT chemical spaces file format (.space). Tools to search these space-files exist that are using ligand-based methods including two-dimensional (2D) fingerprint similarity, substructure matching, and fuzzier similarity metrics such as FTrees. However, there is no software available that enables the screening of these nonenumerated spaces using protein structure as the input query. Here, a hybrid ligand/structure-based virtual screening tool, called SpaceHASTEN, was developed on top of SpaceLight, FTrees, LigPrep, and Glide to allow efficient structure-based virtual screening of nonenumerated chemical spaces. SpaceHASTEN was validated using three public targets picked from the DUD-E data set. It was able to retrieve a large number of diverse and novel high-scoring compounds (virtual hits) from nonenumerated chemical spaces of billions of molecules, after docking a few million compounds. The software can be freely used and is available from http://github.com/TuomoKalliokoski/SpaceHASTEN.
Datasets for manuscript "Predicting chemical end-of-life scenarios using...
catalog.data.gov
datasets.ai
+1more
Updated Apr 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2023). Datasets for manuscript "Predicting chemical end-of-life scenarios using structure-based classification models" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-predicting-chemical-end-of-life-scenarios-using-structure-based-cl
Explore at:
Dataset updated
Apr 1, 2023
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
As described in the README.md file, the GitHub repository github.com/USEPA/PRTR-QSTR-models/tree/data-driven are Python scripts written to run Quantitative Structure–Transfer Relationship (QSTR) models based on chemical structure-based machine learning (ML) models for supporting environmental regulatory decision-making. Using features associated with annual chemical transfer amounts, chemical generator industry sectors, environmental policy stringency, gross value added by industry sectors, chemical descriptors, and chemical unit prices, as in the GitHub repository PRTR_transfers, the QSTR models developed here can predict potential EoL activities for chemicals transferred to off-site locations for EoL management. Also, this contribution shows that QSTR models aid in estimating the mass fraction allocation of chemicals of concern transferred off-site for EoL activities. Also, it describes the Python libraries required for running the code, how to use it, the obtained outputs files after running the Python script, and how to obtain all manuscript figures and results. This dataset is associated with the following publication: Hernandez-Betancur, J.D., G.J. Ruiz-Mercado, and M. Martín. Predicting Chemical End-of-Life Scenarios Using Structure-Based Classification Models. ACS Sustainable Chemistry & Engineering. American Chemical Society, Washington, DC, USA, 11(9): 3594-3602, (2023).
g
Synthetic population housing and person records for the United States
search.gesis.org
openicpsr.org
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GESIS search, Synthetic population housing and person records for the United States [Dataset]. http://doi.org/10.3886/E100274V1
Explore at:
Unique identifier
https://doi.org/10.3886/E100274V1
Dataset provided by
GESIS search
ICPSR - Interuniversity Consortium for Political and Social Research
License
https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de567319https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de567319
Area covered
United States
Description
Abstract (en): InputsThe synthetic population was generated from the 2010-2014 ACS PUMS housing and person files. United States Department of Commerce. Bureau of the Census. (2017-03-06). American Community Survey 2010-2014 ACS 5-Year PUMS File [Data set]. Ann Arbor, MI: Inter-university Consortium of Political and Social Research [distributor]. http://doi.org/10.3886/E100486V1Persistent URL: http://doi.org/10.3886/E100486V1Funding supportThis work is supported under Grant G-2015-13903 from the Alfred P. Sloan Foundation on "The Economics of Socially-Efficient Privacy and Confidentiality Management for Statistical Agencies" (PI: John M. Abowd)OutputsThere are 17 housing files (data/housing)- repHus0.csv, repHus1.csv, ... repHus16.csvand 32 person files (data/person)- rep_recode_ACSpus0.csv, rep_recode_ACSpus1.csv, ... rep_recode_ACSpus31.csv.Files are split to be roughly equal in size. The files contain data for the entire country. Files are not split along any demographic characteristic. The person files and housing files must be concatenated to form a complete person file and a complete housing file, respectively.ProgramsPrograms that generated this release can be found at https://github.com/labordynamicsinstitute/SynUSpopulation/releases/tag/v201703-beta and http://doi.org/10.5281/zenodo.556424Additional informationFor more information, see README.md Smallest Geographic Unit: United States Funding insitution(s): Alfred P. Sloan Foundation.
f
Data from: SynthI: A New Open-Source Tool for Synthon-Based Library Design
acs.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuliana Zabolotna; Dmitriy M. Volochnyuk; Sergey V. Ryabukhin; Kostiantyn Gavrylenko; Dragos Horvath; Olga Klimchuk; Oleksandr Oksiuta; Gilles Marcou; Alexandre Varnek (2023). SynthI: A New Open-Source Tool for Synthon-Based Library Design [Dataset]. http://doi.org/10.1021/acs.jcim.1c00754.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c00754.s002
Dataset updated
May 31, 2023
Dataset provided by
ACS Publications
Authors
Yuliana Zabolotna; Dmitriy M. Volochnyuk; Sergey V. Ryabukhin; Kostiantyn Gavrylenko; Dragos Horvath; Olga Klimchuk; Oleksandr Oksiuta; Gilles Marcou; Alexandre Varnek
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Most of the existing computational tools for de novo library design are focused on the generation, rational selection, and combination of promising structural motifs to form members of the new library. However, the absence of a direct link between the chemical space of the retrosynthetically generated fragments and the pool of available reagents makes such approaches appear as rather theoretical and reality-disconnected. In this context, here we present Synthons Interpreter (SynthI), a new open-source toolkit for de novo library design that allows merging those two chemical spaces into a single synthons space. Here synthons are defined as actual fragments with valid valences and special labels, specifying the position and the nature of reactive centers. They can be issued from either the “breakup” of reference compounds according to 38 retrosynthetic rules or real reagents, after leaving group withdrawal or transformation. Such an approach not only enables the design of synthetically accessible libraries and analog generation but also facilitates reagents (building blocks) analysis in the medicinal chemistry context. SynthI code is publicly available at https://github.com/Laboratoire-de-Chemoinformatique/SynthI.
f
Data from: ACEGEN: Reinforcement Learning of Generative Chemical Agents for...
acs.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Albert Bou; Morgan Thomas; Sebastian Dittert; Carles Navarro; Maciej Majewski; Ye Wang; Shivam Patel; Gary Tresadern; Mazen Ahmad; Vincent Moens; Woody Sherman; Simone Sciabola; Gianni De Fabritiis (2024). ACEGEN: Reinforcement Learning of Generative Chemical Agents for Drug Discovery [Dataset]. http://doi.org/10.1021/acs.jcim.4c00895.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.4c00895.s001
Dataset updated
Nov 21, 2024
Dataset provided by
ACS Publications
Authors
Albert Bou; Morgan Thomas; Sebastian Dittert; Carles Navarro; Maciej Majewski; Ye Wang; Shivam Patel; Gary Tresadern; Mazen Ahmad; Vincent Moens; Woody Sherman; Simone Sciabola; Gianni De Fabritiis
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In recent years, reinforcement learning (RL) has emerged as a valuable tool in drug design, offering the potential to propose and optimize molecules with desired properties. However, striking a balance between capabilities, flexibility, reliability, and efficiency remains challenging due to the complexity of advanced RL algorithms and the significant reliance on specialized code. In this work, we introduce ACEGEN, a comprehensive and streamlined toolkit tailored for generative drug design, built using TorchRL, a modern RL library that offers thoroughly tested reusable components. We validate ACEGEN by benchmarking against other published generative modeling algorithms and show comparable or improved performance. We also show examples of ACEGEN applied in multiple drug discovery case studies. ACEGEN is accessible at https://github.com/acellera/acegen-open and available for use under the MIT license.
Dataset for "Rapid Adaptation of Chemical Named Entity Recognition using...
zenodo.org
application/gzip, xz
Updated Sep 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yue Zhang; Vlachos Dionisios G.; Liu Dongxia; Fang Hui; Yue Zhang; Vlachos Dionisios G.; Liu Dongxia; Fang Hui (2025). Dataset for "Rapid Adaptation of Chemical Named Entity Recognition using Few-Shot Learning and LLM Distillation" [Dataset]. http://doi.org/10.5281/zenodo.14788490
Explore at:
xz, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14788490
Dataset updated
Sep 19, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yue Zhang; Vlachos Dionisios G.; Liu Dongxia; Fang Hui; Yue Zhang; Vlachos Dionisios G.; Liu Dongxia; Fang Hui
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 2, 2025
Description
Github code link: https://github.com/nsndimt/ChemSSP
JCIM paper link: https://pubs.acs.org/doi/10.1021/acs.jcim.5c00248

Note: use tar xf [filename].tar.xz to decompress all files

The following compressed file corresponds to a folder under the root path of project:

model.tar.xz: the scibert model checkpoint

data.tar.xz: the six human annotated chemical NER dataset

episode.tar.xz: sampled episodes for the six datasets

chatGPT.tar.xz, claude.tar.xz, gemini.tar.xz: LLM annotation file and corresponding sampled episodes

Additionally codereadme.tar.xz contains all the codes and an readme file under project root. More detailed instruction are in the readme

Please email zhangyue@udel.edu if you have any questions.
f
Data from: CAT: A Compound Attachment Tool for the Construction of Composite...
acs.figshare.com
figshare.com
txt
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bas van Beek; Juliette Zito; Lucas Visscher; Ivan Infante (2023). CAT: A Compound Attachment Tool for the Construction of Composite Chemical Compounds [Dataset]. http://doi.org/10.1021/acs.jcim.2c00690.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.2c00690.s002
Dataset updated
Jun 5, 2023
Dataset provided by
ACS Publications
Authors
Bas van Beek; Juliette Zito; Lucas Visscher; Ivan Infante
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The continuous improvement of computer architectures allows for the simulation of molecular systems of growing sizes. However, such calculations still require the input of initial structures, which are also becoming increasingly complex. In this work, we present CAT, a Compound Attachment Tool (source code available at https://github.com/nlesc-nano/CAT) and Python package for the automatic construction of composite chemical compounds, which supports the functionalization of organic, inorganic, and hybrid organic–inorganic materials. The CAT workflow consists in defining the anchoring sites on the reference material, usually a large molecular system denoted as a scaffold, and on the molecular species that are attached to it, i.e., the ligands. Usually, ligands are pre-optimized in a conformation biased toward more linear structures to minimize interligand(s) steric interactions, a bias that is important when multiple ligands are attached onto the scaffold. The resulting superstructure(s) are then stored in various formats that can be used afterward in quantum chemical calculations or classical force field-based simulations.
f
Data from: Correlated RNN Framework to Quickly Generate Molecules with...
acs.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chuan Li; Chenghui Wang; Ming Sun; Yan Zeng; Yuan Yuan; Qiaolin Gou; Guangchuan Wang; Yanzhi Guo; Xuemei Pu (2023). Correlated RNN Framework to Quickly Generate Molecules with Desired Properties for Energetic Materials in the Low Data Regime [Dataset]. http://doi.org/10.1021/acs.jcim.2c00997.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.2c00997.s002
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Chuan Li; Chenghui Wang; Ming Sun; Yan Zeng; Yuan Yuan; Qiaolin Gou; Guangchuan Wang; Yanzhi Guo; Xuemei Pu
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Motivated by the challenging of deep learning on the low data regime and the urgent demand for intelligent design on highly energetic materials, we explore a correlated deep learning framework, which consists of three recurrent neural networks (RNNs) correlated by the transfer learning strategy, to efficiently generate new energetic molecules with a high detonation velocity in the case of very limited data available. To avoid the dependence on the external big data set, data augmentation by fragment shuffling of 303 energetic compounds is utilized to produce 500,000 molecules to pretrain RNN, through which the model can learn sufficient structure knowledge. Then the pretrained RNN is fine-tuned by focusing on the 303 energetic compounds to generate 7153 molecules similar to the energetic compounds. In order to more reliably screen the molecules with a high detonation velocity, the SMILE enumeration augmentation coupled with the pretrained knowledge is utilized to build an RNN-based prediction model, through which R2 is boosted from 0.4446 to 0.9572. The comparable performance with the transfer learning strategy based on an existing big database (ChEMBL) to produce the energetic molecules and drug-like ones further supports the effectiveness and generality of our strategy in the low data regime. High-precision quantum mechanics calculations further confirm that 35 new molecules present a higher detonation velocity and lower synthetic accessibility than the classic explosive RDX, along with good thermal stability. In particular, three new molecules are comparable to caged CL-20 in the detonation velocity. All the source codes and the data set are freely available at https://github.com/wangchenghuidream/RNNMGM.
f
Data from: Incorporation of Ligand Charge and Metal Oxidation State...
acs.figshare.com
xlsx
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marco Gibaldi; Anna Kapeliukha; Andrew White; Tom K. Woo (2024). Incorporation of Ligand Charge and Metal Oxidation State Considerations into the Computational Solvent Removal and Activation of Experimental Crystal Structures Preceding Molecular Simulation [Dataset]. http://doi.org/10.1021/acs.jcim.4c01897.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.4c01897.s002
Dataset updated
Dec 23, 2024
Dataset provided by
ACS Publications
Authors
Marco Gibaldi; Anna Kapeliukha; Andrew White; Tom K. Woo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Efficient computational screenings are integral to materials discovery in highly sought-after gas adsorption and storage applications, such as CO2 capture. Preprocessing techniques have been developed to render experimental crystal structures suitable for molecular simulations by mimicking experimental activation protocols, particularly residual solvent removal. Current accounts examining these preprocessed materials databases indicate the presence of assorted structural errors introduced by solvent removal and preprocessing, including improper elimination of charge-balancing ions and ligands. Here, we address the need for a reliable experimental crystal structure preprocessing protocol by introducing a novel solvent removal method, which we call SAMOSA, that is informed by systematic ligand charge and metal oxidation state calculations. A robust set of solvent removal criteria is outlined, which identifies solvent molecules and counterions without predefined molecule lists or significant reliance on experimental chemical information. Validation results against popular metal–organic framework (MOF) databases suggest that this method observes significant performance improvements regarding the retention of charged ligands and recognition of charged frameworks. SAMOSA enhances structure fidelity with respect to the original material as-synthesized, thereby representing a powerful tool in computational materials database curation and preprocessing for molecular simulation. The source code is accessible at https://github.com/uowoolab/SAMOSA.
f
Data from: SpectroIBIS: Automated Data Processing for Multiconformer Quantum...
figshare.com
datasetcatalog.nlm.nih.gov
+1more
xlsx
Updated Feb 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brodie W. Bulcock; Yit-Heng Chooi; Gavin R. Flematti (2025). SpectroIBIS: Automated Data Processing for Multiconformer Quantum Chemical Spectroscopic Calculations [Dataset]. http://doi.org/10.1021/acs.jnatprod.4c01321.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jnatprod.4c01321.s001
Dataset updated
Feb 7, 2025
Dataset provided by
ACS Publications
Authors
Brodie W. Bulcock; Yit-Heng Chooi; Gavin R. Flematti
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Quantum chemical spectroscopic calculations have grown increasingly popular in natural products research for aiding the elucidation of chemical structures, especially their stereochemical configurations. These calculations have become faster with modern computational speeds, but subsequent data handling, inspection, and presentation remain key bottlenecks for many researchers. In this article, we introduce the SpectroIBIS computer program as a user-friendly tool to automate tedious tasks commonly encountered in this workflow. Through a simple graphical user interface, researchers can drag and drop Gaussian or ORCA output files to produce Boltzmann-averaged ECD, VCD, UV–vis and IR data, optical rotations, and/or 1H and 13C NMR chemical shifts in seconds. Also produced are formatted, publication-quality supplementary data tables containing conformer energies and atomic coordinates, saved to a DOCX file compatible with Microsoft Word and LibreOffice. Importantly, SpectroIBIS can assist researchers in finding common calculation issues by automatically checking for redundant conformers and imaginary frequencies. Additional useful features include recognition of conformer energy recalculations at a higher theory level, and automated generation of input files for quantum chemistry programs with optional exclusion of high-energy conformers. Lastly, we demonstrate the applicability of SpectroIBIS with spectroscopic calculations for five natural products. SpectroIBIS is open-source software available as a free desktop application (https://github.com/bbulcock/SpectroIBIS).
f
Data from: Clustering of Synthetic Routes Using Tree Edit Distance
acs.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Genheden; Ola Engkvist; Esben Bjerrum (2023). Clustering of Synthetic Routes Using Tree Edit Distance [Dataset]. http://doi.org/10.1021/acs.jcim.1c00232.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c00232.s001
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Samuel Genheden; Ola Engkvist; Esben Bjerrum
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
We present a novel algorithm to compute the distance between synthetic routes based on tree edit distances. Such distances can be used to cluster synthesis routes generated using a retrosynthesis prediction tool. We show that the clustering of selected routes from a retrosynthesis analysis is performed in less than 10 s on average and only constitutes seven percent of the total time (prediction + clustering). Furthermore, we are able to show that representative routes from each cluster can be used to reduce the set of predicted routes. Finally, we show with a number of examples that the algorithm gives intuitive clusters that can be easily rationalized and that the routes in a cluster tend to use similar chemistry. The algorithm is included in the latest version of open-source AiZynthFinder software (https://github.com/MolecularAI/aizynthfinder) and as a separate package (https://github.com/MolecularAI/route-distances).
f
Data from: sbml-diff: A Tool for Visually Comparing SBML Models in Synthetic...
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Scott-Brown; Antonis Papachristodoulou (2023). sbml-diff: A Tool for Visually Comparing SBML Models in Synthetic Biology [Dataset]. http://doi.org/10.1021/acssynbio.6b00273.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acssynbio.6b00273.s002
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
James Scott-Brown; Antonis Papachristodoulou
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
We present sbml-diff, a tool that is able to read a model of a biochemical reaction network in SBML format and produce a range of diagrams showing different levels of detail. Each diagram type can be used to visualize a single model or to visually compare two or more models. The default view depicts species as ellipses, reactions as rectangles, rules as parallelograms, and events as diamonds. A cartoon view replaces the symbols used for reactions on the basis of the associated Systems Biology Ontology terms. An abstract view represents species as ellipses and draws edges between them to indicate whether a species increases or decreases the production or degradation of another species. sbml-diff is freely licensed under the three-clause BSD license and can be downloaded from https://github.com/jamesscottbrown/sbml-diff and used as a python package called from other software, as a free-standing command-line application, or online using the form at http://sysos.eng.ox.ac.uk/tebio/upload
f
Data from: pICalculax: Improved Prediction of Isoelectric Point for Modified...
acs.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esben J. Bjerrum; Jan H. Jensen; Jakob L. Tolborg (2023). pICalculax: Improved Prediction of Isoelectric Point for Modified Peptides [Dataset]. http://doi.org/10.1021/acs.jcim.7b00030.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.7b00030.s002
Dataset updated
May 31, 2023
Dataset provided by
ACS Publications
Authors
Esben J. Bjerrum; Jan H. Jensen; Jakob L. Tolborg
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The isoelectric point of a peptide is a physicochemical property that can be accurately predicted from the sequence of the peptide when the peptide is built from natural amino acids. Peptides can however have chemical modifications, such as phosphorylations, amidations, and unnatural amino acids, which can result in erroneous predictions if not accounted for. Here we report on an open source program, pICalculax, which in an extensible way can handle pI calculations of modified peptides. Tests on a database of modified peptides and experimentally determined pI values show an improvement in pI predictions when taking the modifications into account. The correlation coefficient improves from 0.45 to 0.91, and the root-mean-square deviation likewise improves from 3.3 to 0.9. The program is available at https://github.com/EBjerrum/pICalculax
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yiwen Lu; Dilek Yalcin; Paul J. Pigram; Lewis D. Blackman; Mario Boley (2023). Interpretable Machine Learning Models for Phase Prediction in Polymerization-Induced Self-Assembly [Dataset]. http://doi.org/10.1021/acs.jcim.3c00460.s002

Data from: Interpretable Machine Learning Models for Phase Prediction in Polymerization-Induced Self-Assembly

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.3c00460.s002

Dataset updated

Jun 2, 2023

Dataset provided by

ACS Publications

Authors

Yiwen Lu; Dilek Yalcin; Paul J. Pigram; Lewis D. Blackman; Mario Boley

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

While polymerization-induced self-assembly (PISA) has become a preferred synthetic route toward amphiphilic block copolymer self-assemblies, predicting their phase behavior from experimental design is extremely challenging, requiring time and work-intensive creation of empirical phase diagrams whenever self-assemblies of novel monomer pairs are sought for specific applications. To alleviate this burden, we develop here the first framework for a data-driven methodology for the probabilistic modeling of PISA morphologies based on a selection and suitable adaption of statistical machine learning methods. As the complexity of PISA precludes generating large volumes of training data with in silico simulations, we focus on interpretable low variance methods that can be interrogated for conformity with chemical intuition and that promise to work well with only 592 training data points which we curated from the PISA literature. We found that among the evaluated linear models, generalized additive models, and rule and tree ensembles, all but the linear models show a decent interpolation performance with around 0.2 estimated error rate and 1 bit expected cross entropy loss (surprisal) when predicting the mixture of morphologies formed from monomer pairs already encountered in the training data. When considering extrapolation to new monomer combinations, the model performance is weaker but the best model (random forest) still achieves highly nontrivial prediction performance (0.27 error rate, 1.6 bit surprisal), which renders it a good candidate to support the creation of empirical phase diagrams for new monomers and conditions. Indeed, we find in three case studies that, when used to actively learn phase diagrams, the model is able to select a smart set of experiments that lead to satisfactory phase diagrams after observing only relatively few data points (5–16) for the targeted conditions. The data set as well as all model training and evaluation codes are publicly available through the GitHub repository of the last author.

Clear search

Close search

Google apps

Main menu

Data from: Interpretable Machine Learning Models for Phase Prediction in...

NIST Excerpts Benchmark Data

Data from: SpaceHASTEN: A Structure-Based Virtual Screening Tool for...

Datasets for manuscript "Predicting chemical end-of-life scenarios using...

Synthetic population housing and person records for the United States

Data from: SynthI: A New Open-Source Tool for Synthon-Based Library Design

Data from: ACEGEN: Reinforcement Learning of Generative Chemical Agents for...

Dataset for "Rapid Adaptation of Chemical Named Entity Recognition using...

Data from: CAT: A Compound Attachment Tool for the Construction of Composite...

Data from: Correlated RNN Framework to Quickly Generate Molecules with...

Data from: Incorporation of Ligand Charge and Metal Oxidation State...

Data from: SpectroIBIS: Automated Data Processing for Multiconformer Quantum...

Data from: Clustering of Synthetic Routes Using Tree Edit Distance

Data from: sbml-diff: A Tool for Visually Comparing SBML Models in Synthetic...

Data from: pICalculax: Improved Prediction of Isoelectric Point for Modified...

Data from: Interpretable Machine Learning Models for Phase Prediction in Polymerization-Induced Self-Assembly