15 datasets found
  1. f

    Data from: Interpretable Machine Learning Models for Phase Prediction in...

    • acs.figshare.com
    xlsx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiwen Lu; Dilek Yalcin; Paul J. Pigram; Lewis D. Blackman; Mario Boley (2023). Interpretable Machine Learning Models for Phase Prediction in Polymerization-Induced Self-Assembly [Dataset]. http://doi.org/10.1021/acs.jcim.3c00460.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Yiwen Lu; Dilek Yalcin; Paul J. Pigram; Lewis D. Blackman; Mario Boley
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    While polymerization-induced self-assembly (PISA) has become a preferred synthetic route toward amphiphilic block copolymer self-assemblies, predicting their phase behavior from experimental design is extremely challenging, requiring time and work-intensive creation of empirical phase diagrams whenever self-assemblies of novel monomer pairs are sought for specific applications. To alleviate this burden, we develop here the first framework for a data-driven methodology for the probabilistic modeling of PISA morphologies based on a selection and suitable adaption of statistical machine learning methods. As the complexity of PISA precludes generating large volumes of training data with in silico simulations, we focus on interpretable low variance methods that can be interrogated for conformity with chemical intuition and that promise to work well with only 592 training data points which we curated from the PISA literature. We found that among the evaluated linear models, generalized additive models, and rule and tree ensembles, all but the linear models show a decent interpolation performance with around 0.2 estimated error rate and 1 bit expected cross entropy loss (surprisal) when predicting the mixture of morphologies formed from monomer pairs already encountered in the training data. When considering extrapolation to new monomer combinations, the model performance is weaker but the best model (random forest) still achieves highly nontrivial prediction performance (0.27 error rate, 1.6 bit surprisal), which renders it a good candidate to support the creation of empirical phase diagrams for new monomers and conditions. Indeed, we find in three case studies that, when used to actively learn phase diagrams, the model is able to select a smart set of experiments that lead to satisfactory phase diagrams after observing only relatively few data points (5–16) for the targeted conditions. The data set as well as all model training and evaluation codes are publicly available through the GitHub repository of the last author.

  2. NIST Excerpts Benchmark Data

    • nist.gov
    • data.nist.gov
    • +1more
    Updated Jan 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). NIST Excerpts Benchmark Data [Dataset]. http://doi.org/10.18434/mds2-2895
    Explore at:
    Dataset updated
    Jan 31, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    The NIST Excerpts Benchmark Data are a set of target data for deidentification algorithms. The data are configured to work with "SDNist: Synthetic Data Report Tool", a package for evaluating synthetic data generators: https://github.com/usnistgov/SDNist. An installation of SDNist will download the data resources automatically. Jan 2025 -- Benhcmark Excerpts: - NIST American Community Survey (ACS) Data Excerpts, 24 demographic features over 40k records, - NIST Survey of Business Owners (SBO) Data Excerpts, 130 demographic and financial features over 161k records The data are curated subsets of U.S. Census Bureau products.

  3. f

    Data from: SpaceHASTEN: A Structure-Based Virtual Screening Tool for...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated Dec 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tuomo Kalliokoski; Ainoleena Turku; Heikki Käsnänen (2024). SpaceHASTEN: A Structure-Based Virtual Screening Tool for Nonenumerated Virtual Chemical Libraries [Dataset]. http://doi.org/10.1021/acs.jcim.4c01790.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 23, 2024
    Dataset provided by
    ACS Publications
    Authors
    Tuomo Kalliokoski; Ainoleena Turku; Heikki Käsnänen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Given the size of the relevant chemical space for drug discovery, working with fully enumerated compound libraries (especially in three-dimensional (3D)) is unfeasible. Nonenumerated virtual chemical spaces are a practical solution to this issue, where compounds are described as building blocks which are then connected by rules. One concrete example of such is the BioSolveIT chemical spaces file format (.space). Tools to search these space-files exist that are using ligand-based methods including two-dimensional (2D) fingerprint similarity, substructure matching, and fuzzier similarity metrics such as FTrees. However, there is no software available that enables the screening of these nonenumerated spaces using protein structure as the input query. Here, a hybrid ligand/structure-based virtual screening tool, called SpaceHASTEN, was developed on top of SpaceLight, FTrees, LigPrep, and Glide to allow efficient structure-based virtual screening of nonenumerated chemical spaces. SpaceHASTEN was validated using three public targets picked from the DUD-E data set. It was able to retrieve a large number of diverse and novel high-scoring compounds (virtual hits) from nonenumerated chemical spaces of billions of molecules, after docking a few million compounds. The software can be freely used and is available from http://github.com/TuomoKalliokoski/SpaceHASTEN.

  4. Datasets for manuscript "Predicting chemical end-of-life scenarios using...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Apr 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). Datasets for manuscript "Predicting chemical end-of-life scenarios using structure-based classification models" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-predicting-chemical-end-of-life-scenarios-using-structure-based-cl
    Explore at:
    Dataset updated
    Apr 1, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    As described in the README.md file, the GitHub repository github.com/USEPA/PRTR-QSTR-models/tree/data-driven are Python scripts written to run Quantitative Structure–Transfer Relationship (QSTR) models based on chemical structure-based machine learning (ML) models for supporting environmental regulatory decision-making. Using features associated with annual chemical transfer amounts, chemical generator industry sectors, environmental policy stringency, gross value added by industry sectors, chemical descriptors, and chemical unit prices, as in the GitHub repository PRTR_transfers, the QSTR models developed here can predict potential EoL activities for chemicals transferred to off-site locations for EoL management. Also, this contribution shows that QSTR models aid in estimating the mass fraction allocation of chemicals of concern transferred off-site for EoL activities. Also, it describes the Python libraries required for running the code, how to use it, the obtained outputs files after running the Python script, and how to obtain all manuscript figures and results. This dataset is associated with the following publication: Hernandez-Betancur, J.D., G.J. Ruiz-Mercado, and M. Martín. Predicting Chemical End-of-Life Scenarios Using Structure-Based Classification Models. ACS Sustainable Chemistry & Engineering. American Chemical Society, Washington, DC, USA, 11(9): 3594-3602, (2023).

  5. g

    Synthetic population housing and person records for the United States

    • search.gesis.org
    • openicpsr.org
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GESIS search, Synthetic population housing and person records for the United States [Dataset]. http://doi.org/10.3886/E100274V1
    Explore at:
    Dataset provided by
    GESIS search
    ICPSR - Interuniversity Consortium for Political and Social Research
    License

    https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de567319https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de567319

    Area covered
    United States
    Description

    Abstract (en): InputsThe synthetic population was generated from the 2010-2014 ACS PUMS housing and person files. United States Department of Commerce. Bureau of the Census. (2017-03-06). American Community Survey 2010-2014 ACS 5-Year PUMS File [Data set]. Ann Arbor, MI: Inter-university Consortium of Political and Social Research [distributor]. http://doi.org/10.3886/E100486V1Persistent URL: http://doi.org/10.3886/E100486V1Funding supportThis work is supported under Grant G-2015-13903 from the Alfred P. Sloan Foundation on "The Economics of Socially-Efficient Privacy and Confidentiality Management for Statistical Agencies" (PI: John M. Abowd)OutputsThere are 17 housing files (data/housing)- repHus0.csv, repHus1.csv, ... repHus16.csvand 32 person files (data/person)- rep_recode_ACSpus0.csv, rep_recode_ACSpus1.csv, ... rep_recode_ACSpus31.csv.Files are split to be roughly equal in size. The files contain data for the entire country. Files are not split along any demographic characteristic. The person files and housing files must be concatenated to form a complete person file and a complete housing file, respectively.ProgramsPrograms that generated this release can be found at https://github.com/labordynamicsinstitute/SynUSpopulation/releases/tag/v201703-beta and http://doi.org/10.5281/zenodo.556424Additional informationFor more information, see README.md Smallest Geographic Unit: United States Funding insitution(s): Alfred P. Sloan Foundation.

  6. f

    Data from: SynthI: A New Open-Source Tool for Synthon-Based Library Design

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuliana Zabolotna; Dmitriy M. Volochnyuk; Sergey V. Ryabukhin; Kostiantyn Gavrylenko; Dragos Horvath; Olga Klimchuk; Oleksandr Oksiuta; Gilles Marcou; Alexandre Varnek (2023). SynthI: A New Open-Source Tool for Synthon-Based Library Design [Dataset]. http://doi.org/10.1021/acs.jcim.1c00754.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Yuliana Zabolotna; Dmitriy M. Volochnyuk; Sergey V. Ryabukhin; Kostiantyn Gavrylenko; Dragos Horvath; Olga Klimchuk; Oleksandr Oksiuta; Gilles Marcou; Alexandre Varnek
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Most of the existing computational tools for de novo library design are focused on the generation, rational selection, and combination of promising structural motifs to form members of the new library. However, the absence of a direct link between the chemical space of the retrosynthetically generated fragments and the pool of available reagents makes such approaches appear as rather theoretical and reality-disconnected. In this context, here we present Synthons Interpreter (SynthI), a new open-source toolkit for de novo library design that allows merging those two chemical spaces into a single synthons space. Here synthons are defined as actual fragments with valid valences and special labels, specifying the position and the nature of reactive centers. They can be issued from either the “breakup” of reference compounds according to 38 retrosynthetic rules or real reagents, after leaving group withdrawal or transformation. Such an approach not only enables the design of synthetically accessible libraries and analog generation but also facilitates reagents (building blocks) analysis in the medicinal chemistry context. SynthI code is publicly available at https://github.com/Laboratoire-de-Chemoinformatique/SynthI.

  7. f

    Data from: ACEGEN: Reinforcement Learning of Generative Chemical Agents for...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Albert Bou; Morgan Thomas; Sebastian Dittert; Carles Navarro; Maciej Majewski; Ye Wang; Shivam Patel; Gary Tresadern; Mazen Ahmad; Vincent Moens; Woody Sherman; Simone Sciabola; Gianni De Fabritiis (2024). ACEGEN: Reinforcement Learning of Generative Chemical Agents for Drug Discovery [Dataset]. http://doi.org/10.1021/acs.jcim.4c00895.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    ACS Publications
    Authors
    Albert Bou; Morgan Thomas; Sebastian Dittert; Carles Navarro; Maciej Majewski; Ye Wang; Shivam Patel; Gary Tresadern; Mazen Ahmad; Vincent Moens; Woody Sherman; Simone Sciabola; Gianni De Fabritiis
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In recent years, reinforcement learning (RL) has emerged as a valuable tool in drug design, offering the potential to propose and optimize molecules with desired properties. However, striking a balance between capabilities, flexibility, reliability, and efficiency remains challenging due to the complexity of advanced RL algorithms and the significant reliance on specialized code. In this work, we introduce ACEGEN, a comprehensive and streamlined toolkit tailored for generative drug design, built using TorchRL, a modern RL library that offers thoroughly tested reusable components. We validate ACEGEN by benchmarking against other published generative modeling algorithms and show comparable or improved performance. We also show examples of ACEGEN applied in multiple drug discovery case studies. ACEGEN is accessible at https://github.com/acellera/acegen-open and available for use under the MIT license.

  8. Dataset for "Rapid Adaptation of Chemical Named Entity Recognition using...

    • zenodo.org
    application/gzip, xz
    Updated Sep 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yue Zhang; Vlachos Dionisios G.; Liu Dongxia; Fang Hui; Yue Zhang; Vlachos Dionisios G.; Liu Dongxia; Fang Hui (2025). Dataset for "Rapid Adaptation of Chemical Named Entity Recognition using Few-Shot Learning and LLM Distillation" [Dataset]. http://doi.org/10.5281/zenodo.14788490
    Explore at:
    xz, application/gzipAvailable download formats
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yue Zhang; Vlachos Dionisios G.; Liu Dongxia; Fang Hui; Yue Zhang; Vlachos Dionisios G.; Liu Dongxia; Fang Hui
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 2, 2025
    Description

    Github code link: https://github.com/nsndimt/ChemSSP
    JCIM paper link: https://pubs.acs.org/doi/10.1021/acs.jcim.5c00248

    Note: use tar xf [filename].tar.xz to decompress all files

    The following compressed file corresponds to a folder under the root path of project:

    • model.tar.xz: the scibert model checkpoint
    • data.tar.xz: the six human annotated chemical NER dataset
    • episode.tar.xz: sampled episodes for the six datasets
    • chatGPT.tar.xz, claude.tar.xz, gemini.tar.xz: LLM annotation file and corresponding sampled episodes

    Additionally codereadme.tar.xz contains all the codes and an readme file under project root. More detailed instruction are in the readme

    Please email zhangyue@udel.edu if you have any questions.
  9. f

    Data from: CAT: A Compound Attachment Tool for the Construction of Composite...

    • acs.figshare.com
    • figshare.com
    txt
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bas van Beek; Juliette Zito; Lucas Visscher; Ivan Infante (2023). CAT: A Compound Attachment Tool for the Construction of Composite Chemical Compounds [Dataset]. http://doi.org/10.1021/acs.jcim.2c00690.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    ACS Publications
    Authors
    Bas van Beek; Juliette Zito; Lucas Visscher; Ivan Infante
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The continuous improvement of computer architectures allows for the simulation of molecular systems of growing sizes. However, such calculations still require the input of initial structures, which are also becoming increasingly complex. In this work, we present CAT, a Compound Attachment Tool (source code available at https://github.com/nlesc-nano/CAT) and Python package for the automatic construction of composite chemical compounds, which supports the functionalization of organic, inorganic, and hybrid organic–inorganic materials. The CAT workflow consists in defining the anchoring sites on the reference material, usually a large molecular system denoted as a scaffold, and on the molecular species that are attached to it, i.e., the ligands. Usually, ligands are pre-optimized in a conformation biased toward more linear structures to minimize interligand(s) steric interactions, a bias that is important when multiple ligands are attached onto the scaffold. The resulting superstructure(s) are then stored in various formats that can be used afterward in quantum chemical calculations or classical force field-based simulations.

  10. f

    Data from: Correlated RNN Framework to Quickly Generate Molecules with...

    • acs.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chuan Li; Chenghui Wang; Ming Sun; Yan Zeng; Yuan Yuan; Qiaolin Gou; Guangchuan Wang; Yanzhi Guo; Xuemei Pu (2023). Correlated RNN Framework to Quickly Generate Molecules with Desired Properties for Energetic Materials in the Low Data Regime [Dataset]. http://doi.org/10.1021/acs.jcim.2c00997.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Chuan Li; Chenghui Wang; Ming Sun; Yan Zeng; Yuan Yuan; Qiaolin Gou; Guangchuan Wang; Yanzhi Guo; Xuemei Pu
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Motivated by the challenging of deep learning on the low data regime and the urgent demand for intelligent design on highly energetic materials, we explore a correlated deep learning framework, which consists of three recurrent neural networks (RNNs) correlated by the transfer learning strategy, to efficiently generate new energetic molecules with a high detonation velocity in the case of very limited data available. To avoid the dependence on the external big data set, data augmentation by fragment shuffling of 303 energetic compounds is utilized to produce 500,000 molecules to pretrain RNN, through which the model can learn sufficient structure knowledge. Then the pretrained RNN is fine-tuned by focusing on the 303 energetic compounds to generate 7153 molecules similar to the energetic compounds. In order to more reliably screen the molecules with a high detonation velocity, the SMILE enumeration augmentation coupled with the pretrained knowledge is utilized to build an RNN-based prediction model, through which R2 is boosted from 0.4446 to 0.9572. The comparable performance with the transfer learning strategy based on an existing big database (ChEMBL) to produce the energetic molecules and drug-like ones further supports the effectiveness and generality of our strategy in the low data regime. High-precision quantum mechanics calculations further confirm that 35 new molecules present a higher detonation velocity and lower synthetic accessibility than the classic explosive RDX, along with good thermal stability. In particular, three new molecules are comparable to caged CL-20 in the detonation velocity. All the source codes and the data set are freely available at https://github.com/wangchenghuidream/RNNMGM.

  11. f

    Data from: Incorporation of Ligand Charge and Metal Oxidation State...

    • acs.figshare.com
    xlsx
    Updated Dec 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Gibaldi; Anna Kapeliukha; Andrew White; Tom K. Woo (2024). Incorporation of Ligand Charge and Metal Oxidation State Considerations into the Computational Solvent Removal and Activation of Experimental Crystal Structures Preceding Molecular Simulation [Dataset]. http://doi.org/10.1021/acs.jcim.4c01897.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 23, 2024
    Dataset provided by
    ACS Publications
    Authors
    Marco Gibaldi; Anna Kapeliukha; Andrew White; Tom K. Woo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Efficient computational screenings are integral to materials discovery in highly sought-after gas adsorption and storage applications, such as CO2 capture. Preprocessing techniques have been developed to render experimental crystal structures suitable for molecular simulations by mimicking experimental activation protocols, particularly residual solvent removal. Current accounts examining these preprocessed materials databases indicate the presence of assorted structural errors introduced by solvent removal and preprocessing, including improper elimination of charge-balancing ions and ligands. Here, we address the need for a reliable experimental crystal structure preprocessing protocol by introducing a novel solvent removal method, which we call SAMOSA, that is informed by systematic ligand charge and metal oxidation state calculations. A robust set of solvent removal criteria is outlined, which identifies solvent molecules and counterions without predefined molecule lists or significant reliance on experimental chemical information. Validation results against popular metal–organic framework (MOF) databases suggest that this method observes significant performance improvements regarding the retention of charged ligands and recognition of charged frameworks. SAMOSA enhances structure fidelity with respect to the original material as-synthesized, thereby representing a powerful tool in computational materials database curation and preprocessing for molecular simulation. The source code is accessible at https://github.com/uowoolab/SAMOSA.

  12. f

    Data from: SpectroIBIS: Automated Data Processing for Multiconformer Quantum...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xlsx
    Updated Feb 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brodie W. Bulcock; Yit-Heng Chooi; Gavin R. Flematti (2025). SpectroIBIS: Automated Data Processing for Multiconformer Quantum Chemical Spectroscopic Calculations [Dataset]. http://doi.org/10.1021/acs.jnatprod.4c01321.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 7, 2025
    Dataset provided by
    ACS Publications
    Authors
    Brodie W. Bulcock; Yit-Heng Chooi; Gavin R. Flematti
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Quantum chemical spectroscopic calculations have grown increasingly popular in natural products research for aiding the elucidation of chemical structures, especially their stereochemical configurations. These calculations have become faster with modern computational speeds, but subsequent data handling, inspection, and presentation remain key bottlenecks for many researchers. In this article, we introduce the SpectroIBIS computer program as a user-friendly tool to automate tedious tasks commonly encountered in this workflow. Through a simple graphical user interface, researchers can drag and drop Gaussian or ORCA output files to produce Boltzmann-averaged ECD, VCD, UV–vis and IR data, optical rotations, and/or 1H and 13C NMR chemical shifts in seconds. Also produced are formatted, publication-quality supplementary data tables containing conformer energies and atomic coordinates, saved to a DOCX file compatible with Microsoft Word and LibreOffice. Importantly, SpectroIBIS can assist researchers in finding common calculation issues by automatically checking for redundant conformers and imaginary frequencies. Additional useful features include recognition of conformer energy recalculations at a higher theory level, and automated generation of input files for quantum chemistry programs with optional exclusion of high-energy conformers. Lastly, we demonstrate the applicability of SpectroIBIS with spectroscopic calculations for five natural products. SpectroIBIS is open-source software available as a free desktop application (https://github.com/bbulcock/SpectroIBIS).

  13. f

    Data from: Clustering of Synthetic Routes Using Tree Edit Distance

    • acs.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel Genheden; Ola Engkvist; Esben Bjerrum (2023). Clustering of Synthetic Routes Using Tree Edit Distance [Dataset]. http://doi.org/10.1021/acs.jcim.1c00232.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Samuel Genheden; Ola Engkvist; Esben Bjerrum
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    We present a novel algorithm to compute the distance between synthetic routes based on tree edit distances. Such distances can be used to cluster synthesis routes generated using a retrosynthesis prediction tool. We show that the clustering of selected routes from a retrosynthesis analysis is performed in less than 10 s on average and only constitutes seven percent of the total time (prediction + clustering). Furthermore, we are able to show that representative routes from each cluster can be used to reduce the set of predicted routes. Finally, we show with a number of examples that the algorithm gives intuitive clusters that can be easily rationalized and that the routes in a cluster tend to use similar chemistry. The algorithm is included in the latest version of open-source AiZynthFinder software (https://github.com/MolecularAI/aizynthfinder) and as a separate package (https://github.com/MolecularAI/route-distances).

  14. f

    Data from: sbml-diff: A Tool for Visually Comparing SBML Models in Synthetic...

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Scott-Brown; Antonis Papachristodoulou (2023). sbml-diff: A Tool for Visually Comparing SBML Models in Synthetic Biology [Dataset]. http://doi.org/10.1021/acssynbio.6b00273.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    ACS Publications
    Authors
    James Scott-Brown; Antonis Papachristodoulou
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    We present sbml-diff, a tool that is able to read a model of a biochemical reaction network in SBML format and produce a range of diagrams showing different levels of detail. Each diagram type can be used to visualize a single model or to visually compare two or more models. The default view depicts species as ellipses, reactions as rectangles, rules as parallelograms, and events as diamonds. A cartoon view replaces the symbols used for reactions on the basis of the associated Systems Biology Ontology terms. An abstract view represents species as ellipses and draws edges between them to indicate whether a species increases or decreases the production or degradation of another species. sbml-diff is freely licensed under the three-clause BSD license and can be downloaded from https://github.com/jamesscottbrown/sbml-diff and used as a python package called from other software, as a free-standing command-line application, or online using the form at http://sysos.eng.ox.ac.uk/tebio/upload

  15. f

    Data from: pICalculax: Improved Prediction of Isoelectric Point for Modified...

    • acs.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esben J. Bjerrum; Jan H. Jensen; Jakob L. Tolborg (2023). pICalculax: Improved Prediction of Isoelectric Point for Modified Peptides [Dataset]. http://doi.org/10.1021/acs.jcim.7b00030.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Esben J. Bjerrum; Jan H. Jensen; Jakob L. Tolborg
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The isoelectric point of a peptide is a physicochemical property that can be accurately predicted from the sequence of the peptide when the peptide is built from natural amino acids. Peptides can however have chemical modifications, such as phosphorylations, amidations, and unnatural amino acids, which can result in erroneous predictions if not accounted for. Here we report on an open source program, pICalculax, which in an extensible way can handle pI calculations of modified peptides. Tests on a database of modified peptides and experimentally determined pI values show an improvement in pI predictions when taking the modifications into account. The correlation coefficient improves from 0.45 to 0.91, and the root-mean-square deviation likewise improves from 3.3 to 0.9. The program is available at https://github.com/EBjerrum/pICalculax

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yiwen Lu; Dilek Yalcin; Paul J. Pigram; Lewis D. Blackman; Mario Boley (2023). Interpretable Machine Learning Models for Phase Prediction in Polymerization-Induced Self-Assembly [Dataset]. http://doi.org/10.1021/acs.jcim.3c00460.s002

Data from: Interpretable Machine Learning Models for Phase Prediction in Polymerization-Induced Self-Assembly

Related Article
Explore at:
xlsxAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Yiwen Lu; Dilek Yalcin; Paul J. Pigram; Lewis D. Blackman; Mario Boley
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

While polymerization-induced self-assembly (PISA) has become a preferred synthetic route toward amphiphilic block copolymer self-assemblies, predicting their phase behavior from experimental design is extremely challenging, requiring time and work-intensive creation of empirical phase diagrams whenever self-assemblies of novel monomer pairs are sought for specific applications. To alleviate this burden, we develop here the first framework for a data-driven methodology for the probabilistic modeling of PISA morphologies based on a selection and suitable adaption of statistical machine learning methods. As the complexity of PISA precludes generating large volumes of training data with in silico simulations, we focus on interpretable low variance methods that can be interrogated for conformity with chemical intuition and that promise to work well with only 592 training data points which we curated from the PISA literature. We found that among the evaluated linear models, generalized additive models, and rule and tree ensembles, all but the linear models show a decent interpolation performance with around 0.2 estimated error rate and 1 bit expected cross entropy loss (surprisal) when predicting the mixture of morphologies formed from monomer pairs already encountered in the training data. When considering extrapolation to new monomer combinations, the model performance is weaker but the best model (random forest) still achieves highly nontrivial prediction performance (0.27 error rate, 1.6 bit surprisal), which renders it a good candidate to support the creation of empirical phase diagrams for new monomers and conditions. Indeed, we find in three case studies that, when used to actively learn phase diagrams, the model is able to select a smart set of experiments that lead to satisfactory phase diagrams after observing only relatively few data points (5–16) for the targeted conditions. The data set as well as all model training and evaluation codes are publicly available through the GitHub repository of the last author.

Search
Clear search
Close search
Google apps
Main menu