27 datasets found
  1. Data from: Active Learning Guided Drug Design Lead Optimization Based on...

    • acs.figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Filipp Gusev; Evgeny Gutkin; Maria G. Kurnikova; Olexandr Isayev (2023). Active Learning Guided Drug Design Lead Optimization Based on Relative Binding Free Energy Modeling [Dataset]. http://doi.org/10.1021/acs.jcim.2c01052.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    ACS Publications
    Authors
    Filipp Gusev; Evgeny Gutkin; Maria G. Kurnikova; Olexandr Isayev
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In silico identification of potent protein inhibitors commonly requires prediction of a ligand binding free energy (BFE). Thermodynamics integration (TI) based on molecular dynamics (MD) simulations is a BFE calculation method capable of acquiring accurate BFE, but it is computationally expensive and time-consuming. In this work, we have developed an efficient automated workflow for identifying compounds with the lowest BFE among thousands of congeneric ligands, which requires only hundreds of TI calculations. Automated machine learning (AutoML) orchestrated by active learning (AL) in an AL–AutoML workflow allows unbiased and efficient search for a small set of best-performing molecules. We have applied this workflow to select inhibitors of the SARS-CoV-2 papain-like protease and were able to find 133 compounds with improved binding affinity, including 16 compounds with better than 100-fold binding affinity improvement. We obtained a hit rate that outperforms that expected of traditional expert medicinal chemist-guided campaigns. Thus, we demonstrate that the combination of AL and AutoML with free energy simulations provides at least 20× speedup relative to the naïve brute force approaches.

  2. c

    Data from: Fast Bayesian force fields from active learning: study of...

    • materialscloud-archive-failover.cineca.it
    • archive.materialscloud.org
    text/markdown, txt +1
    Updated Dec 1, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Xie; Jonathan Vandermause; Lixin Sun; Andrea Cepellotti; Boris Kozinsky; Yu Xie; Jonathan Vandermause; Lixin Sun; Andrea Cepellotti; Boris Kozinsky (2020). Fast Bayesian force fields from active learning: study of inter-dimensional transformation of stanene [Dataset]. http://doi.org/10.24435/materialscloud:qg-99
    Explore at:
    zip, text/markdown, txtAvailable download formats
    Dataset updated
    Dec 1, 2020
    Dataset provided by
    Materials Cloud
    Authors
    Yu Xie; Jonathan Vandermause; Lixin Sun; Andrea Cepellotti; Boris Kozinsky; Yu Xie; Jonathan Vandermause; Lixin Sun; Andrea Cepellotti; Boris Kozinsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gaussian process (GP) regression is one promising technique of constructing machine learning force fields with built-in uncertainty quantification, which can be used to monitor the quality of model predictions. A current limitation of existing GP force fields is that the prediction cost grows linearly with the size of the training data set, making accurate GP predictions slow. In this work, we exploit the special structure of the kernel function to construct a mapping of the trained Gaussian process model, including both forces and their uncertainty predictions, onto spline functions of low-dimensional structural features. This method is incorporated in the Bayesian active learning workflow for training of Bayesian force fields. To demonstrate the capabilities of this method, we construct a force field for stanene and perform large scale dynamics simulation of its structural evolution. We provide a fully open-source implementation of our method, as well as the training and testing examples with the stanene dataset.

  3. Z

    Data from: Uncertainty-aware molecular dynamics from Bayesian active...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vandermause, Jonathan (2022). Uncertainty-aware molecular dynamics from Bayesian active learning: Phase Transformations and Thermal Transport in SiC [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5797176
    Explore at:
    Dataset updated
    Mar 8, 2022
    Dataset provided by
    Xie, Yu
    Johansson, Anders
    Protik, Nakib H.
    Ramakers, Senja
    Kozinsky, Boris
    Vandermause, Jonathan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning interatomic force fields are promising for combining high computational efficiency and accuracy in modeling quantum interactions and simulating atomic level processes. Active learning methods have been recently developed to train force fields efficiently and automatically. Among them, Bayesian active learning utilizes principled uncertainty quantification to make data acquisition decisions. In this work, we present an efficient Bayesian active learning workflow, where the force field is constructed from a sparse Gaussian process regression model based on atomic cluster expansion descriptors. To circumvent the high computational cost of the sparse Gaussian process uncertainty calculation, we formulate a high-performance approximate mapping of the uncertainty and demonstrate a speedup of several orders of magnitude. As an application, we train a model for silicon carbide (SiC), a wide-gap semiconductor with complex polymorphic structure and diverse technological applications in power electronics, nuclear physics and astronomy. We show that the high pressure phase transformation is accurately captured by the autonomous active learning workflow. The trained force field shows excellent agreement with both \textit{ab initio} calculations and experimental measurements, and outperforms existing empirical models on vibrational and thermal properties. The active learning workflow is readily generalized to a wide range of systems, accelerates computational understanding and design.

  4. e

    Fast Bayesian force fields from active learning and mapped Gaussian...

    • b2find.eudat.eu
    Updated Feb 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Fast Bayesian force fields from active learning and mapped Gaussian processes: application to structural phase transition of stanene - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/d71bd391-7231-590e-bac9-67059e2b2a7e
    Explore at:
    Dataset updated
    Feb 4, 2023
    Description

    Gaussian process (GP) regression is one promising technique of constructing machine learning force fields with built-in uncertainty quantification, which can be used to monitor the quality of model predictions. A current limitation of existing GP force fields is that the prediction cost grows linearly with the size of the training data set, making accurate GP predictions slow. In this work, we exploit the special structure of the kernel function to construct a mapping of the trained Gaussian process model, including both forces and their uncertainty predictions, onto spline functions of low-dimensional structural features. This method is incorporated in the Bayesian active learning workflow for training of Bayesian force fields. To demonstrate the capabilities of this method, we construct a force field for stanene and perform large scale dynamics simulation of its structural evolution. We provide a fully open-source implementation of our method, as well as the training and testing examples with the stanene dataset.

  5. Data from: Active Learning FEP: Impact on Performance of AL Protocol and...

    • acs.figshare.com
    zip
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Lonsdale; Jack Glancy; Leen Kalash; David Marcus; Ian D. Wall (2025). Active Learning FEP: Impact on Performance of AL Protocol and Chemical Diversity [Dataset]. http://doi.org/10.1021/acs.jctc.5c00128.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 17, 2025
    Dataset provided by
    ACS Publications
    Authors
    Richard Lonsdale; Jack Glancy; Leen Kalash; David Marcus; Ian D. Wall
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Active learning using models built on binding potency predictions from free energy perturbation (AL-FEP) has been proposed as a method for generating machine learning models capable of predicting biochemical potency for early-stage lead optimization where limited measured data are available. Two applications of AL-FEP are described here for different bromodomain inhibitor series that were developed in historic GSK projects: one where the core is kept constant and the other where core changes are included in the pool of compound ideas. Measured biochemical potency data have been used to assess the performance of the final models and demonstrate that well-performing models can be generated within several rounds of active learning, especially when the core is kept constant. To apply this method routinely to drug discovery projects, a retrospective evaluation of the AL-FEP workflow has been conducted covering parameters including the compound selection strategy, explore–exploit ratios, and number of compounds selected per cycle. Significant differences in performance in terms of model enrichment and R2 are observed and rationalized. Recommendations are made as to when specific parameters should be employed for AL-FEP depending on the context (maximizing potency or broad-range prediction accuracy) in which the final model is to be deployed.

  6. Data from: Active Learning FEP Using 3D-QSAR for Prioritizing Bioisosteres...

    • acs.figshare.com
    zip
    Updated Apr 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Venkata K. Ramaswamy; Matthew Habgood; Mark D. Mackey (2025). Active Learning FEP Using 3D-QSAR for Prioritizing Bioisosteres in Medicinal Chemistry [Dataset]. http://doi.org/10.1021/acsmedchemlett.4c00554.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 29, 2025
    Dataset provided by
    ACS Publications
    Authors
    Venkata K. Ramaswamy; Matthew Habgood; Mark D. Mackey
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Bioisostere replacement is a powerful and popular tool used to optimize the potency and selectivity of candidate molecules in drug discovery. Selecting the right bioisosteres to invest resources in for synthesis and subsequent optimization is key to an efficient drug discovery project. Here we demonstrate how 3D-quantitative structure–activity relationship (3D-QSAR) and relative binding free energy calculations can be combined into an active learning workflow to prioritize molecules from a pool of hundreds of bioisosteres. We demonstrate on a human aldose reductase test case that the use of this workflow can rapidly locate the strongest-binding bioisosteric replacements with a relatively modest computational cost.

  7. e

    Atomistic fracture in bcc iron revealed by active learning of Gaussian...

    • b2find.eudat.eu
    Updated Oct 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Atomistic fracture in bcc iron revealed by active learning of Gaussian approximation potential - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b216c3fe-d905-5298-b073-fdcabc6bf4b2
    Explore at:
    Dataset updated
    Oct 22, 2023
    Description

    Existing, classical interatomic potentials for bcc iron predict contradicting crack-tip mechanisms (i.e. cleavage, dislocation emission, phase transition) for the same crack systems, thus leaving the crack propagation mechanism in bcc iron unclear. In this work, we develop a Gaussian approximation potential (GAP) by extending a DFT database for ferromagnetic bcc iron to include highly distorted primitive bcc cells and surface separation, along with small crack-tip configurations that are identified by means of a fully automated active learning workflow. Our GAP (referred to as Fe-GAP22) predicts crack propagation within 8 meV/atom accuracy. The fully automated, active learning workflow is made publicly available on GitHub. With the newly developed Fe-GAP22, we find that in absence of other defects around the crack tip (e.g. nanovoids, dislocations), the static (T=0K) crack-tip mechanism is cleavage, thus settling the contradictions in the literature. Our work also highlights the need for multi-scale modelling to predict fracture at finite temperatures and finite strain rates.

  8. Data from: Active Learning-Guided Hit Optimization for the Leucine-Rich...

    • acs.figshare.com
    xlsx
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Filipp Gusev; Evgeny Gutkin; Francesco Gentile; Fuqiang Ban; S. Benjamin Koby; Fengling Li; Irene Chau; Suzanne Ackloo; Cheryl H. Arrowsmith; Albina Bolotokova; Pegah Ghiabi; Elisa Gibson; Levon Halabelian; Scott Houliston; Rachel J. Harding; Ashley Hutchinson; Peter Loppnau; Sumera Perveen; Almagul Seitova; Hong Zeng; Matthieu Schapira; Artem Cherkasov; Olexandr Isayev; Maria G. Kurnikova (2025). Active Learning-Guided Hit Optimization for the Leucine-Rich Repeat Kinase 2 WDR Domain Based on In Silico Ligand-Binding Affinities [Dataset]. http://doi.org/10.1021/acs.jcim.5c00588.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 26, 2025
    Dataset provided by
    ACS Publications
    Authors
    Filipp Gusev; Evgeny Gutkin; Francesco Gentile; Fuqiang Ban; S. Benjamin Koby; Fengling Li; Irene Chau; Suzanne Ackloo; Cheryl H. Arrowsmith; Albina Bolotokova; Pegah Ghiabi; Elisa Gibson; Levon Halabelian; Scott Houliston; Rachel J. Harding; Ashley Hutchinson; Peter Loppnau; Sumera Perveen; Almagul Seitova; Hong Zeng; Matthieu Schapira; Artem Cherkasov; Olexandr Isayev; Maria G. Kurnikova
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The leucine-rich repeat kinase 2 (LRRK2) is the most mutated gene in familial Parkinson’s disease, and its mutations lead to pathogenic hallmarks of the disease. The LRRK2 WDR domain is an understudied drug target for Parkinson’s disease, with no known inhibitors prior to the first phase of the Critical Assessment of Computational Hit-Finding Experiments (CACHE) Challenge. A unique advantage of the CACHE Challenge is that the predicted molecules are experimentally validated in-house. Here, we report the design and experimental confirmation of LRRK2 WDR inhibitor molecules. We used an active learning (AL) machine learning (ML) workflow based on optimized free-energy molecular dynamics (MD) simulations utilizing the thermodynamic integration (TI) framework to expand a chemical series around two of our previously confirmed hit molecules. We identified 8 experimentally verified novel inhibitors out of 35 experimentally tested (23% hit rate). These results demonstrate the efficacy of our free-energy-based active learning workflow to explore large chemical spaces quickly and efficiently while minimizing the number and length of expensive simulations. This workflow is widely applicable to screening any chemical space for small-molecule analogs with increased affinity, subject to the general constraints of RBFE calculations. The mean absolute error of the TI MD calculations was 2.69 kcal/mol, with respect to the measured KD of hit compounds.

  9. f

    Metadynamics enhanced training Datasets, DFT-SCAN accurate GAP Model and MD...

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jicheng Guo; Vanessa Woo; David Andersson; Nathaniel Hoyt; Mark Williamson; Ian Foster; Chris Benmore; Nicholas E. Jackson; Ganesh Sivaraman (2023). Metadynamics enhanced training Datasets, DFT-SCAN accurate GAP Model and MD trajectories for "AL4GAP: Active Learning Workflow for generating DFT-SCAN Accurate Machine-Learning Potentials for Combinatorial Molten Salt Mixtures" [Dataset]. http://doi.org/10.6084/m9.figshare.22534981.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Jicheng Guo; Vanessa Woo; David Andersson; Nathaniel Hoyt; Mark Williamson; Ian Foster; Chris Benmore; Nicholas E. Jackson; Ganesh Sivaraman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metadynamics enhanced training Datasets, DFT-SCAN accurate GAP Model and MD trajectories for "AL4GAP: Active Learning Workflow for generating DFT-SCAN Accurate Machine-Learning Potentials for Combinatorial Molten Salt Mixtures"

  10. Data from: Solvent Selection for Mitsunobu Reaction Driven by an Active...

    • acs.figshare.com
    xlsx
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chonghuan Zhang; Yehia Amar; Liwei Cao; Alexei A. Lapkin (2023). Solvent Selection for Mitsunobu Reaction Driven by an Active Learning Surrogate Model [Dataset]. http://doi.org/10.1021/acs.oprd.0c00376.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    ACS Publications
    Authors
    Chonghuan Zhang; Yehia Amar; Liwei Cao; Alexei A. Lapkin
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Optimization of a synthetic reaction with respect to solvent choice and operating conditions was implemented as a machine learning-based workflow. The approach is exemplified on the case study of selection of a promising solvent to maximize the yield of a Mitsunobu reaction producing isopropyl benzoate. A solvent was defined with 15 molecular descriptors, and a library of solvent descriptors was built. The descriptors were converted into a reduced dimensionality form using an Autoencoder. Experimental yields were used to train a multilayered artificial neural network (ANN) surrogate model, which was used for the optimization and design of experiments (DoE). DoE was performed in an active learning mode to reduce the number of experiments required for reaction optimization. The final surrogate model identified 1-chloropentane as a promising solvent, which resulted in an experimental yield of 93%.

  11. m

    Data from: Atomistic fracture in bcc iron revealed by active learning of...

    • archive.materialscloud.org
    bin, text/markdown +1
    Updated Aug 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lei Zhang; Gábor Csányi; Erik van der Giessen; Francesco Maresca; Lei Zhang; Gábor Csányi; Erik van der Giessen; Francesco Maresca (2022). Atomistic fracture in bcc iron revealed by active learning of Gaussian approximation potential [Dataset]. http://doi.org/10.24435/materialscloud:ps-p7
    Explore at:
    zip, bin, text/markdownAvailable download formats
    Dataset updated
    Aug 11, 2022
    Dataset provided by
    Materials Cloud
    Authors
    Lei Zhang; Gábor Csányi; Erik van der Giessen; Francesco Maresca; Lei Zhang; Gábor Csányi; Erik van der Giessen; Francesco Maresca
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Existing, classical interatomic potentials for bcc iron predict contradicting crack-tip mechanisms (i.e. cleavage, dislocation emission, phase transition) for the same crack systems, thus leaving the crack propagation mechanism in bcc iron unclear. In this work, we develop a Gaussian approximation potential (GAP) by extending a DFT database for ferromagnetic bcc iron to include highly distorted primitive bcc cells and surface separation, along with small crack-tip configurations that are identified by means of a fully automated active learning workflow. Our GAP (referred to as Fe-GAP22) predicts crack propagation within 8 meV/atom accuracy. The fully automated, active learning workflow is made publicly available on GitHub. With the newly developed Fe-GAP22, we find that in absence of other defects around the crack tip (e.g. nanovoids, dislocations), the static (T=0K) crack-tip mechanism is cleavage, thus settling the contradictions in the literature. Our work also highlights the need for multi-scale modelling to predict fracture at finite temperatures and finite strain rates.

  12. Data from: FEP Protocol Builder: Optimization of Free Energy Perturbation...

    • acs.figshare.com
    zip
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    César de Oliveira; Karl Leswing; Shulu Feng; René Kanters; Robert Abel; Sathesh Bhat (2023). FEP Protocol Builder: Optimization of Free Energy Perturbation Protocols Using Active Learning [Dataset]. http://doi.org/10.1021/acs.jcim.3c00681.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 18, 2023
    Dataset provided by
    ACS Publications
    Authors
    César de Oliveira; Karl Leswing; Shulu Feng; René Kanters; Robert Abel; Sathesh Bhat
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Significant improvements have been made in the past decade to methods that rapidly and accurately predict binding affinity through free energy perturbation (FEP) calculations. This has been driven by recent advances in small-molecule force fields and sampling algorithms combined with the availability of low-cost parallel computing. Predictive accuracies of ∼1 kcal mol–1 have been regularly achieved, which are sufficient to drive potency optimization in modern drug discovery campaigns. Despite the robustness of these FEP approaches across multiple target classes, there are invariably target systems that do not display expected performance with default FEP settings. Traditionally, these systems required labor-intensive manual protocol development to arrive at parameter settings that produce a predictive FEP model. Due to the (a) relatively large parameter space to be explored, (b) significant compute requirements, and (c) limited understanding of how combinations of parameters can affect FEP performance, manual FEP protocol optimization can take weeks to months to complete, and often does not involve rigorous train-test set splits, resulting in potential overfitting. These manual FEP protocol development timelines do not coincide with tight drug discovery project timelines, essentially preventing the use of FEP calculations for these target systems. Here, we describe an automated workflow termed FEP Protocol Builder (FEP-PB) to rapidly generate accurate FEP protocols for systems that do not perform well with default settings. FEP-PB uses an active-learning workflow to iteratively search the protocol parameter space to develop accurate FEP protocols. To validate this approach, we applied it to pharmaceutically relevant systems where default FEP settings could not produce predictive models. We demonstrate that FEP-PB can rapidly generate accurate FEP protocols for the previously challenging MCL1 system with limited human intervention. We also apply FEP-PB in a real-world drug discovery setting to generate an accurate FEP protocol for the p97 system. FEP-PB is able to generate a more accurate protocol than the expert user, rapidly validating p97 as amenable to free energy calculations. Additionally, through the active-learning workflow, we are able to gain insight into which parameters are most important for a given system. These results suggest that FEP-PB is a robust tool that can aid in rapidly developing accurate FEP protocols and increasing the number of targets that are amenable to the technology.

  13. d

    Replication Data for: The (real) need for a human touch: testing a...

    • search.dataone.org
    Updated Nov 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebők, Miklós; Kacsuk, Zoltán; Máté, Ákos (2023). Replication Data for: The (real) need for a human touch: testing a human–machine hybrid topic classification workflow on a New York Times corpus [Dataset]. http://doi.org/10.7910/DVN/I24CYV
    Explore at:
    Dataset updated
    Nov 12, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Sebők, Miklós; Kacsuk, Zoltán; Máté, Ákos
    Description

    The classification of the items of ever-increasing textual databases has become an important goal for a number of research groups active in the field of computational social science. Due to the increased amount of text data there is a growing number of use-cases where the initial effort of human classifiers was successfully augmented using supervised machine learning (SML). In this paper, we investigate such a hybrid workflow solution classifying the lead paragraphs of New York Times front-page articles from 1996 to 2006 according to policy topic categories (such education or defense) of the Comparative Agendas Project (CAP). The SML classification is conducted in multiple rounds, and within each round, we run the SML algorithm on n samples and n times if the given algorithm is non-deterministic (e.g., SVM). If all the SML predictions point towards a single label for a document, then it is classified as such (this approach is also called a “voting ensemble"). In the second step, we explore several scenarios, ranging from using the SML ensemble without human validation to incorporating active learning. Using these scenarios, we can quantify the gains from the various workflow versions. We find that using human coding and validation combined with an ensemble SML hybrid approach can reduce the need for human coding while maintaining very high precision rates and offering a modest to a good level of recall. The modularity of this hybrid workflow allows for various setups to address the idiosyncratic resource bottlenecks that a large-scale text classification project might face.

  14. n

    Data from: Thermodynamics-Informed Machine Learning for the Design of...

    • curate.nd.edu
    pdf
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Dinis Oliveira Abranches (2024). Thermodynamics-Informed Machine Learning for the Design of Sustainable Materials: The Dawn of Digital Molecular Discovery [Dataset]. http://doi.org/10.7274/25545994.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 11, 2024
    Dataset provided by
    University of Notre Dame
    Authors
    João Dinis Oliveira Abranches
    License

    https://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106

    Description

    Scientists have traditionally employed trial-and-error methodologies to design novel materials, often complemented by basic heuristic rules or chemical intuition (e.g., “like dissolves like”). However, to date, this simplistic approach has led to the discovery and characterization of only a small fraction of all synthesizable compounds. Data-driven approaches such as machine learning are promising alternative routes to these traditional trial-and-error methodologies. Unfortunately, most machine learning models proposed so far do not embed chemical or thermodynamic information in their architectures and molecular descriptors. In turn, this leads to overly complex models that require a tremendous volume of experimental data to be properly trained.

    At the interface between artificial intelligence and green chemistry, the work developed throughout this dissertation uses thermodynamics-informed machine learning to bridge the gap between small, scarce datasets and data-driven approaches. This is accomplished using two major avenues. The first is through the development of active learning workflows, based on Gaussian process machine learning models, that target the description of activity coefficients. This unique approach was particularly directed at capturing the physicochemical properties of mixtures, namely deep eutectic solvents. Active learning was able to efficiently guide the acquisition of experimental data, and, in many cases, a single data point was sufficient to accurately describe mixture properties (namely phase equilibria diagrams), dramatically reducing the effort and cost necessary to characterize novel sustainable materials.

    The second major avenue lies in the development of a digital molecular space based on sigma profiles. These molecular descriptors, derived from quantum chemistry, were shown to be a powerful feature set for neural networks, leading to the accurate prediction of assorted physicochemical properties (e.g., boiling points and aqueous solubilities) for organic and inorganic molecules. A graph neural network was also developed to predict sigma profiles, bypassing the need for expensive quantum chemistry calculations. Finally, sigma profiles were shown to behave as a digital molecular space where optimization tasks can be performed. A remarkable example of this was that of Bayesian optimization towards boiling point optimization. Holding no knowledge of chemistry except for the sigma profile and normal boiling temperature of carbon monoxide (the worst possible initial guess), Bayesian optimization found the global maximum of the available normal boiling temperature dataset (over 1000 molecules encompassing more than 40 families of organic and inorganic compounds) in just fifteen iterations (i.e., fifteen property measurements), cementing sigma profiles as an ideal digital chemical space for molecular optimization and discovery, particularly when little experimental data is available.

  15. Data from: Dimensional Control over Metal Halide Perovskite Crystallization...

    • figshare.com
    txt
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhi Li; Philip W. Nega; Mansoor Ani Najeeb Nellikkal; Chaochao Dun; Matthias Zeller; Jeffrey J. Urban; Wissam A. Saidi; Joshua Schrier; Alexander J. Norquist; Emory M. Chan (2023). Dimensional Control over Metal Halide Perovskite Crystallization Guided by Active Learning [Dataset]. http://doi.org/10.1021/acs.chemmater.1c03564.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    ACS Publications
    Authors
    Zhi Li; Philip W. Nega; Mansoor Ani Najeeb Nellikkal; Chaochao Dun; Matthias Zeller; Jeffrey J. Urban; Wissam A. Saidi; Joshua Schrier; Alexander J. Norquist; Emory M. Chan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Metal halide perovskite (MHP) derivatives, a promising class of optoelectronic materials, have been synthesized with a range of dimensionalities that govern their optoelectronic properties and determine their applications. We demonstrate a data-driven approach combining active learning and high-throughput experimentation to discover, control, and understand the formation of phases with different dimensionalities in the morpholinium (morph) lead iodide system. Using a robot-assisted workflow, we synthesized and characterized two novel MHP derivatives that have distinct optical properties: a one-dimensional (1D) morphPbI3 phase ([C4H10NO][PbI3]) and a two-dimensional (2D) (morph)2PbI4 phase ([C4H10NO]2[PbI4]). To efficiently acquire the data needed to construct a machine learning (ML) model of the reaction conditions where the 1D and 2D phases are formed, data acquisition was guided by a diverse-mini-batch-sampling active learning algorithm, using prediction confidence as a stopping criterion. Querying the ML model uncovered the reaction parameters that have the most significant effects on dimensionality control. Based on these insights, we discuss possible reaction schemes that may selectively promote the formation of morph-Pb-I phases with different dimensionalities. The data-driven approach presented here, including the use of additives to manipulate dimensionality, will be valuable for controlling the crystallization of a range of materials over large reaction-composition spaces.

  16. Incongruent Melting and Phase Diagram of SiC from Machine Learning Molecular...

    • zenodo.org
    bin, zip
    Updated Mar 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Xie; Yu Xie; Menghang Wang; Menghang Wang; Senja Ramakers; Senja Ramakers; Frans Spaepen; Frans Spaepen; Boris Kozinsky; Boris Kozinsky (2025). Incongruent Melting and Phase Diagram of SiC from Machine Learning Molecular Dynamics (Part I) [Dataset]. http://doi.org/10.5281/zenodo.14648292
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Mar 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yu Xie; Yu Xie; Menghang Wang; Menghang Wang; Senja Ramakers; Senja Ramakers; Frans Spaepen; Frans Spaepen; Boris Kozinsky; Boris Kozinsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Silicon Carbide Machine Learning Molecular Dynamics Dataset (Part I)

    Silicon carbide (SiC) is an important technological material, but its high-temperature phase diagram has remained unclear due to conflicting experimental results about congruent versus incongruent melting. Here, we employ large-scale machine learning molecular dynamics (MLMD) simulations to gain insights into SiC decomposition and phase transitions. Our approach relies on a Bayesian active learning workflow to efficiently train an accurate machine learning force field on density functional theory data. Our large-scale simulations provide direct indication that melting of SiC proceeds incongruently via decomposition into silicon-rich and carbon phases at high temperature and pressure. During cooling at high pressures, carbon nanoclusters nucleate and grow within the homogeneous molten liquid. During heating, the decomposed mixture reversibly transitions back into a homogeneous SiC liquid. The full pressure-temperature phase diagram of SiC is systematically constructed using MLMD simulations, providing new understanding of the nature of phases, resolving long-standing inconsistencies from previous experiments and yielding technologically relevant implications for processing and deposition of this material.

    This dataset provides:

    • Large-scale (512K atoms) cooling MLMD trajectories
    • Medium-scale (8K atoms) cooling and heating MLMD trajectories
    • Two-phase (16K atoms) MLMD trajectories
    • ML force fields training data and uncertainty estimations
    Please refer to README.md for more dataset information. Additional MLMD trajectories with defects are available in https://doi.org/10.5281/zenodo.15066528.
  17. f

    Data from: Data-Driven Multi-Objective Optimization Tactics for Catalytic...

    • acs.figshare.com
    zip
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan J. Dotson; Lucy van Dijk; Jacob C. Timmerman; Samantha Grosslight; Richard C. Walroth; Francis Gosselin; Kurt Püntener; Kyle A. Mack; Matthew S. Sigman (2023). Data-Driven Multi-Objective Optimization Tactics for Catalytic Asymmetric Reactions Using Bisphosphine Ligands [Dataset]. http://doi.org/10.1021/jacs.2c08513.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    ACS Publications
    Authors
    Jordan J. Dotson; Lucy van Dijk; Jacob C. Timmerman; Samantha Grosslight; Richard C. Walroth; Francis Gosselin; Kurt Püntener; Kyle A. Mack; Matthew S. Sigman
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Optimization of the catalyst structure to simultaneously improve multiple reaction objectives (e.g., yield, enantioselectivity, and regioselectivity) remains a formidable challenge. Herein, we describe a machine learning workflow for the multi-objective optimization of catalytic reactions that employ chiral bisphosphine ligands. This was demonstrated through the optimization of two sequential reactions required in the asymmetric synthesis of an active pharmaceutical ingredient. To accomplish this, a density functional theory-derived database of

    550 bisphosphine ligands was constructed, and a designer chemical space mapping technique was established. The protocol used classification methods to identify active catalysts, followed by linear regression to model reaction selectivity. This led to the prediction and validation of significantly improved ligands for all reaction outputs, suggesting a general strategy that can be readily implemented for reaction optimizations where performance is controlled by bisphosphine ligands.

  18. c

    Data Collection and Labeling market size was USD 2.41 Billion in 2022!

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research (2025). Data Collection and Labeling market size was USD 2.41 Billion in 2022! [Dataset]. https://www.cognitivemarketresearch.com/data-collection-and-labeling-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    As per Cognitive Market Research's latest published report, the Global Data Collection and Labeling market size was USD 2.41 Billion in 2022 and it is forecasted to reach USD 18.60 Billion by 2030. Data Collection and Labeling Industry's Compound Annual Growth Rate will be 29.1% from 2023 to 2030. Key Dynamics of Data Collection And Labeling Market

    Key Drivers of Data Collection And Labeling Market

    Surge in AI and Machine Learning Adoption: The increasing integration of AI across various industries has led to a notable rise in the demand for high-quality labeled datasets. Precise data labeling is essential for training machine learning models, particularly in fields such as autonomous vehicles, healthcare diagnostics, and facial recognition.

    Proliferation of Unstructured Data: With the surge of images, videos, and audio data generated from digital platforms, businesses are in need of structured labeling services to transform raw data into usable datasets. This trend is propelling the growth of data annotation services, especially for applications in natural language processing and computer vision.

    Rising Use in Healthcare and Retail: Data labeling plays a vital role in applications such as medical imaging, drug discovery, and e-commerce personalization. Industries like healthcare and retail are allocating resources towards labeled datasets to enhance AI-driven diagnostics, recommendation systems, and predictive analytics, thereby increasing market demand.

    Key Restrains for Data Collection And Labeling Market

    High Cost and Time-Intensive Process: The process of manual data labeling is both labor-intensive and costly, particularly for intricate projects that necessitate expert annotators. This can pose a challenge for small businesses or startups that operate with limited budgets and stringent development timelines.

    Data Privacy and Compliance Challenges: Managing sensitive information, including personal photographs, biometric data, or patient records, raises significant concerns regarding security and regulatory compliance. Ensuring compliance with GDPR, HIPAA, or other data protection regulations complicates the data labeling process.

    Lack of Skilled Workforce: The industry is experiencing a shortage of qualified data annotators, especially in specialized areas such as radiology or autonomous systems. The inconsistency in labeling quality due to insufficient domain expertise can adversely affect the accuracy and reliability of AI models.

    Key Trends in Data Collection And Labelingl Market

    Emergence of Automated and Semi-Automated Labeling Tools: Companies are progressively embracing AI-driven labeling tools to minimize manual labor. Innovations such as active learning, auto-labeling, and transfer learning are enhancing efficiency and accelerating the data preparation workflow.

    Expansion of Crowdsourcing Platforms: Crowdsourced data labeling via platforms like Amazon Mechanical Turk is gaining traction as a favored approach. It facilitates quicker turnaround times at reduced costs by utilizing a global workforce, particularly for tasks involving image classification, sentiment analysis, and object detection.

    Transition Towards Industry-Specific Labeling Solutions: Providers are creating domain-specific labeling platforms customized for sectors such as agriculture, autonomous vehicles, or legal technology. These specialized tools enhance accuracy, shorten time-to-market, and cater to the specific requirements of vertical AI applications. What is Data Collection and Labeling?

    Data collection and labeling is the process of gathering and organizing data and adding metadata to it for better analysis and understanding. This process is critical in machine learning and artificial intelligence, as it provides the foundation for training algorithms that can identify patterns and make predictions. Data collection involves gathering raw data from various sources, including sensors, databases, websites, and other forms of digital media. The collected data may be unstructured or structured, and it may be in different formats, such as text, images, videos, or audio.

  19. n

    Data from: Artificial Intelligent Approaches for Navigating Thermodynamic,...

    • curate.nd.edu
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Etinosa James Osaro (2025). Artificial Intelligent Approaches for Navigating Thermodynamic, Molecular, and Material Design Space in Porous Materials [Dataset]. http://doi.org/10.7274/28656632.v1
    Explore at:
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    University of Notre Dame
    Authors
    Etinosa James Osaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The rapid discovery and optimization of metal-organic frameworks (MOFs) for adsorption, diffusion, and gas separation applications require computational strategies that efficiently balance predictive accuracy with computational cost. Traditional simulation techniques such as Classical and Quantum Chemistry methods provide accurate insights but are computationally prohibitive when applied to large-scale materials screening. Machine learning (ML) has emerged as a powerful tool to accelerate MOF discovery, but its effectiveness depends on the availability of large, high-quality training datasets. This dissertation integrates Active Learning (AL), Reinforcement Learning (RL), and Inducing Points (IPs) to systematically explore the thermodynamic, molecular, and materials design space of MOFs, significantly enhancing the efficiency of computational screening workflows. Active Learning is employed to optimize gas adsorption predictions in MOFs while minimizing the number of required simulations. Gaussian Process Regression (GPR) models, combined with various acquisition functions, guide iterative data selection for adsorption modeling, enabling high predictive accuracy with a fraction of the data typically required. Extending this approach, an alchemical molecule-based AL strategy is introduced to predict real-molecule adsorption using surrogate molecular interactions, reducing training dataset size while maintaining accuracy. Furthermore, AL is applied to selectivity predictions in gas separations by integrating adsorption and diffusion modeling into an end-to-end (E2E) framework, improving data acquisition efficiency and minimizing redundant sampling across adsorption and diffusion models. To refine training dataset selection, Bayesian Optimization strategies such as Expected Improvement (EI) and Probability of Improvement (PI) are integrated within an AL framework, ensuring that the most informative MOFs are selected based on key structural properties. Inducing Points (IPs) are incorporated as a complementary strategy to further enhance model efficiency by selecting a representative subset of MOFs that capture the diversity of the full dataset. By leveraging kernel-based methods and principal component analysis (PCA), IPs reduce training data requirements while maintaining model generalizability. A comparative analysis across different AL, BO, and IP-based approaches reveals that combining these strategies significantly improves model robustness while minimizing computational expense. On the other hand, reinforcement learning (RL) is further introduced to actively guide data selection for adsorption modeling. Using Q-learning within a Gaussian Process framework, RL optimizes the selection of MOFs for gas adsorption studies, improving predictive convergence while reducing computational cost compared to standard AL strategies. By integrating AL, BO, RL, and IP methodologies, this research establishes a scalable and computationally efficient framework for MOF screening, offering a transformative approach to materials discovery. The findings contribute to the broader field of AI-assisted materials informatics, facilitating the rapid identification of MOFs for applications in energy storage, carbon capture, and industrial gas separations. Through these innovations, this work advances the role of artificial intelligence in accelerating the exploration of porous materials, bridging the gap between computational efficiency and predictive accuracy.

  20. Data from: Experimental Design Strategy: Weak Reinforcement Leads to...

    • acs.figshare.com
    xlsx
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mateusz Maciejewski; Anne Mai Wassermann; Meir Glick; Eugen Lounkine (2023). Experimental Design Strategy: Weak Reinforcement Leads to Increased Hit Rates and Enhanced Chemical Diversity [Dataset]. http://doi.org/10.1021/acs.jcim.5b00054.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    ACS Publications
    Authors
    Mateusz Maciejewski; Anne Mai Wassermann; Meir Glick; Eugen Lounkine
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    High Throughput Screening (HTS) is a common approach in life sciences to discover chemical matter that modulates a biological target or phenotype. However, low assay throughput, reagents cost, or a flowchart that can deal with only a limited number of hits may impair screening large numbers of compounds. In this case, a subset of compounds is assayed, and in silico models are utilized to aid in iterative screening design, usually to expand around the found hits and enrich subsequent rounds for relevant chemical matter. However, this may lead to an overly narrow focus, and the diversity of compounds sampled in subsequent iterations may suffer. Active learning has been recently successfully applied in drug discovery with the goal of sampling diverse chemical space to improve model performance. Here we introduce a robust and straightforward iterative screening protocol based on naı̈ve Bayes models. Instead of following up on the compounds with the highest scores in the in silico model, we pursue compounds with very low but positive values. This includes unique chemotypes of weakly active compounds that enhance the applicability domain of the model and increase the cumulative hit rates. We show in a retrospective application to 81 Novartis assays that this protocol leads to consistently higher compound and scaffold hit rates compared to a standard expansion around hits or an active learning approach. We recommend using the weak reinforcement strategy introduced herein for iterative screening workflows.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Filipp Gusev; Evgeny Gutkin; Maria G. Kurnikova; Olexandr Isayev (2023). Active Learning Guided Drug Design Lead Optimization Based on Relative Binding Free Energy Modeling [Dataset]. http://doi.org/10.1021/acs.jcim.2c01052.s002
Organization logo

Data from: Active Learning Guided Drug Design Lead Optimization Based on Relative Binding Free Energy Modeling

Related Article
Explore at:
zipAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Filipp Gusev; Evgeny Gutkin; Maria G. Kurnikova; Olexandr Isayev
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

In silico identification of potent protein inhibitors commonly requires prediction of a ligand binding free energy (BFE). Thermodynamics integration (TI) based on molecular dynamics (MD) simulations is a BFE calculation method capable of acquiring accurate BFE, but it is computationally expensive and time-consuming. In this work, we have developed an efficient automated workflow for identifying compounds with the lowest BFE among thousands of congeneric ligands, which requires only hundreds of TI calculations. Automated machine learning (AutoML) orchestrated by active learning (AL) in an AL–AutoML workflow allows unbiased and efficient search for a small set of best-performing molecules. We have applied this workflow to select inhibitors of the SARS-CoV-2 papain-like protease and were able to find 133 compounds with improved binding affinity, including 16 compounds with better than 100-fold binding affinity improvement. We obtained a hit rate that outperforms that expected of traditional expert medicinal chemist-guided campaigns. Thus, we demonstrate that the combination of AL and AutoML with free energy simulations provides at least 20× speedup relative to the naïve brute force approaches.

Search
Clear search
Close search
Google apps
Main menu