Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In silico identification of potent protein inhibitors commonly requires prediction of a ligand binding free energy (BFE). Thermodynamics integration (TI) based on molecular dynamics (MD) simulations is a BFE calculation method capable of acquiring accurate BFE, but it is computationally expensive and time-consuming. In this work, we have developed an efficient automated workflow for identifying compounds with the lowest BFE among thousands of congeneric ligands, which requires only hundreds of TI calculations. Automated machine learning (AutoML) orchestrated by active learning (AL) in an AL–AutoML workflow allows unbiased and efficient search for a small set of best-performing molecules. We have applied this workflow to select inhibitors of the SARS-CoV-2 papain-like protease and were able to find 133 compounds with improved binding affinity, including 16 compounds with better than 100-fold binding affinity improvement. We obtained a hit rate that outperforms that expected of traditional expert medicinal chemist-guided campaigns. Thus, we demonstrate that the combination of AL and AutoML with free energy simulations provides at least 20× speedup relative to the naïve brute force approaches.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gaussian process (GP) regression is one promising technique of constructing machine learning force fields with built-in uncertainty quantification, which can be used to monitor the quality of model predictions. A current limitation of existing GP force fields is that the prediction cost grows linearly with the size of the training data set, making accurate GP predictions slow. In this work, we exploit the special structure of the kernel function to construct a mapping of the trained Gaussian process model, including both forces and their uncertainty predictions, onto spline functions of low-dimensional structural features. This method is incorporated in the Bayesian active learning workflow for training of Bayesian force fields. To demonstrate the capabilities of this method, we construct a force field for stanene and perform large scale dynamics simulation of its structural evolution. We provide a fully open-source implementation of our method, as well as the training and testing examples with the stanene dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning interatomic force fields are promising for combining high computational efficiency and accuracy in modeling quantum interactions and simulating atomic level processes. Active learning methods have been recently developed to train force fields efficiently and automatically. Among them, Bayesian active learning utilizes principled uncertainty quantification to make data acquisition decisions. In this work, we present an efficient Bayesian active learning workflow, where the force field is constructed from a sparse Gaussian process regression model based on atomic cluster expansion descriptors. To circumvent the high computational cost of the sparse Gaussian process uncertainty calculation, we formulate a high-performance approximate mapping of the uncertainty and demonstrate a speedup of several orders of magnitude. As an application, we train a model for silicon carbide (SiC), a wide-gap semiconductor with complex polymorphic structure and diverse technological applications in power electronics, nuclear physics and astronomy. We show that the high pressure phase transformation is accurately captured by the autonomous active learning workflow. The trained force field shows excellent agreement with both \textit{ab initio} calculations and experimental measurements, and outperforms existing empirical models on vibrational and thermal properties. The active learning workflow is readily generalized to a wide range of systems, accelerates computational understanding and design.
Gaussian process (GP) regression is one promising technique of constructing machine learning force fields with built-in uncertainty quantification, which can be used to monitor the quality of model predictions. A current limitation of existing GP force fields is that the prediction cost grows linearly with the size of the training data set, making accurate GP predictions slow. In this work, we exploit the special structure of the kernel function to construct a mapping of the trained Gaussian process model, including both forces and their uncertainty predictions, onto spline functions of low-dimensional structural features. This method is incorporated in the Bayesian active learning workflow for training of Bayesian force fields. To demonstrate the capabilities of this method, we construct a force field for stanene and perform large scale dynamics simulation of its structural evolution. We provide a fully open-source implementation of our method, as well as the training and testing examples with the stanene dataset.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Active learning using models built on binding potency predictions from free energy perturbation (AL-FEP) has been proposed as a method for generating machine learning models capable of predicting biochemical potency for early-stage lead optimization where limited measured data are available. Two applications of AL-FEP are described here for different bromodomain inhibitor series that were developed in historic GSK projects: one where the core is kept constant and the other where core changes are included in the pool of compound ideas. Measured biochemical potency data have been used to assess the performance of the final models and demonstrate that well-performing models can be generated within several rounds of active learning, especially when the core is kept constant. To apply this method routinely to drug discovery projects, a retrospective evaluation of the AL-FEP workflow has been conducted covering parameters including the compound selection strategy, explore–exploit ratios, and number of compounds selected per cycle. Significant differences in performance in terms of model enrichment and R2 are observed and rationalized. Recommendations are made as to when specific parameters should be employed for AL-FEP depending on the context (maximizing potency or broad-range prediction accuracy) in which the final model is to be deployed.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Bioisostere replacement is a powerful and popular tool used to optimize the potency and selectivity of candidate molecules in drug discovery. Selecting the right bioisosteres to invest resources in for synthesis and subsequent optimization is key to an efficient drug discovery project. Here we demonstrate how 3D-quantitative structure–activity relationship (3D-QSAR) and relative binding free energy calculations can be combined into an active learning workflow to prioritize molecules from a pool of hundreds of bioisosteres. We demonstrate on a human aldose reductase test case that the use of this workflow can rapidly locate the strongest-binding bioisosteric replacements with a relatively modest computational cost.
Existing, classical interatomic potentials for bcc iron predict contradicting crack-tip mechanisms (i.e. cleavage, dislocation emission, phase transition) for the same crack systems, thus leaving the crack propagation mechanism in bcc iron unclear. In this work, we develop a Gaussian approximation potential (GAP) by extending a DFT database for ferromagnetic bcc iron to include highly distorted primitive bcc cells and surface separation, along with small crack-tip configurations that are identified by means of a fully automated active learning workflow. Our GAP (referred to as Fe-GAP22) predicts crack propagation within 8 meV/atom accuracy. The fully automated, active learning workflow is made publicly available on GitHub. With the newly developed Fe-GAP22, we find that in absence of other defects around the crack tip (e.g. nanovoids, dislocations), the static (T=0K) crack-tip mechanism is cleavage, thus settling the contradictions in the literature. Our work also highlights the need for multi-scale modelling to predict fracture at finite temperatures and finite strain rates.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The leucine-rich repeat kinase 2 (LRRK2) is the most mutated gene in familial Parkinson’s disease, and its mutations lead to pathogenic hallmarks of the disease. The LRRK2 WDR domain is an understudied drug target for Parkinson’s disease, with no known inhibitors prior to the first phase of the Critical Assessment of Computational Hit-Finding Experiments (CACHE) Challenge. A unique advantage of the CACHE Challenge is that the predicted molecules are experimentally validated in-house. Here, we report the design and experimental confirmation of LRRK2 WDR inhibitor molecules. We used an active learning (AL) machine learning (ML) workflow based on optimized free-energy molecular dynamics (MD) simulations utilizing the thermodynamic integration (TI) framework to expand a chemical series around two of our previously confirmed hit molecules. We identified 8 experimentally verified novel inhibitors out of 35 experimentally tested (23% hit rate). These results demonstrate the efficacy of our free-energy-based active learning workflow to explore large chemical spaces quickly and efficiently while minimizing the number and length of expensive simulations. This workflow is widely applicable to screening any chemical space for small-molecule analogs with increased affinity, subject to the general constraints of RBFE calculations. The mean absolute error of the TI MD calculations was 2.69 kcal/mol, with respect to the measured KD of hit compounds.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metadynamics enhanced training Datasets, DFT-SCAN accurate GAP Model and MD trajectories for "AL4GAP: Active Learning Workflow for generating DFT-SCAN Accurate Machine-Learning Potentials for Combinatorial Molten Salt Mixtures"
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Optimization of a synthetic reaction with respect to solvent choice and operating conditions was implemented as a machine learning-based workflow. The approach is exemplified on the case study of selection of a promising solvent to maximize the yield of a Mitsunobu reaction producing isopropyl benzoate. A solvent was defined with 15 molecular descriptors, and a library of solvent descriptors was built. The descriptors were converted into a reduced dimensionality form using an Autoencoder. Experimental yields were used to train a multilayered artificial neural network (ANN) surrogate model, which was used for the optimization and design of experiments (DoE). DoE was performed in an active learning mode to reduce the number of experiments required for reaction optimization. The final surrogate model identified 1-chloropentane as a promising solvent, which resulted in an experimental yield of 93%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Existing, classical interatomic potentials for bcc iron predict contradicting crack-tip mechanisms (i.e. cleavage, dislocation emission, phase transition) for the same crack systems, thus leaving the crack propagation mechanism in bcc iron unclear. In this work, we develop a Gaussian approximation potential (GAP) by extending a DFT database for ferromagnetic bcc iron to include highly distorted primitive bcc cells and surface separation, along with small crack-tip configurations that are identified by means of a fully automated active learning workflow. Our GAP (referred to as Fe-GAP22) predicts crack propagation within 8 meV/atom accuracy. The fully automated, active learning workflow is made publicly available on GitHub. With the newly developed Fe-GAP22, we find that in absence of other defects around the crack tip (e.g. nanovoids, dislocations), the static (T=0K) crack-tip mechanism is cleavage, thus settling the contradictions in the literature. Our work also highlights the need for multi-scale modelling to predict fracture at finite temperatures and finite strain rates.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Significant improvements have been made in the past decade to methods that rapidly and accurately predict binding affinity through free energy perturbation (FEP) calculations. This has been driven by recent advances in small-molecule force fields and sampling algorithms combined with the availability of low-cost parallel computing. Predictive accuracies of ∼1 kcal mol–1 have been regularly achieved, which are sufficient to drive potency optimization in modern drug discovery campaigns. Despite the robustness of these FEP approaches across multiple target classes, there are invariably target systems that do not display expected performance with default FEP settings. Traditionally, these systems required labor-intensive manual protocol development to arrive at parameter settings that produce a predictive FEP model. Due to the (a) relatively large parameter space to be explored, (b) significant compute requirements, and (c) limited understanding of how combinations of parameters can affect FEP performance, manual FEP protocol optimization can take weeks to months to complete, and often does not involve rigorous train-test set splits, resulting in potential overfitting. These manual FEP protocol development timelines do not coincide with tight drug discovery project timelines, essentially preventing the use of FEP calculations for these target systems. Here, we describe an automated workflow termed FEP Protocol Builder (FEP-PB) to rapidly generate accurate FEP protocols for systems that do not perform well with default settings. FEP-PB uses an active-learning workflow to iteratively search the protocol parameter space to develop accurate FEP protocols. To validate this approach, we applied it to pharmaceutically relevant systems where default FEP settings could not produce predictive models. We demonstrate that FEP-PB can rapidly generate accurate FEP protocols for the previously challenging MCL1 system with limited human intervention. We also apply FEP-PB in a real-world drug discovery setting to generate an accurate FEP protocol for the p97 system. FEP-PB is able to generate a more accurate protocol than the expert user, rapidly validating p97 as amenable to free energy calculations. Additionally, through the active-learning workflow, we are able to gain insight into which parameters are most important for a given system. These results suggest that FEP-PB is a robust tool that can aid in rapidly developing accurate FEP protocols and increasing the number of targets that are amenable to the technology.
The classification of the items of ever-increasing textual databases has become an important goal for a number of research groups active in the field of computational social science. Due to the increased amount of text data there is a growing number of use-cases where the initial effort of human classifiers was successfully augmented using supervised machine learning (SML). In this paper, we investigate such a hybrid workflow solution classifying the lead paragraphs of New York Times front-page articles from 1996 to 2006 according to policy topic categories (such education or defense) of the Comparative Agendas Project (CAP). The SML classification is conducted in multiple rounds, and within each round, we run the SML algorithm on n samples and n times if the given algorithm is non-deterministic (e.g., SVM). If all the SML predictions point towards a single label for a document, then it is classified as such (this approach is also called a “voting ensemble"). In the second step, we explore several scenarios, ranging from using the SML ensemble without human validation to incorporating active learning. Using these scenarios, we can quantify the gains from the various workflow versions. We find that using human coding and validation combined with an ensemble SML hybrid approach can reduce the need for human coding while maintaining very high precision rates and offering a modest to a good level of recall. The modularity of this hybrid workflow allows for various setups to address the idiosyncratic resource bottlenecks that a large-scale text classification project might face.
https://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106
Scientists have traditionally employed trial-and-error methodologies to design novel materials, often complemented by basic heuristic rules or chemical intuition (e.g., “like dissolves like”). However, to date, this simplistic approach has led to the discovery and characterization of only a small fraction of all synthesizable compounds. Data-driven approaches such as machine learning are promising alternative routes to these traditional trial-and-error methodologies. Unfortunately, most machine learning models proposed so far do not embed chemical or thermodynamic information in their architectures and molecular descriptors. In turn, this leads to overly complex models that require a tremendous volume of experimental data to be properly trained.
At the interface between artificial intelligence and green chemistry, the work developed throughout this dissertation uses thermodynamics-informed machine learning to bridge the gap between small, scarce datasets and data-driven approaches. This is accomplished using two major avenues. The first is through the development of active learning workflows, based on Gaussian process machine learning models, that target the description of activity coefficients. This unique approach was particularly directed at capturing the physicochemical properties of mixtures, namely deep eutectic solvents. Active learning was able to efficiently guide the acquisition of experimental data, and, in many cases, a single data point was sufficient to accurately describe mixture properties (namely phase equilibria diagrams), dramatically reducing the effort and cost necessary to characterize novel sustainable materials.
The second major avenue lies in the development of a digital molecular space based on sigma profiles. These molecular descriptors, derived from quantum chemistry, were shown to be a powerful feature set for neural networks, leading to the accurate prediction of assorted physicochemical properties (e.g., boiling points and aqueous solubilities) for organic and inorganic molecules. A graph neural network was also developed to predict sigma profiles, bypassing the need for expensive quantum chemistry calculations. Finally, sigma profiles were shown to behave as a digital molecular space where optimization tasks can be performed. A remarkable example of this was that of Bayesian optimization towards boiling point optimization. Holding no knowledge of chemistry except for the sigma profile and normal boiling temperature of carbon monoxide (the worst possible initial guess), Bayesian optimization found the global maximum of the available normal boiling temperature dataset (over 1000 molecules encompassing more than 40 families of organic and inorganic compounds) in just fifteen iterations (i.e., fifteen property measurements), cementing sigma profiles as an ideal digital chemical space for molecular optimization and discovery, particularly when little experimental data is available.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Metal halide perovskite (MHP) derivatives, a promising class of optoelectronic materials, have been synthesized with a range of dimensionalities that govern their optoelectronic properties and determine their applications. We demonstrate a data-driven approach combining active learning and high-throughput experimentation to discover, control, and understand the formation of phases with different dimensionalities in the morpholinium (morph) lead iodide system. Using a robot-assisted workflow, we synthesized and characterized two novel MHP derivatives that have distinct optical properties: a one-dimensional (1D) morphPbI3 phase ([C4H10NO][PbI3]) and a two-dimensional (2D) (morph)2PbI4 phase ([C4H10NO]2[PbI4]). To efficiently acquire the data needed to construct a machine learning (ML) model of the reaction conditions where the 1D and 2D phases are formed, data acquisition was guided by a diverse-mini-batch-sampling active learning algorithm, using prediction confidence as a stopping criterion. Querying the ML model uncovered the reaction parameters that have the most significant effects on dimensionality control. Based on these insights, we discuss possible reaction schemes that may selectively promote the formation of morph-Pb-I phases with different dimensionalities. The data-driven approach presented here, including the use of additives to manipulate dimensionality, will be valuable for controlling the crystallization of a range of materials over large reaction-composition spaces.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Silicon carbide (SiC) is an important technological material, but its high-temperature phase diagram has remained unclear due to conflicting experimental results about congruent versus incongruent melting. Here, we employ large-scale machine learning molecular dynamics (MLMD) simulations to gain insights into SiC decomposition and phase transitions. Our approach relies on a Bayesian active learning workflow to efficiently train an accurate machine learning force field on density functional theory data. Our large-scale simulations provide direct indication that melting of SiC proceeds incongruently via decomposition into silicon-rich and carbon phases at high temperature and pressure. During cooling at high pressures, carbon nanoclusters nucleate and grow within the homogeneous molten liquid. During heating, the decomposed mixture reversibly transitions back into a homogeneous SiC liquid. The full pressure-temperature phase diagram of SiC is systematically constructed using MLMD simulations, providing new understanding of the nature of phases, resolving long-standing inconsistencies from previous experiments and yielding technologically relevant implications for processing and deposition of this material.
This dataset provides:
README.md
for more dataset information. Additional MLMD trajectories with defects are available in https://doi.org/10.5281/zenodo.15066528.Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Optimization of the catalyst structure to simultaneously improve multiple reaction objectives (e.g., yield, enantioselectivity, and regioselectivity) remains a formidable challenge. Herein, we describe a machine learning workflow for the multi-objective optimization of catalytic reactions that employ chiral bisphosphine ligands. This was demonstrated through the optimization of two sequential reactions required in the asymmetric synthesis of an active pharmaceutical ingredient. To accomplish this, a density functional theory-derived database of
550 bisphosphine ligands was constructed, and a designer chemical space mapping technique was established. The protocol used classification methods to identify active catalysts, followed by linear regression to model reaction selectivity. This led to the prediction and validation of significantly improved ligands for all reaction outputs, suggesting a general strategy that can be readily implemented for reaction optimizations where performance is controlled by bisphosphine ligands.
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
As per Cognitive Market Research's latest published report, the Global Data Collection and Labeling market size was USD 2.41 Billion in 2022 and it is forecasted to reach USD 18.60 Billion by 2030. Data Collection and Labeling Industry's Compound Annual Growth Rate will be 29.1% from 2023 to 2030. Key Dynamics of Data Collection And Labeling Market
Key Drivers of Data Collection And Labeling Market
Surge in AI and Machine Learning Adoption: The increasing integration of AI across various industries has led to a notable rise in the demand for high-quality labeled datasets. Precise data labeling is essential for training machine learning models, particularly in fields such as autonomous vehicles, healthcare diagnostics, and facial recognition.
Proliferation of Unstructured Data: With the surge of images, videos, and audio data generated from digital platforms, businesses are in need of structured labeling services to transform raw data into usable datasets. This trend is propelling the growth of data annotation services, especially for applications in natural language processing and computer vision.
Rising Use in Healthcare and Retail: Data labeling plays a vital role in applications such as medical imaging, drug discovery, and e-commerce personalization. Industries like healthcare and retail are allocating resources towards labeled datasets to enhance AI-driven diagnostics, recommendation systems, and predictive analytics, thereby increasing market demand.
Key Restrains for Data Collection And Labeling Market
High Cost and Time-Intensive Process: The process of manual data labeling is both labor-intensive and costly, particularly for intricate projects that necessitate expert annotators. This can pose a challenge for small businesses or startups that operate with limited budgets and stringent development timelines.
Data Privacy and Compliance Challenges: Managing sensitive information, including personal photographs, biometric data, or patient records, raises significant concerns regarding security and regulatory compliance. Ensuring compliance with GDPR, HIPAA, or other data protection regulations complicates the data labeling process.
Lack of Skilled Workforce: The industry is experiencing a shortage of qualified data annotators, especially in specialized areas such as radiology or autonomous systems. The inconsistency in labeling quality due to insufficient domain expertise can adversely affect the accuracy and reliability of AI models.
Key Trends in Data Collection And Labelingl Market
Emergence of Automated and Semi-Automated Labeling Tools: Companies are progressively embracing AI-driven labeling tools to minimize manual labor. Innovations such as active learning, auto-labeling, and transfer learning are enhancing efficiency and accelerating the data preparation workflow.
Expansion of Crowdsourcing Platforms: Crowdsourced data labeling via platforms like Amazon Mechanical Turk is gaining traction as a favored approach. It facilitates quicker turnaround times at reduced costs by utilizing a global workforce, particularly for tasks involving image classification, sentiment analysis, and object detection.
Transition Towards Industry-Specific Labeling Solutions: Providers are creating domain-specific labeling platforms customized for sectors such as agriculture, autonomous vehicles, or legal technology. These specialized tools enhance accuracy, shorten time-to-market, and cater to the specific requirements of vertical AI applications. What is Data Collection and Labeling?
Data collection and labeling is the process of gathering and organizing data and adding metadata to it for better analysis and understanding. This process is critical in machine learning and artificial intelligence, as it provides the foundation for training algorithms that can identify patterns and make predictions. Data collection involves gathering raw data from various sources, including sensors, databases, websites, and other forms of digital media. The collected data may be unstructured or structured, and it may be in different formats, such as text, images, videos, or audio.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The rapid discovery and optimization of metal-organic frameworks (MOFs) for adsorption, diffusion, and gas separation applications require computational strategies that efficiently balance predictive accuracy with computational cost. Traditional simulation techniques such as Classical and Quantum Chemistry methods provide accurate insights but are computationally prohibitive when applied to large-scale materials screening. Machine learning (ML) has emerged as a powerful tool to accelerate MOF discovery, but its effectiveness depends on the availability of large, high-quality training datasets. This dissertation integrates Active Learning (AL), Reinforcement Learning (RL), and Inducing Points (IPs) to systematically explore the thermodynamic, molecular, and materials design space of MOFs, significantly enhancing the efficiency of computational screening workflows. Active Learning is employed to optimize gas adsorption predictions in MOFs while minimizing the number of required simulations. Gaussian Process Regression (GPR) models, combined with various acquisition functions, guide iterative data selection for adsorption modeling, enabling high predictive accuracy with a fraction of the data typically required. Extending this approach, an alchemical molecule-based AL strategy is introduced to predict real-molecule adsorption using surrogate molecular interactions, reducing training dataset size while maintaining accuracy. Furthermore, AL is applied to selectivity predictions in gas separations by integrating adsorption and diffusion modeling into an end-to-end (E2E) framework, improving data acquisition efficiency and minimizing redundant sampling across adsorption and diffusion models. To refine training dataset selection, Bayesian Optimization strategies such as Expected Improvement (EI) and Probability of Improvement (PI) are integrated within an AL framework, ensuring that the most informative MOFs are selected based on key structural properties. Inducing Points (IPs) are incorporated as a complementary strategy to further enhance model efficiency by selecting a representative subset of MOFs that capture the diversity of the full dataset. By leveraging kernel-based methods and principal component analysis (PCA), IPs reduce training data requirements while maintaining model generalizability. A comparative analysis across different AL, BO, and IP-based approaches reveals that combining these strategies significantly improves model robustness while minimizing computational expense. On the other hand, reinforcement learning (RL) is further introduced to actively guide data selection for adsorption modeling. Using Q-learning within a Gaussian Process framework, RL optimizes the selection of MOFs for gas adsorption studies, improving predictive convergence while reducing computational cost compared to standard AL strategies. By integrating AL, BO, RL, and IP methodologies, this research establishes a scalable and computationally efficient framework for MOF screening, offering a transformative approach to materials discovery. The findings contribute to the broader field of AI-assisted materials informatics, facilitating the rapid identification of MOFs for applications in energy storage, carbon capture, and industrial gas separations. Through these innovations, this work advances the role of artificial intelligence in accelerating the exploration of porous materials, bridging the gap between computational efficiency and predictive accuracy.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
High Throughput Screening (HTS) is a common approach in life sciences to discover chemical matter that modulates a biological target or phenotype. However, low assay throughput, reagents cost, or a flowchart that can deal with only a limited number of hits may impair screening large numbers of compounds. In this case, a subset of compounds is assayed, and in silico models are utilized to aid in iterative screening design, usually to expand around the found hits and enrich subsequent rounds for relevant chemical matter. However, this may lead to an overly narrow focus, and the diversity of compounds sampled in subsequent iterations may suffer. Active learning has been recently successfully applied in drug discovery with the goal of sampling diverse chemical space to improve model performance. Here we introduce a robust and straightforward iterative screening protocol based on naı̈ve Bayes models. Instead of following up on the compounds with the highest scores in the in silico model, we pursue compounds with very low but positive values. This includes unique chemotypes of weakly active compounds that enhance the applicability domain of the model and increase the cumulative hit rates. We show in a retrospective application to 81 Novartis assays that this protocol leads to consistently higher compound and scaffold hit rates compared to a standard expansion around hits or an active learning approach. We recommend using the weak reinforcement strategy introduced herein for iterative screening workflows.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In silico identification of potent protein inhibitors commonly requires prediction of a ligand binding free energy (BFE). Thermodynamics integration (TI) based on molecular dynamics (MD) simulations is a BFE calculation method capable of acquiring accurate BFE, but it is computationally expensive and time-consuming. In this work, we have developed an efficient automated workflow for identifying compounds with the lowest BFE among thousands of congeneric ligands, which requires only hundreds of TI calculations. Automated machine learning (AutoML) orchestrated by active learning (AL) in an AL–AutoML workflow allows unbiased and efficient search for a small set of best-performing molecules. We have applied this workflow to select inhibitors of the SARS-CoV-2 papain-like protease and were able to find 133 compounds with improved binding affinity, including 16 compounds with better than 100-fold binding affinity improvement. We obtained a hit rate that outperforms that expected of traditional expert medicinal chemist-guided campaigns. Thus, we demonstrate that the combination of AL and AutoML with free energy simulations provides at least 20× speedup relative to the naïve brute force approaches.