Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
While polymerization-induced self-assembly (PISA) has become a preferred synthetic route toward amphiphilic block copolymer self-assemblies, predicting their phase behavior from experimental design is extremely challenging, requiring time and work-intensive creation of empirical phase diagrams whenever self-assemblies of novel monomer pairs are sought for specific applications. To alleviate this burden, we develop here the first framework for a data-driven methodology for the probabilistic modeling of PISA morphologies based on a selection and suitable adaption of statistical machine learning methods. As the complexity of PISA precludes generating large volumes of training data with in silico simulations, we focus on interpretable low variance methods that can be interrogated for conformity with chemical intuition and that promise to work well with only 592 training data points which we curated from the PISA literature. We found that among the evaluated linear models, generalized additive models, and rule and tree ensembles, all but the linear models show a decent interpolation performance with around 0.2 estimated error rate and 1 bit expected cross entropy loss (surprisal) when predicting the mixture of morphologies formed from monomer pairs already encountered in the training data. When considering extrapolation to new monomer combinations, the model performance is weaker but the best model (random forest) still achieves highly nontrivial prediction performance (0.27 error rate, 1.6 bit surprisal), which renders it a good candidate to support the creation of empirical phase diagrams for new monomers and conditions. Indeed, we find in three case studies that, when used to actively learn phase diagrams, the model is able to select a smart set of experiments that lead to satisfactory phase diagrams after observing only relatively few data points (5–16) for the targeted conditions. The data set as well as all model training and evaluation codes are publicly available through the GitHub repository of the last author.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
The NIST Excerpts Benchmark Data are a set of target data for deidentification algorithms. The data are configured to work with "SDNist: Synthetic Data Report Tool", a package for evaluating synthetic data generators: https://github.com/usnistgov/SDNist. An installation of SDNist will download the data resources automatically. Jan 2025 -- Benhcmark Excerpts: - NIST American Community Survey (ACS) Data Excerpts, 24 demographic features over 40k records, - NIST Survey of Business Owners (SBO) Data Excerpts, 130 demographic and financial features over 161k records The data are curated subsets of U.S. Census Bureau products.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Given the size of the relevant chemical space for drug discovery, working with fully enumerated compound libraries (especially in three-dimensional (3D)) is unfeasible. Nonenumerated virtual chemical spaces are a practical solution to this issue, where compounds are described as building blocks which are then connected by rules. One concrete example of such is the BioSolveIT chemical spaces file format (.space). Tools to search these space-files exist that are using ligand-based methods including two-dimensional (2D) fingerprint similarity, substructure matching, and fuzzier similarity metrics such as FTrees. However, there is no software available that enables the screening of these nonenumerated spaces using protein structure as the input query. Here, a hybrid ligand/structure-based virtual screening tool, called SpaceHASTEN, was developed on top of SpaceLight, FTrees, LigPrep, and Glide to allow efficient structure-based virtual screening of nonenumerated chemical spaces. SpaceHASTEN was validated using three public targets picked from the DUD-E data set. It was able to retrieve a large number of diverse and novel high-scoring compounds (virtual hits) from nonenumerated chemical spaces of billions of molecules, after docking a few million compounds. The software can be freely used and is available from http://github.com/TuomoKalliokoski/SpaceHASTEN.
Facebook
TwitterAs described in the README.md file, the GitHub repository github.com/USEPA/PRTR-QSTR-models/tree/data-driven are Python scripts written to run Quantitative Structure–Transfer Relationship (QSTR) models based on chemical structure-based machine learning (ML) models for supporting environmental regulatory decision-making. Using features associated with annual chemical transfer amounts, chemical generator industry sectors, environmental policy stringency, gross value added by industry sectors, chemical descriptors, and chemical unit prices, as in the GitHub repository PRTR_transfers, the QSTR models developed here can predict potential EoL activities for chemicals transferred to off-site locations for EoL management. Also, this contribution shows that QSTR models aid in estimating the mass fraction allocation of chemicals of concern transferred off-site for EoL activities. Also, it describes the Python libraries required for running the code, how to use it, the obtained outputs files after running the Python script, and how to obtain all manuscript figures and results. This dataset is associated with the following publication: Hernandez-Betancur, J.D., G.J. Ruiz-Mercado, and M. Martín. Predicting Chemical End-of-Life Scenarios Using Structure-Based Classification Models. ACS Sustainable Chemistry & Engineering. American Chemical Society, Washington, DC, USA, 11(9): 3594-3602, (2023).
Facebook
Twitterhttps://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de567319https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de567319
Abstract (en): InputsThe synthetic population was generated from the 2010-2014 ACS PUMS housing and person files. United States Department of Commerce. Bureau of the Census. (2017-03-06). American Community Survey 2010-2014 ACS 5-Year PUMS File [Data set]. Ann Arbor, MI: Inter-university Consortium of Political and Social Research [distributor]. http://doi.org/10.3886/E100486V1Persistent URL: http://doi.org/10.3886/E100486V1Funding supportThis work is supported under Grant G-2015-13903 from the Alfred P. Sloan Foundation on "The Economics of Socially-Efficient Privacy and Confidentiality Management for Statistical Agencies" (PI: John M. Abowd)OutputsThere are 17 housing files (data/housing)- repHus0.csv, repHus1.csv, ... repHus16.csvand 32 person files (data/person)- rep_recode_ACSpus0.csv, rep_recode_ACSpus1.csv, ... rep_recode_ACSpus31.csv.Files are split to be roughly equal in size. The files contain data for the entire country. Files are not split along any demographic characteristic. The person files and housing files must be concatenated to form a complete person file and a complete housing file, respectively.ProgramsPrograms that generated this release can be found at https://github.com/labordynamicsinstitute/SynUSpopulation/releases/tag/v201703-beta and http://doi.org/10.5281/zenodo.556424Additional informationFor more information, see README.md Smallest Geographic Unit: United States Funding insitution(s): Alfred P. Sloan Foundation.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Most of the existing computational tools for de novo library design are focused on the generation, rational selection, and combination of promising structural motifs to form members of the new library. However, the absence of a direct link between the chemical space of the retrosynthetically generated fragments and the pool of available reagents makes such approaches appear as rather theoretical and reality-disconnected. In this context, here we present Synthons Interpreter (SynthI), a new open-source toolkit for de novo library design that allows merging those two chemical spaces into a single synthons space. Here synthons are defined as actual fragments with valid valences and special labels, specifying the position and the nature of reactive centers. They can be issued from either the “breakup” of reference compounds according to 38 retrosynthetic rules or real reagents, after leaving group withdrawal or transformation. Such an approach not only enables the design of synthetically accessible libraries and analog generation but also facilitates reagents (building blocks) analysis in the medicinal chemistry context. SynthI code is publicly available at https://github.com/Laboratoire-de-Chemoinformatique/SynthI.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In recent years, reinforcement learning (RL) has emerged as a valuable tool in drug design, offering the potential to propose and optimize molecules with desired properties. However, striking a balance between capabilities, flexibility, reliability, and efficiency remains challenging due to the complexity of advanced RL algorithms and the significant reliance on specialized code. In this work, we introduce ACEGEN, a comprehensive and streamlined toolkit tailored for generative drug design, built using TorchRL, a modern RL library that offers thoroughly tested reusable components. We validate ACEGEN by benchmarking against other published generative modeling algorithms and show comparable or improved performance. We also show examples of ACEGEN applied in multiple drug discovery case studies. ACEGEN is accessible at https://github.com/acellera/acegen-open and available for use under the MIT license.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Github code link: https://github.com/nsndimt/ChemSSP
JCIM paper link: https://pubs.acs.org/doi/10.1021/acs.jcim.5c00248
Note: use tar xf [filename].tar.xz to decompress all files
The following compressed file corresponds to a folder under the root path of project:
model.tar.xz: the scibert model checkpointdata.tar.xz: the six human annotated chemical NER datasetepisode.tar.xz: sampled episodes for the six datasetschatGPT.tar.xz, claude.tar.xz, gemini.tar.xz: LLM annotation file and corresponding sampled episodesAdditionally codereadme.tar.xz contains all the codes and an readme file under project root. More detailed instruction are in the readme
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The continuous improvement of computer architectures allows for the simulation of molecular systems of growing sizes. However, such calculations still require the input of initial structures, which are also becoming increasingly complex. In this work, we present CAT, a Compound Attachment Tool (source code available at https://github.com/nlesc-nano/CAT) and Python package for the automatic construction of composite chemical compounds, which supports the functionalization of organic, inorganic, and hybrid organic–inorganic materials. The CAT workflow consists in defining the anchoring sites on the reference material, usually a large molecular system denoted as a scaffold, and on the molecular species that are attached to it, i.e., the ligands. Usually, ligands are pre-optimized in a conformation biased toward more linear structures to minimize interligand(s) steric interactions, a bias that is important when multiple ligands are attached onto the scaffold. The resulting superstructure(s) are then stored in various formats that can be used afterward in quantum chemical calculations or classical force field-based simulations.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Motivated by the challenging of deep learning on the low data regime and the urgent demand for intelligent design on highly energetic materials, we explore a correlated deep learning framework, which consists of three recurrent neural networks (RNNs) correlated by the transfer learning strategy, to efficiently generate new energetic molecules with a high detonation velocity in the case of very limited data available. To avoid the dependence on the external big data set, data augmentation by fragment shuffling of 303 energetic compounds is utilized to produce 500,000 molecules to pretrain RNN, through which the model can learn sufficient structure knowledge. Then the pretrained RNN is fine-tuned by focusing on the 303 energetic compounds to generate 7153 molecules similar to the energetic compounds. In order to more reliably screen the molecules with a high detonation velocity, the SMILE enumeration augmentation coupled with the pretrained knowledge is utilized to build an RNN-based prediction model, through which R2 is boosted from 0.4446 to 0.9572. The comparable performance with the transfer learning strategy based on an existing big database (ChEMBL) to produce the energetic molecules and drug-like ones further supports the effectiveness and generality of our strategy in the low data regime. High-precision quantum mechanics calculations further confirm that 35 new molecules present a higher detonation velocity and lower synthetic accessibility than the classic explosive RDX, along with good thermal stability. In particular, three new molecules are comparable to caged CL-20 in the detonation velocity. All the source codes and the data set are freely available at https://github.com/wangchenghuidream/RNNMGM.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Efficient computational screenings are integral to materials discovery in highly sought-after gas adsorption and storage applications, such as CO2 capture. Preprocessing techniques have been developed to render experimental crystal structures suitable for molecular simulations by mimicking experimental activation protocols, particularly residual solvent removal. Current accounts examining these preprocessed materials databases indicate the presence of assorted structural errors introduced by solvent removal and preprocessing, including improper elimination of charge-balancing ions and ligands. Here, we address the need for a reliable experimental crystal structure preprocessing protocol by introducing a novel solvent removal method, which we call SAMOSA, that is informed by systematic ligand charge and metal oxidation state calculations. A robust set of solvent removal criteria is outlined, which identifies solvent molecules and counterions without predefined molecule lists or significant reliance on experimental chemical information. Validation results against popular metal–organic framework (MOF) databases suggest that this method observes significant performance improvements regarding the retention of charged ligands and recognition of charged frameworks. SAMOSA enhances structure fidelity with respect to the original material as-synthesized, thereby representing a powerful tool in computational materials database curation and preprocessing for molecular simulation. The source code is accessible at https://github.com/uowoolab/SAMOSA.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Quantum chemical spectroscopic calculations have grown increasingly popular in natural products research for aiding the elucidation of chemical structures, especially their stereochemical configurations. These calculations have become faster with modern computational speeds, but subsequent data handling, inspection, and presentation remain key bottlenecks for many researchers. In this article, we introduce the SpectroIBIS computer program as a user-friendly tool to automate tedious tasks commonly encountered in this workflow. Through a simple graphical user interface, researchers can drag and drop Gaussian or ORCA output files to produce Boltzmann-averaged ECD, VCD, UV–vis and IR data, optical rotations, and/or 1H and 13C NMR chemical shifts in seconds. Also produced are formatted, publication-quality supplementary data tables containing conformer energies and atomic coordinates, saved to a DOCX file compatible with Microsoft Word and LibreOffice. Importantly, SpectroIBIS can assist researchers in finding common calculation issues by automatically checking for redundant conformers and imaginary frequencies. Additional useful features include recognition of conformer energy recalculations at a higher theory level, and automated generation of input files for quantum chemistry programs with optional exclusion of high-energy conformers. Lastly, we demonstrate the applicability of SpectroIBIS with spectroscopic calculations for five natural products. SpectroIBIS is open-source software available as a free desktop application (https://github.com/bbulcock/SpectroIBIS).
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
We present a novel algorithm to compute the distance between synthetic routes based on tree edit distances. Such distances can be used to cluster synthesis routes generated using a retrosynthesis prediction tool. We show that the clustering of selected routes from a retrosynthesis analysis is performed in less than 10 s on average and only constitutes seven percent of the total time (prediction + clustering). Furthermore, we are able to show that representative routes from each cluster can be used to reduce the set of predicted routes. Finally, we show with a number of examples that the algorithm gives intuitive clusters that can be easily rationalized and that the routes in a cluster tend to use similar chemistry. The algorithm is included in the latest version of open-source AiZynthFinder software (https://github.com/MolecularAI/aizynthfinder) and as a separate package (https://github.com/MolecularAI/route-distances).
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
We present sbml-diff, a tool that is able to read a model of a biochemical reaction network in SBML format and produce a range of diagrams showing different levels of detail. Each diagram type can be used to visualize a single model or to visually compare two or more models. The default view depicts species as ellipses, reactions as rectangles, rules as parallelograms, and events as diamonds. A cartoon view replaces the symbols used for reactions on the basis of the associated Systems Biology Ontology terms. An abstract view represents species as ellipses and draws edges between them to indicate whether a species increases or decreases the production or degradation of another species. sbml-diff is freely licensed under the three-clause BSD license and can be downloaded from https://github.com/jamesscottbrown/sbml-diff and used as a python package called from other software, as a free-standing command-line application, or online using the form at http://sysos.eng.ox.ac.uk/tebio/upload
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The isoelectric point of a peptide is a physicochemical property that can be accurately predicted from the sequence of the peptide when the peptide is built from natural amino acids. Peptides can however have chemical modifications, such as phosphorylations, amidations, and unnatural amino acids, which can result in erroneous predictions if not accounted for. Here we report on an open source program, pICalculax, which in an extensible way can handle pI calculations of modified peptides. Tests on a database of modified peptides and experimentally determined pI values show an improvement in pI predictions when taking the modifications into account. The correlation coefficient improves from 0.45 to 0.91, and the root-mean-square deviation likewise improves from 3.3 to 0.9. The program is available at https://github.com/EBjerrum/pICalculax
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
While polymerization-induced self-assembly (PISA) has become a preferred synthetic route toward amphiphilic block copolymer self-assemblies, predicting their phase behavior from experimental design is extremely challenging, requiring time and work-intensive creation of empirical phase diagrams whenever self-assemblies of novel monomer pairs are sought for specific applications. To alleviate this burden, we develop here the first framework for a data-driven methodology for the probabilistic modeling of PISA morphologies based on a selection and suitable adaption of statistical machine learning methods. As the complexity of PISA precludes generating large volumes of training data with in silico simulations, we focus on interpretable low variance methods that can be interrogated for conformity with chemical intuition and that promise to work well with only 592 training data points which we curated from the PISA literature. We found that among the evaluated linear models, generalized additive models, and rule and tree ensembles, all but the linear models show a decent interpolation performance with around 0.2 estimated error rate and 1 bit expected cross entropy loss (surprisal) when predicting the mixture of morphologies formed from monomer pairs already encountered in the training data. When considering extrapolation to new monomer combinations, the model performance is weaker but the best model (random forest) still achieves highly nontrivial prediction performance (0.27 error rate, 1.6 bit surprisal), which renders it a good candidate to support the creation of empirical phase diagrams for new monomers and conditions. Indeed, we find in three case studies that, when used to actively learn phase diagrams, the model is able to select a smart set of experiments that lead to satisfactory phase diagrams after observing only relatively few data points (5–16) for the targeted conditions. The data set as well as all model training and evaluation codes are publicly available through the GitHub repository of the last author.