Dataset containing reaction centers used to train the disconnection aware model
The NIST Chemical Kinetics Database includes essentially all reported kinetics results for thermal gas-phase chemical reactions. The database is designed to be searched for kinetics data based on the specific reactants involved, for reactions resulting in specified products, for all the reactions of a particular species, or for various combinations of these. In addition, the bibliography can be searched by author name or combination of names. The database contains in excess of 38,000 separate reaction records for over 11,700 distinct reactant pairs. These data have been abstracted from over 12,000 papers with literature coverage through early 2000. Rate constant records for a specified reaction are found by searching the Reaction Database. All rate constant records for that reaction are returned, with a link to 'Details' on that record. Each rate constant record contains the following information (as available): a) Reactants and, if defined, reaction products; b) Rate parameters: A, n, Ea/R, where k = A* (T/298)**n exp[-(Ea/R)/T], where T is the temperature in Kelvins; c) Uncertainty in A, n, and Ea/R, if reported; d) Temperature range of experiment or temperature range of validity of a review or theoretical paper; e) Pressure range and bulk gas of the experiment; f) Data type of the record (i.e., experimental, relative rate measurement, theoretical calculation, modeling result, etc.). If the result is a relative rate measurement, then the reaction to which the rate is relative is also given; g) Experimental procedure, including separate fields for the description of the apparatus, the time resolution of the experiment, and the excitation technique. A majority of contemporary chemical kinetics methods are represented. The Kinetics Database is being expanded to include other resources for the convenience of the users. Presently this includes direct links to the corresponding NIST WebBook page for all substances for which such a link is possible. This is indicated by underling and highlighting the species. The WebBook provides thermodynamic, spectral, and other data on the species. Note that the link to the WebBook is opened as a new frame in your browser.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
In 2017 Lowe shared curated and published USPTO based chemical reaction datasets in csv format. Based on this, Schwaller et al. published curated reaction smiles (they in turn used the curated set disclosed by Jin and coworkers). Both versions have the drawback of containing only partially curated yields. In those datasets, two columns are available, TextMinedYield and Calculated yield. Many entries there don't contain any, partial, or incorrect numbers. For certain forms of reaction analysis focusing on yield as only available correlation, that information becomes essentially useless since there is no correlation to reaction conditions (unless one would data-mine the CML files or original XML).By correcting and merging the yield into a new column, followed by eliminating faulty entries, the noise in the data set is reduced. The new datasets are reduced by nearly 50%.Attached are two kinds of datasets (of each, Lowe and Schwaller):A "cropped" version, containing only the reaction smiles and the curated yield (and an added ID), and only entries with valid yields. Everything else was filtered out.A second type, a "full" version, including the curated yields and all original input columns and entries (no filtration). The latter might come in handy for other applications where one doesn't agree with the applied removal of invalid entries, or to apply further curation.More details can be found on Github containing Python scripts used to procure the attached datasets and a Readme file.For the less adept programmer, a graphical workflow based on the open-source data analysis platform Knime(R) is also available. The latter contains furthermore a proof of concept reaction splitter (data not included here).
BIOINF595 W2025 Bioactivity Project Dataset Author: Carl Mauro The reaction data used in this project is from the following publication, accessed through the Open Reaction Database (https://open-reaction-database.org/). The original data is used under an MIT license, and is under copyright by the original authors (see LICENSE.txt file for details). Ahneman, D. T.; Estrada, J. G.; Lin, S.; Dreher, S. D.; Doyle, A. G. Predicting Reaction Performance in C–N Cross-Coupling Using Machine… See the full description on the dataset page: https://huggingface.co/datasets/cmmauro/ORD_Ahneman_2018.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Transformer models trained on tasks in organic chemistry on ORDerly benchmark datasets.ORDerly_retro: Retrosynthesis prediction (prediction reactants given a desired product)ORDerly_forward_separated: Forward reaction prediction (predict reaction products given reactants, solvents, and agents), with reactants separated by > from the solvents and agents in the reaction string.ORDerly_forward_mixed: Forward reaction prediction (predict reaction products given reactants, solvents, and agents), with reactants, solvents and agents mixed together in the reaction string.non-uspto-eval: Evaluation of transformer models trained on USPTO data on non-uspto data available in the Open Reaction Database.Full details can be found in our paper: https://chemrxiv.org/engage/chemrxiv/article-details/64ca5d3e4a3f7d0c0d78ca42Neurips workshop paper: https://openreview.net/forum?id=R8FQMsECISCode: https://github.com/sustainable-processes/orderlyThe supplementary datasets used for this work can be found here: https://doi.org/10.6084/m9.figshare.23502372.v3Transformer model architecture is from Molecular Transformer: https://pubs.acs.org/doi/10.1021/acscentsci.9b00576Find the results, models, and checkpoints within MolecularTransformer/experiments. Note that the "wandb" folder was deleted since figshare only allows uploads up to 500 files.Notes:There's a limit of 500 files in figshare, so I deleted the the "docs" and "onmt", and "OpenNMT_py.egg-info" , and "tools" folders from all folders except "ORDerly_retro". I also deleted all wandb-associated files and all checkpoint files.Empty files cannot be uploaded to figshare, so you have to create these yourself, where appropriate (e.g. MolecularTransformer/onmt/tests/_init_.py and non-uspto-eval/MolecularTransformer/experiments/models/ofs_1.pt).Feel free to email me, Daniel Wigh, at dsw46@cam.ac.uk or daniel@reactwise.com or my supervisor Alexei A. Lapkin.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is public data to be used with the aizynthfinder tool for retrosynthesis planning (https://github.com/MolecularAI/aizynthfinder)There are three files available:* full_uspto_03_05_19_rollout_policy.hdf5 - the Keras neural network model used as rollout policy* full_uspto_03_05_19_unique_templates.hdf5 - unique template codes that are used together with the policy to generate new precursors in the tree search* zinc_stock_17_04_20.hdf - stock file made from the ZINC database on 17:th of april 2020.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The data extract is a series of compressed ASCII text files of the full data set contained in the Canada Vigilance Adverse Reaction Online Database. It is intended for users who are familiar with database structures and setting up their own queries. Find details on the data structure required for the data file in the Canada Vigilance Adverse Reaction Online Database - Data Structure. In order to use the data, the file must be loaded into an existing database or information system provided by the user. The Canada Vigilance Adverse Reaction Online Database contains information about suspected adverse reactions (also known as side effects) to health products, captured from adverse reaction reports submitted to Health Canada by consumers and health professionals, who submit reports voluntarily, as well as by market authorization holders (manufacturers and distributors), who are required to submit reports according to the Food and Drugs Regulations. Information concerning vaccines used for immunization have only been included in the database since January 1, 2011. Indication data has recently been added to the data extract files and the Detailed Adverse Reaction Report. Indication refers to the particular condition for which a health product was taken. For example, diabetes is an indication for insulin. Health products are often authorised for use in treating more than one indication. Note: The database cannot be used on its own to evaluate a health product's safety profile. It does not provide conclusive information on the safety of health products, and is not a substitute for medical advice. Should you have an issue of medical concern, consult a qualified health professional.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The robust estimation of chemical kinetic parameters and their associated uncertainty is essential in the field of chemistry and catalysis. The Chemical Kinetics Bayesian Inference Toolbox (CKBIT) is a Python software library introduced to enable users to implement advanced Bayesian inference techniques for kinetic parameter estimation and uncertainty quantification. Leveraging functionalities of other open source Python packages and offering simplified implementation through minimal user-required coding and straightforward Excel input files, CKBIT aspires to make the inference method easily accessible for chemical kinetics. CKBIT provides maximum a posteriori, Markov chain Monte Carlo, and variational inference estimation options. Users may apply these functionalities to estimate activation energies, reaction orders, and pre-exponential terms from chemical reaction data from batch reactors, continuous stirred-tank reactors, and plug flow reactors. The availability of prior distribution specification and the implementation of hierarchical modeling in CKBIT provide a heightened level of accuracy in estimates of kinetic parameters and their uncertainties.
Ames Quantum Chemistry Dataset collects electronic structure, reaction kinetics, and dynamics data calculated at Ames Research Center. This includes potential energy curves and surfaces as well as the reaction cross sections and rate coefficients.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This CODMAC level 3 data set contains the key parameters of the four Reaction wheel housekeeping. In particular, it provides information on the Reaction wheel friction, measured angular momentum & wheel direction. It covers the period from launch in 2004, through the 3 Earth and 1 Mars flyby, plus the hibernation phases, plus the asteroid flybys and finally covers the Prelanding, comet escort & Extension phases of the prime target of the mission. The prime target is comet 67P/Churyumov-Gerasimenko 1 (1969 R1). This version V1.0 is the first version of this dataset.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Although the Canada Vigilance Adverse Reaction Online Database is a relational database, there is a requirement to provide the data to users in a common format; therefore the data has been extracted into a flat file format. All files are dollar ($) sign delimited enclosed in "quotes".
Background We would expect information on adverse drug reactions in randomised clinical trials to be easily retrievable from specific searches of electronic databases. However, complete retrieval of such information may not be straightforward, for two reasons. First, not all clinical drug trials provide data on the frequency of adverse effects. Secondly, not all electronic records of trials include terms in the abstract or indexing fields that enable us to select those with adverse effects data. We have determined how often automated search methods, using indexing terms and/or textwords in the title or abstract, would fail to retrieve trials with adverse effects data.
Methods
We used a sample set of 107 trials known to report frequencies of adverse drug effects, and measured the proportion that (i) were not assigned the appropriate adverse effects indexing terms in the electronic databases, and (ii) did not contain identifiable adverse effects textwords in the title or abstract.
Results
Of the 81 trials with records on both MEDLINE and EMBASE, 25 were not indexed for adverse effects in either database. Twenty-six trials were indexed in one database but not the other. Only 66 of the 107 trials reporting adverse effects data mentioned this in the abstract or title of the paper. Simultaneous use of textword and indexing terms retrieved only 82/107 (77%) papers.
Conclusions
Specific search strategies based on adverse effects textwords and indexing terms will fail to identify nearly a quarter of trials that report on the rate of drug adverse effects.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The reduction of certain group 4 metallocene dichlorides by magnesium or lithium in the presence or absence of Me3SiC2SiMe3 in THF or toluene was investigated, giving in the case of titanium the dinuclear Ti(III) complex [rac-(ebthi)Ti(μ-Cl)]2 (1). For zirconium the 1-oxa-2-zirconacyclohexane 2 was formed by ring-opening reaction of rac-(ebthi)Zr(η2-Me3SiC2SiMe3) with THF. As a byproduct from the synthesis of Cp*2Zr(η2-Me3SiC2SiMe3) starting from Cp*2ZrCl2 another 1-oxa-2-zirconacyclohexane (3) was obtained by ring-opening reaction of THF via the dinuclear complex Cp*2Zr(Cl)-(CH2)4O−Zr(Cl)Cp*2 (4). In the case of hafnium the analogous dinuclear complex Cp*2Hf(Cl)−(CH2)4O−Hf(Cl)Cp*2 (5) and 1-oxa-2-hafnacyclohexane (6) were the main products of the reaction, inhibiting the synthesis of Cp*2Hf(η2-Me3SiC2SiMe3) (7). The tendency for ring opening of THF initiated by metallocenes increases in the series Ti, Zr, Hf, thus leading to consequences for the synthesis of metallocene complexes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adiabatic temperature rise (ATR) is an important method for determining isocyanate conversion in polyurethane foam reactions as well as many other exothermic chemical reactions. ATR can be used in conjunction with change in height and mass measurements to gain understanding into the blowing and gelling reactions that occur during polyurethane foaming as well as give important information on cell morphology. FoamPi is an open-source Raspberry Pi device for monitoring polyurethane foaming reactions. The device effectively monitors temperature rise, change in foam height as well as changes in the mass during the reaction. Three Python scripts are also presented. The first logs raw data during the reaction. The second corrects temperature data such that it can be used in ATR reactions for calculating isocyanate conversion; additionally this script reduces noise in all the data and removes erroneous readings. The final script extracts important information from the corrected data such as maximum temperature change and maximum height change as well as the time to reach these points. Commercial examples of such equipment exist however the price (£10000) of these equipment make these systems inaccessible for many research laboratories. The FoamPi build presented is inexpensive (£350).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental data for manuscript "Strengthening of Calcite Assemblages through Chemical Complexation Reaction" by R. C. Choens, J. Wilson, and A. G. Ilgen; Sandia National Laboratories. The data includes scanning electron microscope images of various calcite assemblages along with experimental data .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These datasets were used in the training and testing of Machine Learning Interatomic Potentials (MLIPs) as part of the work represented in the article titled Does Hessian Data Improve the Performance of Machine Learning Potentials?.RTP Dataset (Reactant–Transition State–Product Dataset):The RTP dataset forms the core training and evaluation set and consists of 35,087 molecular geometries sampled from 11,961 unique elementary reactions. For each reaction, three critical geometries are included: the optimized reactant, transition state (TS), and product. Each geometry is labeled with its corresponding DFT-computed potential energy, atomic forces, and Hessian matrix, calculated at the wb97xd/6-31g(d) level of theory. This dataset represents stationary points (critical points) on the potential energy surface and serves as the foundation for training the MLIPs to reproduce energies, gradients, and curvatures.IRC Dataset (Intrinsic Reaction Coordinate Dataset):To assess the extrapolation performance of the trained MLIPs along continuous reaction pathways, a dataset of 34,248 geometries was compiled from 600 Intrinsic Reaction Coordinate (IRC) paths, each corresponding to a distinct elementary reaction in the RTP dataset. These geometries were obtained by following the minimum energy path (MEP) from the transition state to both reactant and product wells using quantum chemistry calculations at the wb97xd/6-31g(d) level of theory. While these geometries are not explicitly used in training, they provide a rigorous benchmark for evaluating the ability of MLIPs to generalize beyond training data and accurately model transition state connectivity and reaction dynamics.NMS Dataset (Normal Mode Sampling Dataset):To evaluate MLIP robustness on off-equilibrium, perturbed structures, 62,527 geometries were generated via Normal Mode Sampling (NMS). These structures are derived by displacing intermediate IRC geometries along their vibrational modes with random amplitudes, simulating thermal fluctuations and non-equilibrium distortions. The properties of these perturbed structures were calculated at the wb97xd/6-31g(d) level of theory. This dataset allows for testing the model's stability and accuracy in more realistic, noisy molecular environments as encountered in molecular dynamics simulations or under experimental conditions.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The data set is updated on a monthly basis and currently covers the following time period: 1965 to 2023-10-31. The data extract is a series of compressed ASCII text files of the full data set contained in the Canada Vigilance Adverse Reaction Online Database. It is intended for users who are familiar with database structures and setting up their own queries.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The Canada Vigilance Adverse Reaction Online Database contains information about suspected adverse reactions (also known as side effects) to health products.
https://www.nist.gov/open/copyright-fair-use-and-licensing-statements-srd-data-software-and-technical-series-publications#SRDhttps://www.nist.gov/open/copyright-fair-use-and-licensing-statements-srd-data-software-and-technical-series-publications#SRD
The NIST Chemistry WebBook provides users with easy access to chemical and physical property data for chemical species through the internet. The data provided in the site are from collections maintained by the NIST Standard Reference Data Program and outside contributors. Data in the WebBook system are organized by chemical species. The WebBook system allows users to search for chemical species by various means. Once the desired species has been identified, the system will display data for the species. Data include thermochemical properties of species and reactions, thermophysical properties of species, and optical, electronic and mass spectra.
Background Better automation, lower cost per reaction and a heightened interest in comparative genomics has led to a dramatic increase in DNA sequencing activities. Although the large sequencing projects of specialized centers are supported by in-house bioinformatics groups, many smaller laboratories face difficulties managing the appropriate processing and storage of their sequencing output. The challenges include documentation of clones, templates and sequencing reactions, and the storage, annotation and analysis of the large number of generated sequences. Results We describe here a new program, named FOUNTAIN, for the management of large sequencing projects . FOUNTAIN uses the JAVA computer language and data storage in a relational database. Starting with a collection of sequencing objects (clones), the program generates and stores information related to the different stages of the sequencing project using a web browser interface for user input. The generated sequences are subsequently imported and annotated based on BLAST searches against the public databases. In addition, simple algorithms to cluster sequences and determine putative polymorphic positions are implemented. Conclusions A simple, but flexible and scalable software package is presented to facilitate data generation and storage for large sequencing projects. Open source and largely platform and database independent, we wish FOUNTAIN to be improved and extended in a community effort.
Dataset containing reaction centers used to train the disconnection aware model