4 datasets found
  1. e

    Database of scalable training of neural network potentials for complex...

    • b2find.eudat.eu
    Updated Apr 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Database of scalable training of neural network potentials for complex interfaces through data augmentation - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/46e840d3-d4f3-5754-b86f-30d99487fa30
    Explore at:
    Dataset updated
    Apr 2, 2025
    Description

    This database contains the reference data used for direct force training of Artificial Neural Network (ANN) interatomic potentials using the atomic energy network (ænet) and ænet-PyTorch packages (https://github.com/atomisticnet/aenet-PyTorch). It also includes the GPR-augmented data used for indirect force training via Gaussian Process Regression (GPR) surrogate models using the ænet-GPR package (https://github.com/atomisticnet/aenet-gpr). Each data file contains atomic structures, energies, and atomic forces in XCrySDen Structure Format (XSF). The dataset includes all reference training/test data and corresponding GPR-augmented data used in the four benchmark examples presented in the reference paper, “Scalable Training of Neural Network Potentials for Complex Interfaces Through Data Augmentation”. A hierarchy of the dataset is described in the README.txt file, and an overview of the dataset is also summarized in supplementary Table S1 of the reference paper.

  2. Supplementary data (CC BY-NC-SA 4.0): A reactive neural network framework...

    • zenodo.org
    bin, zip
    Updated Jun 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Erlebach; Andreas Erlebach; Martin Šípka; Indranil Saha; Petr Nachtigall; Petr Nachtigall; Christopher J. Heard; Christopher J. Heard; Lukáš Grajciar; Lukáš Grajciar; Martin Šípka; Indranil Saha (2024). Supplementary data (CC BY-NC-SA 4.0): A reactive neural network framework for water-loaded acidic zeolites [Dataset]. http://doi.org/10.5281/zenodo.10361794
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andreas Erlebach; Andreas Erlebach; Martin Šípka; Indranil Saha; Petr Nachtigall; Petr Nachtigall; Christopher J. Heard; Christopher J. Heard; Lukáš Grajciar; Lukáš Grajciar; Martin Šípka; Indranil Saha
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Time period covered
    Dec 12, 2023
    Description

    Content (Creative Commons Attribution Non Commercial Share Alike 4.0 International):

    This dataset provides supplementary data to "A reactive neural network framework for water-loaded acidic zeolites". It contains trained Neural Network Potentials (NNP and ΔNNP model), scripts, and all energy and force data used in this work at the (Δ)NNP, ReaxFF, and DFT (SCAN+D3(BJ) and ωB97X-D3(BJ)) level. Energy and forces are stored as ASE trajectory files (traj), readable by the Atomic Simulation Environment (ASE). In addition, this repository contains the generated training database with DFT (SCAN+D3(BJ)) energies and forces as SchNetPack1.0 database (SiAlOH.db) file readable by ASE and SchNetPack version 1.0.

    1. "aimd_simulations.zip" - VASP INCAR file, XDATCAR and traj file for 10 ps AIMD run (Supplementary Figure 6) and NNP level (re-)calculated energies/forces ("aimd_nnp_recalc.traj")
    2. "biased_dynamics.zip" - VASP/Plumed input and output files for DFT (SCAN+D3(BJ)) and NNP level biased dynamics including traj files (Supplementary Figure 12)
    3. "database_input.zip" - structure (cif) files of the initial structures used for database generation (Supplementary Table 1)
    4. "delta_nnp.zip" - (pytorch) ΔNNP model (compatible with SchNetPack version 1.0) together with example scripts
    5. "error_stats.zip" - traj files of all generalization tests (Figure 1 and Supplementary Figure 4) storing energies/forces at the SCAN+D3(BJ), ReaxFF, and NNP level as well as traj files with ΔNNP and ωB97X-D3(BJ) energies/forces for a subset taken from biased dynamics runs (Supplementary Figure 11)
    6. "md_simulations.zip" - NNP level MD trajectories of all generalization test (Figure 1 and Supplementary Figure 4) runs including an example script for an MD run
    7. "neb_calculations.zip" - traj files and example scripts for NEB calculations at the (Δ)NNP along with the corresponding DFT energy/force data (SCAN+D3(BJ) and ωB97X-D3(BJ))
    8. "nnps.zip" - (pytorch) NNP model files (compatible with SchNetPack version 1.0)
    9. "silica_database.zip" - output files of the single-point (SP) and optimization test runs (Supplementary Figure 1) of pure silica structures together with an example structure optimization script
    10. "SiAlOH.db" - DFT (SCAN+D3(BJ)) training database as SchNetPack1.0 database file readable by ASE and SchNetPack version 1.0

  3. MPI-MNIST Dataset

    • zenodo.org
    application/gzip, pdf
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meira Iske; Meira Iske; Hannes Albers; Hannes Albers; Tobias Kluth; Tobias Kluth; Tobias Knopp; Tobias Knopp (2025). MPI-MNIST Dataset [Dataset]. http://doi.org/10.5281/zenodo.12799417
    Explore at:
    application/gzip, pdfAvailable download formats
    Dataset updated
    Jan 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Meira Iske; Meira Iske; Hannes Albers; Hannes Albers; Tobias Kluth; Tobias Kluth; Tobias Knopp; Tobias Knopp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset for magnetic particle imaging based on the MNIST dataset.

    This dataset contains simulated MPI measurements along with ground truth phantoms selected from the https://yann.lecun.com/exdb/mnist/" target="_blank" rel="noopener">MNIST database of handwritten digits. A state-of-the-art model-based system matrix is used to simulate the MPI measurements of the MNIST phantoms. These measurements are equipped with noise perturbations captured by the preclinical MPI system (Bruker, Ettlingen, Germany). The dataset can be utilized in its provided form, while additional data is included to offer flexibility for creating customized versions.

    MPI-MNIST features four different system matrices, each available in three spatial resolutions. The provided data is generated using a specified system matrix at highest spatial resolution. Reconstruction operations can be performed by using any of the provided system matrices at a lower resolution. This setup allows for simulating reconstructions from either an exact or an inexact forward operator. To cover further operator deviation setups, we provide additional noise data for the application of pixelwise noise to the reconstruction system matrix.

    For supporting the development of learning-based methods, a large amount of further noise samples, captured by the Bruker scanner, is provided.

    For a detailed description of the dataset, see arxiv.org/abs/2501.05583.

    The Python-based GitHub repository available at https://github.com/meiraiske/MPI-MNIST" href="https://github.com/meiraiske/MPI-MNIST" target="_blank" rel="noopener">https://github.com/meiraiske/MPI-MNIST can be used for downloading the data from this website and preparing it for project use which includes an integration to PyTorch or PyTorch Lightning modules.

    File Structure

    All data, except for the phantoms, is provided in the MDF file format. This format is specifically tailored to store MPI data and contains metadata corresponding to the experimental setup. The ground truth phantoms are provided as HDF5 files since they do not require any metadata.

    • SM: Contains twelve system matrices named SM_{physical model}_{resolution}.mdf. It covers four physical models given in three resolutions ('coarse', 'int' and 'fine'). The highest resolution ('fine') is used for data generation.
    • large_noise: Contains large_NoiseMeas.mdf with 390060 noise measurements. Each noise measurement has been averaged over ten empty scanner measurements. This can be used e.g. for learning-based methods.

    For dataset in ['train', 'test']:

    • {dataset}_noise: Contains four noise matrices, where each noise measurement has been averaged over ten empty scanner measurements:
      1. NoiseMeas_phantom_{dataset}.mdf : Additive measurement noise for simulated measurements.
      2. NoiseMeas_phantom_bg_{dataset}.mdf : Unused noise reserved for background correction of 1.
      3. NoiseMeas_SM_{dataset}.mdf : System Matrix noise, that can be applied to each pixel of the reconstruction system matrix.
      4. NoiseMeas_SM_bg_{dataset}.mdf : Unused noise reserved for background correction of 3.
    • {dataset}_gt: Contains {dataset}_gt.hdf5 with flattened and preprocessed ground truth MNIST phantoms given in coarse resolution (15x17=255 pixels) with pixel values in [0, 10].
    • {dataset}_obs: Contains {dataset}_obs.mdf with noise free simulated measurements (observations) of {dataset}_gt.hdf5 using the system matrix stored in SM_fluid_opt_fine.mdf.
    • {dataset}_obsnoisy: Contains {dataset}_obsnoisy.mdf with noise contained simulated measurements, resulting from {dataset}_obs.mdf and {dataset}_phantom_noise.mdf.


    In line with MNIST, each MDF/HDF5 file in {dataset}_gt, {dataset}_obs, {dataset}_obsnoisy for dataset in ['train', 'test'] contains 60000 samples for 'train' and 10000 samples for 'test'. The data can be manually reproduced in the intermediate resolution (45x51=2295 pixels) from the files in this dataset using the system matrices in intermediate ('int') resolution for reconstruction and upsampling the ground truth phantoms by 3 pixels per dimension. This case is also implemented in the Github repository .

    The PDF file MPI-MNIST_Metadata.pdf contains a list of meta information for each of the MDF files of this dataset.

  4. t

    Transformer Network trained on Simulated X-ray photoelectron spectroscopy...

    • researchdata.tuwien.at
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Simperl; Florian Simperl; Florian Simperl; Florian Simperl (2025). Transformer Network trained on Simulated X-ray photoelectron spectroscopy data for organic and inorganic compounds [Dataset]. http://doi.org/10.48436/mvrkc-dz146
    Explore at:
    Dataset updated
    Jul 1, 2025
    Dataset provided by
    TU Wien
    Authors
    Florian Simperl; Florian Simperl; Florian Simperl; Florian Simperl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    This data repository provides the underlying data and neural network training scripts associated with the manuscript titled "A Transformer Network for High-Throughput Material Characterisation with X-ray Photoelectron Spectroscopy" by Simperl and Werner.

    All data files are released under the Creative Commons Attribution 4.0 International (CC-BY) license, while all code files are distributed under the MIT license.

    The repository contains simulated X-ray photoelectron spectroscopy (XPS) spectra stored as hdf5 files in the zipped (h5_files.zip) folder, which was generated using the software developed by the authors. The NIST Standard Reference Database 100 – Simulation of Electron Spectra for Surface Analysis (SESSA) is freely available at https://www.nist.gov/srd/nist-standard-reference-database-100.

    The neural network architecture is implemented using the PyTorch Lightning framework and is fully available within the attached materials as Transformer_SimulatedSpectra.py contained in the python_scripts.zip.

    The trained model and the list of materials for the train, test and validation sets are contained in the models.zip folder.

    The repository contains all the data necessary to replot the figures from the manuscript. These data are available in the form of .csv files or .h5 files for the spectra. In addition, the repository also contains a Python script (Plot_Data_Manuscript.ipynb) which is contained in the python_scripts.zip file.

    Context and methodology

    The dataset and accompanying Python code files included in this repository were used to train a transformer-based neural network capable of directly inferring chemical concentrations from simulated survey X-ray photoelectron spectroscopy (XPS) spectra of bulk compounds.

    The spectral dataset provided here represents the raw output from the SESSA software (version 2.2.2), prior to the normalization procedure described in the associated manuscript. This step of normalisation is of paramount importance for the effective training of the neural network.

    The repository contains the Python scripts utilised to execute the spectral simulations and the neural network training on the Vienna Scientific Cluster (VSC5). In order to obtain guidance on the proper configuration of the Command Line Interface (CLI) tools required for SESSA, users are advised to consult the official SESSA manual, which is available at the following address: https://nvlpubs.nist.gov/nistpubs/NSRDS/NIST.NSRDS.100-2024.pdf.

    To run the neural network training we provided the requirements_nn_training.txt file that contains all the necessary python packages and version numbers. All other python scripts can be run locally with the python libraries listed in requirements_data_analysis.txt.

    Data details

    HDF5 (in zip folder): As described in the manuscript, we simulate X-ray photoelectron spectra for each of the 7,587 inorganic [1] and organic [2] materials in our dataset. To reflect realistic experimental conditions, each simulated spectrum was augmented by systematically varying parameters such as peak width, peak shift, and peak type—all configurable within the SESSA software—as well as by applying statistical Poisson noise to simulate varying signal-to-noise ratios. These modifications account for experimentally observed and material-specific spectral broadening, peak shifts, and detector-induced noise. Each material is represented by an individual HDF5 (.h5) file, named according to its chemical formula and mass density (in g/cm³). For example, the file for SiO2 with a density of 2.196 gcm-3 is named SiO2_2.196.h5. For more complex chemical formulas, such as Co(ClO4)2 with a density of 3.33 gcm-3, the file is named Co_ClO4_2_3.33.h5. Within each HDF5 file, the metadata for each spectrum is stored alongside a fixed energy axis and the corresponding intensity values. The spectral data are organized hierarchically by augmentation parameters in the following directory structure, e.g. for Ac_10.0.h5 we have SNR_0/WIDTH_0.3/SHIFT_-3.0/PEAK_gauss/Ac_10.0/. These files can be easily inspected with H5Web in Visual Studio Code or using h5py in Python or any other h5 interpretable program.

    Session Files: The .ses files are SESSA specific input files that can be directly loaded into SESSA to specify certain input parameters for the initilization (ini), the geometry (geo) and the simulation parameters (sim_para) and are required by the python script Simulation_Script_VSC_json.py to run the simulation on the cluster.

    Json Files: The two json files (MaterialsListVSC_gauss.json, MaterialsListVSC_lorentz.json) are used as the input files to the Python script Simulation_Script_VSC_json.py. These files contain all the material specific information for the SESSA simulation.

    csv files: The csv files are used to generate the plots from the manuscript described in the section "Plotting Scripts".

    npz files: The two .npz files (element_counts.npz, single_elements.npz) are python arrays that are needed by the Transformer_SimulatedSpectra.py script and contain the number of each single element in the dataset and an array of each single element present, respectively.

    SESSA Simulation Script

    There is one python file that sets the communication with SESSA:

    • Simulation_Script_VSC_json.py: This script is the heart of the simulation as it controls the communication through the CLI with SESSA using the specified input paramters in the .json and .ses files together with external functions specified in VSC_function.py

    Technical Details

    Simulation_Script_VSC_json.py: This script uses the functions of the VSC_function.py script (therefore needs to be placed in the same directory as this script) and can be called with the following command:

    python3 Simulation_Script_VSC_json.py MaterialsListVSC_gauss.json 0

    It simulates the spectrum for the material at index 0 in the .json file and with the corresponding parameters specified in the .json file.

    It is important that before running this script the following paths need to be specified:

    • sessa_path: The path to their SESSA installation in sessa_path and the path to their session files in
    • folder_path: The path to their .ses files. In this directory an output folder will be generated where all the output files, including the simulated spectra, are written to.

    To run SESSA on a computing cluster it is important to have a working Xvfb (virtual frame buffer) or a similar tool available to which any graphical output from SESSA can be written to.

    Neural Network Training Script

    Before running the training script it is important to normalize the data such that the squared integral of the spectrum is 1 (as described in the manuscript) and shown in the code: normalize_spectra.py

    For the neural network training we use the Transformer_SimulatedSpectra.py where the external functions used are specified in external_functions.py. This script contains the full description of the neural network architecture, the hyperparameter tuning and the Wandb logging.

    In the models.zip folder the fully trained network final_trained_model.ckpt presented in the manuscript is available as well as the list of training, validation and testing materials (test_materials_list.pt, train_materials_list.pt, val_materials_list.pt) where the corresponding spectra are extracted from the hdf5 files. The file types .ckpt and .pt can be read in by using the pytorch specific load functions in Python, e.g.

    torch.load(train_materials_list)

    Technical Details

    normalize_spectra.py: To run this script properly it is important to set up a python environment with the necessary libraries specified in the requirements_data_analysis.txt file. Then it can be called with

    python3 normalize_spectra.py

    where it is important to specify the path to the .h5 files containing the unnormalized spectra.

    Transformer_SimulatedSpectra.py: To run this script properly on the cluster it is important to set up a python environment with the necessary libraries specified in the requirements_nn_training.txt file. This script also relies on external_functions.py, single_elements.npz and element_counts.npz (that should be placed in the same directory as the python script) file. This is important for creating the datasets for training, validation and testing and ensures that all the single elements appear in the testing set. You can call this script (on the cluster) within a slurm script to start the GPU training.

    python3 Transformer_SimulatedSpectra.py

    It is important that before running this script the following paths need to be specified:

    • data_path: General path where all the data is stored
    • neural_network_data: The location where you keep your normalized hdf5 files
    • wandb_api_key: The api key to use wandb
    • ray_tesults: The location where you want to save your tuning results
    • checkpoints: The location where you want to save your ray

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). Database of scalable training of neural network potentials for complex interfaces through data augmentation - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/46e840d3-d4f3-5754-b86f-30d99487fa30

Database of scalable training of neural network potentials for complex interfaces through data augmentation - Dataset - B2FIND

Explore at:
Dataset updated
Apr 2, 2025
Description

This database contains the reference data used for direct force training of Artificial Neural Network (ANN) interatomic potentials using the atomic energy network (ænet) and ænet-PyTorch packages (https://github.com/atomisticnet/aenet-PyTorch). It also includes the GPR-augmented data used for indirect force training via Gaussian Process Regression (GPR) surrogate models using the ænet-GPR package (https://github.com/atomisticnet/aenet-gpr). Each data file contains atomic structures, energies, and atomic forces in XCrySDen Structure Format (XSF). The dataset includes all reference training/test data and corresponding GPR-augmented data used in the four benchmark examples presented in the reference paper, “Scalable Training of Neural Network Potentials for Complex Interfaces Through Data Augmentation”. A hierarchy of the dataset is described in the README.txt file, and an overview of the dataset is also summarized in supplementary Table S1 of the reference paper.

Search
Clear search
Close search
Google apps
Main menu