100+ datasets found
  1. Supporting data and code: Beyond Economic Dispatch: Modeling Renewable...

    • zenodo.org
    zip
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli (2025). Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models [Dataset]. http://doi.org/10.5281/zenodo.15219959
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository provides the necessary data and Python code to replicate the experiments and generate the figures presented in our manuscript: "Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models".

    Contents:

    • pownet.zip: Contains PowNet version 3.2, the specific version of the simulation software used in this study.
    • inputs.zip: Contains essential modeling inputs required by PowNet for the experiments, including network data, and pre-generated synthetic load and solar time series.
    • scripts.zip: Contains the Python scripts used for installing PowNet, optionally regenerating synthetic data, running simulation experiments, processing results, and generating figures.
    • thai_data.zip (Reference Only): Contains raw data related to the 2023 Thai power system. This data served as a reference during the creation of the PowNet inputs for this study but is not required to run the replication experiments themselves. Code to process the raw data is also provided.

    System Requirements:

    • Python version 3.10+
    • pip package manager

    Setup Instructions:

    1. Download and Unzip Core Files: Download pownet.zip, inputs.zip, scripts.zip, and thai_data.zip. Extract their contents into the same parent folder. Your directory structure should look like this:

      Parent_Folder/
      ├── pownet/    # from pownet.zip
      ├── inputs/    # from inputs.zip
      ├── scripts/    # from scripts.zip
      ├── thai_data.zip/ # from scripts.zip ├── figures/ # Created by scripts later
      ├── outputs/ # Created by scripts later
    2. Install PowNet:

      • Open your terminal or command prompt.
      • Navigate into the pownet directory that you just extracted:

    cd path/to/Parent_Folder/pownet

    pip install -e .

      • These commands install PowNet and its required dependencies into your active Python environment.

    Workflow and Usage:

    Note: All subsequent Python script commands should be run from the scripts directory. Navigate to it first:

    cd path/to/Parent_Folder/scripts

    1. Generate Synthetic Time Series (Optional):

    • This step is optional as the required time series files are already provided within the inputs directory (extracted from inputs.zip). If you wish to regenerate them:
    • Run the generation scripts:
      python create_synthetic_load.py
      python create_synthetic_solar.py
    • Evaluate the generated time series (optional):
      python eval_synthetic_load.py
      python eval_synthetic_solar.py

    2. Calculate Total Solar Availability:

    • Process solar scenarios using data from the inputs directory:
      python process_scenario_solar.py
      

    3. Experiment 1: Compare Strategies for Modeling Purchase Obligations:

    • Run the base case simulations for different modeling strategies:
      • No Must-Take (NoMT):
        python run_basecase.py --model_name "TH23NMT"
        
      • Zero-Cost Renewables (ZCR):
        python run_basecase.py --model_name "TH23ZC"
        
      • Penalized Curtailment (Proposed Method):
        python run_basecase.py --model_name "TH23"
        
    • Run the base case simulation for the Minimum Capacity (MinCap) strategy:
      python run_min_cap.py

      This is a new script because we need to modify the objective function and add constraints.

    4. Experiment 2: Simulate Partial-Firm Contract Switching:

    • Run simulations comparing the base case with the partial-firm contract scenario:
      • Base Case Scenario:
        python run_scenarios.py --model_name "TH23"
        
      • Partial-Firm Contract Scenario:
        python run_scenarios.py --model_name "TH23ESB"
        

    5. Visualize Results:

    • Generate all figures presented in the manuscript:
      python run_viz.py
      
    • Figures will typically be saved in afigures directory within the Parent_Folder.
  2. h

    text-to-python-synthetic

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Data Advice, text-to-python-synthetic [Dataset]. https://huggingface.co/datasets/AI-Data-Advice-Comp/text-to-python-synthetic
    Explore at:
    Dataset authored and provided by
    AI Data Advice
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    AI-Data-Advice-Comp/text-to-python-synthetic dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. Data from: A large synthetic dataset for machine learning applications in...

    • zenodo.org
    csv, json, png, zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
    Explore at:
    zip, png, csv, jsonAvailable download formats
    Dataset updated
    Mar 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

    This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

    Data generation algorithm

    The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

    Network

    The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

    Time series

    The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

    There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

    Usage

    The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

    Selecting a particular country

    This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

    import pandas as pd
    CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

    The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

    CH_gens_list = CH_gens.dropna().squeeze().to_list()

    Finally, we can import all the time series of Swiss generators from a given data table with

    pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

    The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

    Averaging over time

    This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

    hourly_loads = pd.read_csv('loads_2018_3.csv')

    To get a daily average of the loads, we can use:

    daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

    This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

    weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

    Source code

    The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

    Funding

    This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

  4. SDNist v1.3: Temporal Map Challenge Environment

    • datasets.ai
    • data.nist.gov
    • +1more
    0, 23, 5, 8
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2024). SDNist v1.3: Temporal Map Challenge Environment [Dataset]. https://datasets.ai/datasets/sdnist-benchmark-data-and-evaluation-tools-for-data-synthesizers
    Explore at:
    5, 23, 8, 0Available download formats
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    SDNist (v1.3) is a set of benchmark data and metrics for the evaluation of synthetic data generators on structured tabular data. This version (1.3) reproduces the challenge environment from Sprints 2 and 3 of the Temporal Map Challenge. These benchmarks are distributed as a simple open-source python package to allow standardized and reproducible comparison of synthetic generator models on real world data and use cases. These data and metrics were developed for and vetted through the NIST PSCR Differential Privacy Temporal Map Challenge, where the evaluation tools, k-marginal and Higher Order Conjunction, proved effective in distinguishing competing models in the competition environment.SDNist is available via pip install: pip install sdnist==1.2.8 for Python >=3.6 or on the USNIST/Github. The sdnist Python module will download data from NIST as necessary, and users are not required to download data manually.

  5. A

    SDNist: Benchmark data and evaluation tools for data synthesizers.

    • data.amerigeoss.org
    bin, csv, json +1
    Updated Dec 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States (2021). SDNist: Benchmark data and evaluation tools for data synthesizers. [Dataset]. http://identifiers.org/ark:/88434/mds2-2515
    Explore at:
    bin, json, csv, python 3.8 moduleAvailable download formats
    Dataset updated
    Dec 28, 2021
    Dataset provided by
    United States
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    SDNist is a set of benchmark data and metrics for the evaluation of synthetic data generators on structured tabular data. These benchmarks are distributed as a simple open-source python package to allow standardized and reproducible comparison of synthetic generator models on real world data and use cases. These data and metrics were developed for and vetted through the NIST PSCR Differential Privacy Temporal Map Challenge, where the evaluation tools, k-marginal and Higher Order Conjunction, proved effective in distinguishing competing models in the competition environment.SDNist is available via pip install: pip install sdnist for Python >=3.6 or on the [USNIST]Github(https://github.com/usnistgov/SDNist/). The sdnist Python module will download data from NIST as necessary, and users are not required to download data manually.

  6. Data from: Domain-adaptive Data Synthesis for Large-scale Supermarket...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Apr 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Strohmayer; Julian Strohmayer; Martin Kampel; Martin Kampel (2024). Domain-adaptive Data Synthesis for Large-scale Supermarket Product Recognition [Dataset]. http://doi.org/10.5281/zenodo.7750242
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julian Strohmayer; Julian Strohmayer; Martin Kampel; Martin Kampel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition

    This repository contains the data synthesis pipeline and synthetic product recognition datasets proposed in [1].

    Data Synthesis Pipeline:

    We provide the Blender 3.1 project files and Python source code of our data synthesis pipeline pipeline.zip, accompanied by the FastCUT models used for synthetic-to-real domain translation models.zip. For the synthesis of new shelf images, a product assortment list and product images must be provided in the corresponding directories products/assortment/ and products/img/. The pipeline expects product images to follow the naming convention c.png, with c corresponding to a GTIN or generic class label (e.g., 9120050882171.png). The assortment list, assortment.csv, is expected to use the sample format [c, w, d, h], with c being the class label and w, d, and h being the packaging dimensions of the given product in mm (e.g., [4004218143128, 140, 70, 160]). The assortment list to use and the number of images to generate can be specified in generateImages.py (see comments). The rendering process is initiated by either executing load.py from within Blender or within a command-line terminal as a background process.

    Datasets:

    • SG3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 851,801 instances of 3,234 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.
    • SG3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.
    • SGI3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 838,696 instances of 1,063 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.
    • SGI3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.
    • SPS8k - Synthetic Product Shelves 8k (SPS8k) dataset, comprised of 16,224 synthetic shelf images with 1,981,967 instances of 8,112 supermarket products. Instance-level bounding boxes and GTIN class labels are provided for all product instances.
    • SPS8kt - Domain-translated version of SPS8k, utilizing SKU110k as the target domain. Instance-level bounding boxes and GTIN class labels for all product instances.

    Table 1: Dataset characteristics.

    Dataset#images#products#instances labels translation
    SG3k10,0003,234851,801bounding box & generic class¹none
    SG3kt10,0003,234851,801bounding box & generic class¹GroZi-3.2k
    SGI3k10,0001,063838,696bounding box & generic class²none
    SGI3kt10,0001,063838,696bounding box & generic class²GroZi-3.2k
    SPS8k16,2248,1121,981,967bounding box & GTINnone
    SPS8kt16,2248,1121,981,967bounding box & GTINSKU110k

    Sample Format

    A sample consists of an RGB image (i.png) and an accompanying label file (i.txt), which contains the labels for all product instances present in the image. Labels use the YOLO format [c, x, y, w, h].

    ¹SG3k and SG3kt use generic pseudo-GTIN class labels, created by combining the GroZi-3.2k food product category number i (1-27) with the product image index j (j.jpg), following the convention i0000j (e.g., 13000097).

    ²SGI3k and SGI3kt use the generic GroZi-3.2k class labels from https://arxiv.org/abs/2003.06800.

    Download and Use
    This data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

    [1] Strohmayer, Julian, and Martin Kampel. "Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition." International Conference on Computer Analysis of Images and Patterns. Cham: Springer Nature Switzerland, 2023.

    BibTeX citation:

    @inproceedings{strohmayer2023domain,
     title={Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition},
     author={Strohmayer, Julian and Kampel, Martin},
     booktitle={International Conference on Computer Analysis of Images and Patterns},
     pages={239--250},
     year={2023},
     organization={Springer}
    }
  7. f

    CK4Gen, High Utility Synthetic Survival Datasets

    • figshare.com
    zip
    Updated Nov 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Kuo (2024). CK4Gen, High Utility Synthetic Survival Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.27611388.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 5, 2024
    Dataset provided by
    figshare
    Authors
    Nicholas Kuo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ===###Overview:This repository provides high-utility synthetic survival datasets generated using the CK4Gen framework, optimised to retain critical clinical characteristics for use in research and educational settings. Each dataset is based on a carefully curated ground truth dataset, processed with standardised variable definitions and analytical approaches, ensuring a consistent baseline for survival analysis.###===###Description:The repository includes synthetic versions of four widely utilised and publicly accessible survival analysis datasets, each anchored in foundational studies and aligned with established ground truth variations to support robust clinical research and training.#---GBSG2: Based on Schumacher et al. [1]. The study evaluated the effects of hormonal treatment and chemotherapy duration in node-positive breast cancer patients, tracking recurrence-free and overall survival among 686 women over a median of 5 years. Our synthetic version is derived from a variation of the GBSG2 dataset available in the lifelines package [2], formatted to match the descriptions in Sauerbrei et al. [3], which we treat as the ground truth.ACTG320: Based on Hammer et al. [4]. The study investigates the impact of adding the protease inhibitor indinavir to a standard two-drug regimen for HIV-1 treatment. The original clinical trial involved 1,151 patients with prior zidovudine exposure and low CD4 cell counts, tracking outcomes over a median follow-up of 38 weeks. Our synthetic dataset is derived from a variation of the ACTG320 dataset available in the sksurv package [5], which we treat as the ground truth dataset.WHAS500: Based on Goldberg et al. [6]. The study follows 500 patients to investigate survival rates following acute myocardial infarction (MI), capturing a range of factors influencing MI incidence and outcomes. Our synthetic data replicates a ground truth variation from the sksurv package, which we treat as the ground truth dataset.FLChain: Based on Dispenzieri et al. [7]. The study assesses the prognostic relevance of serum immunoglobulin free light chains (FLCs) for overall survival in a large cohort of 15,859 participants. Our synthetic version is based on a variation available in the sksurv package, which we treat as the ground truth dataset.###===###Notes:Please find an in-depth discussion on these datasets, as well as their generation process, in the link below, to our paper:https://arxiv.org/abs/2410.16872Kuo, et al. "CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare." arXiv preprint arXiv:2410.16872 (2024).###===###References:[1]: Schumacher, et al. “Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German breast cancer study group.”, Journal of Clinical Oncology, 1994.[2]: Davidson-Pilon “lifelines: Survival Analysis in Python”, Journal of Open Source Software, 2019.[3]: Sauerbrei, et al. “Modelling the effects of standard prognostic factors in node-positive breast cancer”, British Journal of Cancer, 1999.[4]: Hammer, et al. “A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less”, New England Journal of Medicine, 1997.[5]: Pölsterl “scikit-survival: A library for time-to-event analysis built on top of scikit-learn”, Journal of Machine Learning Research, 2020.[6]: Goldberg, et al. “Incidence and case fatality rates of acute myocardial infarction (1975–1984): the Worcester heart attack study”, American Heart Journal, 1988.[7]: Dispenzieri, et al. “Use of nonclonal serum immunoglobulin free light chains to predict overall survival in the general population”, in Mayo Clinic Proceedings, 2012.

  8. Data archive for paper "Copula-based synthetic data augmentation for...

    • zenodo.org
    zip
    Updated Mar 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Meyer; David Meyer (2022). Data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators" [Dataset]. http://doi.org/10.5281/zenodo.5081927
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Meyer; David Meyer
    Description

    Overview

    This is the data archive for paper “Copula-based synthetic data augmentation for machine-learning emulators”. It contains the paper’s data archive with all model outputs as well as the Singularity image.

    For the Python tool used to generate the synthetic data, please refer to the Synthia repository.

    Requirements

    *Although PBS in not a strict requirement, it is required to run all helper scripts as included in this repository. Please note that depending on your specific system settings and resource availability, you may need to modify PBS parameters at the top of submit scripts stored in the hpc directory (e.g. #PBS -lwalltime=72:00:00).

    Usage

    To reproduce the results from the experiments described in the paper, first fit all copula models to the reduced NWP-SAF dataset with:

    qsub hpc/fit.sh

    then, to generate synthetic data, run all machine learning model configurations, and compute the relevant statistics use:

    qsub hpc/stats.sh
    qsub hpc/ml_control.sh
    qsub hpc/ml_synth.sh

    Finally, to plot all artifacts included in the paper use:

    qsub hpc/plot.sh

    Licence

    Code released under MIT license. Data released under CC BY 4.0.

  9. Kimberlina 1.2 CCUS Geophysical Models and Synthetic Data Sets

    • osti.gov
    Updated Sep 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    USDOE Office of Fossil Energy (FE) (2022). Kimberlina 1.2 CCUS Geophysical Models and Synthetic Data Sets [Dataset]. http://doi.org/10.18141/1887287
    Explore at:
    Dataset updated
    Sep 14, 2022
    Dataset provided by
    National Energy Technology Laboratoryhttps://netl.doe.gov/
    USDOE Office of Fossil Energy (FE)
    Description

    This synthetic multi-scale and multi-physics data set was produced in collaboration with teams at the Lawrence Berkeley National Laboratory, National Energy Technology Laboratory, Los Alamos National Laboratory, and Colorado School of Mines through the Science-informed Machine Learning for Accelerating Real-Time Decisions in Subsurface Applications (SMART) Initiative. Data are associated with the following publication: Alumbaugh, D., Gasperikova, E., Crandall, D., Commer, M., Feng, S., Harbert, W., Li, Y., Lin, Y., and Samarasinghe, S., “The Kimberlina Synthetic Geophysical Model and Data Set for CO2 Monitoring Investigations”, The Geoscience Data Journal, 2023, DOI: 10.1002/gdj3.191. The dataset uses the Kimberlina 1.2 CO2 reservoir flow model simulations based on a hypothetical CO2 storage site in California (Birkholzer et al., 2011; Wainwright et al., 2013). Geophysical properties models (P- and S-wave seismic velocities, saturated density, and electrical resistivity) were produced with an approach similar to that of Yang et al. (2019) and Gasperikova et al. (2022) for 100 Kimberlina 1.2 reservoir models. Links to individual resources are provided below: CO2 Saturation Models; Resistivity Models – part 1, part 2, and part 3; Vp Velocity Models; Vs Velocity Models; Density Models. The 3D distributions of geophysical properties for the 33 time stamps of the SIM001 model were used to generate synthetic seismic, gravity, and electromagnetic (EM) responses for 33 times between zero and 200 years. Synthetic surface seismic data were generated using 2D and 3D finite-difference codes that simulate the acoustic wave equation (Moczo et al., 2007). 2D data were simulated for six point-pressure sources along a 2D line with 10 m receiver spacing and a time spacing of 0.0005 s. 3D simulations were completed for 25 surface pressure sources using a source separation of 1 km in both the x and y directions and a time spacing of 0.001 s. Links to individual resources are provided below: 2D velocity models and 2D surface seismic data. 3D velocity models, and 3D seismic data year0, year1, year2, year5, year10, year15, year20, year25, year30, year35, year40, year45, year49, year50, year51, year52, year55, year60, year65, year70, year75, year80, year85, year90, year95, year100, year110, year120, year130, year140, year150, year175, year200. The Python scripts to read these models and data are provided here. EM simulations used a borehole-to-surface survey configuration, with the source located near the reservoir level and receivers on the surface using the code developed by Commer and Newman (2008). Pseudo-2D data for the source at 2500 m and 3025 m, used a 2D inline receiver configuration to simulate a response over 3D resistivity models. The 3D data contain electric fields generated by borehole sources at monitoring well locations and measured over a surface receiver grid. Vector gravity data, both on the surface and in boreholes, were simulated using a modeling code developed by Rim and Li (2015). The simulation scenarios were parallel to those used for the EM: pseudo-2D data were calculated along the same lines and within the same boreholes, and 3D data were simulated over 3D models on the surface and in three monitoring wells. A series of synthetic well logs of CO2 saturation, acoustic velocity, density, and induction resistivity in the injection well and three monitoring wells are also provided at 0, 1, 2, 5, 10, 15, and 20 years after the initiation of injection. These were constructed by combining the low-frequency trend of the geophysical models with the high-frequency variations of actual well logs collected in the Kimberlina 1 well that was drilled at the proposed site. Measurements of permeability and pore connectivity were made on cores of Vedder Sandstone, which forms the primary reservoir unit: [CT micro

  10. SPIDER - Synthetic Person Information Dataset for Entity Resolution

    • figshare.com
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur (2025). SPIDER - Synthetic Person Information Dataset for Entity Resolution [Dataset]. http://doi.org/10.6084/m9.figshare.29595599.v1
    Explore at:
    text/x-script.pythonAvailable download formats
    Dataset updated
    Jul 18, 2025
    Dataset provided by
    figshare
    Authors
    Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SPIDER - Synthetic Person Information Dataset for Entity Resolution offers researchers with ready to use data that can be utilized in benchmarking Duplicate or Entity Resolution algorithms. The dataset is aimed at person-level fields that are typical in customer data. As it is hard to source real world person level data due to Personally Identifiable Information (PII), there are very few synthetic data available publicly. The current datasets also come with limitations of small volume and core person-level fields missing in the dataset. SPIDER addresses the challenges by focusing on core person level attributes - first/last name, email, phone, address and dob. Using Python Faker library, 40,000 unique, synthetic person records are created. An additional 10,000 duplicate records are generated from the base records using 7 real-world transformation rules. The duplicate records are labelled with original base record and the duplicate rule used for record generation through is_duplicate_of and duplication_rule fieldsDuplicate RulesDuplicate record with a variation in email address.Duplicate record with a variation in email addressDuplicate record with last name variationDuplicate record with first name variationDuplicate record with a nicknameDuplicate record with near exact spellingDuplicate record with only same email and nameOutput FormatThe dataset is presented in both JSON and CSV formats for use in data processing and machine learning tools.Data RegenerationThe project includes the python script used for generating the 50,000 person records. The Python script can be expanded to include - additional duplicate rules, fuzzy name, geographical names' variations and volume adjustments.Files Includedspider_dataset_20250714_035016.csvspider_dataset_20250714_035016.jsonspider_readme.mdDataDescriptionspythoncodeV1.py

  11. 1 Million Employee (Synthetic Data) By Faker

    • kaggle.com
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivam Maurya (2023). 1 Million Employee (Synthetic Data) By Faker [Dataset]. https://www.kaggle.com/datasets/mauryansshivam/1-million-employee-synthetic-data-by-faker/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shivam Maurya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    1 Million Employee (Synthetic Data) By Faker contains 1 Million Fake Employee Data which is generated by Faker library of Python.

    Note: It does not contain any real data. It is generated only for testing purpose.

    Python Code used for generating Data

  12. Data from: ESAT: Environmental Source Apportionment Toolkit Python package

    • s.cnmilf.com
    • catalog.data.gov
    Updated Nov 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2024). ESAT: Environmental Source Apportionment Toolkit Python package [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/esat-environmental-source-apportionment-toolkit-python-package
    Explore at:
    Dataset updated
    Nov 29, 2024
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The Environmental Source Apportionment Toolkit (ESAT) is an open-source software package that provides API and CLI functionality to create source apportionment workflows specifically targeting environmental datasets. Source apportionment in environment science is the process of mathematically estimating the profiles and contributions of multiple sources in some dataset, and in the case of ESAT, while considering data uncertainty. There are many potential use cases for source apportionment in environmental science research, such as in the fields of air quality, water quality and potentially many others. The ESAT toolkit is written in Python and Rust, and uses common packages such as numpy, scipy and pandas for data processing. The source apportionment algorithms provided in ESAT include two variants of non-negative matrix factorization (NMF), both of which have been written in Rust and contained within the python package. A collection of data processing and visualization features are included for data and model analytics. The ESAT package includes a synthetic data generator and comparison tools to evaluate ESAT model outputs.

  13. Z

    replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beck, Hendrik (2023). replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7849595
    Explore at:
    Dataset updated
    Apr 21, 2023
    Dataset provided by
    Imirzian, Natalie
    Plum, Fabian
    Bulla, René
    Labonte, David
    Beck, Hendrik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for semantic and instance segmentation experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

    Abstract:

    Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

    Benchmark data

    Two pose-estimation datasets were procured. Both datasets used first instar Sungaya nexpectata (Zompro 1996) stick insects as a model species. Recordings from an evenly lit platform served as representative for controlled laboratory conditions; recordings from a hand-held phone camera served as approximate example for serendipitous recordings in the field.

    For the platform experiments, walking S. inexpectata were recorded using a calibrated array of five FLIR blackfly colour cameras (Blackfly S USB3, Teledyne FLIR LLC, Wilsonville, Oregon, U.S.), each equipped with 8 mm c-mount lenses (M0828-MPW3 8MM 6MP F2.8-16 C-MOUNT, CBC Co., Ltd., Tokyo, Japan). All videos were recorded with 55 fps, and at the sensors’ native resolution of 2048 px by 1536 px. The cameras were synchronised for simultaneous capture from five perspectives (top, front right and left, back right and left), allowing for time-resolved, 3D reconstruction of animal pose.

    The handheld footage was recorded in landscape orientation with a Huawei P20 (Huawei Technologies Co., Ltd., Shenzhen, China) in stabilised video mode: S. inexpectata were recorded walking across cluttered environments (hands, lab benches, PhD desks etc), resulting in frequent partial occlusions, magnification changes, and uneven lighting, so creating a more varied pose-estimation dataset.

    Representative frames were extracted from videos using DeepLabCut (DLC)-internal k-means clustering. 46 key points in 805 and 200 frames for the platform and handheld case, respectively, were subsequently hand-annotated using the DLC annotation GUI.

    Synthetic data

    We generated a synthetic dataset of 10,000 images at a resolution of 1500 by 1500 px, based on a 3D model of a first instar S. inexpectata specimen, generated with the scAnt photogrammetry workflow. Generating 10,000 samples took about three hours on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super). We applied 70\% scale variation, and enforced hue, brightness, contrast, and saturation shifts, to generate 10 separate sub-datasets containing 1000 samples each, which were combined to form the full dataset.

    Funding

    This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

  14. Synthetic total-field magnetic anomaly data and code to perform Euler...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa (2023). Synthetic total-field magnetic anomaly data and code to perform Euler deconvolution on it [Dataset]. http://doi.org/10.6084/m9.figshare.923450.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic data, source code, and supplementary text for the article "Euler deconvolution of potential field data" by Leonardo Uieda, Vanderlei C. Oliveira Jr., and Valéria C. F. Barbosa. This is part of a tutorial submitted to The Leading Edge (http://library.seg.org/journal/tle). Results were generated using the open-source Python package Fatiando a Terra version 0.2 (http://www.fatiando.org). This material along with the manuscript can also be found at https://github.com/pinga-lab/paper-tle-euler-tutorial Synthetic data and model Examples in the tutorial use synthetic data generated with the IPython notebook create_synthetic_data.ipynb. File synthetic_data.txt has 4 columns: x (north), y (east), z (down) and the total field magnetic anomaly. x, y, and z are in meters. The total field anomaly is in nanoTesla (nT). File metadata.json contains extra information about the data, such as inclination and declination of the inducing field (in degrees), shape of the data grid (number of points in y and x, respectively), the area containing the data (W, E, S, N, in meters), and the model boundaries (W, E, S, N, top, bottom, in meters). File model.pickle is a serialized version of the model used to generate the data. It contains a list of instances of the PolygonalPrism class of Fatiando a Terra. The serialization was done using the cPickle Python module. Reproducing the results in the tutorial The notebook euler-deconvolution-examples.ipynb runs the Euler deconvolution on the synthetic data and generates the figures for the manuscript. It also presents a more detailed explanation of the method and more tests than went into the finished manuscript.

  15. h

    SPP_30K_reasoning_tasks

    • huggingface.co
    Updated Aug 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farouk (2023). SPP_30K_reasoning_tasks [Dataset]. https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2023
    Authors
    Farouk
    Description

    Dataset Card for "SPP_30K_verified_tasks"

      Dataset Summary
    

    This is an augmented version of the Synthetic Python Problems(SPP) Dataset. This dataset has been generated from the subset of the data has been de-duplicated and verified using a Python interpreter. (SPP_30k_verified.jsonl). The original dataset contains small Python functions that include a docstring with a small description of what the function does and some calling examples for the function. The current… See the full description on the dataset page: https://huggingface.co/datasets/pharaouk/SPP_30K_reasoning_tasks.

  16. m

    Python code and data examples for SMatStack

    • data.mendeley.com
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongjian Fang (2023). Python code and data examples for SMatStack [Dataset]. http://doi.org/10.17632/sxbmk6c2t3.1
    Explore at:
    Dataset updated
    Oct 20, 2023
    Authors
    Hongjian Fang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set includes Python code, synthetic data, and observed data for an earthquake on the Chain transform fault. A Jupyter notebook is also provided with examples showing how to run the code.

  17. h

    gretel-text-to-python-fintech-en-v1

    • huggingface.co
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gretel.ai (2024). gretel-text-to-python-fintech-en-v1 [Dataset]. https://huggingface.co/datasets/gretelai/gretel-text-to-python-fintech-en-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2024
    Dataset provided by
    Gretel.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Gretel Synthetic Text-to-Python Dataset for FinTech

    This dataset is a synthetically generated collection of natural language prompts paired with their corresponding Python code snippets, specifically tailored for the FinTech industry. Created using Gretel Navigator's Data Designer, with mistral-nemo-2407 and Qwen/Qwen2.5-Coder-7B as the backend models, it aims to bridge the gap between natural language inputs and high-quality Python code, empowering professionals to implement… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/gretel-text-to-python-fintech-en-v1.

  18. h

    synthetic-code-understanding-v2-python

    • huggingface.co
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Justus Mattern (2025). synthetic-code-understanding-v2-python [Dataset]. https://huggingface.co/datasets/justus27/synthetic-code-understanding-v2-python
    Explore at:
    Dataset updated
    Jun 2, 2025
    Authors
    Justus Mattern
    Description

    justus27/synthetic-code-understanding-v2-python dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    metal-python-synthetic-explanations-gpt4

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LUM AI, metal-python-synthetic-explanations-gpt4 [Dataset]. https://huggingface.co/datasets/lum-ai/metal-python-synthetic-explanations-gpt4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    LUM AI
    Description

    Dataset Card for "metal-python-synthetic-explanations-gpt4"

    More Information needed

  20. h

    tiny-codes

    • huggingface.co
    Updated Sep 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nam Pham (2023). tiny-codes [Dataset]. http://doi.org/10.57967/hf/0937
    Explore at:
    Dataset updated
    Sep 6, 2023
    Authors
    Nam Pham
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Reasoning with Language and Code

    This synthetic dataset is a collection of 1.6 millions short and clear code snippets that can help LLM models learn how to reason with both natural and programming languages. The dataset covers a wide range of programming languages, such as Python, TypeScript, JavaScript, Ruby, Julia, Rust, C++, Bash, Java, C#, and Go. It also includes two database languages: Cypher (for graph databases) and SQL (for relational databases) in order to study the… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-codes.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli (2025). Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models [Dataset]. http://doi.org/10.5281/zenodo.15219959
Organization logo

Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models

Explore at:
zipAvailable download formats
Dataset updated
Apr 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository provides the necessary data and Python code to replicate the experiments and generate the figures presented in our manuscript: "Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models".

Contents:

  • pownet.zip: Contains PowNet version 3.2, the specific version of the simulation software used in this study.
  • inputs.zip: Contains essential modeling inputs required by PowNet for the experiments, including network data, and pre-generated synthetic load and solar time series.
  • scripts.zip: Contains the Python scripts used for installing PowNet, optionally regenerating synthetic data, running simulation experiments, processing results, and generating figures.
  • thai_data.zip (Reference Only): Contains raw data related to the 2023 Thai power system. This data served as a reference during the creation of the PowNet inputs for this study but is not required to run the replication experiments themselves. Code to process the raw data is also provided.

System Requirements:

  • Python version 3.10+
  • pip package manager

Setup Instructions:

  1. Download and Unzip Core Files: Download pownet.zip, inputs.zip, scripts.zip, and thai_data.zip. Extract their contents into the same parent folder. Your directory structure should look like this:

    Parent_Folder/
    ├── pownet/    # from pownet.zip
    ├── inputs/    # from inputs.zip
    ├── scripts/    # from scripts.zip
    ├── thai_data.zip/ # from scripts.zip ├── figures/ # Created by scripts later
    ├── outputs/ # Created by scripts later
  2. Install PowNet:

    • Open your terminal or command prompt.
    • Navigate into the pownet directory that you just extracted:

cd path/to/Parent_Folder/pownet

pip install -e .

    • These commands install PowNet and its required dependencies into your active Python environment.

Workflow and Usage:

Note: All subsequent Python script commands should be run from the scripts directory. Navigate to it first:

cd path/to/Parent_Folder/scripts

1. Generate Synthetic Time Series (Optional):

  • This step is optional as the required time series files are already provided within the inputs directory (extracted from inputs.zip). If you wish to regenerate them:
  • Run the generation scripts:
    python create_synthetic_load.py
    python create_synthetic_solar.py
  • Evaluate the generated time series (optional):
    python eval_synthetic_load.py
    python eval_synthetic_solar.py

2. Calculate Total Solar Availability:

  • Process solar scenarios using data from the inputs directory:
    python process_scenario_solar.py
    

3. Experiment 1: Compare Strategies for Modeling Purchase Obligations:

  • Run the base case simulations for different modeling strategies:
    • No Must-Take (NoMT):
      python run_basecase.py --model_name "TH23NMT"
      
    • Zero-Cost Renewables (ZCR):
      python run_basecase.py --model_name "TH23ZC"
      
    • Penalized Curtailment (Proposed Method):
      python run_basecase.py --model_name "TH23"
      
  • Run the base case simulation for the Minimum Capacity (MinCap) strategy:
    python run_min_cap.py

    This is a new script because we need to modify the objective function and add constraints.

4. Experiment 2: Simulate Partial-Firm Contract Switching:

  • Run simulations comparing the base case with the partial-firm contract scenario:
    • Base Case Scenario:
      python run_scenarios.py --model_name "TH23"
      
    • Partial-Firm Contract Scenario:
      python run_scenarios.py --model_name "TH23ESB"
      

5. Visualize Results:

  • Generate all figures presented in the manuscript:
    python run_viz.py
    
  • Figures will typically be saved in afigures directory within the Parent_Folder.
Search
Clear search
Close search
Google apps
Main menu