38 datasets found

Supporting data and code: Beyond Economic Dispatch: Modeling Renewable...
zenodo.org
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli (2025). Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models [Dataset]. http://doi.org/10.5281/zenodo.15219959
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15219959
Dataset updated
Apr 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository provides the necessary data and Python code to replicate the experiments and generate the figures presented in our manuscript: "Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models".

Contents:

pownet.zip: Contains PowNet version 3.2, the specific version of the simulation software used in this study.

inputs.zip: Contains essential modeling inputs required by PowNet for the experiments, including network data, and pre-generated synthetic load and solar time series.

scripts.zip: Contains the Python scripts used for installing PowNet, optionally regenerating synthetic data, running simulation experiments, processing results, and generating figures.

thai_data.zip (Reference Only): Contains raw data related to the 2023 Thai power system. This data served as a reference during the creation of the PowNet inputs for this study but is not required to run the replication experiments themselves. Code to process the raw data is also provided.

System Requirements:

Python version 3.10+

pip package manager

Setup Instructions:

Download and Unzip Core Files: Download pownet.zip, inputs.zip, scripts.zip, and thai_data.zip. Extract their contents into the same parent folder. Your directory structure should look like this:

Parent_Folder/ ├── pownet/ # from pownet.zip ├── inputs/ # from inputs.zip ├── scripts/ # from scripts.zip ├── thai_data.zip/ # from scripts.zip ├── figures/ # Created by scripts later ├── outputs/ # Created by scripts later

Install PowNet:

Open your terminal or command prompt.

Navigate into the pownet directory that you just extracted:

cd path/to/Parent_Folder/pownet

pip install -e .

These commands install PowNet and its required dependencies into your active Python environment.

Workflow and Usage:

Note: All subsequent Python script commands should be run from the scripts directory. Navigate to it first:

cd path/to/Parent_Folder/scripts

1. Generate Synthetic Time Series (Optional):

This step is optional as the required time series files are already provided within the inputs directory (extracted from inputs.zip). If you wish to regenerate them:

Run the generation scripts:

python create_synthetic_load.py python create_synthetic_solar.py

Evaluate the generated time series (optional):

python eval_synthetic_load.py python eval_synthetic_solar.py

2. Calculate Total Solar Availability:

Process solar scenarios using data from the inputs directory:

python process_scenario_solar.py

3. Experiment 1: Compare Strategies for Modeling Purchase Obligations:

Run the base case simulations for different modeling strategies:

No Must-Take (NoMT):

python run_basecase.py --model_name "TH23NMT"

Zero-Cost Renewables (ZCR):

python run_basecase.py --model_name "TH23ZC"

Penalized Curtailment (Proposed Method):

python run_basecase.py --model_name "TH23"

Run the base case simulation for the Minimum Capacity (MinCap) strategy:

python run_min_cap.py
This is a new script because we need to modify the objective function and add constraints.

4. Experiment 2: Simulate Partial-Firm Contract Switching:

Run simulations comparing the base case with the partial-firm contract scenario:

Base Case Scenario:

python run_scenarios.py --model_name "TH23"

Partial-Firm Contract Scenario:

python run_scenarios.py --model_name "TH23ESB"

5. Visualize Results:

Generate all figures presented in the manuscript:

python run_viz.py

Figures will typically be saved in afigures directory within the Parent_Folder.
Data from: A large synthetic dataset for machine learning applications in...
zenodo.org
csv, json, png, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Explore at:
zip, png, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13378476
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
f
CK4Gen, High Utility Synthetic Survival Datasets
figshare.com
zip
Updated Nov 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Kuo (2024). CK4Gen, High Utility Synthetic Survival Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.27611388.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27611388.v1
Dataset updated
Nov 5, 2024
Dataset provided by
figshare
Authors
Nicholas Kuo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
===###Overview:This repository provides high-utility synthetic survival datasets generated using the CK4Gen framework, optimised to retain critical clinical characteristics for use in research and educational settings. Each dataset is based on a carefully curated ground truth dataset, processed with standardised variable definitions and analytical approaches, ensuring a consistent baseline for survival analysis.###===###Description:The repository includes synthetic versions of four widely utilised and publicly accessible survival analysis datasets, each anchored in foundational studies and aligned with established ground truth variations to support robust clinical research and training.#---GBSG2: Based on Schumacher et al. [1]. The study evaluated the effects of hormonal treatment and chemotherapy duration in node-positive breast cancer patients, tracking recurrence-free and overall survival among 686 women over a median of 5 years. Our synthetic version is derived from a variation of the GBSG2 dataset available in the lifelines package [2], formatted to match the descriptions in Sauerbrei et al. [3], which we treat as the ground truth.ACTG320: Based on Hammer et al. [4]. The study investigates the impact of adding the protease inhibitor indinavir to a standard two-drug regimen for HIV-1 treatment. The original clinical trial involved 1,151 patients with prior zidovudine exposure and low CD4 cell counts, tracking outcomes over a median follow-up of 38 weeks. Our synthetic dataset is derived from a variation of the ACTG320 dataset available in the sksurv package [5], which we treat as the ground truth dataset.WHAS500: Based on Goldberg et al. [6]. The study follows 500 patients to investigate survival rates following acute myocardial infarction (MI), capturing a range of factors influencing MI incidence and outcomes. Our synthetic data replicates a ground truth variation from the sksurv package, which we treat as the ground truth dataset.FLChain: Based on Dispenzieri et al. [7]. The study assesses the prognostic relevance of serum immunoglobulin free light chains (FLCs) for overall survival in a large cohort of 15,859 participants. Our synthetic version is based on a variation available in the sksurv package, which we treat as the ground truth dataset.###===###Notes:Please find an in-depth discussion on these datasets, as well as their generation process, in the link below, to our paper:https://arxiv.org/abs/2410.16872Kuo, et al. "CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare." arXiv preprint arXiv:2410.16872 (2024).###===###References:[1]: Schumacher, et al. “Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German breast cancer study group.”, Journal of Clinical Oncology, 1994.[2]: Davidson-Pilon “lifelines: Survival Analysis in Python”, Journal of Open Source Software, 2019.[3]: Sauerbrei, et al. “Modelling the effects of standard prognostic factors in node-positive breast cancer”, British Journal of Cancer, 1999.[4]: Hammer, et al. “A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less”, New England Journal of Medicine, 1997.[5]: Pölsterl “scikit-survival: A library for time-to-event analysis built on top of scikit-learn”, Journal of Machine Learning Research, 2020.[6]: Goldberg, et al. “Incidence and case fatality rates of acute myocardial infarction (1975–1984): the Worcester heart attack study”, American Heart Journal, 1988.[7]: Dispenzieri, et al. “Use of nonclonal serum immunoglobulin free light chains to predict overall survival in the general population”, in Mayo Clinic Proceedings, 2012.
SPIDER - Synthetic Person Information Dataset for Entity Resolution
figshare.com
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur (2025). SPIDER - Synthetic Person Information Dataset for Entity Resolution [Dataset]. http://doi.org/10.6084/m9.figshare.29595599.v1
Explore at:
text/x-script.pythonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29595599.v1
Dataset updated
Jul 18, 2025
Dataset provided by
figshare
Authors
Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SPIDER - Synthetic Person Information Dataset for Entity Resolution offers researchers with ready to use data that can be utilized in benchmarking Duplicate or Entity Resolution algorithms. The dataset is aimed at person-level fields that are typical in customer data. As it is hard to source real world person level data due to Personally Identifiable Information (PII), there are very few synthetic data available publicly. The current datasets also come with limitations of small volume and core person-level fields missing in the dataset. SPIDER addresses the challenges by focusing on core person level attributes - first/last name, email, phone, address and dob. Using Python Faker library, 40,000 unique, synthetic person records are created. An additional 10,000 duplicate records are generated from the base records using 7 real-world transformation rules. The duplicate records are labelled with original base record and the duplicate rule used for record generation through is_duplicate_of and duplication_rule fieldsDuplicate RulesDuplicate record with a variation in email address.Duplicate record with a variation in email addressDuplicate record with last name variationDuplicate record with first name variationDuplicate record with a nicknameDuplicate record with near exact spellingDuplicate record with only same email and nameOutput FormatThe dataset is presented in both JSON and CSV formats for use in data processing and machine learning tools.Data RegenerationThe project includes the python script used for generating the 50,000 person records. The Python script can be expanded to include - additional duplicate rules, fuzzy name, geographical names' variations and volume adjustments.Files Includedspider_dataset_20250714_035016.csvspider_dataset_20250714_035016.jsonspider_readme.mdDataDescriptionspythoncodeV1.py
replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks
zenodo.org
data.niaid.nih.gov
zip
Updated Apr 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Plum; Fabian Plum; René Bulla; Hendrik Beck; Hendrik Beck; Natalie Imirzian; Natalie Imirzian; David Labonte; David Labonte; René Bulla (2023). replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks [Dataset]. http://doi.org/10.5281/zenodo.7849417
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7849417
Dataset updated
Apr 21, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fabian Plum; Fabian Plum; René Bulla; Hendrik Beck; Hendrik Beck; Natalie Imirzian; Natalie Imirzian; David Labonte; David Labonte; René Bulla
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

Abstract:

Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

Benchmark data

Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).

The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.

Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.

Synthetic data generation

Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.

A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.

Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).

Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.

Funding

This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
h
dummy_health_data
huggingface.co
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mudumbai Vraja Kishore (2025). dummy_health_data [Dataset]. https://huggingface.co/datasets/vrajakishore/dummy_health_data
Explore at:
Dataset updated
May 29, 2025
Authors
Mudumbai Vraja Kishore
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Synthetic Healthcare Dataset

Overview

This dataset is a synthetic healthcare dataset created for use in data analysis. It mimics real-world patient healthcare data and is intended for applications within the healthcare industry.

Data Generation

The data has been generated using the Faker Python library, which produces randomized and synthetic records that resemble real-world data patterns. It includes various healthcare-related fields such as patient… See the full description on the dataset page: https://huggingface.co/datasets/vrajakishore/dummy_health_data.
Movies Dataset (1980-2020)
kaggle.com
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Waqar Ali (2024). Movies Dataset (1980-2020) [Dataset]. https://www.kaggle.com/datasets/waqi786/movies-dataset-1980-2020/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Waqar Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains 30,000 synthetic records of movies released between 1980 and 2020. The data includes movie titles, director names, genres, release dates, durations, and ratings.

The dataset is generated using the Faker library in Python and includes a diverse range of movie genres such as Action, Comedy, Drama, Horror, Romance, Sci-Fi, Thriller, Fantasy, Documentary, and Adventure.

This dataset is suitable for exploratory data analysis, machine learning projects, and movie trend analysis over the given period.

Attributes: - Title: Movie title (string) - Director: Director name (string) - Genre: Movie genre (string) - Release Date: Date when the movie was released (datetime) - Duration: Length of the movie in minutes (integer) - Rating: Rating score of the movie (1-10, float)

Tags: synthetic-data, movies, data-generation, python, faker, pandas

License: Creative Commons Attribution 4.0 International (CC BY 4.0)
MatSim Dataset and benchmark for one-shot visual materials and textures...
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik (2025). MatSim Dataset and benchmark for one-shot visual materials and textures recognition [Dataset]. http://doi.org/10.5281/zenodo.7390166
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7390166
Dataset updated
Jun 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The MatSim Dataset and benchmark

Lastest version

Synthetic dataset and real images benchmark for visual similarity recognition of materials and textures.

MatSim: a synthetic dataset, a benchmark, and a method for computer vision-based recognition of similarities and transitions between materials and textures focusing on identifying any material under any conditions using one or a few examples (one-shot learning).

Based on the paper: One-shot recognition of any material anywhere using contrastive learning with physics-based rendering

Benchmark_MATSIM.zip: contain the benchmark made of real-world images as described in the paper

MatSim_object_train_split_1,2,3.zip: Contain a subset of the synthetics dataset for images of CGI images materials on random objects as described in the paper.

MatSim_Vessels_Train_1,2,3.zip : Contain a subset of the synthetics dataset for images of CGI images materials inside transparent containers as described in the paper.

*Note: these are subsets of the dataset; the full dataset can be found at:
https://e1.pcloud.link/publink/show?code=kZIiSQZCYU5M4HOvnQykql9jxF4h0KiC5MX

or
https://icedrive.net/s/A13FWzZ8V2aP9T4ufGQ1N3fBZxDF

Code:

Up to date code for generating the dataset, reading and evaluation and trained nets can be found in this URL:https://github.com/sagieppel/MatSim-Dataset-Generator-Scripts-And-Neural-net

Dataset Generation Scripts.zip: Contain the Blender (3.1) Python scripts used for generating the dataset, this code might be odl up to date code can be found here
Net_Code_And_Trained_Model.zip: Contain a reference neural net code, including loaders, trained models, and evaluators scripts that can be used to read and train with the synthetic dataset or test the model with the benchmark. Note code in the ZIP file is not up to date and contains some bugs For the Latest version of this code see this URL

Further documentation can be found inside the zip files or in the paper.
f
Ranking algorithm;Makefile for C program;Test data;Synthetic data...
rs.figshare.com
zip
Updated Jun 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. E. J. Newman (2023). Ranking algorithm;Makefile for C program;Test data;Synthetic data generation;README file from Ranking with multiple types of pairwise comparisons [Dataset]. http://doi.org/10.6084/m9.figshare.21299708.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21299708.v1
Dataset updated
Jun 13, 2023
Dataset provided by
The Royal Society
Authors
M. E. J. Newman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
C program implementing the method described in the paper;Gcc makefile for compiling the C program;Example data for use with the C program;Python program for generating synthetic test data;Instructions for use of the other files
h
tiny-codes
huggingface.co
Updated Jan 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nam Pham (2023). tiny-codes [Dataset]. http://doi.org/10.57967/hf/0937
Explore at:
Unique identifier
https://doi.org/10.57967/hf/0937
Dataset updated
Jan 26, 2024
Authors
Nam Pham
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Reasoning with Language and Code

This synthetic dataset is a collection of 1.6 millions short and clear code snippets that can help LLM models learn how to reason with both natural and programming languages. The dataset covers a wide range of programming languages, such as Python, TypeScript, JavaScript, Ruby, Julia, Rust, C++, Bash, Java, C#, and Go. It also includes two database languages: Cypher (for graph databases) and SQL (for relational databases) in order to study the… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-codes.

Tuberculosis X-Ray Dataset (Synthetic)

kaggle.com

Updated Mar 12, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Arif Miah (2025). Tuberculosis X-Ray Dataset (Synthetic) [Dataset]. https://www.kaggle.com/datasets/miadul/tuberculosis-x-ray-dataset-synthetic/code

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 12, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Arif Miah

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📝 Dataset Summary

This synthetic dataset contains 20,000 records of X-ray data labeled as "Normal" or "Tuberculosis". It is specifically created for training and evaluating classification models in the field of medical image analysis. The dataset aims to aid in building machine learning and deep learning models for detecting tuberculosis from X-ray data.

💡 Context

Tuberculosis (TB) is a highly infectious disease that primarily affects the lungs. Accurate detection of TB using chest X-rays can significantly enhance medical diagnostics. However, real-world datasets are often scarce or restricted due to privacy concerns. This synthetic dataset bridges that gap by providing simulated patient data while maintaining realistic distributions and patterns commonly observed in TB cases.

🗃️ Dataset Details

Number of Rows: 20,000
Number of Columns: 15
File Format: CSV
Resolution: Simulated patient data, not real X-ray images
Size: Approximately 10 MB

🏷️ Columns and Descriptions

Column Name	Description
Patient_ID	Unique ID for each patient (e.g., PID000001)
Age	Age of the patient (in years)
Gender	Gender of the patient (Male/Female)
Chest_Pain	Presence of chest pain (Yes/No)
Cough_Severity	Severity of cough (Scale: 0-9)
Breathlessness	Severity of breathlessness (Scale: 0-4)
Fatigue	Level of fatigue experienced (Scale: 0-9)
Weight_Loss	Weight loss (in kg)
Fever	Level of fever (Mild, Moderate, High)
Night_Sweats	Whether night sweats are present (Yes/No)
Sputum_Production	Level of sputum production (Low, Medium, High)
Blood_in_Sputum	Presence of blood in sputum (Yes/No)
Smoking_History	Smoking status (Never, Former, Current)
Previous_TB_History	Previous tuberculosis history (Yes/No)
Class	Target variable indicating the condition (Normal, Tuberculosis)

🔍 Data Generation Process

The dataset was generated using Python with the following libraries:
- Pandas: To create and save the dataset as a CSV file
- NumPy: To generate random numbers and simulate realistic data
- Random Seed: Set to ensure reproducibility

The target variable "Class" has a 70-30 distribution between Normal and Tuberculosis cases. The data is randomly generated with realistic patterns that mimic typical TB symptoms and demographic distributions.

🔧 Usage

This dataset is intended for:
- Machine Learning and Deep Learning classification tasks
- Data exploration and feature analysis
- Model evaluation and comparison
- Educational and research purposes

📊 Potential Applications

Tuberculosis Detection Models: Train CNNs or other classification algorithms to detect TB.
Healthcare Research: Analyze the correlation between symptoms and TB outcomes.
Data Visualization: Perform EDA to uncover patterns and insights.
Model Benchmarking: Compare various algorithms for TB detection.

📑 License

This synthetic dataset is open for educational and research use. Please credit the creator if used in any public or academic work.

🙌 Acknowledgments

This dataset was generated as a synthetic alternative to real-world data to help developers and researchers practice building and fine-tuning classification models without the constraints of sensitive patient data.

h
glaive-code-assistant
huggingface.co
Updated Sep 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Glaive AI (2023). glaive-code-assistant [Dataset]. https://huggingface.co/datasets/glaiveai/glaive-code-assistant
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 22, 2023
Dataset authored and provided by
Glaive AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Glaive-code-assistant

Glaive-code-assistant is a dataset of ~140k code problems and solutions generated using Glaive’s synthetic data generation platform. The data is intended to be used to make models act as code assistants, and so the data is structured in a QA format where the questions are worded similar to how real users will ask code related questions. The data has ~60% python samples. To report any problems or suggestions in the data, join the Glaive discord
Data from: Synthetic Datasets for Numeric Uncertainty Quantification
figshare.com
zip
Updated Aug 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hussain Mohammed Kabir (2021). Synthetic Datasets for Numeric Uncertainty Quantification [Dataset]. http://doi.org/10.6084/m9.figshare.16528650.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16528650.v1
Dataset updated
Aug 28, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Hussain Mohammed Kabir
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic Datasets for Numeric Uncertainty QuantificationThe Source of Dataset with Generation ScriptWe generate these synthetic datasets with the help of the following python script in the Kaggle.https://www.kaggle.com/dipuk0506/toy-dataset-for-regression-and-uqHow to Use DatasetsTrain Shallow NNsThe following notebook presents how to train Shallow NNs.https://www.kaggle.com/dipuk0506/shallow-nn-on-toy-datasetsVersion-N of the notebook applies a shallow NN to Data-N.Train RVFLThe following notebook presents how to train Random Vector Functional Link (RVFL) Networks.https://www.kaggle.com/dipuk0506/shallow-nn-on-toy-datasetsVersion-N of the notebook applies an RVFL network to Data-N.
o
BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness...
explore.openaire.eu
Updated Jan 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco Estupiñán-Romero; Nina Van Goethem; Marjan Meurisse; Javier González-Galindo; Enrique Bernal-Delgado (2023). BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness assessment - Common Data Model Specification [Dataset]. http://doi.org/10.5281/zenodo.6913045
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6913045
Dataset updated
Jan 26, 2023
Authors
Francisco Estupiñán-Romero; Nina Van Goethem; Marjan Meurisse; Javier González-Galindo; Enrique Bernal-Delgado
Description
This publication corresponds to the Common Data Model (CDM) specification of the Baseline Use Case proposed in T.5.2 (WP5) in the BY-COVID project on “SARS-CoV-2 Vaccine(s) effectiveness in preventing SARS-CoV-2 infection.” Research Question: “How effective have the SARS-CoV-2 vaccination programmes been in preventing SARS-CoV-2 infections?” Intervention (exposure): COVID-19 vaccine(s) Outcome: SARS-CoV-2 infection Subgroup analysis: Vaccination schedule (type of vaccine) Study Design: An observational retrospective longitudinal study to assess the effectiveness of the SARS-CoV-2 vaccine in preventing SARS-CoV-2 infections using routinely collected social, health and care data from several countries. A causal model was established using Directed Acyclic Graphs (DAGs) to map domain knowledge, theories and assumptions about the causal relationship between exposure and outcome. The DAG developed for the research question of interest is shown below. Cohort definition: All people eligible to be vaccinated (from 5 to 115 years old, included) or with, at least, one dose of a SARS-CoV-2 vaccine (any of the available brands) having or not a previous SARS-CoV-2 infection. Inclusion criteria: All people vaccinated with at least one dose of the COVID-19 vaccine (any available brands) in an area of residence. Any person eligible to be vaccinated (from 5 to 115 years old, included) with a positive diagnosis (irrespective of the type of test) for SARS-CoV-2 infection (COVID-19) during the period of study. Exclusion criteria: People not eligible for the vaccine (from 0 to 4 years old, included) Study period: From the date of the first documented SARS-CoV-2 infection in each country to the most recent date in which data is available at the time of analysis. Roughly from 01-03-2020 to 30-06-2022, depending on the country. Files included in this publication: Causal model (responding to the research question) SARS-CoV-2 vaccine effectiveness causal model v.1.0.0 (HTML) - Interactive report showcasing the structural causal model (DAG) to answer the research question SARS-CoV-2 vaccine effectiveness causal model v.1.0.0 (QMD) - Quarto RMarkdown script to produce the structural causal model Common data model specification (following the causal model) SARS-CoV-2 vaccine effectiveness data model specification (XLXS) - Human-readable version (Excel) SARS-CoV-2 vaccine effectiveness data model specification dataspice (HTML) - Human-readable version (interactive report) SARS-CoV-2 vaccine effectiveness data model specification dataspice (JSON) - Machine-readable version Synthetic dataset (complying with the common data model specifications) SARS-CoV-2 vaccine effectiveness synthetic dataset (CSV) [UTF-8, pipe | separated, N~650,000 registries] SARS-CoV-2 vaccine effectiveness synthetic dataset EDA (HTML) - Interactive report of the exploratory data analysis (EDA) of the synthetic dataset SARS-CoV-2 vaccine effectiveness synthetic dataset EDA (JSON) - Machine-readable version of the exploratory data analysis (EDA) of the synthetic dataset SARS-CoV-2 vaccine effectiveness synthetic dataset generation script (IPYNB) - Jupyter notebook with Python scripting and commenting to generate the synthetic dataset #### Baseline Use Case: SARS-CoV-2 vaccine effectiveness assessment - Common Data Model Specification v.1.1.0 change log #### Updated Causal model to eliminate the consideration of 'vaccination_schedule_cd' as a mediator Adjusted the study period to be consistent with the Study Protocol Updated 'sex_cd' as a required variable Added 'chronic_liver_disease_bl' as a comorbidity at the individual level Updated 'socecon_lvl_cd' at the area level as a recommended variable Added crosswalks for the definition of 'chronic_liver_disease_bl' in a separate sheet Updated the 'vaccination_schedule_cd' reference to the 'Vaccine' node in the updated DAG Updated the description of the 'confirmed_case_dt' and 'previous_infection_dt' variables to clarify the definition and the need for a single registry per person The scripts (software) accompanying the data model specification are offered "as-is" without warranty and disclaiming liability for damages resulting from using it. The software is released under the CC-BY-4.0 licence, which permits you to use the content for almost any purpose (but does not grant you any trademark permissions), so long as you note the license and give credit.
Synthetic Multimodal Drone Delivery Dataset
zenodo.org
zip
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diyar Altinses; Diyar Altinses (2025). Synthetic Multimodal Drone Delivery Dataset [Dataset]. http://doi.org/10.5281/zenodo.15124580
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15124580
Dataset updated
Apr 2, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Diyar Altinses; Diyar Altinses
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Nov 4, 2024
Description
README: Synthetic Logistics Dataset Structure and Components

This dataset provides a structured representation of logistics data designed to evaluate and optimize hybrid truck-and-drone delivery networks. It captures a comprehensive set of parameters essential for modeling real-world logistics scenarios, including spatial coordinates, environmental conditions, and operational constraints. The data is meticulously organized into distinct keys, each representing a critical aspect of the delivery network, enabling researchers and practitioners to conduct flexible and in-depth analyses.

The dataset is a curated subset derived from the research presented in the paper "Synthetic Dataset Generation for Optimizing Multimodal Drone Delivery Systems" by Altinsel et al. (2024), published in Drones. It serves as a practical resource for studying the interplay between ground-based and aerial delivery systems, with a focus on efficiency, environmental impact, and operational feasibility.

Altinses, D., Torres, D. O. S., Gobachew, A. M., Lier, S., & Schwung, A. (2024). Synthetic Dataset Generation for Optimizing Multimodal Drone Delivery Systems. Drones (2504-446X), 8(12).

Each data file contains information on ten customer locations, specified by their x and y coordinates, which facilitate the modeling of delivery routes and service areas. Additionally, the dataset includes communication data represented as a two-dimensional grid, which can be used to assess signal strength, connectivity, or other network-related factors that influence drone operations.

A key feature of this dataset is the inclusion of wind data, structured as a two-dimensional grid with four distinct features per grid point. These features likely represent wind velocity components (such as horizontal and vertical directions) along with auxiliary parameters like turbulence intensity or wind shear, which are crucial for drone path planning and energy consumption estimation. The wind data enables researchers to simulate realistic environmental conditions and evaluate their impact on drone performance, stability, and battery life.

By integrating geospatial, environmental, and operational data, this dataset supports a wide range of applications, from route optimization and energy efficiency studies to risk assessment and resilience planning in multimodal delivery systems. Its synthetic nature ensures reproducibility while maintaining relevance to real-world logistics challenges, making it a valuable tool for advancing research in drone-assisted delivery networks.

The 4 wind channels represent:

X and Y (Grid Positions)

These define where the arrows start (usually a meshgrid).

U and V (Arrow Directions)

U = Horizontal component (e.g., gradient in x).

V = Vertical component (e.g., gradient in y).

How to load the files using Python:

data = np.loadtxt('data.txt')

#### Just for Wind data:

data = data.reshape((4,16,16))
Retail Store Star Schema Dataset
kaggle.com
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrinivas Vishnupurikar (2025). Retail Store Star Schema Dataset [Dataset]. https://www.kaggle.com/datasets/shrinivasv/retail-store-star-schema-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shrinivas Vishnupurikar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🛍️ Retail Star Schema (Normalized & Denormalized) – Synthetic Dataset

This dataset provides a simulated retail data warehouse designed using star schema modeling principles.

It includes both normalized and denormalized versions of a retail sales star schema, making it a valuable resource for data engineers, analysts, and data warehouse enthusiasts who want to explore real-world scenarios, performance tuning, and modeling strategies.

📁 Dataset Structure

This dataset set has two Fact tables: - fact_sales_normalized.csv – No columns from the dim_* tables have been normalised. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2F11f3c0350acd609e6b9d9336d0abb448%2FNormalized-Retail-Star-Schema.png?generation=1745327115564885&alt=media" alt="Normalized Star Schema">

fact_sales_denormalized.csv – Specific columns from certain dim_* tables have been normalised. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2Fb567c752c7bc8bc55d9d6142d6ac40cf%2FDenormalized-Retial-Star-Schema.png?generation=1745327148166677&alt=media" alt="Denormalized Star Schema">

However, the dim_* table stay the same for both as follows: - Dim_Customers.csv - Dim_Products.csv - Dim_Stores.csv - Dim_Dates.csv - Dim_Salesperson - Dim_Campaign

🧠 Use Cases

Practice star schema design and dimensional modeling

Learn how to denormalize dimensions for BI and analytics performance

Benchmark analytical queries (joins, aggregations, filtering)

Test data pipelines, ETL/ELT transformations, and query optimization strategies

Explore how denormalization affects storage, redundancy, and performance

📌 Notes

All data is synthetic and randomly generated via python scripts that use polars library for data manipulation— no real customer or business data is included.

Ideal for use with tools like SQL engines, Redshift, BigQuery, Snowflake, or even DuckDB.

📎 Credits

Shrinivas Vishnupurikar, Data Engineer @Velotio Technologies.
h
cf-cpp-to-python-code-generation
huggingface.co
Updated Jul 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hesam Haddad (2025). cf-cpp-to-python-code-generation [Dataset]. https://huggingface.co/datasets/demoversion/cf-cpp-to-python-code-generation
Explore at:
Dataset updated
Jul 20, 2025
Authors
Hesam Haddad
Description
Dataset

The cf-llm-finetune uses a synthetic parallel dataset built from the Codeforces submissions and problems. C++ ICPC-style solutions are filtered, cleaned, and paired with problem statements to generate Python translations using GPT-4.1, creating a fine-tuning dataset for code translation. The final dataset consists of C++ solutions from 2,000 unique problems, and synthetic Python answers, split into train (1,400), validation (300), and test (300) sets. For details on dataset… See the full description on the dataset page: https://huggingface.co/datasets/demoversion/cf-cpp-to-python-code-generation.
H
Augmented Texas 7000-bus synthetic grid
dataverse.harvard.edu
Updated May 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ignacio Aravena; Jiyu Wang (2025). Augmented Texas 7000-bus synthetic grid [Dataset]. http://doi.org/10.7910/DVN/AKUDJT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/AKUDJT
Dataset updated
May 16, 2025
Dataset provided by
Harvard Dataverse
Authors
Ignacio Aravena; Jiyu Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Texas
Dataset funded by
https://ror.org/01bj3aw27
Description
Augmented Texas 7000-bus synthetic grid Augmented version of the synthetic Texas 7k dataset published by Texas A&M University. The system has been populated with high-resolution distributed photovoltaic (PV) generation, comprising 4,499 PV plants of varying sizes with associated time series for 1 year of operation. This high-resolution dataset was produced following publicly available data and it is free of CEII. Details on the procedure followed to generate the PV dataset can be found in the Open COG Grid Project Year 1 Report (Chapter 6). The technical data of the system is provided using the (open) CTM specification for easy accessibility from Python without additional packages (data can be loaded as a dictionary). The time series for demand and PV production are provided as a HDF5 file, also loadable with standard open-source tools. We additionally provide example scripts for parsing the data in Python. Prepared by LLNL under Contract DE-AC52-07NA27344. LLNL control number: LLNL-DATA-2001833.
MISATO - Machine learning dataset for structure-based drug discovery
zenodo.org
data.niaid.nih.gov
application/gzip, bin +1
Updated May 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Till Siebenmorgen; Till Siebenmorgen; Filipe Menezes; Sabrina Benassou; Erinc Merdivan; Stefan Kesselheim; Marie Piraud; Marie Piraud; Fabian J. Theis; Michael Sattler; Grzegorz M. Popowicz; Filipe Menezes; Sabrina Benassou; Erinc Merdivan; Stefan Kesselheim; Fabian J. Theis; Michael Sattler; Grzegorz M. Popowicz (2023). MISATO - Machine learning dataset for structure-based drug discovery [Dataset]. http://doi.org/10.5281/zenodo.7711953
Explore at:
application/gzip, txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7711953
Dataset updated
May 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Till Siebenmorgen; Till Siebenmorgen; Filipe Menezes; Sabrina Benassou; Erinc Merdivan; Stefan Kesselheim; Marie Piraud; Marie Piraud; Fabian J. Theis; Michael Sattler; Grzegorz M. Popowicz; Filipe Menezes; Sabrina Benassou; Erinc Merdivan; Stefan Kesselheim; Fabian J. Theis; Michael Sattler; Grzegorz M. Popowicz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Developments in Artificial Intelligence (AI) have had an enormous impact on scientific research in recent years. Yet, relatively few robust methods have been reported in the field of structure-based drug discovery. To train AI models to abstract from structural data, highly curated and precise biomolecule-ligand interaction datasets are urgently needed. We present MISATO, a curated dataset of almost 20000 experimental structures of protein-ligand complexes, associated molecular dynamics traces, and electronic properties. Semi-empirical quantum mechanics was used to systematically refine protonation states of proteins and small molecule ligands. Molecular dynamics traces for protein-ligand complexes were obtained in explicit water. The dataset is made readily available to the scientific community via simple python data-loaders. AI baseline models are provided for dynamical and electronic properties. This highly curated dataset is expected to enable the next-generation of AI models for structure-based drug discovery. Our vision is to make MISATO the first step of a vibrant community project for the development of powerful AI-based drug discovery tools.
h
Multi-IaC-Eval
huggingface.co
Updated Sep 9, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sam Davidson (2010). Multi-IaC-Eval [Dataset]. https://huggingface.co/datasets/samdavidson/Multi-IaC-Eval
Explore at:
Dataset updated
Sep 9, 2010
Authors
Sam Davidson
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Multi-IaC-Eval

We present Multi-IaC-Eval is a novel benchmark dataset for evaluating LLM-based IaC generation and mutation across AWS CloudFormation, Terraform, and Cloud Development Kit (CDK) formats. The dataset consists of triplets containing initial IaC templates, natural language modification requests, and corresponding updated templates, created through a synthetic data generation pipeline with rigorous validation. Cloudformation: 263 Terraform: 446 CDK (Python): 64 CDK… See the full description on the dataset page: https://huggingface.co/datasets/samdavidson/Multi-IaC-Eval.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli (2025). Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models [Dataset]. http://doi.org/10.5281/zenodo.15219959

Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15219959

Dataset updated

Apr 15, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository provides the necessary data and Python code to replicate the experiments and generate the figures presented in our manuscript: "Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models".

Contents:

pownet.zip: Contains PowNet version 3.2, the specific version of the simulation software used in this study.
inputs.zip: Contains essential modeling inputs required by PowNet for the experiments, including network data, and pre-generated synthetic load and solar time series.
scripts.zip: Contains the Python scripts used for installing PowNet, optionally regenerating synthetic data, running simulation experiments, processing results, and generating figures.
thai_data.zip (Reference Only): Contains raw data related to the 2023 Thai power system. This data served as a reference during the creation of the PowNet inputs for this study but is not required to run the replication experiments themselves. Code to process the raw data is also provided.

System Requirements:

Python version 3.10+
pip package manager

Setup Instructions:

Download and Unzip Core Files: Download pownet.zip, inputs.zip, scripts.zip, and thai_data.zip. Extract their contents into the same parent folder. Your directory structure should look like this:

Parent_Folder/
├── pownet/    # from pownet.zip
├── inputs/    # from inputs.zip
├── scripts/    # from scripts.zip
├── thai_data.zip/    # from scripts.zip
├── figures/    # Created by scripts later
├── outputs/    # Created by scripts later

Install PowNet:
- Open your terminal or command prompt.
- Navigate into the pownet directory that you just extracted:

cd path/to/Parent_Folder/pownet

pip install -e .

- These commands install PowNet and its required dependencies into your active Python environment.

Workflow and Usage:

Note: All subsequent Python script commands should be run from the scripts directory. Navigate to it first:

cd path/to/Parent_Folder/scripts

1. Generate Synthetic Time Series (Optional):

This step is optional as the required time series files are already provided within the inputs directory (extracted from inputs.zip). If you wish to regenerate them:

Run the generation scripts:

python create_synthetic_load.py

python create_synthetic_solar.py

Evaluate the generated time series (optional):

python eval_synthetic_load.py

python eval_synthetic_solar.py

2. Calculate Total Solar Availability:

Process solar scenarios using data from the inputs directory:
```
python process_scenario_solar.py
```

3. Experiment 1: Compare Strategies for Modeling Purchase Obligations:

Run the base case simulations for different modeling strategies:
- No Must-Take (NoMT):
  python run_basecase.py --model_name "TH23NMT"
- Zero-Cost Renewables (ZCR):
  python run_basecase.py --model_name "TH23ZC"
- Penalized Curtailment (Proposed Method):
  python run_basecase.py --model_name "TH23"

Run the base case simulation for the Minimum Capacity (MinCap) strategy:

python run_min_cap.py

This is a new script because we need to modify the objective function and add constraints.

4. Experiment 2: Simulate Partial-Firm Contract Switching:

Run simulations comparing the base case with the partial-firm contract scenario:
- Base Case Scenario:
```
python run_scenarios.py --model_name "TH23"
```
- Partial-Firm Contract Scenario:
  python run_scenarios.py --model_name "TH23ESB"

5. Visualize Results:

Generate all figures presented in the manuscript:
```
python run_viz.py
```
Figures will typically be saved in afigures directory within the Parent_Folder.

Clear search

Close search

Google apps

Main menu

Supporting data and code: Beyond Economic Dispatch: Modeling Renewable...

Data from: A large synthetic dataset for machine learning applications in...

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding

CK4Gen, High Utility Synthetic Survival Datasets

SPIDER - Synthetic Person Information Dataset for Entity Resolution

replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks

dummy_health_data

Movies Dataset (1980-2020)

MatSim Dataset and benchmark for one-shot visual materials and textures...

Ranking algorithm;Makefile for C program;Test data;Synthetic data...

tiny-codes

Tuberculosis X-Ray Dataset (Synthetic)

📝 Dataset Summary

💡 Context

🗃️ Dataset Details

🏷️ Columns and Descriptions

🔍 Data Generation Process

🔧 Usage

📊 Potential Applications

📑 License

🙌 Acknowledgments

glaive-code-assistant

Data from: Synthetic Datasets for Numeric Uncertainty Quantification

BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness...

Synthetic Multimodal Drone Delivery Dataset

README: Synthetic Logistics Dataset Structure and Components

The 4 wind channels represent:

How to load the files using Python:

Retail Store Star Schema Dataset

🛍️ Retail Star Schema (Normalized & Denormalized) – Synthetic Dataset

📁 Dataset Structure

🧠 Use Cases

📌 Notes

📎 Credits

cf-cpp-to-python-code-generation

Augmented Texas 7000-bus synthetic grid

MISATO - Machine learning dataset for structure-based drug discovery

Multi-IaC-Eval

Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models