Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository provides the necessary data and Python code to replicate the experiments and generate the figures presented in our manuscript: "Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models".
Contents:
pownet.zip
: Contains PowNet version 3.2, the specific version of the simulation software used in this study.inputs.zip
: Contains essential modeling inputs required by PowNet for the experiments, including network data, and pre-generated synthetic load and solar time series.scripts.zip
: Contains the Python scripts used for installing PowNet, optionally regenerating synthetic data, running simulation experiments, processing results, and generating figures.thai_data.zip
(Reference Only): Contains raw data related to the 2023 Thai power system. This data served as a reference during the creation of the PowNet inputs for this study but is not required to run the replication experiments themselves. Code to process the raw data is also provided.System Requirements:
pip
package managerSetup Instructions:
Download and Unzip Core Files: Download pownet.zip
, inputs.zip
, scripts.zip
, and thai_data.zip
. Extract their contents into the same parent folder. Your directory structure should look like this:
Parent_Folder/
├── pownet/ # from pownet.zip
├── inputs/ # from inputs.zip
├── scripts/ # from scripts.zip
├── thai_data.zip/ # from scripts.zip
├── figures/ # Created by scripts later
├── outputs/ # Created by scripts later
Install PowNet:
pownet
directory that you just extracted:cd path/to/Parent_Folder/pownet
pip install -e .
Workflow and Usage:
Note: All subsequent Python script commands should be run from the scripts
directory. Navigate to it first:
cd path/to/Parent_Folder/scripts
1. Generate Synthetic Time Series (Optional):
inputs
directory (extracted from inputs.zip
). If you wish to regenerate them:python create_synthetic_load.py
python create_synthetic_solar.py
python eval_synthetic_load.py
python eval_synthetic_solar.py
2. Calculate Total Solar Availability:
inputs
directory:
python process_scenario_solar.py
3. Experiment 1: Compare Strategies for Modeling Purchase Obligations:
python run_basecase.py --model_name "TH23NMT"
python run_basecase.py --model_name "TH23ZC"
python run_basecase.py --model_name "TH23"
python run_min_cap.py
This is a new script because we need to modify the objective function and add constraints.
4. Experiment 2: Simulate Partial-Firm Contract Switching:
python run_scenarios.py --model_name "TH23"
python run_scenarios.py --model_name "TH23ESB"
5. Visualize Results:
python run_viz.py
figures
directory within the Parent_Folder
.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.
This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.
The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.
The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.
The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.
There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).
The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.
This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):
import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)
The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:
CH_gens_list = CH_gens.dropna().squeeze().to_list()
Finally, we can import all the time series of Swiss generators from a given data table with
pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)
The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.
This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:
hourly_loads = pd.read_csv('loads_2018_3.csv')
To get a daily average of the loads, we can use:
daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()
This results in series of length 364. To average further over entire weeks and get series of length 52, we use:
weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()
The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.
This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SPIDER - Synthetic Person Information Dataset for Entity Resolution offers researchers with ready to use data that can be utilized in benchmarking Duplicate or Entity Resolution algorithms. The dataset is aimed at person-level fields that are typical in customer data. As it is hard to source real world person level data due to Personally Identifiable Information (PII), there are very few synthetic data available publicly. The current datasets also come with limitations of small volume and core person-level fields missing in the dataset. SPIDER addresses the challenges by focusing on core person level attributes - first/last name, email, phone, address and dob. Using Python Faker library, 40,000 unique, synthetic person records are created. An additional 10,000 duplicate records are generated from the base records using 7 real-world transformation rules. The duplicate records are labelled with original base record and the duplicate rule used for record generation through is_duplicate_of and duplication_rule fieldsDuplicate RulesDuplicate record with a variation in email address.Duplicate record with a variation in email addressDuplicate record with last name variationDuplicate record with first name variationDuplicate record with a nicknameDuplicate record with near exact spellingDuplicate record with only same email and nameOutput FormatThe dataset is presented in both JSON and CSV formats for use in data processing and machine learning tools.Data RegenerationThe project includes the python script used for generating the 50,000 person records. The Python script can be expanded to include - additional duplicate rules, fuzzy name, geographical names' variations and volume adjustments.Files Includedspider_dataset_20250714_035016.csvspider_dataset_20250714_035016.jsonspider_readme.mdDataDescriptionspythoncodeV1.py
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.
Abstract:
Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.
Benchmark data
Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).
The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.
Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.
Synthetic data generation
Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.
A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.
Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).
Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.
Funding
This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic Healthcare Dataset
Overview
This dataset is a synthetic healthcare dataset created for use in data analysis. It mimics real-world patient healthcare data and is intended for applications within the healthcare industry.
Data Generation
The data has been generated using the Faker Python library, which produces randomized and synthetic records that resemble real-world data patterns. It includes various healthcare-related fields such as patient… See the full description on the dataset page: https://huggingface.co/datasets/vrajakishore/dummy_health_data.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains 30,000 synthetic records of movies released between 1980 and 2020. The data includes movie titles, director names, genres, release dates, durations, and ratings.
The dataset is generated using the Faker library in Python and includes a diverse range of movie genres such as Action, Comedy, Drama, Horror, Romance, Sci-Fi, Thriller, Fantasy, Documentary, and Adventure.
This dataset is suitable for exploratory data analysis, machine learning projects, and movie trend analysis over the given period.
Attributes: - Title: Movie title (string) - Director: Director name (string) - Genre: Movie genre (string) - Release Date: Date when the movie was released (datetime) - Duration: Length of the movie in minutes (integer) - Rating: Rating score of the movie (1-10, float)
Tags: synthetic-data, movies, data-generation, python, faker, pandas
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The MatSim Dataset and benchmark
Synthetic dataset and real images benchmark for visual similarity recognition of materials and textures.
MatSim: a synthetic dataset, a benchmark, and a method for computer vision-based recognition of similarities and transitions between materials and textures focusing on identifying any material under any conditions using one or a few examples (one-shot learning).
Based on the paper: One-shot recognition of any material anywhere using contrastive learning with physics-based rendering
Benchmark_MATSIM.zip: contain the benchmark made of real-world images as described in the paper
MatSim_object_train_split_1,2,3.zip: Contain a subset of the synthetics dataset for images of CGI images materials on random objects as described in the paper.
MatSim_Vessels_Train_1,2,3.zip : Contain a subset of the synthetics dataset for images of CGI images materials inside transparent containers as described in the paper.
*Note: these are subsets of the dataset; the full dataset can be found at:
https://e1.pcloud.link/publink/show?code=kZIiSQZCYU5M4HOvnQykql9jxF4h0KiC5MX
or
https://icedrive.net/s/A13FWzZ8V2aP9T4ufGQ1N3fBZxDF
Code:
Up to date code for generating the dataset, reading and evaluation and trained nets can be found in this URL:https://github.com/sagieppel/MatSim-Dataset-Generator-Scripts-And-Neural-net
Dataset Generation Scripts.zip: Contain the Blender (3.1) Python scripts used for generating the dataset, this code might be odl up to date code can be found here
Net_Code_And_Trained_Model.zip: Contain a reference neural net code, including loaders, trained models, and evaluators scripts that can be used to read and train with the synthetic dataset or test the model with the benchmark. Note code in the ZIP file is not up to date and contains some bugs For the Latest version of this code see this URL
Further documentation can be found inside the zip files or in the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
C program implementing the method described in the paper;Gcc makefile for compiling the C program;Example data for use with the C program;Python program for generating synthetic test data;Instructions for use of the other files
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Reasoning with Language and Code
This synthetic dataset is a collection of 1.6 millions short and clear code snippets that can help LLM models learn how to reason with both natural and programming languages. The dataset covers a wide range of programming languages, such as Python, TypeScript, JavaScript, Ruby, Julia, Rust, C++, Bash, Java, C#, and Go. It also includes two database languages: Cypher (for graph databases) and SQL (for relational databases) in order to study the… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-codes.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This synthetic dataset contains 20,000 records of X-ray data labeled as "Normal" or "Tuberculosis". It is specifically created for training and evaluating classification models in the field of medical image analysis. The dataset aims to aid in building machine learning and deep learning models for detecting tuberculosis from X-ray data.
Tuberculosis (TB) is a highly infectious disease that primarily affects the lungs. Accurate detection of TB using chest X-rays can significantly enhance medical diagnostics. However, real-world datasets are often scarce or restricted due to privacy concerns. This synthetic dataset bridges that gap by providing simulated patient data while maintaining realistic distributions and patterns commonly observed in TB cases.
Column Name | Description |
---|---|
Patient_ID | Unique ID for each patient (e.g., PID000001) |
Age | Age of the patient (in years) |
Gender | Gender of the patient (Male/Female) |
Chest_Pain | Presence of chest pain (Yes/No) |
Cough_Severity | Severity of cough (Scale: 0-9) |
Breathlessness | Severity of breathlessness (Scale: 0-4) |
Fatigue | Level of fatigue experienced (Scale: 0-9) |
Weight_Loss | Weight loss (in kg) |
Fever | Level of fever (Mild, Moderate, High) |
Night_Sweats | Whether night sweats are present (Yes/No) |
Sputum_Production | Level of sputum production (Low, Medium, High) |
Blood_in_Sputum | Presence of blood in sputum (Yes/No) |
Smoking_History | Smoking status (Never, Former, Current) |
Previous_TB_History | Previous tuberculosis history (Yes/No) |
Class | Target variable indicating the condition (Normal, Tuberculosis) |
The dataset was generated using Python with the following libraries:
- Pandas: To create and save the dataset as a CSV file
- NumPy: To generate random numbers and simulate realistic data
- Random Seed: Set to ensure reproducibility
The target variable "Class" has a 70-30 distribution between Normal and Tuberculosis cases. The data is randomly generated with realistic patterns that mimic typical TB symptoms and demographic distributions.
This dataset is intended for:
- Machine Learning and Deep Learning classification tasks
- Data exploration and feature analysis
- Model evaluation and comparison
- Educational and research purposes
This synthetic dataset is open for educational and research use. Please credit the creator if used in any public or academic work.
This dataset was generated as a synthetic alternative to real-world data to help developers and researchers practice building and fine-tuning classification models without the constraints of sensitive patient data.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Glaive-code-assistant
Glaive-code-assistant is a dataset of ~140k code problems and solutions generated using Glaive’s synthetic data generation platform. The data is intended to be used to make models act as code assistants, and so the data is structured in a QA format where the questions are worded similar to how real users will ask code related questions. The data has ~60% python samples. To report any problems or suggestions in the data, join the Glaive discord
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetic Datasets for Numeric Uncertainty QuantificationThe Source of Dataset with Generation ScriptWe generate these synthetic datasets with the help of the following python script in the Kaggle.https://www.kaggle.com/dipuk0506/toy-dataset-for-regression-and-uqHow to Use DatasetsTrain Shallow NNsThe following notebook presents how to train Shallow NNs.https://www.kaggle.com/dipuk0506/shallow-nn-on-toy-datasetsVersion-N of the notebook applies a shallow NN to Data-N.Train RVFLThe following notebook presents how to train Random Vector Functional Link (RVFL) Networks.https://www.kaggle.com/dipuk0506/shallow-nn-on-toy-datasetsVersion-N of the notebook applies an RVFL network to Data-N.
This publication corresponds to the Common Data Model (CDM) specification of the Baseline Use Case proposed in T.5.2 (WP5) in the BY-COVID project on “SARS-CoV-2 Vaccine(s) effectiveness in preventing SARS-CoV-2 infection.” Research Question: “How effective have the SARS-CoV-2 vaccination programmes been in preventing SARS-CoV-2 infections?” Intervention (exposure): COVID-19 vaccine(s) Outcome: SARS-CoV-2 infection Subgroup analysis: Vaccination schedule (type of vaccine) Study Design: An observational retrospective longitudinal study to assess the effectiveness of the SARS-CoV-2 vaccine in preventing SARS-CoV-2 infections using routinely collected social, health and care data from several countries. A causal model was established using Directed Acyclic Graphs (DAGs) to map domain knowledge, theories and assumptions about the causal relationship between exposure and outcome. The DAG developed for the research question of interest is shown below. Cohort definition: All people eligible to be vaccinated (from 5 to 115 years old, included) or with, at least, one dose of a SARS-CoV-2 vaccine (any of the available brands) having or not a previous SARS-CoV-2 infection. Inclusion criteria: All people vaccinated with at least one dose of the COVID-19 vaccine (any available brands) in an area of residence. Any person eligible to be vaccinated (from 5 to 115 years old, included) with a positive diagnosis (irrespective of the type of test) for SARS-CoV-2 infection (COVID-19) during the period of study. Exclusion criteria: People not eligible for the vaccine (from 0 to 4 years old, included) Study period: From the date of the first documented SARS-CoV-2 infection in each country to the most recent date in which data is available at the time of analysis. Roughly from 01-03-2020 to 30-06-2022, depending on the country. Files included in this publication: Causal model (responding to the research question) SARS-CoV-2 vaccine effectiveness causal model v.1.0.0 (HTML) - Interactive report showcasing the structural causal model (DAG) to answer the research question SARS-CoV-2 vaccine effectiveness causal model v.1.0.0 (QMD) - Quarto RMarkdown script to produce the structural causal model Common data model specification (following the causal model) SARS-CoV-2 vaccine effectiveness data model specification (XLXS) - Human-readable version (Excel) SARS-CoV-2 vaccine effectiveness data model specification dataspice (HTML) - Human-readable version (interactive report) SARS-CoV-2 vaccine effectiveness data model specification dataspice (JSON) - Machine-readable version Synthetic dataset (complying with the common data model specifications) SARS-CoV-2 vaccine effectiveness synthetic dataset (CSV) [UTF-8, pipe | separated, N~650,000 registries] SARS-CoV-2 vaccine effectiveness synthetic dataset EDA (HTML) - Interactive report of the exploratory data analysis (EDA) of the synthetic dataset SARS-CoV-2 vaccine effectiveness synthetic dataset EDA (JSON) - Machine-readable version of the exploratory data analysis (EDA) of the synthetic dataset SARS-CoV-2 vaccine effectiveness synthetic dataset generation script (IPYNB) - Jupyter notebook with Python scripting and commenting to generate the synthetic dataset #### Baseline Use Case: SARS-CoV-2 vaccine effectiveness assessment - Common Data Model Specification v.1.1.0 change log #### Updated Causal model to eliminate the consideration of 'vaccination_schedule_cd' as a mediator Adjusted the study period to be consistent with the Study Protocol Updated 'sex_cd' as a required variable Added 'chronic_liver_disease_bl' as a comorbidity at the individual level Updated 'socecon_lvl_cd' at the area level as a recommended variable Added crosswalks for the definition of 'chronic_liver_disease_bl' in a separate sheet Updated the 'vaccination_schedule_cd' reference to the 'Vaccine' node in the updated DAG Updated the description of the 'confirmed_case_dt' and 'previous_infection_dt' variables to clarify the definition and the need for a single registry per person The scripts (software) accompanying the data model specification are offered "as-is" without warranty and disclaiming liability for damages resulting from using it. The software is released under the CC-BY-4.0 licence, which permits you to use the content for almost any purpose (but does not grant you any trademark permissions), so long as you note the license and give credit.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides a structured representation of logistics data designed to evaluate and optimize hybrid truck-and-drone delivery networks. It captures a comprehensive set of parameters essential for modeling real-world logistics scenarios, including spatial coordinates, environmental conditions, and operational constraints. The data is meticulously organized into distinct keys, each representing a critical aspect of the delivery network, enabling researchers and practitioners to conduct flexible and in-depth analyses.
The dataset is a curated subset derived from the research presented in the paper "Synthetic Dataset Generation for Optimizing Multimodal Drone Delivery Systems" by Altinsel et al. (2024), published in Drones. It serves as a practical resource for studying the interplay between ground-based and aerial delivery systems, with a focus on efficiency, environmental impact, and operational feasibility.
Altinses, D., Torres, D. O. S., Gobachew, A. M., Lier, S., & Schwung, A. (2024). Synthetic Dataset Generation for Optimizing Multimodal Drone Delivery Systems. Drones (2504-446X), 8(12).
Each data file contains information on ten customer locations, specified by their x and y coordinates, which facilitate the modeling of delivery routes and service areas. Additionally, the dataset includes communication data represented as a two-dimensional grid, which can be used to assess signal strength, connectivity, or other network-related factors that influence drone operations.
A key feature of this dataset is the inclusion of wind data, structured as a two-dimensional grid with four distinct features per grid point. These features likely represent wind velocity components (such as horizontal and vertical directions) along with auxiliary parameters like turbulence intensity or wind shear, which are crucial for drone path planning and energy consumption estimation. The wind data enables researchers to simulate realistic environmental conditions and evaluate their impact on drone performance, stability, and battery life.
By integrating geospatial, environmental, and operational data, this dataset supports a wide range of applications, from route optimization and energy efficiency studies to risk assessment and resilience planning in multimodal delivery systems. Its synthetic nature ensures reproducibility while maintaining relevance to real-world logistics challenges, making it a valuable tool for advancing research in drone-assisted delivery networks.
X
and Y
(Grid Positions)
These define where the arrows start (usually a meshgrid).
U
and V
(Arrow Directions)
U
= Horizontal component (e.g., gradient in x
).
V
= Vertical component (e.g., gradient in y
).
data = np.loadtxt('data.txt')
#### Just for Wind data:
data = data.reshape((4,16,16))
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides a simulated retail data warehouse designed using star schema modeling principles.
It includes both normalized and denormalized versions of a retail sales star schema, making it a valuable resource for data engineers, analysts, and data warehouse enthusiasts who want to explore real-world scenarios, performance tuning, and modeling strategies.
This dataset set has two Fact tables:
- fact_sales_normalized.csv – No columns from the dim_* tables have been normalised.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2F11f3c0350acd609e6b9d9336d0abb448%2FNormalized-Retail-Star-Schema.png?generation=1745327115564885&alt=media" alt="Normalized Star Schema">
However, the dim_* table stay the same for both as follows: - Dim_Customers.csv - Dim_Products.csv - Dim_Stores.csv - Dim_Dates.csv - Dim_Salesperson - Dim_Campaign
Explore how denormalization affects storage, redundancy, and performance
All data is synthetic and randomly generated via python scripts that use polars library for data manipulation— no real customer or business data is included.
Ideal for use with tools like SQL engines, Redshift, BigQuery, Snowflake, or even DuckDB.
Shrinivas Vishnupurikar, Data Engineer @Velotio Technologies.
Dataset
The cf-llm-finetune uses a synthetic parallel dataset built from the Codeforces submissions and problems. C++ ICPC-style solutions are filtered, cleaned, and paired with problem statements to generate Python translations using GPT-4.1, creating a fine-tuning dataset for code translation. The final dataset consists of C++ solutions from 2,000 unique problems, and synthetic Python answers, split into train (1,400), validation (300), and test (300) sets. For details on dataset… See the full description on the dataset page: https://huggingface.co/datasets/demoversion/cf-cpp-to-python-code-generation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Augmented Texas 7000-bus synthetic grid Augmented version of the synthetic Texas 7k dataset published by Texas A&M University. The system has been populated with high-resolution distributed photovoltaic (PV) generation, comprising 4,499 PV plants of varying sizes with associated time series for 1 year of operation. This high-resolution dataset was produced following publicly available data and it is free of CEII. Details on the procedure followed to generate the PV dataset can be found in the Open COG Grid Project Year 1 Report (Chapter 6). The technical data of the system is provided using the (open) CTM specification for easy accessibility from Python without additional packages (data can be loaded as a dictionary). The time series for demand and PV production are provided as a HDF5 file, also loadable with standard open-source tools. We additionally provide example scripts for parsing the data in Python. Prepared by LLNL under Contract DE-AC52-07NA27344. LLNL control number: LLNL-DATA-2001833.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Developments in Artificial Intelligence (AI) have had an enormous impact on scientific research in recent years. Yet, relatively few robust methods have been reported in the field of structure-based drug discovery. To train AI models to abstract from structural data, highly curated and precise biomolecule-ligand interaction datasets are urgently needed. We present MISATO, a curated dataset of almost 20000 experimental structures of protein-ligand complexes, associated molecular dynamics traces, and electronic properties. Semi-empirical quantum mechanics was used to systematically refine protonation states of proteins and small molecule ligands. Molecular dynamics traces for protein-ligand complexes were obtained in explicit water. The dataset is made readily available to the scientific community via simple python data-loaders. AI baseline models are provided for dynamical and electronic properties. This highly curated dataset is expected to enable the next-generation of AI models for structure-based drug discovery. Our vision is to make MISATO the first step of a vibrant community project for the development of powerful AI-based drug discovery tools.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Multi-IaC-Eval
We present Multi-IaC-Eval is a novel benchmark dataset for evaluating LLM-based IaC generation and mutation across AWS CloudFormation, Terraform, and Cloud Development Kit (CDK) formats. The dataset consists of triplets containing initial IaC templates, natural language modification requests, and corresponding updated templates, created through a synthetic data generation pipeline with rigorous validation. Cloudformation: 263 Terraform: 446 CDK (Python): 64 CDK… See the full description on the dataset page: https://huggingface.co/datasets/samdavidson/Multi-IaC-Eval.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository provides the necessary data and Python code to replicate the experiments and generate the figures presented in our manuscript: "Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models".
Contents:
pownet.zip
: Contains PowNet version 3.2, the specific version of the simulation software used in this study.inputs.zip
: Contains essential modeling inputs required by PowNet for the experiments, including network data, and pre-generated synthetic load and solar time series.scripts.zip
: Contains the Python scripts used for installing PowNet, optionally regenerating synthetic data, running simulation experiments, processing results, and generating figures.thai_data.zip
(Reference Only): Contains raw data related to the 2023 Thai power system. This data served as a reference during the creation of the PowNet inputs for this study but is not required to run the replication experiments themselves. Code to process the raw data is also provided.System Requirements:
pip
package managerSetup Instructions:
Download and Unzip Core Files: Download pownet.zip
, inputs.zip
, scripts.zip
, and thai_data.zip
. Extract their contents into the same parent folder. Your directory structure should look like this:
Parent_Folder/
├── pownet/ # from pownet.zip
├── inputs/ # from inputs.zip
├── scripts/ # from scripts.zip
├── thai_data.zip/ # from scripts.zip
├── figures/ # Created by scripts later
├── outputs/ # Created by scripts later
Install PowNet:
pownet
directory that you just extracted:cd path/to/Parent_Folder/pownet
pip install -e .
Workflow and Usage:
Note: All subsequent Python script commands should be run from the scripts
directory. Navigate to it first:
cd path/to/Parent_Folder/scripts
1. Generate Synthetic Time Series (Optional):
inputs
directory (extracted from inputs.zip
). If you wish to regenerate them:python create_synthetic_load.py
python create_synthetic_solar.py
python eval_synthetic_load.py
python eval_synthetic_solar.py
2. Calculate Total Solar Availability:
inputs
directory:
python process_scenario_solar.py
3. Experiment 1: Compare Strategies for Modeling Purchase Obligations:
python run_basecase.py --model_name "TH23NMT"
python run_basecase.py --model_name "TH23ZC"
python run_basecase.py --model_name "TH23"
python run_min_cap.py
This is a new script because we need to modify the objective function and add constraints.
4. Experiment 2: Simulate Partial-Firm Contract Switching:
python run_scenarios.py --model_name "TH23"
python run_scenarios.py --model_name "TH23ESB"
5. Visualize Results:
python run_viz.py
figures
directory within the Parent_Folder
.