39 datasets found

Data from: A large synthetic dataset for machine learning applications in...
zenodo.org
csv, json, png, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Explore at:
zip, png, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13378476
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
h
Instella-GSM8K-synthetic
huggingface.co
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AMD (2025). Instella-GSM8K-synthetic [Dataset]. https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 16, 2025
Dataset authored and provided by
AMD
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Instella-GSM8K-synthetic

The Instella-GSM8K-synthetic dataset was used in the second stage pre-training of Instella-3B model, which was trained on top of the Instella-3B-Stage1 model. This synthetic dataset was generated using the training set of GSM8k dataset, where we first used Qwen2.5-72B-Instruct to

Abstract numerical values as function parameters and generate a Python program to solve the math question. Identify and replace numerical values in the existing question with… See the full description on the dataset page: https://huggingface.co/datasets/amd/Instella-GSM8K-synthetic.
Z
replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models
data.niaid.nih.gov
zenodo.org
Updated Apr 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Labonte, David (2023). replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7849595
Explore at:
Dataset updated
Apr 21, 2023
Dataset provided by
Beck, Hendrik
Imirzian, Natalie
Plum, Fabian
Labonte, David
Bulla, René
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for semantic and instance segmentation experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

Abstract:

Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

Benchmark data

Two pose-estimation datasets were procured. Both datasets used first instar Sungaya nexpectata (Zompro 1996) stick insects as a model species. Recordings from an evenly lit platform served as representative for controlled laboratory conditions; recordings from a hand-held phone camera served as approximate example for serendipitous recordings in the field.

For the platform experiments, walking S. inexpectata were recorded using a calibrated array of five FLIR blackfly colour cameras (Blackfly S USB3, Teledyne FLIR LLC, Wilsonville, Oregon, U.S.), each equipped with 8 mm c-mount lenses (M0828-MPW3 8MM 6MP F2.8-16 C-MOUNT, CBC Co., Ltd., Tokyo, Japan). All videos were recorded with 55 fps, and at the sensors’ native resolution of 2048 px by 1536 px. The cameras were synchronised for simultaneous capture from five perspectives (top, front right and left, back right and left), allowing for time-resolved, 3D reconstruction of animal pose.

The handheld footage was recorded in landscape orientation with a Huawei P20 (Huawei Technologies Co., Ltd., Shenzhen, China) in stabilised video mode: S. inexpectata were recorded walking across cluttered environments (hands, lab benches, PhD desks etc), resulting in frequent partial occlusions, magnification changes, and uneven lighting, so creating a more varied pose-estimation dataset.

Representative frames were extracted from videos using DeepLabCut (DLC)-internal k-means clustering. 46 key points in 805 and 200 frames for the platform and handheld case, respectively, were subsequently hand-annotated using the DLC annotation GUI.

Synthetic data

We generated a synthetic dataset of 10,000 images at a resolution of 1500 by 1500 px, based on a 3D model of a first instar S. inexpectata specimen, generated with the scAnt photogrammetry workflow. Generating 10,000 samples took about three hours on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super). We applied 70\% scale variation, and enforced hue, brightness, contrast, and saturation shifts, to generate 10 separate sub-datasets containing 1000 samples each, which were combined to form the full dataset.

Funding

This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Tuberculosis X-Ray Dataset (Synthetic)

kaggle.com

Updated Mar 12, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Arif Miah (2025). Tuberculosis X-Ray Dataset (Synthetic) [Dataset]. https://www.kaggle.com/datasets/miadul/tuberculosis-x-ray-dataset-synthetic

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 12, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Arif Miah

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📝 Dataset Summary

This synthetic dataset contains 20,000 records of X-ray data labeled as "Normal" or "Tuberculosis". It is specifically created for training and evaluating classification models in the field of medical image analysis. The dataset aims to aid in building machine learning and deep learning models for detecting tuberculosis from X-ray data.

💡 Context

Tuberculosis (TB) is a highly infectious disease that primarily affects the lungs. Accurate detection of TB using chest X-rays can significantly enhance medical diagnostics. However, real-world datasets are often scarce or restricted due to privacy concerns. This synthetic dataset bridges that gap by providing simulated patient data while maintaining realistic distributions and patterns commonly observed in TB cases.

🗃️ Dataset Details

Number of Rows: 20,000
Number of Columns: 15
File Format: CSV
Resolution: Simulated patient data, not real X-ray images
Size: Approximately 10 MB

🏷️ Columns and Descriptions

Column Name	Description
Patient_ID	Unique ID for each patient (e.g., PID000001)
Age	Age of the patient (in years)
Gender	Gender of the patient (Male/Female)
Chest_Pain	Presence of chest pain (Yes/No)
Cough_Severity	Severity of cough (Scale: 0-9)
Breathlessness	Severity of breathlessness (Scale: 0-4)
Fatigue	Level of fatigue experienced (Scale: 0-9)
Weight_Loss	Weight loss (in kg)
Fever	Level of fever (Mild, Moderate, High)
Night_Sweats	Whether night sweats are present (Yes/No)
Sputum_Production	Level of sputum production (Low, Medium, High)
Blood_in_Sputum	Presence of blood in sputum (Yes/No)
Smoking_History	Smoking status (Never, Former, Current)
Previous_TB_History	Previous tuberculosis history (Yes/No)
Class	Target variable indicating the condition (Normal, Tuberculosis)

🔍 Data Generation Process

The dataset was generated using Python with the following libraries:
- Pandas: To create and save the dataset as a CSV file
- NumPy: To generate random numbers and simulate realistic data
- Random Seed: Set to ensure reproducibility

The target variable "Class" has a 70-30 distribution between Normal and Tuberculosis cases. The data is randomly generated with realistic patterns that mimic typical TB symptoms and demographic distributions.

🔧 Usage

This dataset is intended for:
- Machine Learning and Deep Learning classification tasks
- Data exploration and feature analysis
- Model evaluation and comparison
- Educational and research purposes

📊 Potential Applications

Tuberculosis Detection Models: Train CNNs or other classification algorithms to detect TB.
Healthcare Research: Analyze the correlation between symptoms and TB outcomes.
Data Visualization: Perform EDA to uncover patterns and insights.
Model Benchmarking: Compare various algorithms for TB detection.

📑 License

This synthetic dataset is open for educational and research use. Please credit the creator if used in any public or academic work.

🙌 Acknowledgments

This dataset was generated as a synthetic alternative to real-world data to help developers and researchers practice building and fine-tuning classification models without the constraints of sensitive patient data.

Synthetic dataset used in "The maximum weighted submatrix coverage problem:...
zenodo.org
text/x-python, zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Derval Guillaume; Derval Guillaume; Branders Vincent; Dupont Pierre; Schaus Pierre; Branders Vincent; Dupont Pierre; Schaus Pierre (2020). Synthetic dataset used in "The maximum weighted submatrix coverage problem: A CP approach" [Dataset]. http://doi.org/10.5281/zenodo.3549866
Explore at:
zip, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3549866
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Derval Guillaume; Derval Guillaume; Branders Vincent; Dupont Pierre; Schaus Pierre; Branders Vincent; Dupont Pierre; Schaus Pierre
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic dataset used in "The maximum weighted submatrix coverage problem: A CP approach".

Includes both the generated datasets as a zip archive and the python script used to generate them.

Each instance is composed of two files in the form

XxY_K_O_0xN_AxB_Smatrix.tsv being the matrix to use. Each row on a separate line, with tab-separated cells.

XxY_K_O_0xN_AxB_Ssolution.txt giving the implanted solution. One submatrix per line. Then two JSON arrays follow, separated by a tabulation. The first is the list of rows selected in the submatrix, the second the columns.

With:

X and Y the size of the matrix

K the number of submatrices in the implanted solution

O the (minimum) overlap percentage of each submatrix

N the sigma used for the background noise

A and B the size of the implanted submatrices (subject to noise)
Z
Data from: Synthetic Multimodal Dataset for Daily Life Activities
data.niaid.nih.gov
zenodo.org
Updated Jan 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kawamura, Takahiro (2024). Synthetic Multimodal Dataset for Daily Life Activities [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8046266
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
Fukuda, Ken
Kozaki, Kouji
Egami, Shusaku
Ugai, Takanori
Swe Nwe Nwe Htun
Kawamura, Takahiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outline

This dataset is originally created for the Knowledge Graph Reasoning Challenge for Social Issues (KGRC4SI)

Video data that simulates daily life actions in a virtual space from Scenario Data.

Knowledge graphs, and transcriptions of the Video Data content ("who" did what "action" with what "object," when and where, and the resulting "state" or "position" of the object).

Knowledge Graph Embedding Data are created for reasoning based on machine learning

This data is open to the public as open data

Details

Videos

mp4 format

203 action scenarios

For each scenario, there is a character rear view (file name ending in 0), an indoor camera switching view (file name ending in 1), and a fixed camera view placed in each corner of the room (file name ending in 2-5). Also, for each action scenario, data was generated for a minimum of 1 to a maximum of 7 patterns with different room layouts (scenes). A total of 1,218 videos

Videos with slowly moving characters simulate the movements of elderly people.

Knowledge Graphs

RDF format

203 knowledge graphs corresponding to the videos

Includes schema and location supplement information

The schema is described below

SPARQL endpoints and query examples are available

Script Data

txt format

Data provided to VirtualHome2KG to generate videos and knowledge graphs

Includes the action title and a brief description in text format.

Embedding

Embedding Vectors in TransE, ComplEx, and RotatE. Created with DGL-KE (https://dglke.dgl.ai/doc/)

Embedding Vectors created with jRDF2vec (https://github.com/dwslab/jRDF2Vec).

Specification of Ontology

Please refer to the specification for descriptions of all classes, instances, and properties: https://aistairc.github.io/VirtualHome2KG/vh2kg_ontology.htm

Related Resources

KGRC4SI Final Presentations with automatic English subtitles (YouTube)

VirtualHome2KG (Software)

VirtualHome-AIST (Unity)

VirtualHome-AIST (Python API)

Visualization Tool (Software)

Script Editor (Software)
d
A dataset of 1500-word stories generated by gpt-4o-mini for 236...
search.dataone.org
dataverse.no
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rettberg, Jill Walker; Wigers, Hermann (2025). A dataset of 1500-word stories generated by gpt-4o-mini for 236 nationalities [Dataset]. http://doi.org/10.18710/VM2K4O
Explore at:
Unique identifier
https://doi.org/10.18710/VM2K4O
Dataset updated
May 29, 2025
Dataset provided by
DataverseNO
Authors
Rettberg, Jill Walker; Wigers, Hermann
Description
We created a dataset of stories generated by OpenAI’s gpt-4o-miniby using a Python script to construct prompts that were sent to the OpenAI API. We used Statistics Norway’s list of 252 countries, added demonyms for each country, for example Norwegian for Norway, and removed countries without demonyms, leaving us with 236 countries. Our base prompt was “Write a 1500 word potential {demonym} story”, and we generated 50 stories for each country. The scripts used to generate the data, and additional scripts for analysis are available at the GitHub repository https://github.com/MachineVisionUiB/GPT_stories
Z
Data from: Domain-adaptive Data Synthesis for Large-scale Supermarket...
data.niaid.nih.gov
zenodo.org
Updated Apr 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Strohmayer, Julian (2024). Domain-adaptive Data Synthesis for Large-scale Supermarket Product Recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7750241
Explore at:
Dataset updated
Apr 5, 2024
Dataset provided by
Kampel, Martin
Strohmayer, Julian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition

This repository contains the data synthesis pipeline and synthetic product recognition datasets proposed in [1].

Data Synthesis Pipeline:

We provide the Blender 3.1 project files and Python source code of our data synthesis pipeline pipeline.zip, accompanied by the FastCUT models used for synthetic-to-real domain translation models.zip. For the synthesis of new shelf images, a product assortment list and product images must be provided in the corresponding directories products/assortment/ and products/img/. The pipeline expects product images to follow the naming convention c.png, with c corresponding to a GTIN or generic class label (e.g., 9120050882171.png). The assortment list, assortment.csv, is expected to use the sample format [c, w, d, h], with c being the class label and w, d, and h being the packaging dimensions of the given product in mm (e.g., [4004218143128, 140, 70, 160]). The assortment list to use and the number of images to generate can be specified in generateImages.py (see comments). The rendering process is initiated by either executing load.py from within Blender or within a command-line terminal as a background process.

Datasets:

SG3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 851,801 instances of 3,234 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.

SG3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.

SGI3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 838,696 instances of 1,063 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.

SGI3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.

SPS8k - Synthetic Product Shelves 8k (SPS8k) dataset, comprised of 16,224 synthetic shelf images with 1,981,967 instances of 8,112 supermarket products. Instance-level bounding boxes and GTIN class labels are provided for all product instances.

SPS8kt - Domain-translated version of SPS8k, utilizing SKU110k as the target domain. Instance-level bounding boxes and GTIN class labels for all product instances.

Table 1: Dataset characteristics.

Dataset

images

products

instances

labels
translation

SG3k 10,000 3,234 851,801 bounding box & generic class¹ none

SG3kt 10,000 3,234 851,801 bounding box & generic class¹ GroZi-3.2k

SGI3k 10,000 1,063 838,696 bounding box & generic class² none

SGI3kt 10,000 1,063 838,696 bounding box & generic class² GroZi-3.2k

SPS8k 16,224 8,112 1,981,967 bounding box & GTIN none

SPS8kt 16,224 8,112 1,981,967 bounding box & GTIN SKU110k

Sample Format

A sample consists of an RGB image (i.png) and an accompanying label file (i.txt), which contains the labels for all product instances present in the image. Labels use the YOLO format [c, x, y, w, h].

¹SG3k and SG3kt use generic pseudo-GTIN class labels, created by combining the GroZi-3.2k food product category number i (1-27) with the product image index j (j.jpg), following the convention i0000j (e.g., 13000097).

²SGI3k and SGI3kt use the generic GroZi-3.2k class labels from https://arxiv.org/abs/2003.06800.

Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

[1] Strohmayer, Julian, and Martin Kampel. "Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition." International Conference on Computer Analysis of Images and Patterns. Cham: Springer Nature Switzerland, 2023.

BibTeX citation:

@inproceedings{strohmayer2023domain, title={Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition}, author={Strohmayer, Julian and Kampel, Martin}, booktitle={International Conference on Computer Analysis of Images and Patterns}, pages={239--250}, year={2023}, organization={Springer} }
Synthetic total-field magnetic anomaly data and code to perform Euler...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa (2023). Synthetic total-field magnetic anomaly data and code to perform Euler deconvolution on it [Dataset]. http://doi.org/10.6084/m9.figshare.923450.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.923450.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Leonardo Uieda; Vanderlei C. Oliveira Jr.; Valeria C. F. Barbosa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic data, source code, and supplementary text for the article "Euler deconvolution of potential field data" by Leonardo Uieda, Vanderlei C. Oliveira Jr., and Valéria C. F. Barbosa. This is part of a tutorial submitted to The Leading Edge (http://library.seg.org/journal/tle). Results were generated using the open-source Python package Fatiando a Terra version 0.2 (http://www.fatiando.org). This material along with the manuscript can also be found at https://github.com/pinga-lab/paper-tle-euler-tutorial Synthetic data and model Examples in the tutorial use synthetic data generated with the IPython notebook create_synthetic_data.ipynb. File synthetic_data.txt has 4 columns: x (north), y (east), z (down) and the total field magnetic anomaly. x, y, and z are in meters. The total field anomaly is in nanoTesla (nT). File metadata.json contains extra information about the data, such as inclination and declination of the inducing field (in degrees), shape of the data grid (number of points in y and x, respectively), the area containing the data (W, E, S, N, in meters), and the model boundaries (W, E, S, N, top, bottom, in meters). File model.pickle is a serialized version of the model used to generate the data. It contains a list of instances of the PolygonalPrism class of Fatiando a Terra. The serialization was done using the cPickle Python module. Reproducing the results in the tutorial The notebook euler-deconvolution-examples.ipynb runs the Euler deconvolution on the synthetic data and generates the figures for the manuscript. It also presents a more detailed explanation of the method and more tests than went into the finished manuscript.

Online Retail & E-Commerce Dataset

kaggle.com

Updated Mar 20, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ertuğrul EŞOL (2025). Online Retail & E-Commerce Dataset [Dataset]. https://www.kaggle.com/datasets/ertugrulesol/online-retail-data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 20, 2025

Dataset provided by

Kaggle

Authors

Ertuğrul EŞOL

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Overview:

This dataset contains 1000 rows of synthetic online retail sales data, mimicking transactions from an e-commerce platform. It includes information about customer demographics, product details, purchase history, and (optional) reviews. This dataset is suitable for a variety of data analysis, data visualization and machine learning tasks, including but not limited to: customer segmentation, product recommendation, sales forecasting, market basket analysis, and exploring general e-commerce trends. The data was generated using the Python Faker library, ensuring realistic values and distributions, while maintaining no privacy concerns as it contains no real customer information.

Data Source:

This dataset is entirely synthetic. It was generated using the Python Faker library and does not represent any real individuals or transactions.

Data Content:

Column Name	Data Type	Description
`customer_id`	Integer	Unique customer identifier (ranging from 10000 to 99999)
`order_date`	Date	Order date (a random date within the last year)
`product_id`	Integer	Product identifier (ranging from 100 to 999)
`category_id`	Integer	Product category identifier (10, 20, 30, 40, or 50)
`category_name`	String	Product category name (Electronics, Fashion, Home & Living, Books & Stationery, Sports & Outdoors)
`product_name`	String	Product name (randomly selected from a list of products within the corresponding category)
`quantity`	Integer	Quantity of the product ordered (ranging from 1 to 5)
`price`	Float	Unit price of the product (ranging from 10.00 to 500.00, with two decimal places)
`payment_method`	String	Payment method used (Credit Card, Bank Transfer, Cash on Delivery)
`city`	String	Customer's city (generated using Faker's `city()` method, so the locations will depend on the Faker locale you used)
`review_score`	Integer	Customer's product rating (ranging from 1 to 5, or None with a 20% probability)
`gender`	String	Customer's gender (M/F, or None with a 10% probability)
`age`	Integer	Customer's age (ranging from 18 to 75)

Potential Use Cases (Inspiration):

Customer Segmentation: Group customers based on demographics, purchasing behavior, and preferences.

Product Recommendation: Build a recommendation system to suggest products to customers based on their past purchases and browsing history.

Sales Forecasting: Predict future sales based on historical trends.

Market Basket Analysis: Identify products that are frequently purchased together.

Price Optimization: Analyze the relationship between price and demand.

Geographic Analysis: Explore sales patterns across different cities.

Time Series Analysis: Investigate sales trends over time.

Educational Purposes: Great for practicing data cleaning, EDA, feature engineering, and modeling.

Supporting data and code: Beyond Economic Dispatch: Modeling Renewable...
zenodo.org
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli (2025). Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models [Dataset]. http://doi.org/10.5281/zenodo.15219959
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15219959
Dataset updated
Apr 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jaruwan Pimsawan; Jaruwan Pimsawan; Stefano Galelli; Stefano Galelli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository provides the necessary data and Python code to replicate the experiments and generate the figures presented in our manuscript: "Supporting data and code: Beyond Economic Dispatch: Modeling Renewable Purchase Agreements in Production Cost Models".

Contents:

pownet.zip: Contains PowNet version 3.2, the specific version of the simulation software used in this study.

inputs.zip: Contains essential modeling inputs required by PowNet for the experiments, including network data, and pre-generated synthetic load and solar time series.

scripts.zip: Contains the Python scripts used for installing PowNet, optionally regenerating synthetic data, running simulation experiments, processing results, and generating figures.

thai_data.zip (Reference Only): Contains raw data related to the 2023 Thai power system. This data served as a reference during the creation of the PowNet inputs for this study but is not required to run the replication experiments themselves. Code to process the raw data is also provided.

System Requirements:

Python version 3.10+

pip package manager

Setup Instructions:

Download and Unzip Core Files: Download pownet.zip, inputs.zip, scripts.zip, and thai_data.zip. Extract their contents into the same parent folder. Your directory structure should look like this:

Parent_Folder/ ├── pownet/ # from pownet.zip ├── inputs/ # from inputs.zip ├── scripts/ # from scripts.zip ├── thai_data.zip/ # from scripts.zip ├── figures/ # Created by scripts later ├── outputs/ # Created by scripts later

Install PowNet:

Open your terminal or command prompt.

Navigate into the pownet directory that you just extracted:

cd path/to/Parent_Folder/pownet

pip install -e .

These commands install PowNet and its required dependencies into your active Python environment.

Workflow and Usage:

Note: All subsequent Python script commands should be run from the scripts directory. Navigate to it first:

cd path/to/Parent_Folder/scripts

1. Generate Synthetic Time Series (Optional):

This step is optional as the required time series files are already provided within the inputs directory (extracted from inputs.zip). If you wish to regenerate them:

Run the generation scripts:

python create_synthetic_load.py python create_synthetic_solar.py

Evaluate the generated time series (optional):

python eval_synthetic_load.py python eval_synthetic_solar.py

2. Calculate Total Solar Availability:

Process solar scenarios using data from the inputs directory:

python process_scenario_solar.py

3. Experiment 1: Compare Strategies for Modeling Purchase Obligations:

Run the base case simulations for different modeling strategies:

No Must-Take (NoMT):

python run_basecase.py --model_name "TH23NMT"

Zero-Cost Renewables (ZCR):

python run_basecase.py --model_name "TH23ZC"

Penalized Curtailment (Proposed Method):

python run_basecase.py --model_name "TH23"

Run the base case simulation for the Minimum Capacity (MinCap) strategy:

python run_min_cap.py
This is a new script because we need to modify the objective function and add constraints.

4. Experiment 2: Simulate Partial-Firm Contract Switching:

Run simulations comparing the base case with the partial-firm contract scenario:

Base Case Scenario:

python run_scenarios.py --model_name "TH23"

Partial-Firm Contract Scenario:

python run_scenarios.py --model_name "TH23ESB"

5. Visualize Results:

Generate all figures presented in the manuscript:

python run_viz.py

Figures will typically be saved in afigures directory within the Parent_Folder.
R
Replication Data for: Towards tsunami early-warning with Distributed...
entrepot.recherche.data.gouv.fr
txt, zip
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Becerril; Carlos Becerril (2025). Replication Data for: Towards tsunami early-warning with Distributed Acoustic Sensing: expected seafloor strains induced by tsunamis [Dataset]. http://doi.org/10.57745/ENBAIS
Explore at:
txt(2945), zip(29141166165), zip(29048595813)Available download formats
Unique identifier
https://doi.org/10.57745/ENBAIS
Dataset updated
May 12, 2025
Dataset provided by
Recherche Data Gouv
Authors
Carlos Becerril; Carlos Becerril
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Time period covered
Mar 3, 2024
Dataset funded by
Agence nationale de la recherche
European Research Council
European Innovation Council
Description
This repository contains the data sets and Python routines to replicate results outlined in the manuscript: Towards tsunami early-warning with Distributed Acoustic Sensing: expected seafloor strains induced by tsunamis. The contents of this repository are divided into 2 zip files containing: Repository_Part1.zip A) Input files used to define and render the simulation with the SeisSol software package. B) StrainModel_Fig4_5.py -- Python routine to generate Figures 3 and 4 from the seafloor strain model as described in the manuscript. C) PREM.csv -- PREM model, auxiliary file for StrainModel_Fig4_5.py D) Rcvr_Processing.py -- Python routine to extract and process data contained in F) through I). Generate results observed in Fig.6 E) receiver_lines1234567.dat -- Receiver location file, auxiliary file for Rcvr_Processing.py F) Seafloor_Array_Y_0km -- Directory containing synthetic data (SeisSol generated) from the seafloor-buried (10cm) receivers along the array Y=0 km. G) SeaSurface_Array_Y_0km -- Directory containing synthetic data (SeisSol generated) from the receivers placed 10cm below the sea surface, along the array Y=0 km. Repository_Part2.zip H) Seafloor_Array_X_100km -- Directory containing synthetic data (SeisSol generated) from the seafloor-buried (10cm) receivers along the array X=100 km. I) SeaSurface_Array_X_100km -- Directory containing synthetic data (SeisSol generated) from the receivers placed 10cm below the sea surface, along the array X=100 km.
h
glaive-code-assistant
huggingface.co
Updated Sep 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Glaive AI (2023). glaive-code-assistant [Dataset]. https://huggingface.co/datasets/glaiveai/glaive-code-assistant
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 22, 2023
Dataset authored and provided by
Glaive AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Glaive-code-assistant

Glaive-code-assistant is a dataset of ~140k code problems and solutions generated using Glaive’s synthetic data generation platform. The data is intended to be used to make models act as code assistants, and so the data is structured in a QA format where the questions are worded similar to how real users will ask code related questions. The data has ~60% python samples. To report any problems or suggestions in the data, join the Glaive discord
d
6DOF pose estimation - synthetically generated dataset using BlenderProc
search.dataone.org
datadryad.org
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Divyam Sheth (2023). 6DOF pose estimation - synthetically generated dataset using BlenderProc [Dataset]. http://doi.org/10.5061/dryad.rbnzs7hj5
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.rbnzs7hj5
Dataset updated
Nov 27, 2023
Dataset provided by
Dryad Digital Repository
Authors
Divyam Sheth
Time period covered
Jan 1, 2023
Description
Accurate and robust 6DOF (Six Degrees of Freedom) pose estimation is a critical task in various fields, including computer vision, robotics, and augmented reality. This research paper presents a novel approach to enhance the accuracy and reliability of 6DOF pose estimation by introducing a robust method for generating synthetic data and leveraging the ease of multi-class training using the generated dataset. The proposed method tackles the challenge of insufficient real-world annotated data by creating a large and diverse synthetic dataset that accurately mimics real-world scenarios. The proposed method only requires a CAD model of the object and there is no limit to the number of unique data that can be generated. Furthermore, a multi-class training strategy that harnesses the synthetic dataset's diversity is proposed and presented. This approach mitigates class imbalance issues and significantly boosts accuracy across varied object classes and poses. Experimental results underscore th..., This dataset has been synthetically generated using 3D software like Blender and APIs like Blendeproc., , # Data Repository README

This repository contains data organized into a structured format. The data consists of three main folders and two files, each serving a specific purpose. The data contains two folders - Cat and Hand.

Cat Dataset: 63492 labeled data with images, masks, and poses.

Hand Dataset: 42418 labeled data with images, masks, and poses.

Usage: The dataset is ready for use by simply extracting the contents of the zip file, whether for training in a segmentation task or a pose estimation task.

To view .npy files you will need to use Python with the numpy package installed. In Python use the following commands.

import numpy
data = numpy.load('file.npy')
print(data)

What free/open software is appropriate for viewing the .ply files?
These files can be opened using any 3D modeling software like Blender, Meshlab, etc.

Camera Matrix Intrinstics Format :

Fx 0 px 0 Fy py 0 0 0

Below is an overview of the data organization:

Folder Structure

Rgb:

This ...
h
python_plagiarism_code_dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nop, python_plagiarism_code_dataset [Dataset]. https://huggingface.co/datasets/nop12/python_plagiarism_code_dataset
Explore at:
Authors
nop
Description
Python Plagiarism Code Dataset

Overview

This dataset contains pairs of Python code samples with varying degrees of similarity, designed for training and evaluating plagiarism detection systems. The dataset was created using Large Language Models (LLMs) to generate synthetic code variations at different transformation levels, simulating real-world plagiarism scenarios in an academic context.

Purpose

The dataset addresses the limitations of existing code… See the full description on the dataset page: https://huggingface.co/datasets/nop12/python_plagiarism_code_dataset.
h
spp
huggingface.co
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wuyetao (2023). spp [Dataset]. https://huggingface.co/datasets/wuyetao/spp
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 6, 2023
Authors
wuyetao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic Python Problems(SPP) Dataset

The dataset includes around 450k synthetic Python programming problems. Each Python problem consists of a task description, 1-3 examples, code solution and 1-3 test cases. The CodeGeeX-13B model was used to generate this dataset. A subset of the data has been verified by Python interpreter and de-duplicated. This data is SPP_30k_verified.jsonl. The dataset is in a .jsonl format (json per line). Released as part of Self-Learning to Improve Code… See the full description on the dataset page: https://huggingface.co/datasets/wuyetao/spp.
c
Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation,...
cancerimagingarchive.net
csv, dicom, n/a +1
Updated May 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2025). Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation, MIDI-B-Curated-Validation, MIDI-B-Synthetic-Test, MIDI-B-Curated-Test) [Dataset]. http://doi.org/10.7937/cf2p-aw56
Explore at:
sqlite and zip, dicom, csv, n/aAvailable download formats
Unique identifier
https://doi.org/10.7937/cf2p-aw56
Dataset updated
May 2, 2025
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 2, 2025
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
Abstract
These resources comprise a large and diverse collection of multi-site, multi-modality, and multi-cancer clinical DICOM images from 538 subjects infused with synthetic PHI/PII in areas encountered by TCIA curation teams. Also provided is a TCIA-curated version of the synthetic dataset, along with mapping files for mapping identifiers between the two.
This new MIDI data resource includes DICOM datasets used in the Medical Image De-Identification Benchmark (MIDI-B) challenge at MICCAI 2024. They are accompanied by ground truth answer keys and a validation script for evaluating the effectiveness of medical image de-identification workflows. The validation script systematically assesses de-identified data against an answer key outlining appropriate actions and values for proper de-identification of medical images, promoting safer and more consistent medical image sharing.
Introduction
Medical imaging research increasingly relies on large-scale data sharing. However, reliable de-identification of DICOM images still presents significant challenges due to the wide variety of DICOM header elements and pixel data where identifiable information may be embedded. To address this, we have developed an openly accessible synthetic dataset containing artificially generated protected health information (PHI) and personally identifiable information (PII).
These resources complement our earlier work (Pseudo-PHI-DICOM-data ) hosted on The Cancer Imaging Archive. As an example of its use, we also provide a version curated by The Cancer Imaging Archive (TCIA) curation team. This resource builds upon best practices emphasized by the MIDI Task Group who underscore the importance of transparency, documentation, and reproducibility in de-identification workflows, part of the themes at recent conferences (Synapse:syn53065760) and workshops (2024 MIDI-B Challenge Workshop).
This framework enables objective benchmarking of de-identification performance, promotes transparency in compliance with regulatory standards, and supports the establishment of consistent best practices for sharing clinical imaging data. We encourage the research community to use these resources to enhance and standardize their medical image de-identification workflows.
Methods
Subject Inclusion and Exclusion Criteria
The source data were selected from imaging already hosted in de-identified form on TCIA. Imaging containing faces were excluded, and no new human studies were performed for his project.
Data Acquisition
To build the synthetic dataset, image series were selected from TCIA’s curated datasets to represent a broad range of imaging modalities (CR, CT, DX, MG, MR, PT, SR, US) , manufacturers including (GE, Siemens, Varian , Confirma, Agfa, Eigen, Elekta, Hologic, KONICA MINOLTA, others) , scan parameters, and regions of the body. These were processed to inject the synthetic PHI/PII as described.
Data Analysis
Synthetic pools of PHI, like subject and scanning institution information, were generated using the Python package Faker (https://pypi.org/project/Faker/8.10.3/). These were inserted into DICOM metadata of selected imaging files using a system of inheritable rule-based templates outlining re-identification functions for data insertion and logging for answer key creation. Text was also burned-in to the pixel data of a number of images. By systematically embedding realistic synthetic PHI into image headers and pixel data, accompanied by a detailed ground-truth answer key, our framework enables users transparency, documentation, and reproducibility in de-identification practices, aligned with the HIPAA Safe Harbor method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices.
Usage Notes
This DICOM collection is split into two datasets, synthetic and curated. The synthetic dataset is the PHI/PII infused DICOM collection accompanied by a validation script and answer keys for testing, refining and benchmarking medical image de-identification pipelines. The curated dataset is a version of the synthetic dataset curated and de-identified by members of The Cancer Imaging Archive curation team. It can be used as a guide, an example of medical image curation best practices. For the purposes of the De-Identification challenge at MICCAI 2024, the synthetic and curated datasets each contain two subsets, a portion for Validation and the other for Testing.
To link a curated dataset to the original synthetic dataset and answer keys, a mapping between the unique identifiers (UIDs) and patient IDs must be provided in CSV format to the evaluation software. We include the mapping files associated with the TCIA-curated set as an example. Lastly, for both the Validation and Testing datasets, an answer key in sqlite.db format is provided. These components are for use with the Python validation script linked below (4). Combining these components, a user developing or evaluating de-identification methods can ensure they meet a specification for successfully de-identifying medical image data.
o
BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness...
explore.openaire.eu
Updated Jan 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco Estupiñán-Romero; Nina Van Goethem; Marjan Meurisse; Javier González-Galindo; Enrique Bernal-Delgado (2023). BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness assessment - Common Data Model Specification [Dataset]. http://doi.org/10.5281/zenodo.6913045
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6913045
Dataset updated
Jan 26, 2023
Authors
Francisco Estupiñán-Romero; Nina Van Goethem; Marjan Meurisse; Javier González-Galindo; Enrique Bernal-Delgado
Description
This publication corresponds to the Common Data Model (CDM) specification of the Baseline Use Case proposed in T.5.2 (WP5) in the BY-COVID project on “SARS-CoV-2 Vaccine(s) effectiveness in preventing SARS-CoV-2 infection.” Research Question: “How effective have the SARS-CoV-2 vaccination programmes been in preventing SARS-CoV-2 infections?” Intervention (exposure): COVID-19 vaccine(s) Outcome: SARS-CoV-2 infection Subgroup analysis: Vaccination schedule (type of vaccine) Study Design: An observational retrospective longitudinal study to assess the effectiveness of the SARS-CoV-2 vaccine in preventing SARS-CoV-2 infections using routinely collected social, health and care data from several countries. A causal model was established using Directed Acyclic Graphs (DAGs) to map domain knowledge, theories and assumptions about the causal relationship between exposure and outcome. The DAG developed for the research question of interest is shown below. Cohort definition: All people eligible to be vaccinated (from 5 to 115 years old, included) or with, at least, one dose of a SARS-CoV-2 vaccine (any of the available brands) having or not a previous SARS-CoV-2 infection. Inclusion criteria: All people vaccinated with at least one dose of the COVID-19 vaccine (any available brands) in an area of residence. Any person eligible to be vaccinated (from 5 to 115 years old, included) with a positive diagnosis (irrespective of the type of test) for SARS-CoV-2 infection (COVID-19) during the period of study. Exclusion criteria: People not eligible for the vaccine (from 0 to 4 years old, included) Study period: From the date of the first documented SARS-CoV-2 infection in each country to the most recent date in which data is available at the time of analysis. Roughly from 01-03-2020 to 30-06-2022, depending on the country. Files included in this publication: Causal model (responding to the research question) SARS-CoV-2 vaccine effectiveness causal model v.1.0.0 (HTML) - Interactive report showcasing the structural causal model (DAG) to answer the research question SARS-CoV-2 vaccine effectiveness causal model v.1.0.0 (QMD) - Quarto RMarkdown script to produce the structural causal model Common data model specification (following the causal model) SARS-CoV-2 vaccine effectiveness data model specification (XLXS) - Human-readable version (Excel) SARS-CoV-2 vaccine effectiveness data model specification dataspice (HTML) - Human-readable version (interactive report) SARS-CoV-2 vaccine effectiveness data model specification dataspice (JSON) - Machine-readable version Synthetic dataset (complying with the common data model specifications) SARS-CoV-2 vaccine effectiveness synthetic dataset (CSV) [UTF-8, pipe | separated, N~650,000 registries] SARS-CoV-2 vaccine effectiveness synthetic dataset EDA (HTML) - Interactive report of the exploratory data analysis (EDA) of the synthetic dataset SARS-CoV-2 vaccine effectiveness synthetic dataset EDA (JSON) - Machine-readable version of the exploratory data analysis (EDA) of the synthetic dataset SARS-CoV-2 vaccine effectiveness synthetic dataset generation script (IPYNB) - Jupyter notebook with Python scripting and commenting to generate the synthetic dataset #### Baseline Use Case: SARS-CoV-2 vaccine effectiveness assessment - Common Data Model Specification v.1.1.0 change log #### Updated Causal model to eliminate the consideration of 'vaccination_schedule_cd' as a mediator Adjusted the study period to be consistent with the Study Protocol Updated 'sex_cd' as a required variable Added 'chronic_liver_disease_bl' as a comorbidity at the individual level Updated 'socecon_lvl_cd' at the area level as a recommended variable Added crosswalks for the definition of 'chronic_liver_disease_bl' in a separate sheet Updated the 'vaccination_schedule_cd' reference to the 'Vaccine' node in the updated DAG Updated the description of the 'confirmed_case_dt' and 'previous_infection_dt' variables to clarify the definition and the need for a single registry per person The scripts (software) accompanying the data model specification are offered "as-is" without warranty and disclaiming liability for damages resulting from using it. The software is released under the CC-BY-4.0 licence, which permits you to use the content for almost any purpose (but does not grant you any trademark permissions), so long as you note the license and give credit.
Synthetic Electrochemical Impedance Spectra Generator
zenodo.org
text/x-python, txt +1
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Slava SHKIRSKIY; Slava SHKIRSKIY (2025). Synthetic Electrochemical Impedance Spectra Generator [Dataset]. http://doi.org/10.5281/zenodo.14652183
Explore at:
zip, txt, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14652183
Dataset updated
Jan 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Slava SHKIRSKIY; Slava SHKIRSKIY
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# Synthetic Electrochemical Impedance Spectra Generator

This Python script generates synthetic EIS spectra for predefined circuits using the `impedance.models.circuits.CustomCircuit` library. It simulates realistic experimental datasets for educational purposes, incorporating random file names, missing values, and empty files.

## Features
- **Circuit Modeling**: Supports circuits like `R0-C0`, `R0-p(R1,C1)`, etc., with randomized parameters.
- **Custom Frequency Range**: Logarithmic sweep from \(10^5\) to \(10^{-2}\) Hz.
- **Realistic Data Challenges**:
- Random 3-line headers in files.
- Missing values in every other 100th file.
- Empty data in every 100th file.

## Output Format
- **Columns**: `Freq_Hz`, `Re_Z_Ohm`, `-Im_Z_Ohm`, `|Z|_Ohm`, `Phase_deg`.
- **File Naming**: Random 8-character alphanumeric strings.

Customize circuits, frequency range, and data patterns as needed.
E
Smashcima (2025-03-28)
live.european-language-grid.eu
Updated Dec 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Smashcima (2025-03-28) [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/23850
Explore at:
Dataset updated
Dec 29, 2024
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Smashcima is a library and framework for synthesizing images containing handwritten music for creating synthetic training data for OMR models. It is primarily intended to be used as part of optical music recognition workflows, esp. with domain adaptation in mind. The target user is therefore a machine-learning, document processing, library sciences, or computational musicology researcher with minimal skills in python programming.

Smashcima is the only tool that simultaneously: - synthesizes handwritten music notation, - produces not only raster images but also segmentation masks, classification labels, bounding boxes, and more, - synthesizes entire pages as well as individual symbols, - synthesizes background paper textures, - synthesizes also polyphonic and pianoform music images, - accepts just MusicXML as input, - is written in Python, which simplifies its adoption and extensibility.

Therefore, Smashcima brings a unique new capability for optical music recognition (OMR): synthesizing a near-realistic image of handwritten sheet music from just a MusicXML file. As opposed to notation editors, which work with a fixed set of fonts and a set of layout rules, it can adapt handwriting styles from existing OMR datasets to arbitrary music (beyond the music encoded in existing OMR datasets), and randomize layout to simulate the imprecisions of handwriting, while guaranteeing the semantic correctness of the output rendering. Crucially, the rendered image is provided also with the positions of all the visual elements of music notation, so that both object detection-based and sequence-to-sequence OMR pipelines can utilize Smashcima as a synthesizer of training data.

(In combination with the LMX canonical linearization of MusicXML, one can imagine the endless possibilities of running Smashcima on inputs from a MusicXML generator.)

Facebook

Twitter

Click to copy link

Link copied

Cite

Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476

Data from: A large synthetic dataset for machine learning applications in power transmission grids

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip, png, csv, jsonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13378476

Dataset updated

Mar 25, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

Clear search

Close search

Google apps

Main menu

Data from: A large synthetic dataset for machine learning applications in...

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding

Instella-GSM8K-synthetic

replicAnt - Plum2023 - Pose-Estimation Datasets and Trained Models

Tuberculosis X-Ray Dataset (Synthetic)

📝 Dataset Summary

💡 Context

🗃️ Dataset Details

🏷️ Columns and Descriptions

🔍 Data Generation Process

🔧 Usage

📊 Potential Applications

📑 License

🙌 Acknowledgments

Synthetic dataset used in "The maximum weighted submatrix coverage problem:...

Data from: Synthetic Multimodal Dataset for Daily Life Activities

A dataset of 1500-word stories generated by gpt-4o-mini for 236...

Data from: Domain-adaptive Data Synthesis for Large-scale Supermarket...

images

products

instances

Synthetic total-field magnetic anomaly data and code to perform Euler...

Online Retail & E-Commerce Dataset

Supporting data and code: Beyond Economic Dispatch: Modeling Renewable...

Replication Data for: Towards tsunami early-warning with Distributed...

glaive-code-assistant

6DOF pose estimation - synthetically generated dataset using BlenderProc

Folder Structure

python_plagiarism_code_dataset

spp

Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation,...

Abstract

Introduction

Methods

Subject Inclusion and Exclusion Criteria

Data Acquisition

Data Analysis

Usage Notes

BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness...

Synthetic Electrochemical Impedance Spectra Generator

Smashcima (2025-03-28)

Data from: A large synthetic dataset for machine learning applications in power transmission grids

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding