59 datasets found

Z
Data Cleaning, Translation & Split of the Dataset for the Automatic...
data.niaid.nih.gov
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
Explore at:
Dataset updated
Aug 8, 2022
Dataset authored and provided by
Köhler, Juliane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Z
DustNet - structured data and Python code to reproduce the model,...
data.niaid.nih.gov
zenodo.org
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simmons, Benno I. (2024). DustNet - structured data and Python code to reproduce the model, statistical analysis and figures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10631953
Explore at:
Dataset updated
Jul 7, 2024
Dataset provided by
Augousti, Andy T.
Siegert, Stefan
Nowak, T. E.
Simmons, Benno I.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and Python code used for AOD prediction with DustNet model - a Machine Learning/AI based forecasting.

Model input data and code

Processed MODIS AOD data (from Aqua and Terra) and selected ERA5 variables* ready to reproduce the DustNet model results or for similar forecasting with Machine Learning. These long-term daily timeseries (2003-2022) are provided as n-dimensional NumPy arrays. The Python code to handle the data and run the DustNet model** is included as Jupyter Notebook ‘DustNet_model_code.ipynb’. A subfolder with normalised and split data into training/validation/testing sets is also provided with Python code for two additional ML based models** used for comparison (U-NET and Conv2D). Pre-trained models are also archived here as TensorFlow files.

Model output data and code

This dataset was constructed by running the ‘DustNet_model_code.ipynb’ (see above). It consists of 1095 days of forecased AOD data (2020-2022) by CAMS, DustNet model, naïve prediction (persistence) and gridded climatology. The ground truth raw AOD data form MODIS is provided for comparison and statystical analysis of predictions. It is intended for a quick reproduction of figures and statystical analysis presented in DustNet introducing paper.

*datasets are NumPy arrays (v1.23) created in Python v3.8.18.

**all ML models were created with Keras in Python v3.10.10.
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
data.europa.eu
zenodo.org
unknown
Updated Mar 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://data.europa.eu/88u/dataset/oai-zenodo-org-4571228
Explore at:
unknown(395470535)Available download formats
Dataset updated
Mar 1, 2021
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Notable changes to each version of the dataset are documented in CHANGELOG.md.
d
MC-LSTM papers, model runs
search.dataone.org
hydroshare.org
Updated Dec 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Martin Frame (2023). MC-LSTM papers, model runs [Dataset]. http://doi.org/10.4211/hs.d750278db868447dbd252a8c5431affd
Explore at:
Unique identifier
https://doi.org/10.4211/hs.d750278db868447dbd252a8c5431affd
Dataset updated
Dec 30, 2023
Dataset provided by
Hydroshare
Authors
Jonathan Martin Frame
Time period covered
Jan 1, 1989 - Jan 1, 2015
Area covered

Description
Runs from two papers exploring the use of mass conserving LSTM. Model results used in the papers are 1) model_outputs_for_analysis_extreme_events_paper.tar.gz, and 2) model_outputs_for_analysis_mass_balance_paper.tar.gz.

The models here are trained/calibrated on three different time periods. Standard Time Split (time split 1): test period(1989-1999) is the same period used by previous studies which allows us to confirm that the deep learning models (LSTM andMC-LSTM) trained for this project perform as expected relative to prior work. NWM Time Split (time split 2): The second test period (1995-2014) allows us to benchmark against the NWM-Rv2, which does not provide data prior to 1995. Return period split: The third test period (based on return periods) allows us to benchmark only on water years that contain streamflow events that are larger (per basin) than anything seen in the training data (<= 5-year return periods in training and > 5-year return periods in testing).

Also included are an ensemble of model runs for LSTM, MC-LSTM for the "standard" training period and two forcing products. These files are provided in the format "

IMPORTANT NOTE: This python environment should be used to extract and load the data: https://github.com/jmframe/mclstm_2021_extrapolate/blob/main/python_environment.yml, as the pickle files serialized the data with specific versions of python libraries. Specifically, the pickle serialization was done with xarray=0.16.1.

Code to interpret these runs can be found here: https://github.com/jmframe/mclstm_2021_extrapolate https://github.com/jmframe/mclstm_2021_mass_balance

Papers are available here: https://hess.copernicus.org/preprints/hess-2021-423/
Dataset for "Predicting the growth trajectory and yield of greenhouse...
zenodo.org
explore.openaire.eu
text/x-python, zip
Updated Apr 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qi Yang; Qi Yang; Zhenong Jin; Zhenong Jin (2024). Dataset for "Predicting the growth trajectory and yield of greenhouse strawberries based on knowledge-guided computer vision" [Dataset]. http://doi.org/10.5281/zenodo.10957909
Explore at:
text/x-python, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10957909
Dataset updated
Apr 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Qi Yang; Qi Yang; Zhenong Jin; Zhenong Jin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overall

A strawberry dataset for the paper "Qi Yang, Licheng Liu, Junxiong Zhou, Mary Rogers, Zhenong Jin, 2024. Predicting the growth trajectory and yield of greenhouse strawberries based on knowledge-guided computer vision, Computers and Electronics in Agriculture, 220, 108911. https://doi.org/10.1016/j.compag.2024.108911" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.compag.2024.108911"

Plant traits measurements

The folder "measurement.zip" includes treatment-level and fruit-level ground truth data.

Treatment-level

data_dryMatter_2022.csv data_dryMatter_2023.csv data_freshMatter_2022.csv data_freshMatter_2023.csv data_fruitNumber_2022.csv data_fruitNumber_2023.csv data_plantBiomass_2022.csv data_plantBiomass_2023.csv

Fruit-level

Fruit conditions with five classes, 1-5 represent Normal, Wizened, Malformed, Wizened & Malformed, and Overripe, respectively.

data_size_freshWeight_condition_2022_0N.csv data_size_freshWeight_condition_2022_50N.csv data_size_freshWeight_condition_2022_100N.csv data_size_freshWeight_condition_2022_150N.csv

Fruit size for tagged fruits

data_taggedFruit_diameter_2022.csv data_taggedFruit_diameter_2023.csv data_taggedFruit_length_2022.csv data_taggedFruit_length_2023.csv

Fresh yield and lifespan for tagged fruits (only available in experiment 2023)

data_taggedFruit_freshMatter_2023.csv data_taggedFruit_lifespan_2023.csv

Weather data

weather_daily_2022.csv weather_daily_2023.csv

Image data with label

Object and phenology detection

The folder "strawberry_img_random.zip" contains images and the corresponding JSON labels for object and phenological stages detection.

Fruit size and decimal phenological stage

The folder "strawberry_img_tagged.zip" contains images and the corresponding JSON labels for fruit size and decimal phenological stages detection.

For example, "label": "small g, 8.84, 7.62, 0.4", This label means the fruit has an 8.84mm diameter and 7.62mm length, with the main stage being small green and the decimal stage being DS-4

Merge and split Data

A Python script, "datasetProcessing.py", can be used to merge and split the image data into training and testing set.

Pre-trained models

models.zip

Data collector: Dr. Qi Yang, University of Minnesota, USA. Email: qiyang577@gmail.com

All the files belong to Prof. Zhenong Jin, University of Minnesota, USA. Email: jinzn@umn.edu
T
corr2cause
tensorflow.org
Updated Sep 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). corr2cause [Dataset]. https://www.tensorflow.org/datasets/catalog/corr2cause
Explore at:
Dataset updated
Sep 9, 2023
Description
Corr2cause

Causal inference is one of the hallmarks of human intelligence.

Corr2cause is a large-scale dataset of more than 400K samples, on which seventeen existing LLMs are evaluated in the related paper.

Overall, Corr2cause contains 415,944 samples, with 18.57% in valid samples. The average length of the premise is 424.11 tokens, and hypothesis 10.83 tokens. The data is split into 411,452 training samples, 2,246 development and test samples, respectively. Since the main purpose of the dataset is to benchmark the performance of LLMs, the test and development sets have been prioritized to have a comprehensive coverage over all sizes of graphs.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('corr2cause', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
e
Damped pendulum for nonlinear system identification - inputs are sampled...
b2find.eudat.eu
Updated Jul 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Damped pendulum for nonlinear system identification - inputs are sampled from a multivariate-normal distribution - synthetically generated - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/a411d807-9ffa-54f8-9f12-4646791cbda4
Explore at:
Dataset updated
Jul 21, 2025
Description
Overview This dataset contains input-output data of a damped nonlinear pendulum that is actuated at the mounting point. The data was generated with statesim [1], a python package for simulating linear and nonlinear ODEs, for the system actuated pendulum. The configuration .json files for the corresponding datasets (in-distribution and out-of-distribution) can be found in the respective folders. After creating the dataset, the files are stored in the raw folder. Then, they are split into subsets for training, testing, and validation and can be found in the processed folder; details about the splitting are found in the config.json file. The dataset can be used to test system identification algorithms and methods that aim to identify nonlinear dynamics from input-output measurements. The training dataset is used to optimize the model parameters, the validation set for hyperparameter optimization, and the test set only for the final evaluation. In [2], the authors used the same underlying dynamics to create their dataset but without damping terms. Input generation Input trajectories are sampled from a multivariate-normal distribution. Noise Gaussian white noise of approximately 30dB is added at the output. Statistics The input and output size is one. In-distribution data: 2 100 000 data points Training: 10 000 trajectories of length 150 Validation: 2 000 trajectories of length 150 Test: 2 000 trajectories of length 150 Out-of-distribution data: 7 times 100 000 data points 7 different datasets were only used for testing. Each dataset contains 200 trajectories of length 500. References Frank, D. statesim [Computer software]. https://github.com/Dany-L/statesim Lu, L., Jin, P., Pang, G., Zhang, Z., & Karniadakis, G. E. (2021). Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nature machine intelligence, 3(3), 218-229.
r
Data from: JSON Dataset of Simulated Building Heat Control for System of...
researchdata.se
gimi9.com
+1more
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Nilsson (2025). JSON Dataset of Simulated Building Heat Control for System of Systems Interoperability [Dataset]. http://doi.org/10.5878/e5hb-ne80
Explore at:
(438755370), (110041420), (156812), (5417)Available download formats
Unique identifier
https://doi.org/10.5878/e5hb-ne80
Dataset updated
Mar 21, 2025
Dataset provided by
Luleå University of Technology
Authors
Jacob Nilsson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Luleå Municipality
Description
Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation.

The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data.

The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.

The data file with temperatures (smhi-july-23-29-2018.csv) acts as input for the thermodynamic building simulation found on Github, where it is used to get the outside temperature and corresponding timestamps. Temperature data for Luleå Summer 2018 were downloaded from SMHI.
t
Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...
researchdata.tuwien.at
b2find.eudat.eu
html, pdf, zip
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
Explore at:
html, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.48436/m2ha4-t1v92
Dataset updated
Mar 19, 2025
Dataset provided by
TU Wien
Authors
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How To Cite?

Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

Folder Structure

The folder named “submission” contains the following:

“pythonProject”: This folder contains all the Python files and subfolders needed for analysis.

ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

Setting Up the Environment

Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.

The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

Subfolders

1. Data_4_IJGIS

This folder contains the data used for the results reported in the paper.

Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

2. results_[DateTime] (e.g., results_20240906_15_00_13)

This folder will be generated when you run the code and will store the output of each step.

The current folder contains results created during code debugging for the submission.

When you run the code, a new folder with fresh results will be generated.

Python Files

1. helper_functions.py

Contains reusable functions used throughout the analysis.

Each function includes a description of its purpose and the input parameters required.

2. create_sanity_plots.py

Generates scatter plots like those in Figure 3 of the paper.

Although the code has been run for all 309 trials, it can be used to check the sample data provided.

Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.

Usage: Run this file to create visualizations similar to Figure 3.

3. overlapping_sliding_window_loop.py

Implements overlapping sliding window segmentation and generates plots like those in Figure 4.

Output:

Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.

Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.

A visualization of the segments, similar to Figure 4, will be automatically generated.

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

These files compute features as explained in Tables 1 and 2 of the paper, respectively.

They process the segmented recordings generated by the overlapping_sliding_window_loop.py.

Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

5. training_prediction.py

This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.

A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.

b. Training/Validation/Test Split

Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).

Make sure that you follow the instructions in the comments to the code exactly.

Output: The split data is saved as .csv files in the results folder.

c. Machine and Deep Learning Experiments

This part contains three main code blocks:

iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.

XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.

XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

d. Inference (Monitoring Part)

Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.

Figure 8 in the paper is generated using this part of the code.

6. sequence_analysis.py

Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.

This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

Licenses

The data is licensed under CC-BY, the code is licensed under MIT.
Rescaled CIFAR-10 dataset
zenodo.org
explore.openaire.eu
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15188748
Dataset updated
Jun 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Description
Motivation

The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled CIFAR-10 dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

and is therefore significantly more challenging.

Access and rights

The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

[4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

The h5 files containing the dataset

The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
t
Tour Recommendation Model
test.researchdata.tuwien.at
test.researchdata.tuwien.ac.at
bin, png +1
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
Explore at:
text/markdown, png, binAvailable download formats
Unique identifier
https://doi.org/10.70124/akpf6-8p175
Dataset updated
May 14, 2025
Dataset provided by
TU Wien
Authors
Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Dataset Description for Tour Recommendation Model

Context and Methodology:

Research Domain/Project:
This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

Purpose:
The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

Creation Methodology:
The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

Technical Details:

Structure of the Dataset:
The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

place_or_event_id: Unique identifier for each tourist place or event.

rating: Rating given by the user, ranging from 1 to 5.

The data is split into three subsets:

Training Set: 80% of the dataset used to train the model.

Validation Set: A small portion used for hyperparameter tuning.

Test Set: 20% used to evaluate model performance.

Folder and File Naming Conventions:
The dataset files are stored in the following structure:

user_ratings_dataset.csv: The original dataset file containing user ratings.

tour_recommendation_model.pkl: The saved model after training.

actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

Software Requirements:
To open and work with this dataset, the following software and libraries are required:

Python 3.x

Pandas for data manipulation

Scikit-learn for training and evaluating machine learning models

Matplotlib for chart generation

Joblib for saving and loading the trained model

The dataset can be opened and processed using any Python environment that supports these libraries.

Additional Resources:

The model training code, README file, and performance chart are available in the project repository.

For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

Further Details:

Dataset Reusability:
The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

Train other types of models (e.g., regression, classification).

Experiment with different features or add more metadata to enrich the dataset.

Data Integrity:
The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

Licensing:
The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.
m
Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...
data.mendeley.com
narcis.nl
Updated Jan 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TaeKeun Yoo (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
Explore at:
Unique identifier
https://doi.org/10.17632/ffn745r57z.2
Dataset updated
Jan 11, 2021
Authors
TaeKeun Yoo
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

Python version:

from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

connect data in your google drive

from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

Change the path for the custom data

In this case, we used ICL vault prediction using preop measurement

dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

optimal features (sorted by importance) :

1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

Split the dataset to train and test data, if necessary.

For example, we can split data to 8:2 as a simple validation test

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

Optimal parameter search could be performed in this section

parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_
CYP450 80/20 splits
figshare.com
txt
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Siegle (2016). CYP450 80/20 splits [Dataset]. http://doi.org/10.6084/m9.figshare.1066108.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1066108.v1
Dataset updated
Jan 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Daniel Siegle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data from an NIH HTS of 17K compounds against five isozymes of cytochrome P450 screening for inhibition. The activity score is taken from the NIH assay and merged with all the 2-D descriptors from the program Molecular Operating Environment (MOE). The datasets are separated by isozyme and then balanced between actives and inactives. Finally the balanced datasets are subject to an 80/20 training/test split. Link to python script of data manipulation...
Monkeypox Skin Lesion Dataset
kaggle.com
Updated Jul 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TensorKitty (2022). Monkeypox Skin Lesion Dataset [Dataset]. https://www.kaggle.com/datasets/nafin59/monkeypox-skin-lesion-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
TensorKitty
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
An updated version of the MSLD dataset, MSLD v2.0 has been released after being verified by an expert dermatologist!

For details, check our GitHub repo!

Context

The recent monkeypox outbreak has become a global healthcare concern owing to its rapid spread in more than 65 countries around the globe. To obstruct its expeditious pace, early diagnosis is a must. But the confirmatory Polymerase Chain Reaction (PCR) tests and other biochemical assays are not readily available in sufficient quantities. In this scenario, computer-aided monkeypox identification from skin lesion images can be a beneficial measure. Nevertheless, so far, such datasets are not available. Hence, the "Monkeypox Skin Lesion Dataset (MSLD)" is created by collecting and processing images from different means of web-scrapping i.e., from news portals, websites and publicly accessible case reports.

The creation of "Monkeypox Image Lesion Dataset" is primarily focused on distinguishing the monkeypox cases from the similar non-monkeypox cases. Therefore, along with the 'Monkeypox' class, we included skin lesion images of 'Chickenpox' and 'Measles' because of their resemblance to the monkeypox rash and pustules in initial state in another class named 'Others' to perform binary classification.

Content

There are 3 folders in the dataset.

1) Original Images: It contains a total number of 228 images, among which 102 belongs to the 'Monkeypox' class and the remaining 126 represents the 'Others' class i.e., non-monkeypox (chickenpox and measles) cases.

2) Augmented Images: To aid the classification task, several data augmentation methods such as rotation, translation, reflection, shear, hue, saturation, contrast and brightness jitter, noise, scaling etc. have been applied using MATLAB R2020a. Although this can be readily done using ImageGenerator/other image augmentors, to ensure reproducibility of the results, the augmented images are provided in this folder. Post-augmentation, the number of images increased by approximately 14-folds. The classes 'Monkeypox' and 'Others' have 1428 and 1764 images, respectively.

3) Fold1: One of the three-fold cross validation datasets. To avoid any sort of bias in training, three-fold cross validation was performed. The original images were split into training, validation and test set(s) with the approximate proportion of 70 : 10 : 20 while maintaining patient independence. According to the commonly perceived data preparation practice, only the training and validation images were augmented while the test set contained only the original images. Users have the option of using the folds directly or using the original data and employing other algorithms to augment it.

Additionally, a CSV file is provided that has 228 rows and two columns. The table contains the list of all the ImageID(s) with their corresponding label.

Web Application

Since monkeypox is demonstrating a very rapid community transmission pattern, a consumer-level software is truly necessary to increase awareness and encourage people to take rapid action. We have developed an easy-to-use web application named Monkey Pox Detector using the open-source python streamlit framework that uses our trained model to address this issue. It makes predictions on whether or not to see a specialist along with the prediction accuracy. Future updates will benefit from the user data we continue to collect and use to improve our model. The web app has a flask core, so that it can be deployed cross-platform in the future.

Learn more at our GitHub repo!

Citation

If this dataset helped your research, please cite the following articles:

Ali, S. N., Ahmed, M. T., Paul, J., Jahan, T., Sani, S. M. Sakeef, Noor, N., & Hasan, T. (2022). Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study. arXiv preprint arXiv:2207.03342.

@article{Nafisa2022, title={Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study}, author={Ali, Shams Nafisa and Ahmed, Md. Tazuddin and Paul, Joydip and Jahan, Tasnim and Sani, S. M. Sakeef and Noor, Nawshaba and Hasan, Taufiq}, journal={arXiv preprint arXiv:2207.03342}, year={2022} }

Ali, S. N., Ahmed, M. T., Jahan, T., Paul, J., Sani, S. M. Sakeef, Noor, N., Asma, A. N., & Hasan, T. (2023). A Web-based Mpox Skin Lesion Detection System Using State-of-the-art Deep Learning Models Considering Racial Diversity. arXiv preprint arXiv:2306.14169.

@article{Nafisa2023, title={A Web-base...
S
machine learning models on the WDBC dataset
scidb.cn
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.23537
Dataset updated
Apr 15, 2025
Dataset provided by
Science Data Bank
Authors
Mahdi Aghaziarati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
e
JSON dataset för simulerad byggnadsvärmekontroll för system-av-system...
b2find.eudat.eu
Updated Apr 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). JSON dataset för simulerad byggnadsvärmekontroll för system-av-system interoperabilitet - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/442bc87f-092d-57d9-a2f0-ba1c7e049d36
Explore at:
Dataset updated
Apr 19, 2022
Description
Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation. The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data. The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/. Datasetet innehåller simulerad servicedata för system-av-system interoperabilitetsforskning. För mer information se bifogad dokumentation och den engelska katalogsidan. Data kommer i två semikolonseparerade (;) csv-filer, training.csv och test.csv. Träning/testfördelningen är inte slumpmässig; träningsdata kommer från de första 80 % av de simulerade tidsstegen och testdata är de sista 20 %. Det finns ingen specifik valideringsdatauppsättning, valideringsdatan bör istället väljas slumpmässigt från träningsdatan. Simuleringen körs i lika många tidssteg som det finns tillgängliga utetemperaturvärden. De ursprungliga SMHI-data samplar bara en gång i timmen, som linjärt interpolerar för att få ett temperaturprov var tionde sekund. Data som sparas vid varje tidssteg består av 34 JSON-meddelanden (fyra per rum och två temperaturavläsningar utifrån), 9 temperaturvärden (ett per rum och utanför), 8 börvärden och 8 ställdonutgångar. Data som är associerade med vart och ett av dessa 34 JSON-meddelanden lagras som en enda rad i tabellerna. Detta innebär att mycket data dupliceras, ett val som görs för att göra det lättare att använda datan. Simuleringsdata är inte avsedd att öppnas och analyseras i kalkylprogram, det är avsett att träna maskininlärningsmodeller. Det rekommenderas att öppna data med pandas-biblioteket för Python, tillgängligt på https://pypi.org/project/pandas/. Building temperature simulation. Simulering av byggnadstemperatur. Simulation
e
Coupled mass-spring-damper system for nonlinear system identification -...
b2find.eudat.eu
Updated Aug 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Coupled mass-spring-damper system for nonlinear system identification - actuated with random static inputs - synthetically generated - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/e96c92da-aa2e-517f-a915-e33031f009f2
Explore at:
Dataset updated
Aug 15, 2025
Description
Overview This dataset contains input-output data of a coupled mass-spring-damper system with a nonlinear force profile. The data was generated with statesim [1], a python package for simulating linear and nonlinear ODEs, for the system coupled-msd. The configuration .json files for the corresponding datasets (in-distribution and out-of-distribution) can be found in the respective folders. After creating the dataset, the files are stored in the raw folder. Then, they are split into subsets for training, testing, and validation and can be found in the processed folder; details about the splitting are found in the config.json file. The dataset can be used to test system identification algorithms and methods that aim to identify nonlinear dynamics from input-output measurements. The training dataset is used to optimize the model parameters, the validation set for hyperparameter optimization, and the test set only for the final evaluation. In [2], the authors use the same underlying dynamics to create their dataset. Input generation Input trajectories are piecewise constant trajectories. Noise Gaussian white noise of approximately 30dB is added at the output. Statistics The input and output size is one. In-distribution data: 1,500,000 data points Training: 120 trajectories of length 7500 Validation: 20 trajectories of length 7500 Test: 60 trajectories of length 7500 Out-of-distribution data: 10 times 3000 data points 10 different datasets were only used for testing. Each dataset contains 50 trajectories of length 6000. References Frank, D. statesim [Computer software]. https://github.com/Dany-L/statesim Revay, M., Wang, R., & Manchester, I. R. (2020). A convex parameterization of robust recurrent neural networks. IEEE Control Systems Letters, 5(4), 1363-1368.
Z
Dataset and model weights for particle-based plasma simulation using a graph...
data.niaid.nih.gov
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mlinarević, Marin (2025). Dataset and model weights for particle-based plasma simulation using a graph neural network [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14941474
Explore at:
Dataset updated
Mar 4, 2025
Dataset provided by
Holt, George
Agnello, Adriano
Mlinarević, Marin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This record contains the trained graph network simulator models described in the paper "Particle-based plasma simulation using a graph neural network" [1], and the data used to train and evaluate them. The data were generated from simualtions of two counterpropagating beams of electrons in one spatial dimension using the EPOCH particle-in-cell code [2].

The provided files are:

models.zip - contains a directory for each of models A, B and C, each containing three files:

hyperparameters.json - contains the model hyperparameters and parameters used for training the model (batch size, learning rate, added noise)

model-*.pt - contains trained model weights

train_state-*.pt - contains the optimizer state, training step and training loss; it can be used to resume training

data.zip - contains the data split into training, validation and test sets, each in its own NPZ file, and JSON files "metadata_long_timestep.json" and "metadata_short_timestep.json" with information about the dataset.

The metadata JSON files contain the following information:

"bounds": the lower and upper bounds of the simulated space, in metres

"sequence_length" : the number of snapshots in each simulation

"default_connectivity_radius": the value that will be used to construct graphs unless overriden when running the code, in metres

"dim": the number of spatial dimensions

"dt": nominal time step between snapshots, in seconds

"vel_mean" and "vel_std": the mean and standard deviation of velocities computed from displacements between adjacent snapshots in the training data, used for scaling input features

"acc_mean" and "acc_std": the mean and standard deviation of acclerations computed from displacements between adjacent snapshots in the training data, used for scaling the outputs of the model

"E_mean", "E_std", "B_mean", "B_std": the mean and standard deviation of x, y and z components of the electric (E) and magnetic (B) fields in the training data, with x being the single dimension in which particles are constrained to move in the simulation; used for scaling input features

"weight_max" and "weight_min": the maximum and minimum particle weight in the training dataset

Model A was trained with the entire training dataset, using the parameters in metadata_short_timestep.json for scaling. For models B and C, every fourth snapshot from the data was taken and the rest were discarded; the corresponding metadata is contained in metadata_long_timestep.json.

Each NPZ file contains an array named "gns_data", which can be loaded with the following Python code:

import numpy as np with np.load(, allow_pickle=True) as data_file: data = data_file['gns_data']

This contains a list of dictionaries, each containing data from one simulation:

'particle_weights': array of shape (number_of_particles,)

'particle_positions': array of shape (number_of_particles, number_of_snapshots, dim), containing the position of each particle in metres

'momenta': array of shape (number_of_particles, number_of_snapshots, dim), containing the momentum of each particle in kg m/s

'grid_positions': array of shape (number_of_grid_points,), containing the positions of grid points in metres

'electric_field': array of shape (number_of_grid_points, number_of_snapshots, 3), containing the x, y and z componenets of the electric field at each grid point from each snapshot in newtons per coulomb

'magnetic_field': array of shape (number_of_grid_points, number_of_snapshots, 3), containing the x, y and z componenets of the electric field at each grid point from each snapshot in teslas

'num_left': number of particles initially in the beam travelling to the left

'num_right': number of particles initially in the beam travelling to the right

'original_index': string, unique ID for the sample

'simulation_time': array of shape (number_of_snapshots,) containing the simulated time in seconds at which each snapshot was recorded

'temp': the initial temperature of the beams in kelvins, as given to EPOCH

'drift_p': the magnitude of the initial drift momentum of the beams in kg m/s, as given to EPOCH

'dens': the initial number of electrons per metre as given to EPOCH

References

[1] M. Mlinarevic, G. K. Holt and A. Agnello, Particle-based plasma simulation using a graph neural network, 2025, arXiv:2503.00274 [physics.plasm-ph]

[2] T. D. Arber et al., Contemporary particle-in-cell approach to laser-plasma modelling, Plasma Phys. Control. Fusion 57 (2015) 113001.
o
Train and Evaluation Code, Road Classification Models and Test set of the...
explore.openaire.eu
zenodo.org
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miguel Angel Manso Callejo; Calimanut-Ionut Cira; Teresa Iturrioz; Francisco Serradilla Garcia (2024). Train and Evaluation Code, Road Classification Models and Test set of the paper "Impact of Image Resolution and Image Overlap on the Prediction Performance of Convolutional Neural Networks Trained for Road Classification" [Dataset]. http://doi.org/10.5281/zenodo.10835684
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10835684
Dataset updated
Mar 20, 2024
Authors
Miguel Angel Manso Callejo; Calimanut-Ionut Cira; Teresa Iturrioz; Francisco Serradilla Garcia
Description
This repository contains the Python scripts built for training and evaluation of the implementation, together with the test data and the resulting road classification models corresponding to the paper "Impact of Image Resolution and Image Overlap on the Prediction Performance of Convolutional Neural Networks Trained for Road Classification". The scripts make use of the Tensorflow with Keras framework and the additional required dependencies. The training and validation set is based on the binary SROADEX dataset (https://zenodo.org/records/6482346) that was re-split into tiles that feature the image resolutions (256 x 256, 512 x 512, and 1024 x 1024 pixels) and image overlaps (0% and 12.5%) considered in this study. The data have been generated using scripts developed in Python using Open Source libraries (GDAL/OGR and MapScript) for rasterization of vector cartography that represents the axes of the different types of roads (urban, interurban and rural). This binary road data contains information from 16 full orthoimages (28.5 km * 18.5 km) with spatial resolution of 0.5 m/pixel from the insular and peninsular Spanish territory. Due to the size on disk of approximately 546 gigabytes, this training and validation data is only available upon request from the corresponding author. The test set has been generated from a novel area of 28.5 km * 18.5 km and features binary road labels. The test sets are provided in the repository for each resolution (with no overlap), so that additional DL models can be evaluated on the same data and compared with the results achieved in this study. The structure of the information shared in this repository is as follows:The scripts have been grouped by tile resolution (256, 512 and 1024). First, the test set and the evaluation script can be found. For each tile resolution, there are two subfolders (corresponding to the "no overlap" and "12.5% overlap"). In each case, the Python scripts for training the models in the three repetitions are shared, and the trained models (H5 format) are shared in compressed form. Finally, for each resolution we also share the testing dataset which consists of two folders. The material is distributed under a CC-BY 4.0 license.

Facebook

Twitter

Click to copy link

Link copied

Cite

Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:

Dataset updated

Aug 8, 2022

Dataset authored and provided by

Köhler, Juliane

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.

Clear search

Close search

Google apps

Main menu

Data Cleaning, Translation & Split of the Dataset for the Automatic...

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

DustNet - structured data and Python code to reproduce the model,...

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

MC-LSTM papers, model runs

Dataset for "Predicting the growth trajectory and yield of greenhouse...

Overall

Plant traits measurements

Treatment-level

Fruit-level

Weather data

Image data with label

Object and phenology detection

Fruit size and decimal phenological stage

Merge and split Data

Pre-trained models

corr2cause

Corr2cause

Damped pendulum for nonlinear system identification - inputs are sampled...

Data from: JSON Dataset of Simulated Building Heat Control for System of...

Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

How To Cite?

Folder Structure

Setting Up the Environment

Subfolders

1. Data_4_IJGIS

2. results_[DateTime] (e.g., results_20240906_15_00_13)

Python Files

1. helper_functions.py

2. create_sanity_plots.py

3. overlapping_sliding_window_loop.py

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

5. training_prediction.py

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

b. Training/Validation/Test Split

c. Machine and Deep Learning Experiments

d. Inference (Monitoring Part)

6. sequence_analysis.py

Licenses

Rescaled CIFAR-10 dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set

Tour Recommendation Model

Dataset Description for Tour Recommendation Model

Context and Methodology:

Technical Details:

Further Details:

Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

connect data in your google drive

Change the path for the custom data

In this case, we used ICL vault prediction using preop measurement

optimal features (sorted by importance) :

1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

Split the dataset to train and test data, if necessary.

For example, we can split data to 8:2 as a simple validation test

In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

Optimal parameter search could be performed in this section

CYP450 80/20 splits

Monkeypox Skin Lesion Dataset

An updated version of the MSLD dataset, MSLD v2.0 has been released after being verified by an expert dermatologist!

Context

Content

Web Application

Citation

machine learning models on the WDBC dataset

JSON dataset för simulerad byggnadsvärmekontroll för system-av-system...

Coupled mass-spring-damper system for nonlinear system identification -...

Dataset and model weights for particle-based plasma simulation using a graph...

Train and Evaluation Code, Road Classification Models and Test set of the...

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft