5 datasets found

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...
zenodo.org
csv, zip
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah (2025). MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies [Dataset]. http://doi.org/10.5281/zenodo.15666484
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15666484
Dataset updated
Jun 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 14, 2025
Description

The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.

Details on the input files for confidence analysis: Analysis1.csv and Analysis2.csv
These files contain two columns for the test data: column 1 = match or mismatch between predicted and true alterations? column 2 = probability of a correct classification, according to bootstrapped samples (Analysis1.csv) or to simulated proxies (Analysis2.csv)
Z
Carbide Volume Fraction Estimation in as-cast HCCI Alloys using Machine...
data.niaid.nih.gov
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Müller, Martin; Nayak, Ullal Pranav (2024). Carbide Volume Fraction Estimation in as-cast HCCI Alloys using Machine Learning [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10654149
Explore at:
Dataset updated
Apr 10, 2024
Dataset provided by
Saarland University
Authors
Müller, Martin; Nayak, Ullal Pranav
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repo contains a literature collection of compositional data and experimentally determined carbide volume fractions (CVF) for as-cast high chromium cast iron (HCCI) alloys as well as a machine learning (ML) model to predict CVF based on the chemical composition.

The zip file "Dataset_HCCI_CVF.zip" contains the raw data compiled from literature, as well as the train and test splits that were used for training the ML model. The raw data compilation ("20240213_HCCI CVF Composition Database_zenodo.xlsx") lists the chemical compositions and experimentally determined CVF with corresponding references. Carbon-to-Chromium ratio has been added as an additional column. Moreover, CVF has been calculated according to existing literatures formulas (six in total). The deviation (in %) from experimental CVF for each calculation is also given.

A separate list of all references that have been included in the dataset is also provided as .bib and .ris files ("References for Excel Database.zip").

The zip file "ML_model_HCCI_CVF.zip" contains the final trained ML model (MATLAB file) and the corresponding MATLAB script that can be run in order to predict the CVF based on the chemical composition ("model_inference_CVF_HCCI.m"). The script accesses the trained ML model "GPR_final_all_data.mat" that must be stored in the same location as the MATLAB script. Input of the chemical composition can be done either directly in the MATLAB script or by loading an excel or csv spreadsheet. Further details about usage of the code are also mentioned in the MATLAB script.
EEG Spectrogram Images for Alcoholism Detection
kaggle.com
zip
Updated Sep 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayeem26s (2024). EEG Spectrogram Images for Alcoholism Detection [Dataset]. https://www.kaggle.com/datasets/sayeemmohammed/eeg-spectrogram-images/code
Explore at:
zip(556609928 bytes)Available download formats
Dataset updated
Sep 6, 2024
Authors
Sayeem26s
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains EEG spectrogram images aimed at classifying alcoholism. It consists of 7,200 images in total, split into training and testing sets. The images are derived from EEG signals captured from 12 different brain channels.

Folders:

Train Folder (5,750 images)

Test Folder (1,450 images)

Data Processing 1. Initial Data: The dataset started as CSV files containing raw EEG signal data. 2. Conversion to EDF: The CSV data was converted to EDF (European Data Format) for better handling of EEG data. 3. Processing in MATLAB EEGLAB: The EDF files were processed in MATLAB using the EEGLAB toolbox to generate accurate spectrograms. 4. Final Processing: The spectrograms were further refined and converted into images using Python, ready for model training.
Data from: World's Fastest Brain-Computer Interface: Combining EEG2Code with...
figshare.com
bin
Updated Feb 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Nagel; Martin Spüler (2019). World's Fastest Brain-Computer Interface: Combining EEG2Code with Deep Learning [Dataset]. http://doi.org/10.6084/m9.figshare.7701065.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7701065.v1
Dataset updated
Feb 11, 2019
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Sebastian Nagel; Martin Spüler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General descriptionData was recorded using BCI2000 with g.USBamp (g.tec, Austria) EEG amplifier. 32 electrodes were used. Sampling rate was set to 600 Hz and data was bandpass filtered by the amplifier between 0.1 Hz and 60 Hz using a Chebyshev filter of order 8 and notch-filtered at 50 Hz. Data was stored as MATLAB mat-File.2. Experimental descriptionThe experiment was split in a training phase and a testing phase. During both, the participant had to focus a target which was modulated with fully random stimulation patterns, which were presented with 60 bits per second.For training, the participant had to perform 96 runs for, each with 4 s of stimulation, which means a total of 96*4*60=23040 bits were presented. For testing, the participant also had to perform 96 runs, but with 5 s of stimulation, which results in 96*5*60 = 28800 Bits.3. Variable descriptionThe file VP1.mat contains the following variables:- train_data_xcontains the raw EEG data of the training runs split by runs. The matrix has the following dimension: #runs X #channels X #samples- train_data_ycontains the stimulation pattern for each train run, upsampled to be synchronized with the EEG data. The matrix has the following dimension: #runs X #samples- test_data_xcontains the raw EEG data of the test runs split by runs. The matrix has the following dimension: #runs X #channels X #samples- test_data_ycontains the stimulation pattern for each test run, upsampled to be synchronized with the EEG data. The matrix has the following dimension: #runs X #samplesThe file VP1.hdf5 is the Keras CNN model which was trained during the online experiment.The file EEG2Code.py is a python script which takes the MAT-file as input and outputs the pattern prediction accuracy for each of the test run. It must be noted that the script searches for a Keras model with the file name as the MAT-file (but with hdf5 file extension). If the model exists, it will be loaded, otherwise a new model will be trained.
Rescaled CIFAR-10 dataset
zenodo.org
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15188748
Dataset updated
Jun 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Description
Motivation

The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled CIFAR-10 dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

and is therefore significantly more challenging.

Access and rights

The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

[4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

The h5 files containing the dataset

The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah (2025). MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies [Dataset]. http://doi.org/10.5281/zenodo.15666484

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies

Explore at:

csv, zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15666484

Dataset updated

Jun 18, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Abhishek Borah; Xavier Emery; Xavier Emery; Parag Jyoti Dutta; Parag Jyoti Dutta; Abhishek Borah

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jun 14, 2025

Description

The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv and Trad_Test.csv) is derived directly from the original complete geochemical dataset (alldata.csv) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv and Simu_Test.csv) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip archive contains the raw input files alldata.csvused to generate the proxies_alldata.csv, it also contains Analysis1.csv and Analysis2.csvfor performing confidence analysis. To run the executable files in place of the .m scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.

Details on the input files for confidence analysis: Analysis1.csv and Analysis2.csv

These files contain two columns for the test data: column 1 = match or mismatch between predicted and true alterations? column 2 = probability of a correct classification, according to bootstrapped samples (Analysis1.csv) or to simulated proxies (Analysis2.csv)

Clear search

Close search

Google apps

Main menu

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and...

Carbide Volume Fraction Estimation in as-cast HCCI Alloys using Machine...

EEG Spectrogram Images for Alcoholism Detection

Data from: World's Fastest Brain-Computer Interface: Combining EEG2Code with...

Rescaled CIFAR-10 dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set

MetaCost XGBoost Training and Evaluation Dataset with MATBLAB Codes and files for generating proxies