100+ datasets found

H
Data from: Data augmentation for disruption prediction via robust surrogate...
dataverse.harvard.edu
osti.gov
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/FMJCAD
Dataset updated
Aug 31, 2024
Dataset provided by
Harvard Dataverse
Authors
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.
Variable Message Signal annotated images for object detection
zenodo.org
zip
Updated Oct 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas (2022). Variable Message Signal annotated images for object detection [Dataset]. http://doi.org/10.5281/zenodo.5904211
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5904211
Dataset updated
Oct 2, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
If you use this dataset, please cite this paper: Puertas, E.; De-Las-Heras, G.; Sánchez-Soriano, J.; Fernández-Andrés, J. Dataset: Variable Message Signal Annotated Images for Object Detection. Data 2022, 7, 41. https://doi.org/10.3390/data7040041

This dataset consists of Spanish road images taken from inside a vehicle, as well as annotations in XML files in PASCAL VOC format that indicate the location of Variable Message Signals within them. Also, a CSV file is attached with information regarding the geographic position, the folder where the image is located, and the text in Spanish. This can be used to train supervised learning computer vision algorithms, such as convolutional neural networks. Throughout this work, the process followed to obtain the dataset, image acquisition, and labeling, and its specifications are detailed. The dataset is constituted of 1216 instances, 888 positives, and 328 negatives, in 1152 jpg images with a resolution of 1280x720 pixels. These are divided into 576 real images and 576 images created from the data-augmentation technique. The purpose of this dataset is to help in road computer vision research since there is not one specifically for VMSs.

The folder structure of the dataset is as follows:

vms_dataset/

data.csv

real_images/

imgs/

annotations/

data-augmentation/

imgs/

annotations/

In which:

data.csv: Each row contains the following information separated by commas (,): image_name, x_min, y_min, x_max, y_max, class_name, lat, long, folder, text.

real_images: Images extracted directly from the videos.

data-augmentation: Images created using data-augmentation

imgs: Image files in .jpg format.

annotations: Annotation files in .xml format.
Z
Training dataset for "A deep learned nanowire segmentation model using...
data.niaid.nih.gov
zenodo.org
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David, A. Santos (2024). Training dataset for "A deep learned nanowire segmentation model using synthetic data augmentation" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6469772
Explore at:
Dataset updated
Jul 16, 2024
Dataset provided by
Bai-Xiang, Xu
Lin, Binbin
Sarbajit, Banerjee
David, A. Santos
Nima, Emami
Yuting, Luo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This image dataset contains synthetic structure images used for training the deep-learning based nanowire segmentation model presented in our work "A deep learned nanowire segmentation model using synthetic data augmentation" to be published in npj Computational materials. Detailed information can be found in the corresponding article.
f
Convolutional and recurrent neural network for human activity recognition:...
plos.figshare.com
docx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Hernandez; Tomoya Suzuki; Gentiane Venture (2023). Convolutional and recurrent neural network for human activity recognition: Application on American sign language [Dataset]. http://doi.org/10.1371/journal.pone.0228869
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0228869
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Vincent Hernandez; Tomoya Suzuki; Gentiane Venture
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
Human activity recognition is an important and difficult topic to study because of the important variability between tasks repeated several times by a subject and between subjects. This work is motivated by providing time-series signal classification and a robust validation and test approaches. This study proposes to classify 60 signs from the American Sign Language based on data provided by the LeapMotion sensor by using different conventional machine learning and deep learning models including a model called DeepConvLSTM that integrates convolutional and recurrent layers with Long-Short Term Memory cells. A kinematic model of the right and left forearm/hand/fingers/thumb is proposed as well as the use of a simple data augmentation technique to improve the generalization of neural networks. DeepConvLSTM and convolutional neural network demonstrated the highest accuracy compared to other models with 91.1 (3.8) and 89.3 (4.0) % respectively compared to the recurrent neural network or multi-layer perceptron. Integrating convolutional layers in a deep learning model seems to be an appropriate solution for sign language recognition with depth sensors data.
Z
Data from: MedMNIST-C: Comprehensive benchmark and improved classifier...
data.niaid.nih.gov
zenodo.org
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Di Salvo, Francesco (2024). MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11471503
Explore at:
Dataset updated
Jul 31, 2024
Dataset provided by
Ledig, Christian
Doerrich, Sebastian
Di Salvo, Francesco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract: The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection, covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at github.com/francescodisalvo05/medmnistc-api.

This work has been accepted at the Workshop on Advancing Data Solutions in Medical Imaging AI @ MICCAI 2024 [preprint].

Note: Due to space constraints, we have uploaded all datasets except TissueMNIST-C. However, it can be reproduced via our APIs.

Usage: We recommend using the demo code and tutorials available on our GitHub repository.

Citation: If you find this work useful, please consider citing us:

@article{disalvo2024medmnist, title={MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions}, author={Di Salvo, Francesco and Doerrich, Sebastian and Ledig, Christian}, journal={arXiv preprint arXiv:2406.17536}, year={2024} }

Disclaimer: This repository is inspired by MedMNIST APIs and the ImageNet-C repository. Thus, please also consider citing MedMNIST, the respective source datasets (described here), and ImageNet-C.
f
Per-class recall values for results in Table 5.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chinmayee Athalye; Rima Arnaout (2023). Per-class recall values for results in Table 5. [Dataset]. http://doi.org/10.1371/journal.pone.0282532.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0282532.t006
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Chinmayee Athalye; Rima Arnaout
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
While domain-specific data augmentation can be useful in training neural networks for medical imaging tasks, such techniques have not been widely used to date. Our objective was to test whether domain-specific data augmentation is useful for medical imaging using a well-benchmarked task: view classification on fetal ultrasound FETAL-125 and OB-125 datasets. We found that using a context-preserving cut-paste strategy, we could create valid training data as measured by performance of the resulting trained model on the benchmark test dataset. When used in an online fashion, models trained on this hybrid data performed similarly to those trained using traditional data augmentation (FETAL-125 F-score 85.33 ± 0.24 vs 86.89 ± 0.60, p-value 0.014; OB-125 F-score 74.60 ± 0.11 vs 72.43 ± 0.62, p-value 0.004). Furthermore, the ability to perform augmentations during training time, as well as the ability to apply chosen augmentations equally across data classes, are important considerations in designing a bespoke data augmentation. Finally, we provide open-source code to facilitate running bespoke data augmentations in an online fashion. Taken together, this work expands the ability to design and apply domain-guided data augmentations for medical imaging tasks.
i
Data from: Context-guided Ground Truth Sampling for Multi-Modality Data...
ieee-dataport.org
Updated Apr 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heng Qi (2022). Context-guided Ground Truth Sampling for Multi-Modality Data Augmentation in Autonomous Driving [Dataset]. https://ieee-dataport.org/documents/context-guided-ground-truth-sampling-multi-modality-data-augmentation-autonomous-driving
Explore at:
Dataset updated
Apr 24, 2022
Authors
Heng Qi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
existing multimodal data augmentation is only a brief reference to single-modal work
Data from: Phenotype Driven Data Augmentation Methods for Transcriptomic...
zenodo.org
zip
Updated Jun 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikita Janakarajan; Nikita Janakarajan; Mara Graziani; Mara Graziani; María Rodríguez Martínez; María Rodríguez Martínez (2025). Phenotype Driven Data Augmentation Methods for Transcriptomic Data [Dataset]. http://doi.org/10.5281/zenodo.14983178
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14983178
Dataset updated
Jun 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nikita Janakarajan; Nikita Janakarajan; Mara Graziani; Mara Graziani; María Rodríguez Martínez; María Rodríguez Martínez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the data and associated results of all experiments conducted in our work "Phenotype Driven Data Augmentation Methods for Transcriptomic Data". In this work, we introduce two classes of phenotype driven data augmentation approaches – signature-dependent and signature-independent. The signature-dependent methods assume the existence of distinct gene signatures describing some phenotype and are simple, non-parametric, and novel data augmentation methods. The signature-independent methods are a modification of the established Gamma-Poisson and Poisson sampling methods for gene expression data. We benchmark our proposed methods against random oversampling, SMOTE, unmodified versions of Gamma-Poisson and Poisson sampling, and unaugmented data.

This repository contains data used for all our experiments. This includes the original data based off which augmentation was performed, the cross validation split indices as a json file, the training and validation data augmented by the various augmentation methods mentioned in our study, a test set (containing only real samples) and an external test set standardised accordingly with respect to each augmentation method and training data per CV split.

The compressed files 5x5stratified_{x}percent.zip contains data that were augmented on x% of the available real data. brca_public.zip contains data used for the breast cancer experiments. distribution_size_effect.zip contains data used for hyperparameter tuning the reference set size for the modified Poisson and Gamma-Poisson methods.

The compressed file results.zip contains all the results from all the experiments. This includes the parameter files used to train the various models, the metrics (balanced accuracy and auc-roc) computed including p-values, as well as the latent space of train, validation and test (for the (N)VAE) for all 25 (5x5) CV splits.

PLEASE NOTE: If any part of this repository is used in any form for your work, please attribute the following, in addition to attributing the original data source - TCGA, CPTAC, GSE20713 and METABRIC, accordingly:

@article{janakarajan2025phenotype,
title={Phenotype driven data augmentation methods for transcriptomic data},
author={Janakarajan, Nikita and Graziani, Mara and Rodr{\'\i}guez Mart{\'\i}nez, Mar{\'\i}a},
journal={Bioinformatics Advances},
volume={5},
number={1},
pages={vbaf124},
year={2025},
publisher={Oxford University Press}
}

Data from: SEMFIRE forest dataset for semantic segmentation and data...

zenodo.org

application/gzip, bin +2

Updated Jan 20, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Dominik Bittner; Dominik Bittner; Maria Eduarda Andrada; Maria Eduarda Andrada; David Portugal; David Portugal; João Filipe Ferreira; João Filipe Ferreira (2022). SEMFIRE forest dataset for semantic segmentation and data augmentation [Dataset]. http://doi.org/10.5281/zenodo.5819064

Explore at:

zip, application/gzip, bin, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.5819064

Dataset updated

Jan 20, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Dominik Bittner; Dominik Bittner; Maria Eduarda Andrada; Maria Eduarda Andrada; David Portugal; David Portugal; João Filipe Ferreira; João Filipe Ferreira

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

SEMFIRE Datasets (Forest environment dataset)

These datasets are used for semantic segmentation and data augmentation and contain various forestry scenes. They were collected as part of the research work conducted by the Institute of Systems and Robotics, University of Coimbra team within the scope of the Safety, Exploration and Maintenance of Forests with Ecological Robotics (SEMFIRE, ref. CENTRO-01-0247-FEDER-032691) research project coordinated by Ingeniarius Ltd.

The semantic segmentation algorithms attempt to identify various semantic classes (e.g. background, live flammable materials, trunks, canopies etc.) in the images of the datasets.

The datasets include diverse image types, e.g. original camera images and their labeled images. In total the SEMFIRE datasets include about 1700 image pairs. Each dataset includes corresponding .bag files.

To launch those .bag files on your ROS environment, use the instructions on the following Github repository

Description of each dataset:

2019_2020_quinta_do_bolao_coimbra: Robot moving on a path through a forest environment
2020_ctcv_parking_lot_coimbra: Robot moving in a circle in a parking lot for testings
2020_sete_fontes_forest: A set of forest images acquired by hand-held apparatus

Each dataset consists of following directories:

images directory: diverse image types, e.g. original camera images and their labeled images
rosbags directory: .bag files, which correspond to the image directory

Each images directory consists of following directories:

img: original camera images
lbl: single channel images (ground truth) with corresponding labels for each image in img
lbl_colored: camera images in lbl colorized according to different semantic classes (for more details see the datasets descriptions)
lbl_overlaid: camera images in img overlaid with corresponding labels (colored)

Each rosbags directory contains .bag files with the following topics:

2019_2020_quinta_do_bolao_coimbra_rosbags:
- /back_lslidar_packet
- /dalsa_camera_720p/compressed
- /flir_ax8/compressed
- /front_lslidar_packet
- /gps_fix
- /gps_time
- /gps_vel
- /imu/data
- /realsense/aligned_depth_to_color/image_raw
- /realsense/color/camera_info
- /realsense/color/image_raw/compressed
- /realsense/depth/camera_info
- /realsense/depth/image_rect_raw/compressed
- /realsense/extrinsics/depth_to_color
2020_ctcv_parking_lot_coimbra_rosbags:
- /dalsa_camera_720p/compressed
- /gps_fix
- /gps_ime
- /fused_point_cloud
- /imu/data
- /imu/mag
- /imu/rpy
2020_sete_fontes_forest_rosbags:
- /realsense/camera_info
- /realsense/depth_compressed/compressedDepth
- /realsense/nir/left/compressed
- /realsense/nir/right/compressed
- /realsense/rgb/compressed

All datasets include a detailed description as a text file. In addition, they include a rosbag_info.txt file with a description for each ROS inside the .bag files as well as a description for each ROS topic.

The following table shows the statistical description of typical portuguese woodland configurations with structured plantations of Pinus pinaster (Pp, pine trees) and Eucalyptus globulus (Eg, eucalyptus).

	"Low density" structured plantation	"High density" structured plantation
Tree density (assuming plantation in rows spaced 3m apart in all cases)	Eg: 900 trees/ha Pp: 450 trees/ha	Eg: 1400 trees/ha Pp: 1250 trees/ha
Average heights and corresponding ages of plantation trees	Eg: 12m (6 years old) Pp: 10m (15 years old)	Eg: 12m (6 years old) Pp: 10m (15 years old)
Maximum heights and corresponding fully-matured ages of plantation trees	Eg: 20m (11 years old) Pp: 30m (40 years old)	Eg: 20m (11 years old) Pp: 30m (40 years old)
Diameter at chest level (DCL – 1,3m) of plantation trees (average/maximum)	Eg: 15cm/25cm Pp: 20cm/50cm	Eg: 15cm/25cm Pp: 20cm/50cm
Natural density of herbaceous plants	30% of woodland area	30% of woodland area
Natural density of bush and shrubbery	30% of woodland area	30% of woodland area
Natural density of arboreal plants (not part of plantation)	5% of woodland area	5% of woodland area

f
Comparison of models trained with traditional and cut-paste data...
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chinmayee Athalye; Rima Arnaout (2023). Comparison of models trained with traditional and cut-paste data augmentation when application of augmentation during training time is balanced. [Dataset]. http://doi.org/10.1371/journal.pone.0282532.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0282532.t005
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Chinmayee Athalye; Rima Arnaout
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of models trained with traditional and cut-paste data augmentation when application of augmentation during training time is balanced.
Data archive for paper "Copula-based synthetic data augmentation for...
zenodo.org
zip
Updated Mar 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Meyer; David Meyer (2022). Data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators" [Dataset]. http://doi.org/10.5281/zenodo.5150327
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5150327
Dataset updated
Mar 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Meyer; David Meyer
Description
Overview

This is the data archive for paper "Copula-based synthetic data augmentation for machine-learning emulators". It contains the paper’s data archive with model outputs (see results folder) and the Singularity image for (optionally) re-running experiments.

For the Python tool used to generate synthetic data, please refer to Synthia.

Requirements

Singularity >= 3

Portable Batch System (PBS) job scheduler*

Today's high-performance computer (e.g. ~ 32 CPUs @ 2 500 MHz with 64 GB of RAM )

*Although PBS in not a strict requirement, it is required to run all helper scripts as included in this repository. Please note that depending on your specific system settings and resource availability, you may need to modify PBS parameters at the top of submit scripts stored in the hpc directory (e.g. #PBS -lwalltime=72:00:00).

Usage

To reproduce the results from the experiments described in the paper, first fit all copula models to the reduced NWP-SAF dataset with:

qsub hpc/fit.sh

then, to generate synthetic data, run all machine learning model configurations, and compute the relevant statistics use:

qsub hpc/stats.sh qsub hpc/ml_control.sh qsub hpc/ml_synth.sh

Finally, to plot all artifacts included in the paper use:

qsub hpc/plot.sh

Licence

Code released under MIT license. Data from the reduced NWP-SAF dataset released under CC BY 4.0.
Aptos and Messidor eye images
kaggle.com
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anik Bhowmick ae20b102 (2024). Aptos and Messidor eye images [Dataset]. https://www.kaggle.com/datasets/anikbhowmickae20b102/binary-classification-data-aptos-and-messidor
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2024
Dataset provided by
Kaggle
Authors
Anik Bhowmick ae20b102
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Early detection of Diabetic Retinopathy is a key challenge to prevent a patient from potential vision loss. The task of DR detection often requires special expertise from ophthalmologists. In remote places of the world such facilities may not be available, so In an attempt to automate the detection of DR, machine learning and deep learning techniques can be adopted. Some of the recent papers have proven such success on various publicly available dataset.

Another challenge of deep learning techniques is the availability of rightly processed standardized data. Cleaning and preprocessing the data often takes much longer time than the model training. As a part of my research work, I had to preprocess the images taken from APTOS and Messidor before training the model. I applied circle-crop and Graham Ben's preprocessing technique and scaled all the images to 512X512 format. Also, I applied the data augmentation technique and increased the number of samples from 3662 data of APTOS to 18310, and 400 messidor samples to 3600 samples. I divided the images into two classes class 0 (NO DR) and class 1 (DR). The large number of data is essential for transfer learning. This process is very cumbersome and time-consuming. So I thought to upload the newly generated dataset in Kaggle so that some people might find it useful for their work. I hope this will help many people. Feel free to use the data.
Punjabi Shahmukhi Alphabet database (Nastaleeq)
kaggle.com
Updated Mar 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rafique22 (2022). Punjabi Shahmukhi Alphabet database (Nastaleeq) [Dataset]. http://doi.org/10.34740/kaggle/dsv/3313160
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/3313160
Dataset updated
Mar 17, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
rafique22
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The greatest challenge of machine learning problems is to select suitable techniques and resources such as tools and datasets. Despite the existence of millions of speakers around the globe and the rich literary history of more than a thousand years, it is expensive to find the computational linguistic work related to Punjabi Shahmukhi script, a member of the Perso-Arabic context-specific script low-resource language family. The selection of the best algorithm for a machine learning problem heavily depends on the availability of a dataset for that specific task. We present a novel, custom-built, and first-of-its-kind dataset for Punjabi in Shahmukhi script, its design, development, and validation process using Artificial Neural Networks. The dataset uses up to 40 classes, in multiple fonts, including Nasta’leeq, Naskh, and Arabic Type, etc, many font sizes and has been presented in many sub sizes. The dataset has been designed with a special dataset construction process by which researchers can make changes in the dataset as per their requirements.* The dataset construction program can also perform data augmentation to generate millions of images for a machine learning algorithm with different parameters including font type, size orientation, and translation. Using this process, a dataset of any language can be constructed. The CNNs in different architectures have been implemented and validation accuracy of up to 99% has been achieved.
Z
ProxyFAUG: Proximity-based Fingerprint Augmentation (data)
data.niaid.nih.gov
Updated Jun 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kalousis, Alexandros (2022). ProxyFAUG: Proximity-based Fingerprint Augmentation (data) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4457390
Explore at:
Dataset updated
Jun 13, 2022
Dataset provided by
Kalousis, Alexandros
Anagnostopoulos, Grigorios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The supplementary data of the paper "ProxyFAUG: Proximity-based Fingerprint Augmentation".

Open access Author’s accepted manuscript version: https://arxiv.org/abs/2102.02706v2

Published paper: https://ieeexplore.ieee.org/document/9662590

The train/validation/test sets used in the paper "ProxyFAUG: Proximity-based Fingerprint Augmentation", after having passed the preprocessing process described in the paper, are made available here. Moreover, the augmentations produced by the proposed ProxyFAUG method are also made available with the files (x_aug_train.csv, y_aug_train.csv). More specifically:

x_train_pre.csv : The features side (x) information of the preprocessed training set.

x_val_pre.csv : The features side (x) information of the preprocessed validation set.

x_test_pre.csv : The features side (x) information of the preprocessed test set.

x_aug_train.csv : The features side (x) information of the fingerprints generated by ProxyFAUG.

y_train.csv : The location ground truth information (y) of the training set.

y_val.csv : The location ground truth information (y) of the validation set.

y_test.csv : The location ground truth information (y) of the test set.

y_aug_train.csv : The location ground truth information (y) of the fingerprints generated by ProxyFAUG.

Note that in the paper, the original training set (x_train_pre.csv) is used as a baseline, and is compared against the scenario where the concatenation of the original and the generated training sets (concatenation of x_train_pre.csv and x_aug_train.csv) is used.

The full code implementation related to the paper is available here:

Code: https://zenodo.org/record/4457353

The original full dataset used in this study, is the public dataset sigfox_dataset_antwerp.csv which can be access here:

https://zenodo.org/record/3904158#.X4_h7y8RpQI

The above link is related to the publication "Sigfox and LoRaWAN Datasets for Fingerprint Localization in Large Urban and Rural Areas", in which the original full dataset was published. The publication is available here:

http://www.mdpi.com/2306-5729/3/2/13

The credit for the creation of the original full dataset goes to Aernouts, Michiel; Berkvens, Rafael; Van Vlaenderen, Koen; and Weyn, Maarten.

The train/validation/test split of the original dataset that is used in this paper, is taken from our previous work "A Reproducible Analysis of RSSI Fingerprinting for Outdoors Localization Using Sigfox: Preprocessing and Hyperparameter Tuning". Using the same train/validation/test split in different works strengthens the consistency of the comparison of results. All relevant material of that work is listed below:

Preprint: https://arxiv.org/abs/1908.06851

Paper: https://ieeexplore.ieee.org/document/8911792

Code: https://zenodo.org/record/3228752

Data: https://zenodo.org/record/3228744
n
Data from: Trust, AI, and Synthetic Biometrics
curate.nd.edu
pdf
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick G Tinsley (2024). Trust, AI, and Synthetic Biometrics [Dataset]. http://doi.org/10.7274/25604631.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.7274/25604631.v1
Dataset updated
Nov 11, 2024
Dataset provided by
University of Notre Dame
Authors
Patrick G Tinsley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Artificial Intelligence-based image generation has recently seen remarkable advancements, largely driven by deep learning techniques, such as Generative Adversarial Networks (GANs). With the influx and development of generative models, so too have biometric re-identification models and presentation attack detection models seen a surge in discriminative performance. However, despite the impressive photo-realism of generated samples and the additive value to the data augmentation pipeline, the role and usage of machine learning models has received intense scrutiny and criticism, especially in the context of biometrics, often being labeled as untrustworthy. Problems that have garnered attention in modern machine learning include: humans' and machines' shared inability to verify the authenticity of (biometric) data, the inadvertent leaking of private biometric data through the image synthesis process, and racial bias in facial recognition algorithms. Given the arrival of these unwanted side effects, public trust has been shaken in the blind use and ubiquity of machine learning.

However, in tandem with the advancement of generative AI, there are research efforts to re-establish trust in generative and discriminative machine learning models. Explainability methods based on aggregate model salience maps can elucidate the inner workings of a detection model, establishing trust in a post hoc manner. The CYBORG training strategy, originally proposed by Boyd, attempts to actively build trust into discriminative models by incorporating human salience into the training process.

In doing so, CYBORG-trained machine learning models behave more similar to human annotators and generalize well to unseen types of synthetic data. Work in this dissertation also attempts to renew trust in generative models by training generative models on synthetic data in order to avoid identity leakage in models trained on authentic data. In this way, the privacy of individuals whose biometric data was seen during training is not compromised through the image synthesis procedure. Future development of privacy-aware image generation techniques will hopefully achieve the same degree of biometric utility in generative models with added guarantees of trustworthiness.
D
Data from: Using distant supervision to augment manually annotated data for...
lifesciences.datastations.nl
pdf, txt, zip
Updated Jul 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
P. Su; P. Su (2019). Using distant supervision to augment manually annotated data for relation extraction [Dataset]. http://doi.org/10.17026/DANS-XVU-RVK2
Explore at:
txt(3413), txt(8429933), txt(221183954), txt(221072128), txt(744646), zip(18029), txt(229932813), pdf(1333118), txt(194787816), txt(236785257), txt(201857325), txt(214114445), txt(214219684), txt(258892324)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-XVU-RVK2
Dataset updated
Jul 9, 2019
Dataset provided by
DANS Data Station Life Sciences
Authors
P. Su; P. Su
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is the data for the paper "Using distant supervision to augment manually annotated data for relation extraction"Significant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.
f
Data_Sheet_1_Deep learning-based Alzheimer's disease detection:...
frontiersin.figshare.com
pdf
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosanna Turrisi; Alessandro Verri; Annalisa Barla (2024). Data_Sheet_1_Deep learning-based Alzheimer's disease detection: reproducibility and the effect of modeling choices.PDF [Dataset]. http://doi.org/10.3389/fncom.2024.1360095.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fncom.2024.1360095.s001
Dataset updated
Sep 20, 2024
Dataset provided by
Frontiers
Authors
Rosanna Turrisi; Alessandro Verri; Annalisa Barla
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionMachine Learning (ML) has emerged as a promising approach in healthcare, outperforming traditional statistical techniques. However, to establish ML as a reliable tool in clinical practice, adherence to best practices in data handling, and modeling design and assessment is crucial. In this work, we summarize and strictly adhere to such practices to ensure reproducible and reliable ML. Specifically, we focus on Alzheimer's Disease (AD) detection, a challenging problem in healthcare. Additionally, we investigate the impact of modeling choices, including different data augmentation techniques and model complexity, on overall performance.MethodsWe utilize Magnetic Resonance Imaging (MRI) data from the ADNI corpus to address a binary classification problem using 3D Convolutional Neural Networks (CNNs). Data processing and modeling are specifically tailored to address data scarcity and minimize computational overhead. Within this framework, we train 15 predictive models, considering three different data augmentation strategies and five distinct 3D CNN architectures with varying convolutional layers counts. The augmentation strategies involve affine transformations, such as zoom, shift, and rotation, applied either concurrently or separately.ResultsThe combined effect of data augmentation and model complexity results in up to 10% variation in prediction accuracy. Notably, when affine transformation are applied separately, the model achieves higher accuracy, regardless the chosen architecture. Across all strategies, the model accuracy exhibits a concave behavior as the number of convolutional layers increases, peaking at an intermediate value. The best model reaches excellent performance both on the internal and additional external testing set.DiscussionsOur work underscores the critical importance of adhering to rigorous experimental practices in the field of ML applied to healthcare. The results clearly demonstrate how data augmentation and model depth—often overlooked factors– can dramatically impact final performance if not thoroughly investigated. This highlights both the necessity of exploring neglected modeling aspects and the need to comprehensively report all modeling choices to ensure reproducibility and facilitate meaningful comparisons across studies.
Z
Augmented dataset of rumours and non-rumours for rumour detection
data.niaid.nih.gov
live.european-language-grid.eu
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sooji Han (2020). Augmented dataset of rumours and non-rumours for rumour detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3249976
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Jie Gao
Fabio Ciravegna
Sooji Han
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains a collection of Twitter rumours and non-rumours during six real-world events: 1) 2013 Boston marathon bombings, 2) 2014 Ottawa shooting, 3) 2014 Sydney siege, 4) 2015 Charlie Hebdo Attack, 5) 2014 Ferguson unrest, and 6) 2015 Germanwings plane crash

The data set is an augmented data set of the PHEME dataset of rumours and non-rumours based on two data sets: the PHEME data 2, and the CrisisLexT26 data 3.

PHEME-Aug v2.0 (aug-rnr-data_filtered.tar.bz2 and aur-rnr-data_full.tar.bz2) contains augmented data for all six events.

aug-rnr-data_full.tar.bz2 contains source tweets and replies without temporal filtering. Please refer to [1] for details about temporal filtering. The statistics are as follows:

2013 Boston marathon bombings: 392 rumours and 784 non-rumours

2014 Ottawa shooting: 1,047 rumours and 2,072 non-rumours

2014 Sydney siege: 1,764 rumours and 3,530 non-rumours

2015 Charlie Hebdo Attack: 1,225 rumours and 2,450 non-rumours

2014 Ferguson unrest: 737 rumours and 1,476 non-rumours

2015 Germanwings plane crash: 502 rumours and 604 non-rumours

aug-rnr-data_filtered.tar.bz2 contains source tweets, replies, and retweets after temporal filtering and deduplication. Please refer to [1] for details. The statistics are as follows:

2013 Boston marathon bombings: 323 rumours and 645 non-rumours

2014 Ottawa shooting: 713 rumours and 1,420 non-rumours

2014 Sydney siege: 1,134 rumours and 2,262 non-rumours

2015 Charlie Hebdo Attack: 812 rumours and 1,673 non-rumours

2014 Ferguson unrest: 471 rumours and 949 non-rumours

2015 Germanwings plane crash: 375 rumours and 402 non-rumours

The data structure follows the format of the PHEME data [2]. Each event has a directory, with two subfolders, rumours and non-rumours. These two folders have folders named with a tweet ID. The tweet itself can be found on the 'source-tweet' directory of the tweet in question, and the directory 'reactions' has the set of tweets responding to that source tweet. Also each folder contains ‘aug_complete.csv’ and ‘reference.csv'.

'aug_complete.csv' file contains the metadata (tweet ID, tweet text, timestamp, and rumour label) of augmented tweets before deduplication and filtering tweets without context (i.e., replies).

'reference.csv' file contains manually annotated reference tweets [2, 3].

If you use our augmented data (PHEME-Aug v2.0), please also cite:

[1] Han S., Gao, J., Ciravegna, F. (2019). "Neural Language Model Based Training Data Augmentation for Weakly Supervised Early Rumor Detection", The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, Canada, 27-30 August, 2019

==============================================================================================

[2] Kochkina, E., Liakata, M., & Zubiaga, A. (2018). All-in-one: Multi-task Learning for Rumour Verification. COLING.

[3] Olteanu, A., Vieweg, S., & Castillo, C. (2015, February). What to expect when the unexpected happens: Social media communications across crises. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing (pp. 994-1009). ACM
R
Solar flare forecasting based on magnetogram sequences learning with MViT...
redu.unicamp.br
data.niaid.nih.gov
+1more
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Repositório de Dados de Pesquisa da Unicamp (2024). Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation [Dataset]. http://doi.org/10.25824/redu/IH0AH0
Explore at:
Unique identifier
https://doi.org/10.25824/redu/IH0AH0
Dataset updated
Jul 15, 2024
Dataset provided by
Repositório de Dados de Pesquisa da Unicamp
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Description
Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders: magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders. M24/M48: both present the following sub-folders structure: Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root: inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files: magnetogram_jpg: follows the format "hmi.sharp_720s...magnetogram.fits.jpg" and Seqs16: follows the format "hmi.sharp_720s...to.", where: hmi: is the instrument that captured the image sharp_720s: is the database source of SDO/HMI. : is the identification of SHARP region, and can contain one or more solar ARs classified by the (NOAA). : is the date-time the instrument captured the image in the format yyyymmdd_hhnnss_TAI (y:year, m:month, d:day, h:hours, n:minutes, s:seconds). : is the date-time when the sequence starts, and follow the same format of . : is the date-time when the sequence ends, and follow the same format of . Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "flare_Mclass_.txt", where: : is Seq16 if refers to a sequence, or void if refers direct to images. : "24h" or "48h". : is "TrainVal" or "Test". The refers to the split of Train/Val. : void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. All SF_MViT...folders: Model training codes: "SF_MViT_M+_", where: : void or "oT" (over Train) or "oTV" (over Train and Val) or "oTV_Test" (over Train, Val and Test); : "24h" or "48h"; : "oneSplit" for a specific split or "allSplits" if run all splits. : void is default to run 1 GPU or "2gpu" to run into 2 gpus systems; Job submission files: "jobMViT_", where: : point the queue in Lovelace environment hosted on CENAPAD-SP (https://www.cenapad.unicamp.br/parque/jobsLovelace) Temporary inputs: "Seq16_flare_Mclass_.txt: : train or val; : void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. Outputs: "saida_MViT_Adam_10-7", where: : k0 to k4, means the correlated split of the output, or void if the output is from all splits. Error files: "err_MViT_Adam_10-7", where: : k0 to k4, means the correlated split of the error log file, or void if the error file is from all splits. Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=-valid_loss=-Wloss_k=.ckpt", where: : epoch number of the checkpoint; : corresponding valid loss; : 0 to 4.
u
Data from: Classification of complex local environments in systems of...
deepblue.lib.umich.edu
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lee, Shih Kuang; Tsai, Sun Ting; Glotzer, Sharon C. (2024). Classification of complex local environments in systems of particle shapes through shape-symmetry encoded data augmentation [Dataset]. http://doi.org/10.7302/w13t-2177
Explore at:
Unique identifier
https://doi.org/10.7302/w13t-2177
Dataset updated
Feb 23, 2024
Dataset provided by
Deep Blue Data
Authors
Lee, Shih Kuang; Tsai, Sun Ting; Glotzer, Sharon C.
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
2022
Description
The trajectory data and codes were generated for our work "Classification of complex local environments in systems of particle shapes through shape-symmetry encoded data augmentation" (amidst peer review process). The data sets contain trajectory data in GSD file format for 7 test systems, including cubic structures, two-dimensional and three-dimensional patchy particle shape systems, hexagonal bipyramids with two aspect ratios, and truncated shapes with two degrees of truncation. Besides, the corresponding Python code and Jupyter notebook used to perform data augmentation, MLP classifier training, and MLP classifier testing are included.

Facebook

Twitter

Click to copy link

Link copied

Cite

Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD

Data from: Data augmentation for disruption prediction via robust surrogate models

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.7910/DVN/FMJCAD

Dataset updated

Aug 31, 2024

Dataset provided by

Harvard Dataverse

Authors

Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.

Clear search

Close search

Google apps

Main menu

Data from: Data augmentation for disruption prediction via robust surrogate...

Variable Message Signal annotated images for object detection

Training dataset for "A deep learned nanowire segmentation model using...

Convolutional and recurrent neural network for human activity recognition:...

Data from: MedMNIST-C: Comprehensive benchmark and improved classifier...

Per-class recall values for results in Table 5.

Data from: Context-guided Ground Truth Sampling for Multi-Modality Data...

Data from: Phenotype Driven Data Augmentation Methods for Transcriptomic...

Data from: SEMFIRE forest dataset for semantic segmentation and data...

Comparison of models trained with traditional and cut-paste data...

Data archive for paper "Copula-based synthetic data augmentation for...

Aptos and Messidor eye images

Punjabi Shahmukhi Alphabet database (Nastaleeq)

ProxyFAUG: Proximity-based Fingerprint Augmentation (data)

Data from: Trust, AI, and Synthetic Biometrics

Data from: Using distant supervision to augment manually annotated data for...

Data_Sheet_1_Deep learning-based Alzheimer's disease detection:...

Augmented dataset of rumours and non-rumours for rumour detection

Solar flare forecasting based on magnetogram sequences learning with MViT...

Data from: Classification of complex local environments in systems of...

Data from: Data augmentation for disruption prediction via robust surrogate models