100+ datasets found

H
Data from: Data augmentation for disruption prediction via robust surrogate...
dataverse.harvard.edu
osti.gov
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/FMJCAD
Dataset updated
Aug 31, 2024
Dataset provided by
Harvard Dataverse
Authors
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.
Variable Message Signal annotated images for object detection
zenodo.org
zip
Updated Oct 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas (2022). Variable Message Signal annotated images for object detection [Dataset]. http://doi.org/10.5281/zenodo.5904211
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5904211
Dataset updated
Oct 2, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gonzalo de las Heras de Matías; Gonzalo de las Heras de Matías; Javier Sánchez-Soriano; Javier Sánchez-Soriano; Enrique Puertas; Enrique Puertas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
If you use this dataset, please cite this paper: Puertas, E.; De-Las-Heras, G.; Sánchez-Soriano, J.; Fernández-Andrés, J. Dataset: Variable Message Signal Annotated Images for Object Detection. Data 2022, 7, 41. https://doi.org/10.3390/data7040041

This dataset consists of Spanish road images taken from inside a vehicle, as well as annotations in XML files in PASCAL VOC format that indicate the location of Variable Message Signals within them. Also, a CSV file is attached with information regarding the geographic position, the folder where the image is located, and the text in Spanish. This can be used to train supervised learning computer vision algorithms, such as convolutional neural networks. Throughout this work, the process followed to obtain the dataset, image acquisition, and labeling, and its specifications are detailed. The dataset is constituted of 1216 instances, 888 positives, and 328 negatives, in 1152 jpg images with a resolution of 1280x720 pixels. These are divided into 576 real images and 576 images created from the data-augmentation technique. The purpose of this dataset is to help in road computer vision research since there is not one specifically for VMSs.

The folder structure of the dataset is as follows:

vms_dataset/

data.csv

real_images/

imgs/

annotations/

data-augmentation/

imgs/

annotations/

In which:

data.csv: Each row contains the following information separated by commas (,): image_name, x_min, y_min, x_max, y_max, class_name, lat, long, folder, text.

real_images: Images extracted directly from the videos.

data-augmentation: Images created using data-augmentation

imgs: Image files in .jpg format.

annotations: Annotation files in .xml format.
d
Data from: How many specimens make a sufficient training set for automated...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard (2024). How many specimens make a sufficient training set for automated three dimensional feature extraction? [Dataset]. http://doi.org/10.5061/dryad.1rn8pk12f
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.1rn8pk12f
Dataset updated
Jun 1, 2024
Dataset provided by
Dryad Digital Repository
Authors
James M. Mulqueeney; Alex Searle-Barnes; Anieke Brombacher; Marisa Sweeney; Anjali Goswami; Thomas H. G. Ezard
Description
Deep learning has emerged as a robust tool for automating feature extraction from 3D images, offering an efficient alternative to labour-intensive and potentially biased manual image segmentation methods. However, there has been limited exploration into the optimal training set sizes, including assessing whether artificial expansion by data augmentation can achieve consistent results in less time and how consistent these benefits are across different types of traits. In this study, we manually segmented 50 planktonic foraminifera specimens from the genus Menardella to determine the minimum number of training images required to produce accurate volumetric and shape data from internal and external structures. The results reveal unsurprisingly that deep learning models improve with a larger number of training images with eight specimens being required to achieve 95% accuracy. Furthermore, data augmentation can enhance network accuracy by up to 8.0%. Notably, predicting both volumetric and ..., Data collection 50 planktonic foraminifera, comprising 4 Menardella menardii, 17 Menardella limbata, 18 Menardella exilis, and 11 Menardella pertenuis specimens, were used in our analyses (electronic supplementary material, figures S1 and S2). The taxonomic classification of these species was established based on the analysis of morphological characteristics observed in their shells. In this context, all species are characterised by lenticular, low trochosprial tests with a prominent keel [13]. Discrimination among these species is achievable, as M. limbata can be distinguished from its ancestor, M. menardii, by having a greater number of chambers and a smaller umbilicus. Moreover, M. exilis and M. pertenuis can be discerned from M. limbata by their thinner, more polished tests and reduced trochospirality. Furthermore, M. pertenuis is identifiable by a thin plate extending over the umbilicus and possessing a greater number of chambers in the final whorl compared to M. exilis [13]. The s..., , # Data from: How many specimens make a sufficient training set for automated three dimensional feature extraction?

https://doi.org/10.5061/dryad.1rn8pk12f

All computer code and final raw data used for this research work are stored in GitHub: https://github.com/JamesMulqueeney/Automated-3D-Feature-Extraction and have been archived within the Zenodo repository:Â https://doi.org/10.5281/zenodo.11109348.Â

This data is the additional primary data used in each analysis. These include: CT Image Files, Manual Segmentation Files (use for training or analysis), Inputs and Outputs for Shape Analysis and an example .h5 file which can be used to practice AI segmentation.Â

Description of the data and file structure

The primary data is arranged into the following:

Image_Files.zip: Foraminiferal CT data used in the analysis.Â

**I...
n
Data from: Exploring deep learning techniques for wild animal behaviour...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa (2024). Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers [Dataset]. http://doi.org/10.5061/dryad.2ngf1vhwk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2ngf1vhwk
Dataset updated
Feb 22, 2024
Dataset provided by
Nagoya University
Osaka University
Authors
Ryoma Otsuka; Naoya Yoshimura; Kei Tanigaki; Shiho Koyama; Yuichi Mizutani; Ken Yoda; Takuya Maekawa
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.

This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
o
Data from: Use data augmentation for a deep learning classification model...
explore.openaire.eu
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hantian Dong; Biaokai Zhu; Xinri Zhang; Xiaomei Kong (2022). Use data augmentation for a deep learning classification model with chest X-ray clinical imaging featuring coal workers' pneumoconiosis [Dataset]. http://doi.org/10.6084/m9.figshare.c.6099645
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.c.6099645
Dataset updated
Jan 1, 2022
Authors
Hantian Dong; Biaokai Zhu; Xinri Zhang; Xiaomei Kong
Description
Abstract Purpose This paper aims to develop a successful deep learning model with data augmentation technique to discover the clinical uniqueness of chest X-ray imaging features of coal workers' pneumoconiosis (CWP). Patients and methods We enrolled 149 CWP patients and 68 dust-exposure workers for a prospective cohort observational study between August 2021 and December 2021 at First Hospital of Shanxi Medical University. Two hundred seventeen chest X-ray images were collected for this study, obtaining reliable diagnostic results through the radiologists' team, and confirming clinical imaging features. We segmented regions of interest with diagnosis reports, then classified them into three categories. To identify these clinical features, we developed a deep learning model (ShuffleNet V2-ECA Net) with data augmentation through performances of different deep learning models by assessment with Receiver Operation Characteristics (ROC) curve and area under the curve (AUC), accuracy (ACC), and Loss curves. Results We selected the ShuffleNet V2-ECA Net as the optimal model. The average AUC of this model was 0.98, and all classifications of clinical imaging features had an AUC above 0.95. Conclusion We performed a study on a small dataset to classify the chest X-ray clinical imaging features of pneumoconiosis using a deep learning technique. A deep learning model of ShuffleNet V2 and ECA-Net was successfully constructed using data augmentation, which achieved an average accuracy of 98%. This method uncovered the uniqueness of the chest X-ray imaging features of CWP, thus supplying additional reference material for clinical application.
cars_wagonr_swift
kaggle.com
zip
Updated Sep 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ajay (2019). cars_wagonr_swift [Dataset]. https://www.kaggle.com/ajaykgp12/cars-wagonr-swift
Explore at:
zip(44486490 bytes)Available download formats
Dataset updated
Sep 11, 2019
Authors
Ajay
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Data science beginners start with curated set of data, but it's a well known fact that in a real Data Science Project, major time is spent on collecting, cleaning and organizing data . Also domain expertise is considered as important aspect of creating good ML models. Being an automobile enthusiast, I tool up this challenge to collect images of two of the popular car models from a used car website, where users upload the images of the car they want to sell and then train a Deep Neural Network to identify model of a car from car images. In my search for images I found that approximately 10 percent of the cars pictures did not represent the intended car correctly and those pictures have to be deleted from final data.

Content

There are 4000 images of two of the popular cars (Swift and Wagonr) in India of make Maruti Suzuki with 2000 pictures belonging to each model. The data is divided into training set with 2400 images , validation set with 800 images and test set with 800 images. The data was randomized before splitting into training, test and validation set.

The starter kernal is provided for keras with CNN. I have also created github project documenting advanced techniques in pytorch and keras for image classification like data augmentation, dropout, batch normalization and transfer learning

Inspiration

With small dataset like this, how much accuracy can we achieve and whether more data is always better. The baseline model trained in Keras achieves 88% accuracy on validation set, can we achieve even better performance and by how much.

Is the data collected for the two car models representative of all possible car from all over country or there is sample bias .

I would also like someone to extend the concept to build a use case so that if user uploads an incorrect car picture of car , the ML model could automatically flag it. For example user uploading incorrect model or an image which is not a car
Data Augmentation at the LHC through Analysis-specific Fast Simulation with...
zenodo.org
explore.openaire.eu
application/gzip
Updated Oct 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maurizio Pierini; Maurizio Pierini; Cheng Chen; Cheng Chen (2020). Data Augmentation at the LHC through Analysis-specific Fast Simulation with Deep Learning: W+jet large test dataset [Dataset]. http://doi.org/10.5281/zenodo.4080968
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4080968
Dataset updated
Oct 14, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maurizio Pierini; Maurizio Pierini; Cheng Chen; Cheng Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
W+jet events at generator and reconstruction level, used to train analysis-specific generative models.

Events are represented as an array of relevant high-level features. Reco objects are matched to Gen objects and a minimal selection is applied to define the generator support in the N-dim space identified by the input features.

About 2M events, used for large-scale testing

Details in https://arxiv.org/abs/2010.01835
Z
Training dataset for "A deep learned nanowire segmentation model using...
data.niaid.nih.gov
zenodo.org
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David, A. Santos (2024). Training dataset for "A deep learned nanowire segmentation model using synthetic data augmentation" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6469772
Explore at:
Dataset updated
Jul 16, 2024
Dataset provided by
Sarbajit, Banerjee
David, A. Santos
Yuting, Luo
Bai-Xiang, Xu
Lin, Binbin
Nima, Emami
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This image dataset contains synthetic structure images used for training the deep-learning based nanowire segmentation model presented in our work "A deep learned nanowire segmentation model using synthetic data augmentation" to be published in npj Computational materials. Detailed information can be found in the corresponding article.
m
Database of scalable training of neural network potentials for complex...
archive.materialscloud.org
bz2, text/markdown +1
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
In Won Yeu; Annika Stuke; Alexander Urban; Nongnuch Artrith; In Won Yeu; Annika Stuke; Alexander Urban; Nongnuch Artrith (2025). Database of scalable training of neural network potentials for complex interfaces through data augmentation [Dataset]. http://doi.org/10.24435/materialscloud:w6-9a
Explore at:
bz2, text/markdown, txtAvailable download formats
Unique identifier
https://doi.org/10.24435/materialscloud:w6-9a
Dataset updated
Apr 2, 2025
Dataset provided by
Materials Cloud
Authors
In Won Yeu; Annika Stuke; Alexander Urban; Nongnuch Artrith; In Won Yeu; Annika Stuke; Alexander Urban; Nongnuch Artrith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This database contains the reference data used for direct force training of Artificial Neural Network (ANN) interatomic potentials using the atomic energy network (ænet) and ænet-PyTorch packages (https://github.com/atomisticnet/aenet-PyTorch). It also includes the GPR-augmented data used for indirect force training via Gaussian Process Regression (GPR) surrogate models using the ænet-GPR package (https://github.com/atomisticnet/aenet-gpr). Each data file contains atomic structures, energies, and atomic forces in XCrySDen Structure Format (XSF). The dataset includes all reference training/test data and corresponding GPR-augmented data used in the four benchmark examples presented in the reference paper, "Scalable Training of Neural Network Potentials for Complex Interfaces Through Data Augmentation". A hierarchy of the dataset is described in the README.txt file, and an overview of the dataset is also summarized in supplementary Table S1 of the reference paper.
Variable Misuse tool: Dataset for data augmentation (4)
zenodo.org
explore.openaire.eu
zip
Updated Mar 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristian Robledo; Cristian Robledo; Francesca Sallicati; Javier Gutiérrez; Francesca Sallicati; Javier Gutiérrez (2022). Variable Misuse tool: Dataset for data augmentation (4) [Dataset]. http://doi.org/10.5281/zenodo.6090379
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6090379
Dataset updated
Mar 8, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cristian Robledo; Cristian Robledo; Francesca Sallicati; Javier Gutiérrez; Francesca Sallicati; Javier Gutiérrez
Description
Dataset used for data augmentation in the training phase of the Variable Misuse tool. It contains some source code files extracted from third-party repositories.
i
Data from: Equidistant and Uniform Data Augmentation for 3D Objects
ieee-dataport.org
Updated Jan 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Morozov (2022). Equidistant and Uniform Data Augmentation for 3D Objects [Dataset]. https://ieee-dataport.org/documents/equidistant-and-uniform-data-augmentation-3d-objects
Explore at:
Dataset updated
Jan 6, 2022
Authors
Alexander Morozov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
many methods exist to augment a 3D object
Data and code for: Assessing the Reliability of Point Mutation as Data...
zenodo.org
zip
Updated Jan 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Utku Ozbulak; Utku Ozbulak; Joris Vankerschaver; Joris Vankerschaver (2024). Data and code for: Assessing the Reliability of Point Mutation as Data Augmentation for Deep Learning with Genomic Data [Dataset]. http://doi.org/10.5281/zenodo.10457988
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10457988
Dataset updated
Jan 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Utku Ozbulak; Utku Ozbulak; Joris Vankerschaver; Joris Vankerschaver
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and code for the paper "Assessing the Reliability of Point Mutation as Data Augmentation for Deep Learning with Genomic Data".
i
Enhanced Cardiovascular Disease Dataset with Data Augmentation
ieee-dataport.org
Updated Mar 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jose Luis Lopez-Saynes (2025). Enhanced Cardiovascular Disease Dataset with Data Augmentation [Dataset]. https://ieee-dataport.org/documents/enhanced-cardiovascular-disease-dataset-data-augmentation
Explore at:
Dataset updated
Mar 3, 2025
Authors
Jose Luis Lopez-Saynes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
physical
Brain Tumor Paper Dataset and Code
zenodo.org
data.niaid.nih.gov
bin
Updated Feb 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yazan Al-Smadi; Yazan Al-Smadi (2023). Brain Tumor Paper Dataset and Code [Dataset]. http://doi.org/10.5281/zenodo.7619446
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7619446
Dataset updated
Feb 8, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yazan Al-Smadi; Yazan Al-Smadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Brain Tumor Detection Research Paper Code and Dataset

Paper title: Transforming brain tumor detection: the impact of YOLO models and MRI orientations.

Authored by: Yazan Al-Smadi, Ahmad Al-Qerem, et al. (2023)

This project contains a full version of the used brain tumor dataset and a full code version of the proposed research methodology.
m
augmentation data for DAISM
data.mendeley.com
explore.openaire.eu
+1more
Updated Jun 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yating Lin (2022). augmentation data for DAISM [Dataset]. http://doi.org/10.17632/ysjwjvpnh3.1
Explore at:
Unique identifier
https://doi.org/10.17632/ysjwjvpnh3.1
Dataset updated
Jun 22, 2022
Authors
Yating Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purified dataset for data augmentation for DAISM-DNNXMBD can be downloaded from this repository.

The pbmc8k dataset downloaded from 10X Genomics were processed and uesd for data augmentation to create training datasets for training DAISM-DNN models. pbmc8k.h5ad contains 5 cell types (B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells), and pbmc8k_fine.h5ad cantains 7 cell types (naive.B.cells, memory.B.cells, naive.CD4.T.cells, memory.CD4.T.cells,naive.CD8.T.cells, memory.CD8.T.cells, regulatory.T.cells, monocytes, macrophages, myeloid.dendritic.cells, NK.cells).

For RNA-seq dataset, it contains 5 cell types (B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells). Raw FASTQ reads were downloaded from the NCBI website, and transcription and gene-level expression quantification were performed using Salmon (version 0.11.3) with Gencode v29 after quality control of FASTQ reads using fastp. All tools were used with default parameters.
f
Table1_Enhancing biomechanical machine learning with limited data:...
frontiersin.figshare.com
pdf
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich (2024). Table1_Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.pdf [Dataset]. http://doi.org/10.3389/fbioe.2024.1350135.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fbioe.2024.1350135.s001
Dataset updated
Feb 14, 2024
Dataset provided by
Frontiers
Authors
Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.
Data from: Prediction of blood-brain barrier penetrating peptides based on...
figshare.com
application/x-rar
Updated Apr 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhifeng Gu; Yuduo Hao; Tianyu Wang; Peiling Cai; Yang Zhang; Kejun Deng; Hao Lin; Hao Lv (2024). Prediction of blood-brain barrier penetrating peptides based on data augmentation with Augur [Dataset]. http://doi.org/10.6084/m9.figshare.25466461.v4
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25466461.v4
Dataset updated
Apr 5, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Zhifeng Gu; Yuduo Hao; Tianyu Wang; Peiling Cai; Yang Zhang; Kejun Deng; Hao Lin; Hao Lv
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The blood-brain barrier serves as a critical interface between the bloodstream and brain tissue, mainly composed of pericytes, neurons, endothelial cells, and tightly connected basal membranes. It plays a pivotal role in safeguarding brain from harmful substances, thus protecting the integrity of the nervous system and preserving overall brain homeostasis. However, this remarkable selective transmission also poses a formidable challenge in the realm of central nervous system diseases treatment, hindering the delivery of large-molecule drugs into the brain. In response to this challenge, many researchers have devoted themselves to developing drug delivery systems capable of breaching the blood-brain barrier. Among these, blood-brain barrier penetrating peptides have emerged as promising candidates. These peptides had the advantages of high biosafety, ease of synthesis, and exceptional penetration efficiency, making them an effective drug delivery solution. While previous studies have developed a few prediction models for B3PPs, their performance has often been hampered by issue of limited positive data.In this study, we present Augur, a novel prediction model using borderline-SMOTE-based data augmentation and machine learning. we extract highly interpretable physicochemical properties of blood-brain barrier penetrating peptides while solving the issues of small sample size and imbalance of positive and negative samples. Experimental results demonstrate the superior prediction performance of Augur with an AUC value of 0.932 on the training set and 0.931 on the independent test set.This newly developed Augur model demonstrates superior performance in predicting blood-brain barrier penetrating peptides, offering valuable insights for drug development targeting neurological disorders. This breakthrough may enhance the efficiency of peptide-based drug discovery and pave the way for innovative treatment strategies for central nervous system diseases.
f
Datasets GO ID/attribute p-value q-value.
figshare.com
xls
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sifan Feng; Zhenyou Wang; Yinghua Jin; Shengbin Xu (2024). Datasets GO ID/attribute p-value q-value. [Dataset]. http://doi.org/10.1371/journal.pone.0305857.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305857.t004
Dataset updated
Jul 22, 2024
Dataset provided by
PLOS ONE
Authors
Sifan Feng; Zhenyou Wang; Yinghua Jin; Shengbin Xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Traditional differential expression genes (DEGs) identification models have limitations in small sample size datasets because they require meeting distribution assumptions, otherwise resulting high false positive/negative rates due to sample variation. In contrast, tabular data model based on deep learning (DL) frameworks do not need to consider the data distribution types and sample variation. However, applying DL to RNA-Seq data is still a challenge due to the lack of proper labeling and the small sample size compared to the number of genes. Data augmentation (DA) extracts data features using different methods and procedures, which can significantly increase complementary pseudo-values from limited data without significant additional cost. Based on this, we combine DA and DL framework-based tabular data model, propose a model TabDEG, to predict DEGs and their up-regulation/down-regulation directions from gene expression data obtained from the Cancer Genome Atlas database. Compared to five counterpart methods, TabDEG has high sensitivity and low misclassification rates. Experiment shows that TabDEG is robust and effective in enhancing data features to facilitate classification of high-dimensional small sample size datasets and validates that TabDEG-predicted DEGs are mapped to important gene ontology terms and pathways associated with cancer.
i
Data from: Regularization for Unconditional Image Diffusion Models via...
ieee-dataport.org
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kensuke NAKAMURA (2025). Regularization for Unconditional Image Diffusion Models via Shifted Data Augmentation [Dataset]. http://ieee-dataport.org/documents/regularization-unconditional-image-diffusion-models-shifted-data-augmentation
Explore at:
Dataset updated
Jun 22, 2025
Authors
Kensuke NAKAMURA
Description
it often causes leakage
S
Synthetic Data Generation Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Synthetic Data Generation Report [Dataset]. https://www.datainsightsmarket.com/reports/synthetic-data-generation-1124388
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Jun 16, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The synthetic data generation market is experiencing explosive growth, driven by the increasing need for high-quality data in various applications, including AI/ML model training, data privacy compliance, and software testing. The market, currently estimated at $2 billion in 2025, is projected to experience a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated $10 billion by 2033. This significant expansion is fueled by several key factors. Firstly, the rising adoption of artificial intelligence and machine learning across industries demands large, high-quality datasets, often unavailable due to privacy concerns or data scarcity. Synthetic data provides a solution by generating realistic, privacy-preserving datasets that mirror real-world data without compromising sensitive information. Secondly, stringent data privacy regulations like GDPR and CCPA are compelling organizations to explore alternative data solutions, making synthetic data a crucial tool for compliance. Finally, the advancements in generative AI models and algorithms are improving the quality and realism of synthetic data, expanding its applicability in various domains. Major players like Microsoft, Google, and AWS are actively investing in this space, driving further market expansion. The market segmentation reveals a diverse landscape with numerous specialized solutions. While large technology firms dominate the broader market, smaller, more agile companies are making significant inroads with specialized offerings focused on specific industry needs or data types. The geographical distribution is expected to be skewed towards North America and Europe initially, given the high concentration of technology companies and early adoption of advanced data technologies. However, growing awareness and increasing data needs in other regions are expected to drive substantial market growth in Asia-Pacific and other emerging markets in the coming years. The competitive landscape is characterized by a mix of established players and innovative startups, leading to continuous innovation and expansion of market applications. This dynamic environment indicates sustained growth in the foreseeable future, driven by an increasing recognition of synthetic data's potential to address critical data challenges across industries.

Facebook

Twitter

Click to copy link

Link copied

Cite

Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD

Data from: Data augmentation for disruption prediction via robust surrogate models

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.7910/DVN/FMJCAD

Dataset updated

Aug 31, 2024

Dataset provided by

Harvard Dataverse

Authors

Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.

Clear search

Close search

Google apps

Main menu

Data from: Data augmentation for disruption prediction via robust surrogate...

Variable Message Signal annotated images for object detection

Data from: How many specimens make a sufficient training set for automated...

Description of the data and file structure

Data from: Exploring deep learning techniques for wild animal behaviour...

Data from: Use data augmentation for a deep learning classification model...

cars_wagonr_swift

Context

Content

Inspiration

Data Augmentation at the LHC through Analysis-specific Fast Simulation with...

Training dataset for "A deep learned nanowire segmentation model using...

Database of scalable training of neural network potentials for complex...

Variable Misuse tool: Dataset for data augmentation (4)

Data from: Equidistant and Uniform Data Augmentation for 3D Objects

Data and code for: Assessing the Reliability of Point Mutation as Data...

Enhanced Cardiovascular Disease Dataset with Data Augmentation

Brain Tumor Paper Dataset and Code

augmentation data for DAISM

Table1_Enhancing biomechanical machine learning with limited data:...

Data from: Prediction of blood-brain barrier penetrating peptides based on...

Datasets GO ID/attribute p-value q-value.

Data from: Regularization for Unconditional Image Diffusion Models via...

Synthetic Data Generation Report

Data from: Data augmentation for disruption prediction via robust surrogate models