100+ datasets found

Machine learning algorithm validation with a limited sample size
plos.figshare.com
text/x-python
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0224365
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
h
AIME_2024-train-test
huggingface.co
Updated Nov 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeong Seong Cheol (2025). AIME_2024-train-test [Dataset]. https://huggingface.co/datasets/lejelly/AIME_2024-train-test
Explore at:
Dataset updated
Nov 3, 2025
Authors
Jeong Seong Cheol
Description
AIME 2024 (train/test 1:9)

This dataset is derived from Maxwell-Jia/AIME_2024 by splitting the original single train split into train and test with a ratio of 1:9.

Source

Original dataset: Maxwell-Jia/AIME_2024Link: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024

License

Inherits the original dataset's license (MIT) unless otherwise noted in this repository.

Splitting Details

Method: datasets.Dataset.train_test_split Test size: 90.0%… See the full description on the dataset page: https://huggingface.co/datasets/lejelly/AIME_2024-train-test.
Data from: Web Data Commons Training and Test Sets for Large-Scale Product...
linkagelibrary.icpsr.umich.edu
da-ra.de
Updated Nov 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
Explore at:
Unique identifier
https://doi.org/10.3886/E127481V1
Dataset updated
Nov 26, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Ralph Peeters; Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
d
Data from: Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
landslide-train-test-valid-img-mask-2
kaggle.com
zip
Updated Nov 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Equinn Yip (2025). landslide-train-test-valid-img-mask-2 [Dataset]. https://www.kaggle.com/datasets/equinnyip/landslide-train-test-valid-img-mask-2
Explore at:
zip(3058599924 bytes)Available download formats
Dataset updated
Nov 21, 2025
Authors
Equinn Yip
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The Landslide4Sense dataset has three splits, training/validation/test, consisting of 3799, 245, and 800 image patches, respectively. Each image patch is a composite of 14 bands that include:

Multispectral data from Sentinel-2: B1, B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12.

Slope data from ALOS PALSAR: B13.

Digital elevation model (DEM) from ALOS PALSAR: B14.

All bands in the competition dataset are resized to the resolution of ~10m per pixel. The image patches have the size of 128 x 128 pixels and are labeled pixel-wise.
Prediction of Personality Traits using the Big 5 Framework
zenodo.org
csv, text/x-python
Updated Feb 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
Explore at:
text/x-python, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7596072
Dataset updated
Feb 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Neelima Brahmbhatt; Neelima Brahmbhatt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

1. Acquire Personality Dataset

The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

2. Data preprocessing

After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

3. Feature Extraction

The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

EXT1 I am the life of the party. EXT2 I don't talk a lot. EXT3 I feel comfortable around people. EXT4 I am quiet around strangers. EST1 I get stressed out easily. EST2 I get irritated easily. EST3 I worry about things. EST4 I change my mood a lot. AGR1 I have a soft heart. AGR2 I am interested in people. AGR3 I insult people. AGR4 I am not really interested in others. CSN1 I am always prepared. CSN2 I leave my belongings around. CSN3 I follow a schedule. CSN4 I make a mess of things. OPN1 I have a rich vocabulary. OPN2 I have difficulty understanding abstract ideas. OPN3 I do not have a good imagination. OPN4 I use difficult words.

4. Training the Model

Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

5. Personality Prediction Output

After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
R
Data from: Fashion Mnist Dataset
universe.roboflow.com
opendatalab.com
+3more
zip
Updated Aug 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
Explore at:
zipAvailable download formats
Dataset updated
Aug 10, 2022
Dataset authored and provided by
Popular Benchmarks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Clothing
Description
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Han Xiao, Kashif Rasul and Roland Vollgraf

https://arxiv.org/abs/1708.07747

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

Version 1 (original-images_Original-FashionMNIST-Splits):

Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.

This version was not trained

Version 3 (original-images_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set

https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

Citation:

@online{xiao2017/online, author = {Han Xiao and Kashif Rasul and Roland Vollgraf}, title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms}, date = {2017-08-28}, year = {2017}, eprintclass = {cs.LG}, eprinttype = {arXiv}, eprint = {cs.LG/1708.07747}, }
Fake News Detection
kaggle.com
zip
Updated Nov 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KranNaik777 (2025). Fake News Detection [Dataset]. https://www.kaggle.com/datasets/krannaik777/train-news
Explore at:
zip(38846301 bytes)Available download formats
Dataset updated
Nov 4, 2025
Authors
KranNaik777
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
The fake news detection dataset used in this project contains labeled news articles categorized as either "fake" or "real." These articles have been collected from credible real-world sources and fact-checking websites, ensuring diverse and high-quality data. The dataset includes textual features such as the news content, along with metadata like publication date, author, and source details. On average, articles vary in length, providing a rich linguistic variety for model training. The dataset is balanced to minimize bias between fake and real news categories, supporting robust classification. It often contains thousands to hundreds of thousands of articles, enabling effective machine learning model development and evaluation. Additionally, some versions of the dataset may also include image URLs for multimodal analysis, expanding the detection capability beyond text alone. This comprehensive dataset plays a critical role in training and validating the fake news detection model used in this project.

Here is a description for each column header of the fake news dataset:

id: A unique identifier assigned to each news article in the dataset for easy reference and indexing.

headline: The title or headline of the news article, summarizing the key news story in brief.

written by: The author or journalist who wrote the news article; this may sometimes be missing or anonymized.

news: The full text content of the news article, which is the main body used for analysis and classification.

label: The classification label indicating the authenticity of the news article, typically a binary value such as "fake" or "real" (or 0 for real and 1 for fake), indicating whether the news is deceptive or truthful.

This detailed column description provides clarity on the structure and contents of the dataset used for fake news detection modeling.
Z
Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Jun 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henderson, N. Ashley; Kauwe, K. Steven; Sparks, D. Taylor (2021). Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4903957
Explore at:
Dataset updated
Jun 6, 2021
Authors
Henderson, N. Ashley; Kauwe, K. Steven; Sparks, D. Taylor
Description
This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.

For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.

For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0
PGD MNIST
kaggle.com
zip
Updated Mar 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kishore Sudula (2024). PGD MNIST [Dataset]. https://www.kaggle.com/datasets/sudulakishore/pgd-mnist
Explore at:
zip(55759208 bytes)Available download formats
Dataset updated
Mar 2, 2024
Authors
Kishore Sudula
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Context PGD MNIST is a dataset of adversarial images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. PGD MNIST intends to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms on adversarial examples. It shares the same image size and structure of training and testing splits.

The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. "If it doesn't work on MNIST, it won't work at all", they said. "Well, if it does work on MNIST, it may still fail on others."

Check your models accuracy on this dataset and improve the adversarial robustness of the models

Content Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. The training and test data sets have 785 columns. The first column consists of the class labels (see above), and represents an integer. The rest of the columns contain the pixel-values of the associated image.

To locate a pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27. The pixel is located on row i and column j of a 28 x 28 matrix. For example, 31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below.

Columns description Each row is a separate image Column 1 is the class label (integers 0-9). Remaining columns are pixel numbers (784 total). Each value is the darkness of the pixel (1 to 255)
f
Probing Datasets for Noisy Texts
federation.figshare.com
researchdata.edu.au
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte (2021). Probing Datasets for Noisy Texts [Dataset]. http://doi.org/10.25955/604c5307db043
Explore at:
Unique identifier
https://doi.org/10.25955/604c5307db043
Dataset updated
Mar 14, 2021
Dataset provided by
Federation University Australia
Authors
Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
ContextProbing tasks are popular among NLP researchers to assess the richness of the encoded representations of linguistic information. Each probing task is a classification problem, and the model’s performance shall vary depending on the richness of the linguistic properties crammed into the representation.

This dataset contains five new probing datasets consist of noisy texts (Tweets) which can serve as a benchmark dataset for researchers to study the linguistic characteristics of unstructured and noisy texts.File StructureFormat: A tab-separated text file

Column 1: train/test/validation split (tr-train, te-test, va-validation)

Column 2: class label (refer to the content

section for the class labels of each task file)

Column 3: Tweet message (text) Column

4: a unique ID Contentsent_len.tsvIn this classification task, the goal is to predict the sentence length in 8 possible bins (0-7) based on their lengths; 0: (5-8), 1: (9-12), 2: (13-16), 3: (17-20), 4: (21-25), 5: (26-29), 6: (30-33), 7: (34-70). This task is called “SentLen” in the paper.word_content.tsvWe consider a 10-way classifications task with 10 words as targets considering the available manually annotated instances. The task is predicting which of the target words appears on the given sentence. We have considered only the words that appear in the BERT vocabulary as target words. We constructed the data by picking the first 10 lower-cased words occurring in the corpus vocabulary ordered by frequency and having a length of at least 4 characters (to remove noise). Each sentence contains a single target word, and the word occurs precisely once in the sentence. The task is referred to as “WC” in the paper. bigram_shift.tsvThe purpose of the Bigram Shift task is to test whether an encoder is sensitive to legal word orders. Two adjacent words in a Tweet are inverted, and the classification model performs a binary classification to identify inverted (I) and non-inverted/original (O) Tweets. The task is referred to as “BShift” in the paper. tree_depth.tsvThe Tree Depth task evaluates the encoded sentence's ability to understand the hierarchical structure by allowing the classification model to predict the depth of the longest path from the root to any leaf in the Tweet's parser tree. The task is referred to as “TreeDepth” in the paper. odd_man_out.tsv

The Tweets are modified by replacing a random noun or a verb o with another noun or verb r. The task of the classifier is to identify whether the sentence gets modified due to this change. Class label O refers to the unmodified sentences while C refers to modified sentences. The task is called “SOMO” in the paper.
Train, validation and test split size used in this experiment.
plos.figshare.com
xls
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuanqing Ding; Fanliang Bu; Hanming Zhai; Zhiwen Hou; Yifan Wang (2024). Train, validation and test split size used in this experiment. [Dataset]. http://doi.org/10.1371/journal.pone.0311720.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311720.t002
Dataset updated
Oct 10, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Yuanqing Ding; Fanliang Bu; Hanming Zhai; Zhiwen Hou; Yifan Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Train, validation and test split size used in this experiment.
h
test-olmocr2
huggingface.co
Updated Oct 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel van Strien (2025). test-olmocr2 [Dataset]. https://huggingface.co/datasets/davanstrien/test-olmocr2
Explore at:
Dataset updated
Oct 23, 2025
Authors
Daniel van Strien
Description
Document OCR using olmOCR-2-7B-1025-FP8

This dataset contains markdown-formatted OCR results from images in davanstrien/test-olmocr2 using olmOCR-2-7B.

Processing Details

Source Dataset: davanstrien/test-olmocr2 Model: allenai/olmOCR-2-7B-1025-FP8 Number of Samples: 100 Processing Time: 0h 3m 32s Processing Date: 2025-10-23 17:00 UTC

Configuration

Image Column: image Output Column: markdown Dataset Split: train Batch Size: 512 Max Model Length: 16,384… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/test-olmocr2.
h
rolm-test
huggingface.co
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel van Strien (2025). rolm-test [Dataset]. https://huggingface.co/datasets/davanstrien/rolm-test
Explore at:
Dataset updated
Aug 4, 2025
Authors
Daniel van Strien
Description
OCR Text Extraction using RolmOCR

This dataset contains extracted text from images in davanstrien/playbills-pdf-images-text using RolmOCR.

Processing Details

Source Dataset: davanstrien/playbills-pdf-images-text Model: reducto/RolmOCR Number of Samples: 10 Processing Time: 5.8 minutes Processing Date: 2025-08-04 17:08 UTC

Configuration

Image Column: image Output Column: rolmocr_text Dataset Split: train Batch Size: 16 Max Model Length: 24,000 tokens Max… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/rolm-test.
Glaucoma Dataset: EyePACS-AIROGS-light-V2
kaggle.com
zip
Updated Mar 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Riley Kiefer (2024). Glaucoma Dataset: EyePACS-AIROGS-light-V2 [Dataset]. https://www.kaggle.com/datasets/deathtrooper/glaucoma-dataset-eyepacs-airogs-light-v2/code
Explore at:
zip(549533071 bytes)Available download formats
Dataset updated
Mar 9, 2024
Authors
Riley Kiefer
Description
News: Now with a 10.0 Kaggle usability score: supplemental metadata.csv file added to dataset.

Overview: This is an improved machine-learning-ready glaucoma dataset using a balanced subset of standardized fundus images from the Rotterdam EyePACS AIROGS [1] set. This dataset is split into training, validation, and test folders which contain 4000 (~84%), 385 (~8%), and 385 (~8%) fundus images in each class respectively. Each training set has a folder for each class: referable glaucoma (RG) and non-referable glaucoma (NRG). This dataset is designed to easily benchmark your glaucoma classification models in Kaggle. Please make a contribution in the code tab, I have created a template to make it even easier!

Please cite the dataset and at least the first of my related works if you found this dataset useful!

Riley Kiefer. "EyePACS-AIROGS-light-V2". Kaggle, 2024, doi: 10.34740/KAGGLE/DSV/7802508.

Riley Kiefer. "EyePACS-AIROGS-light-V1". Kaggle, 2023, doi: 10.34740/kaggle/ds/3222646.

Riley Kiefer. "Standardized Multi-Channel Dataset for Glaucoma, v19 (SMDG-19)". Kaggle, 2023, doi: 10.34740/kaggle/ds/2329670

Steen, J., Kiefer, R., Ardali, M., Abid, M. & Amjadian, E. Standardized and Open-Access Glaucoma Dataset for Artificial Intelligence Applications. Invest. Ophthalmol. Vis. Sci. 64, 384–384 (2023).

Amjadian, E., Ardali, M. R., Kiefer, R., Abid, M. & Steen, J. Ground truth validation of publicly available datasets utilized in artificial intelligence models for glaucoma detection. Invest. Ophthalmol. Vis. Sci. 64, 392–392 (2023).

R. Kiefer, M. Abid, M. R. Ardali, J. Steen and E. Amjadian, "Automated Fundus Image Standardization Using a Dynamic Global Foreground Threshold Algorithm," 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 2023, pp. 460-465, doi: 10.1109/ICIVC58118.2023.10270429.

Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023.

R. Kiefer, J. Steen, M. Abid, M. R. Ardali and E. Amjadian, "A Survey of Glaucoma Detection Algorithms using Fundus and OCT Images," 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 2022, pp. 0191-0196, doi: 10.1109/IEMCON56893.2022.9946629.

E. Amjadian, R. Kiefer, J. Steen, M. Abid, M. Ardali, "A Comprehensive Survey of Publicly Available Glaucoma Datasets for Automated Glaucoma Detection". American Academy of Optometry. 2022.

Improvements from v1: - According to an ablation study on the image standardization methods applied to dataset v1 [3], images are standardized according to the CROP methodology (remove black background before resizing). This method yields more of the actual fundus foreground in the resultant image. - Increased the image resize dimensions from 256x256 pixels to 512x512 pixels - Reason: Provides greater model input flexibility, detail, and size. This also better supports the ONH-cropping models. - Added 3000 images from the Rotterdam EyePACS AIROGS dev set - Reason: More data samples can improve model generalizability - Readjusted train/val/test split - Reason: The validation and test split sizes were different - Improved sampling from source dataset - Reason: v1 NRG samples were not randomly selected

Drawbacks of Rotterdam EyePACS AIROGS: One of the largest drawbacks of the original dataset is the accessibility of the dataset. The dataset requires a long download, a large storage space, it spans several folders, and it is not machine-learning-ready (it requires data processing and splitting). The dataset also contains raw fundus images in their original dimensions; these original images often contain a large amount of black background and the dimensions are too large for machine learning inputs. The proposed dataset addresses the aforementioned concerns by image sampling and image standardization to balance and reduce the dataset size respectively.

Origin: The images in this dataset are sourced from the Rotterdam EyePACS AIROGS [1] dataset, which contains 113,893 color fundus images from 60,357 subjects and approximately 500 different sites with a heterogeneous ethnicity; this impressive dataset is over 60GB when compressed. The first lightweight version of the dataset is known as EyePACS-AIROGS-light (v1) [2].

About Me: I have studied glaucoma-related research for my computer science master's thesis. Since my graduation, I have dedicated my time to keeping my research up-to-date and relevant for fellow glaucoma researchers. I hope that my research can provi...
n
Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...
narcis.nl
data.mendeley.com
Updated Jan 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoo, T (via Mendeley Data) (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
Explore at:
Unique identifier
https://doi.org/10.17632/ffn745r57z.2
Dataset updated
Jan 11, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Yoo, T (via Mendeley Data)
Description
Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

Python version:

from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

connect data in your google drive

from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

Change the path for the custom data

In this case, we used ICL vault prediction using preop measurement

dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

optimal features (sorted by importance) :

1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

Split the dataset to train and test data, if necessary.

For example, we can split data to 8:2 as a simple validation test

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

Optimal parameter search could be performed in this section

parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_
Autism Facial Emotion Recognition
kaggle.com
zip
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Hasibur Rahman (2025). Autism Facial Emotion Recognition [Dataset]. https://www.kaggle.com/datasets/hasibur013/autism-facial-emotion-recognition
Explore at:
zip(208400945 bytes)Available download formats
Dataset updated
Jul 29, 2025
Authors
Md. Hasibur Rahman
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Autism Facial Recognition Dataset (Augmented & Split)

Overview

This dataset is a significantly augmented and meticulously split version of an existing Autism Facial Recognition Dataset, designed for training and evaluating machine learning models to detect Autism Spectrum Disorder (ASD) from facial images. Through extensive data augmentation and a structured train/test split, this dataset aims to provide a robust and diverse foundation for developing highly accurate and generalizable facial recognition models in the context of ASD research.

Context

Autism Spectrum Disorder (ASD) is a neurodevelopmental condition that impacts social interaction, communication, and behavior. Early detection and intervention are crucial for improving outcomes for individuals with ASD. Recent research has explored the potential of computer vision and facial recognition techniques to aid in the preliminary screening or identification of ASD-related facial cues. This dataset is tailored to support such research by providing a rich collection of processed facial images.

Data Augmentation

The original dataset has been expanded through a comprehensive augmentation pipeline using the Albumentations library. This process generates 10 augmented versions for each original image, dramatically increasing the dataset size and variety. The augmentation techniques applied include:

Geometric Augmentations:

HorizontalFlip: Randomly flips images horizontally (p=0.5).

Rotate: Rotates images by a random angle within -10 to +10 degrees (p=0.7), filling new pixels with a constant border.

Color and Brightness Augmentations:

HueSaturationValue: Randomly shifts hue, saturation, and value (brightness) to simulate varying lighting conditions and camera settings (p=0.5).

RandomGamma: Adjusts image contrast by applying gamma correction (gamma limit 85-115, p=0.5).

Quality and Noise Augmentations:

GaussianBlur: Applies a Gaussian blur to introduce slight blurring effects (sigma limit 0.1-1.4, p=0.3).

GaussNoise: Adds random Gaussian noise to images, simulating sensor noise (p=0.3).

Resizing and Cropping:

RandomResizedCrop: Crops a random portion of the image and resizes it to a uniform size of (224, 224) pixels. This helps the model become more robust to variations in image scale and composition (scale 0.91-0.95, p=1.0).

These augmentations ensure that the trained models are less prone to overfitting and can generalize better to unseen data, accounting for variations in real-world image capture.

Dataset Structure

The processed dataset is organized into a standard train/test split, making it immediately usable for machine learning workflows. The directory structure is as follows:

Autism Facial Emotion Recognition Dataset/ ├── train/ │ ├── Autistic/ │ │ ├── image_001_aug_1.jpg │ │ ├── ... │ └── Non_Autistic/ │ ├── image_001_aug_1.jpg │ └── ... └── test/ ├── Autistic/ │ ├── image_xxx_aug_y.jpg │ ├── ... └── Non_Autistic/ ├── image_xxx_aug_y.jpg └── ...

train/: Contains the training set, comprising approximately 80% of the augmented images from each class.

test/: Contains the testing set, comprising the remaining 20% of the augmented images from each class.

Each class (Autistic and Non_Autistic) maintains its proportional representation within both the training and testing sets to ensure balanced evaluation.

Image Format

All images are processed and saved as JPEG files. They are resized to 224x224 pixels, a common input size for many pre-trained deep learning models (e.g., ResNet, VGG, MobileNet).

Usage

This dataset is ideal for:

Training deep learning models for facial recognition and classification (e.g., CNNs, Vision Transformers).

Transfer learning experiments using pre-trained models.

Research into facial biomarkers for ASD.

Developing explainable AI (XAI) techniques to understand model decisions related to ASD features.

Acknowledgements

The original dataset from which this augmented version was derived should be credited appropriately. Please refer to the source of the "Autism Facial Recognition Dataset" you used as INPUT_DIR.

Citation

If you use this dataset in your research, please cite this Kaggle dataset.
R
Hard Hat Workers Dataset
universe.roboflow.com
zip
Updated Sep 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph Nelson (2022). Hard Hat Workers Dataset [Dataset]. https://universe.roboflow.com/joseph-nelson/hard-hat-workers/model/13
Explore at:
zipAvailable download formats
Dataset updated
Sep 30, 2022
Dataset authored and provided by
Joseph Nelson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Workers Bounding Boxes
Description
Overview

The Hard Hat dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.

The original dataset has a 75/25 train-test split.

Example Image: https://i.imgur.com/7spoIJT.png" alt="Example Image">

Use Cases

One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.

Using this Dataset

Use the fork or Download this Dataset button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

Dataset Versions:

Image Preprocessing | Image Augmentation | Modify Classes * v1 (resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations * v2 (raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images * v3 (v3): generated with the original 75/25 train-test split | Modify Classes used to drop person class | Preprocessing and Augmentation applied * v5 (raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class * v8 (raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and person classes * v9 (raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and helmet classes * v10 (raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images * v11 (augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model * v12 (augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Fast Model * v13 (augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Accurate Model * v14 (raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class, and remap/relabel helmet class to head

Choosing Between Computer Vision Model Sizes | Roboflow Train

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
S155
zenodo.org
tar
Updated Oct 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Zurowietz; Martin Zurowietz (2020). S155 [Dataset]. http://doi.org/10.5281/zenodo.3603803
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3603803
Dataset updated
Oct 6, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin Zurowietz; Martin Zurowietz
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A fully annotated subset of the SO242/2_155-1 image dataset. The annotations are given as train and test splits that can be used to evaluate machine learning methods. The following classes of fauna were used for annotation:

anemone

coral

crustacean

ipnops fish

litter

ophiuroid

other fauna

sea cucumber

sponge

stalked crinoid

For a definition of the classes see [1].

Related datasets:

S083: https://doi.org/10.5281/zenodo.3600132

S171: https://doi.org/10.5281/zenodo.3603809

S233: https://doi.org/10.5281/zenodo.3603815

This dataset contains the following files:

annotations/test.csv: The BIIGLE CSV annotation report of the annotations of the test split of this dataset. These annotations are used to test the performance of the trained Mask R-CNN model.

annotations/train.csv: The BIIGLE CSV annotation report of the annotations of the train split of this dataset. These annotations are used to generate the annotation patches which are transformed with scale and style transfer to be used to train the Mask R-CNN model.

images/: Directory that contains all the original image files.

dataset.json: JSON file that contains information about the dataset.

name: The name of the dataset.

images_dir: Name of the directory that contains the original image files.

metadata_file: Path to the CSV file that contains image metadata.

test_annotations_file: Path to the CSV file that contains the test annotations.

train_annotations_file: Path to the CSV file that contains the train annotations.

annotation_patches_dir: Name of the directory that should contain the scale- and style-transferred annotation patches.

crop_dimension: Edge length of an annotation or style patch in pixels.

metadata.csv: A CSV file that contains metadata for each original image file. In this case the distance of the camera to the sea floor is given for each image.
StreetSurfaceVis: a dataset of street-level imagery with annotations of road...
zenodo.org
data.niaid.nih.gov
csv, pdf, zip
Updated Jan 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandra Kapp; Alexandra Kapp; Edith Hoffmann; Edith Hoffmann; Esther Weigmann; Esther Weigmann; Helena Mihaljevic; Helena Mihaljevic (2025). StreetSurfaceVis: a dataset of street-level imagery with annotations of road surface type and quality [Dataset]. http://doi.org/10.5281/zenodo.11449977
Explore at:
zip, pdf, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11449977
Dataset updated
Jan 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexandra Kapp; Alexandra Kapp; Edith Hoffmann; Edith Hoffmann; Esther Weigmann; Esther Weigmann; Helena Mihaljevic; Helena Mihaljevic
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
StreetSurfaceVis

StreetSurfaceVis is an image dataset containing 9,122 street-level images from Germany with labels on road surface type and quality. The CSV file streetSurfaceVis_v1_0.csv contains all image metadata and four folders contain the image files. All images are available in four different sizes, based on the image width, in 256px, 1024px, 2048px and the original size.
Folders containing the images are named according to the respective image size. Image files are named based on the mapillary_image_id.

You can find the corresponding publication here: StreetSurfaceVis: a dataset of crowdsourced street-level imagery with semi-automated annotations of road surface type and quality

Image metadata

Each CSV record contains information about one street-level image with the following attributes:

mapillary_image_id: ID provided by Mapillary (see information below on Mapillary)

user_id: Mapillary user ID of contributor

user_name: Mapillary user name of contributor

captured_at: timestamp, capture time of image

longitude, latitude: location the image was taken at

train: Suggestion to split train and test data. `True` for train data and `False` for test data. Test data contains data from 5 cities which are excluded in the training data.

surface_type: Surface type of the road in the focal area (the center of the lower image half) of the image. Possible values: asphalt, concrete, paving_stones, sett, unpaved

surface_quality: Surface quality of the road in the focal area of the image. Possible values: (1) excellent, (2) good, (3) intermediate, (4) bad, (5) very bad (see the attached Labeling Guide document for details)

Image source

Images are obtained from Mapillary, a crowd-sourcing plattform for street-level imagery. More metadata about each image can be obtained via the Mapillary API . User-generated images are shared by Mapillary under the CC-BY-SA License.

For each image, the dataset contains the mapillary_image_id and user_name.
You can access user information on the Mapillary website by https://www.mapillary.com/app/user/
and image information by https://www.mapillary.com/app/?focus=photo&pKey=

If you use the provided images, please adhere to the terms of use of Mapillary.

Instances per class

Total number of images: 9,122

excellent good intermediate bad very bad
asphalt 971 1697 821 246 -
concrete 314 350 250 58 -
paving stones 385 1063 519 70 -
sett - 129 694 540 -
unpaved - - 326 387 303

For modeling, we recommend using a train-test split where the test data includes geospatially distinct areas, thereby ensuring the model's ability to generalize to unseen regions is tested. We propose five cities varying in population size and from different regions in Germany for testing - images are tagged accordingly.

Number of test images (train-test split): 776

Inter-rater-reliablility

Three annotators labeled the dataset, such that each image was annotated by one person. Annotators were encouraged to consult each other for a second opinion when uncertain.
1,800 images were annotated by all three annotators, resulting in a Krippendorff's alpha of 0.96 for surface type and 0.74 for surface quality.

Recommended image preprocessing

As the focal road located in the bottom center of the street-level image is labeled, it is recommended to crop images to their lower and middle half prior using for classification tasks.

This is an exemplary code for recommended image preprocessing in Python:

from PIL import Imageimg = Image.open(image_path)
width, height = img.size
img_cropped = img.crop((0.25 * width, 0.5 * height, 0.75 * width, height))

License

CC-BY-SA

Citation

If you use this dataset, please cite as:

Kapp, A., Hoffmann, E., Weigmann, E. et al. StreetSurfaceVis: a dataset of crowdsourced street-level imagery annotated by road surface type and quality. Sci Data 12, 92 (2025). https://doi.org/10.1038/s41597-024-04295-9

@article{kapp_streetsurfacevis_2025, title = {{StreetSurfaceVis}: a dataset of crowdsourced street-level imagery annotated by road surface type and quality}, volume = {12}, issn = {2052-4463}, url = {https://doi.org/10.1038/s41597-024-04295-9}, doi = {10.1038/s41597-024-04295-9}, pages = {92}, number = {1}, journaltitle = {Scientific Data}, shortjournal = {Scientific Data}, author = {Kapp, Alexandra and Hoffmann, Edith and Weigmann, Esther and Mihaljević, Helena}, date = {2025-01-16}, }

-----------------------------------------------------------------------------------------------------------------------------------------------------------

This is part of the SurfaceAI project at the University of Applied Sciences, HTW Berlin.

- Prof. Dr. Helena Mihajlević
- Alexandra Kapp
- Edith Hoffmann
- Esther Weigmann

Contact: surface-ai@htw-berlin.de

https://surfaceai.github.io/surfaceai/

Funding: SurfaceAI is a mFund project funded by the Federal Ministry for Digital and Transportation Germany.

Facebook

Twitter

Click to copy link

Link copied

Cite

Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365

Machine learning algorithm validation with a limited sample size

Explore at:

text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0224365

Dataset updated

May 30, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Clear search

Close search

Google apps

Main menu

	excellent	good	intermediate	bad	very bad
asphalt	971	1697	821	246	-
concrete	314	350	250	58	-
paving stones	385	1063	519	70	-
sett	-	129	694	540	-
unpaved	-	-	326	387	303

Machine learning algorithm validation with a limited sample size

AIME_2024-train-test

Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

Data from: Training dataset for NABat Machine Learning V1.0

landslide-train-test-valid-img-mask-2

Prediction of Personality Traits using the Big 5 Framework

Data from: Fashion Mnist Dataset

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Version 1 (original-images_Original-FashionMNIST-Splits):

Version 3 (original-images_trainSetSplitBy80_20):

Citation:

Fake News Detection

Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...

PGD MNIST

Probing Datasets for Noisy Texts

Train, validation and test split size used in this experiment.

test-olmocr2

rolm-test

Glaucoma Dataset: EyePACS-AIROGS-light-V2

Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

connect data in your google drive

Change the path for the custom data

In this case, we used ICL vault prediction using preop measurement

optimal features (sorted by importance) :

1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

Split the dataset to train and test data, if necessary.

For example, we can split data to 8:2 as a simple validation test

In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

Optimal parameter search could be performed in this section

Autism Facial Emotion Recognition

Autism Facial Recognition Dataset (Augmented & Split)

Overview

Context

Data Augmentation

Dataset Structure

Image Format

Usage

Acknowledgements

Citation

Hard Hat Workers Dataset

Overview

Use Cases

Using this Dataset

Dataset Versions:

About Roboflow

S155

StreetSurfaceVis: a dataset of street-level imagery with annotations of road...

StreetSurfaceVis

Image metadata

Image source

Instances per class

Inter-rater-reliablility

Recommended image preprocessing

License

Citation

Machine learning algorithm validation with a limited sample size