100+ datasets found

Machine learning algorithm validation with a limited sample size
plos.figshare.com
text/x-python
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0224365
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
R
Occluded Validation Set Cropped Dataset
universe.roboflow.com
zip
Updated Feb 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SAU (2022). Occluded Validation Set Cropped Dataset [Dataset]. https://universe.roboflow.com/sau-cixmv/occluded-validation-set-cropped
Explore at:
zipAvailable download formats
Dataset updated
Feb 4, 2022
Dataset authored and provided by
SAU
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Sheep Bounding Boxes
Description
Occluded Validation Set Cropped

## Overview Occluded Validation Set Cropped is a dataset for object detection tasks - it contains Sheep annotations for 243 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
t
ImageNet Validation Set - Dataset - LDM
service.tib.eu
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). ImageNet Validation Set - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/imagenet-validation-set
Explore at:
Dataset updated
Dec 3, 2024
Description
The dataset used in the paper is the ImageNet validation set, a subset of the ImageNet dataset.
r
Validation Set
resodate.org
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davide Gerosa; Geraint Pratten; Alberto Vecchio (2024). Validation Set [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdmFsaWRhdGlvbi1zZXQ=
Explore at:
Dataset updated
Dec 16, 2024
Dataset provided by
Leibniz Data Manager
Authors
Davide Gerosa; Geraint Pratten; Alberto Vecchio
Description
A dataset used to train and test the neural network classiﬁers.
R
Tomato Validation Set Dataset
universe.roboflow.com
zip
Updated May 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YH (2024). Tomato Validation Set Dataset [Dataset]. https://universe.roboflow.com/yh-ci4ev/tomato-validation-set
Explore at:
zipAvailable download formats
Dataset updated
May 5, 2024
Dataset authored and provided by
YH
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Tomato Bounding Boxes
Description
Tomato Validation Set

## Overview Tomato Validation Set is a dataset for object detection tasks - it contains Tomato annotations for 4,265 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
give us the data validation test set
kaggle.com
zip
Updated Apr 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna (2021). give us the data validation test set [Dataset]. https://www.kaggle.com/annatmp/give-us-the-data-validation-test-set
Explore at:
zip(439562080 bytes)Available download formats
Dataset updated
Apr 23, 2021
Authors
Anna
Description
Dataset

This dataset was created by Anna

Contents
f
Results on validation set data.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jul 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rutten, Matthieu; Smits, Henk; Kurstjens, Steef; Çallı, Erdi; Murphy, Keelin; van Ginneken, Bram; Samson, Tijs; Herpers, Robert (2021). Results on validation set data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000918798
Explore at:
Dataset updated
Jul 30, 2021
Authors
Rutten, Matthieu; Smits, Henk; Kurstjens, Steef; Çallı, Erdi; Murphy, Keelin; van Ginneken, Bram; Samson, Tijs; Herpers, Robert
Description
Five models are trained using various input masking probabilities (IMP). Each resulting model is validated using the heavily masked validation dataset of 13596 samples (5668 positive) to evaluate their performance in the context of missing input data. AUC values for the optimal training IMP are shown, along with those achieved with no input masking (NIM). Bold font indicates the highest AUC in the table. Results for other IMP values are provided in the S1 File.
d
Landsat 8 Collection 1 cloud truth mask validation set
catalog.data.gov
data.usgs.gov
+2more
Updated Nov 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Landsat 8 Collection 1 cloud truth mask validation set [Dataset]. https://catalog.data.gov/dataset/landsat-8-collection-1-cloud-truth-mask-validation-set
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center in Sioux Falls, SD developed a cloud validation dataset from 48 unique Landsat 8 Collection 1 images. These images were selected at random from the Landsat 8 archive from various locations around the world. While these validation images were subjectively designed by a single analyst, they provide useful information for quantifying the accuracy of clouds flagged by various cloud masking algorithms. Each mask is provided in GeoTIFF format, and includes all bands from the original Landsat 8 Level-1 Collection 1 data product (COG GeoTIFF), and its associated Level-1 metadata (MTL.txt file).
R
Validation Set Dataset
universe.roboflow.com
zip
Updated May 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VA (2024). Validation Set Dataset [Dataset]. https://universe.roboflow.com/va-2pswp/validation-set-hmu1x/model/3
Explore at:
zipAvailable download formats
Dataset updated
May 7, 2024
Dataset authored and provided by
VA
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cars Motorcycles Bounding Boxes
Description
Validation Set

## Overview Validation Set is a dataset for object detection tasks - it contains Cars Motorcycles annotations for 219 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
t
MS Training Set, MS Validation Set, and UW Validation/Test Set - Dataset -...
service.tib.eu
Updated Dec 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). MS Training Set, MS Validation Set, and UW Validation/Test Set - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/ms-training-set--ms-validation-set--and-uw-validation-test-set
Explore at:
Dataset updated
Dec 17, 2024
Description
The MS Training Set, MS Validation Set, and UW Validation/Test Set are used for training, validation, and testing the proposed methods.
Data from: Web Data Commons Training and Test Sets for Large-Scale Product...
linkagelibrary.icpsr.umich.edu
da-ra.de
Updated Nov 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
Explore at:
Unique identifier
https://doi.org/10.3886/E127481V1
Dataset updated
Nov 26, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Ralph Peeters; Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
AIMO2 - Omni-MATH based validation set
kaggle.com
zip
Updated Oct 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Mirea (2024). AIMO2 - Omni-MATH based validation set [Dataset]. https://www.kaggle.com/datasets/gabrielmirea/aimo2-omni-math-based-validation-set
Explore at:
zip(231246 bytes)Available download formats
Dataset updated
Oct 29, 2024
Authors
Gabriel Mirea
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Original dataset here - https://huggingface.co/datasets/KbsdJames/Omni-MATH

Processed by selecting only INT solutions of 7+ difficulty. Then ran through COT and TIR with Qwen2.5-math-1.5B-instruct and further processed by filtering out any problem solved or common problems solved by this model.
h
IERv2-Validation-Set
huggingface.co
Updated Jan 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
taesiri (2025). IERv2-Validation-Set [Dataset]. https://huggingface.co/datasets/taesiri/IERv2-Validation-Set
Explore at:
Dataset updated
Jan 17, 2025
Authors
taesiri
Description
taesiri/IERv2-Validation-Set dataset hosted on Hugging Face and contributed by the HF Datasets community
r
Validation set from one discharge
resodate.org
service.tib.eu
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jian. Liu; Hong. Qin; Ting. Lan (2025). Validation set from one discharge [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdmFsaWRhdGlvbi1zZXQtZnJvbS1vbmUtZGlzY2hhcmdl
Explore at:
Dataset updated
Jan 3, 2025
Dataset provided by
Leibniz Data Manager
Authors
Jian. Liu; Hong. Qin; Ting. Lan
Description
The dataset used in the paper is a validation set from one discharge, containing N-channel MUM system samples.
happywhale-tfrecords-25val
kaggle.com
zip
Updated Mar 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rickyinferno (2022). happywhale-tfrecords-25val [Dataset]. https://www.kaggle.com/datasets/runjiali/happywhale-tfrecords-25val/code
Explore at:
zip(61907695094 bytes)Available download formats
Dataset updated
Mar 20, 2022
Authors
Rickyinferno
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
val_half: containing 1/4 ids which have 50% of pictures in this validation set and 50% in the training set

val_all: containing 1/4 ids whose pictures are not included in the training set

train: training set

test: test set
U
Landsat 9 Collection 2 cloud truth mask validation set
data.usgs.gov
catalog.data.gov
Updated Nov 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pat Scaramuzza (2023). Landsat 9 Collection 2 cloud truth mask validation set [Dataset]. http://doi.org/10.5066/P9VRGJ1J
Explore at:
Unique identifier
https://doi.org/10.5066/P9VRGJ1J
Dataset updated
Nov 28, 2023
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Pat Scaramuzza
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Nov 1, 2021 - Jun 1, 2023
Description
The U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center in Sioux Falls, SD developed a cloud validation dataset from 48 unique Landsat 9 Collection 2 images. These images were selected at random from the Landsat 9 archive from various locations around the world. While these validation images were subjectively designed by a single analyst, they provide useful information for quantifying the accuracy of clouds flagged by various cloud masking algorithms. Each mask is provided in GeoTIFF format, and includes all bands from the original Landsat 9 Collection 2 Level-1 data product (COG GeoTIFF), and its associated Level-1 metadata (MTL.txt file).
R
Validation Set Dataset
universe.roboflow.com
zip
Updated Apr 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashwin Alinkil (2023). Validation Set Dataset [Dataset]. https://universe.roboflow.com/ashwin-alinkil-5bvgz/validation-set-2fjfc/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Apr 17, 2023
Dataset authored and provided by
Ashwin Alinkil
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Variables measured
Cars Trucks Vans Pedestrians Bounding Boxes
Description
Validation Set

## Overview Validation Set is a dataset for object detection tasks - it contains Cars Trucks Vans Pedestrians annotations for 1,500 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
R
Validation Data Set Dataset
universe.roboflow.com
zip
Updated Oct 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Santo Tomas (2022). Validation Data Set Dataset [Dataset]. https://universe.roboflow.com/university-of-santo-tomas/validation-data-set/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Oct 13, 2022
Dataset authored and provided by
University of Santo Tomas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Microscopic Eggs Bounding Boxes
Description
Validation Data Set

## Overview Validation Data Set is a dataset for object detection tasks - it contains Microscopic Eggs annotations for 300 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
f
Comparison of classification results of different models on the validation...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Nov 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shao, Ran; Bi, Xiao-Jun; Chen, Zheng (2022). Comparison of classification results of different models on the validation set. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000425794
Explore at:
Dataset updated
Nov 11, 2022
Authors
Shao, Ran; Bi, Xiao-Jun; Chen, Zheng
Description
Comparison of classification results of different models on the validation set.
f
Validation metrics for 10 random cross-validation sets.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jan 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brown, Samantha; Kosciuch, Karl; Riser-Espinoza, Daniel; Hallingstad, Eric; Haddock, Jeanette; Rabie, Paul (2023). Validation metrics for 10 random cross-validation sets. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001101855
Explore at:
Dataset updated
Jan 3, 2023
Authors
Brown, Samantha; Kosciuch, Karl; Riser-Espinoza, Daniel; Hallingstad, Eric; Haddock, Jeanette; Rabie, Paul
Description
For each validation set the following metrics were calculated: RMSE, Pearson’s correlation coefficient, proportion of predictions exceeding the OOS estimates, and average absolute error (average of the absolute value of the difference between predicted and actual raptor average probability of persistence).

Facebook

Twitter

Click to copy link

Link copied

Cite

Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365

Machine learning algorithm validation with a limited sample size

Explore at:

text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0224365

Dataset updated

May 30, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Clear search

Close search

Google apps

Main menu

Machine learning algorithm validation with a limited sample size

Occluded Validation Set Cropped Dataset

Occluded Validation Set Cropped

ImageNet Validation Set - Dataset - LDM

Validation Set

Tomato Validation Set Dataset

Tomato Validation Set

give us the data validation test set

Dataset

Contents

Results on validation set data.

Landsat 8 Collection 1 cloud truth mask validation set

Validation Set Dataset

Validation Set

MS Training Set, MS Validation Set, and UW Validation/Test Set - Dataset -...

Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

AIMO2 - Omni-MATH based validation set

IERv2-Validation-Set

Validation set from one discharge

happywhale-tfrecords-25val

Landsat 9 Collection 2 cloud truth mask validation set

Validation Set Dataset

Validation Set

Validation Data Set Dataset

Validation Data Set

Comparison of classification results of different models on the validation...

Validation metrics for 10 random cross-validation sets.

Machine learning algorithm validation with a limited sample size