100+ datasets found
  1. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  2. R

    Occluded Validation Set Cropped Dataset

    • universe.roboflow.com
    zip
    Updated Feb 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SAU (2022). Occluded Validation Set Cropped Dataset [Dataset]. https://universe.roboflow.com/sau-cixmv/occluded-validation-set-cropped
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 4, 2022
    Dataset authored and provided by
    SAU
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Sheep Bounding Boxes
    Description

    Occluded Validation Set Cropped

    ## Overview
    
    Occluded Validation Set Cropped is a dataset for object detection tasks - it contains Sheep annotations for 243 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  3. t

    ImageNet Validation Set - Dataset - LDM

    • service.tib.eu
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ImageNet Validation Set - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/imagenet-validation-set
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    The dataset used in the paper is the ImageNet validation set, a subset of the ImageNet dataset.

  4. r

    Validation Set

    • resodate.org
    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davide Gerosa; Geraint Pratten; Alberto Vecchio (2024). Validation Set [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdmFsaWRhdGlvbi1zZXQ=
    Explore at:
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    Davide Gerosa; Geraint Pratten; Alberto Vecchio
    Description

    A dataset used to train and test the neural network classifiers.

  5. R

    Tomato Validation Set Dataset

    • universe.roboflow.com
    zip
    Updated May 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YH (2024). Tomato Validation Set Dataset [Dataset]. https://universe.roboflow.com/yh-ci4ev/tomato-validation-set
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 5, 2024
    Dataset authored and provided by
    YH
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Tomato Bounding Boxes
    Description

    Tomato Validation Set

    ## Overview
    
    Tomato Validation Set is a dataset for object detection tasks - it contains Tomato annotations for 4,265 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  6. give us the data validation test set

    • kaggle.com
    zip
    Updated Apr 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna (2021). give us the data validation test set [Dataset]. https://www.kaggle.com/annatmp/give-us-the-data-validation-test-set
    Explore at:
    zip(439562080 bytes)Available download formats
    Dataset updated
    Apr 23, 2021
    Authors
    Anna
    Description

    Dataset

    This dataset was created by Anna

    Contents

  7. f

    Results on validation set data.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jul 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rutten, Matthieu; Smits, Henk; Kurstjens, Steef; Çallı, Erdi; Murphy, Keelin; van Ginneken, Bram; Samson, Tijs; Herpers, Robert (2021). Results on validation set data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000918798
    Explore at:
    Dataset updated
    Jul 30, 2021
    Authors
    Rutten, Matthieu; Smits, Henk; Kurstjens, Steef; Çallı, Erdi; Murphy, Keelin; van Ginneken, Bram; Samson, Tijs; Herpers, Robert
    Description

    Five models are trained using various input masking probabilities (IMP). Each resulting model is validated using the heavily masked validation dataset of 13596 samples (5668 positive) to evaluate their performance in the context of missing input data. AUC values for the optimal training IMP are shown, along with those achieved with no input masking (NIM). Bold font indicates the highest AUC in the table. Results for other IMP values are provided in the S1 File.

  8. d

    Landsat 8 Collection 1 cloud truth mask validation set

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Landsat 8 Collection 1 cloud truth mask validation set [Dataset]. https://catalog.data.gov/dataset/landsat-8-collection-1-cloud-truth-mask-validation-set
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    The U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center in Sioux Falls, SD developed a cloud validation dataset from 48 unique Landsat 8 Collection 1 images. These images were selected at random from the Landsat 8 archive from various locations around the world. While these validation images were subjectively designed by a single analyst, they provide useful information for quantifying the accuracy of clouds flagged by various cloud masking algorithms. Each mask is provided in GeoTIFF format, and includes all bands from the original Landsat 8 Level-1 Collection 1 data product (COG GeoTIFF), and its associated Level-1 metadata (MTL.txt file).

  9. R

    Validation Set Dataset

    • universe.roboflow.com
    zip
    Updated May 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VA (2024). Validation Set Dataset [Dataset]. https://universe.roboflow.com/va-2pswp/validation-set-hmu1x/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2024
    Dataset authored and provided by
    VA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cars Motorcycles Bounding Boxes
    Description

    Validation Set

    ## Overview
    
    Validation Set is a dataset for object detection tasks - it contains Cars Motorcycles annotations for 219 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  10. t

    MS Training Set, MS Validation Set, and UW Validation/Test Set - Dataset -...

    • service.tib.eu
    Updated Dec 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). MS Training Set, MS Validation Set, and UW Validation/Test Set - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/ms-training-set--ms-validation-set--and-uw-validation-test-set
    Explore at:
    Dataset updated
    Dec 17, 2024
    Description

    The MS Training Set, MS Validation Set, and UW Validation/Test Set are used for training, validation, and testing the proposed methods.

  11. Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

    • linkagelibrary.icpsr.umich.edu
    • da-ra.de
    Updated Nov 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Ralph Peeters; Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  12. AIMO2 - Omni-MATH based validation set

    • kaggle.com
    zip
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Mirea (2024). AIMO2 - Omni-MATH based validation set [Dataset]. https://www.kaggle.com/datasets/gabrielmirea/aimo2-omni-math-based-validation-set
    Explore at:
    zip(231246 bytes)Available download formats
    Dataset updated
    Oct 29, 2024
    Authors
    Gabriel Mirea
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Original dataset here - https://huggingface.co/datasets/KbsdJames/Omni-MATH

    Processed by selecting only INT solutions of 7+ difficulty. Then ran through COT and TIR with Qwen2.5-math-1.5B-instruct and further processed by filtering out any problem solved or common problems solved by this model.

  13. h

    IERv2-Validation-Set

    • huggingface.co
    Updated Jan 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    taesiri (2025). IERv2-Validation-Set [Dataset]. https://huggingface.co/datasets/taesiri/IERv2-Validation-Set
    Explore at:
    Dataset updated
    Jan 17, 2025
    Authors
    taesiri
    Description

    taesiri/IERv2-Validation-Set dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. r

    Validation set from one discharge

    • resodate.org
    • service.tib.eu
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jian. Liu; Hong. Qin; Ting. Lan (2025). Validation set from one discharge [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdmFsaWRhdGlvbi1zZXQtZnJvbS1vbmUtZGlzY2hhcmdl
    Explore at:
    Dataset updated
    Jan 3, 2025
    Dataset provided by
    Leibniz Data Manager
    Authors
    Jian. Liu; Hong. Qin; Ting. Lan
    Description

    The dataset used in the paper is a validation set from one discharge, containing N-channel MUM system samples.

  15. happywhale-tfrecords-25val

    • kaggle.com
    zip
    Updated Mar 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rickyinferno (2022). happywhale-tfrecords-25val [Dataset]. https://www.kaggle.com/datasets/runjiali/happywhale-tfrecords-25val/code
    Explore at:
    zip(61907695094 bytes)Available download formats
    Dataset updated
    Mar 20, 2022
    Authors
    Rickyinferno
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    val_half: containing 1/4 ids which have 50% of pictures in this validation set and 50% in the training set

    val_all: containing 1/4 ids whose pictures are not included in the training set

    train: training set

    test: test set

  16. U

    Landsat 9 Collection 2 cloud truth mask validation set

    • data.usgs.gov
    • catalog.data.gov
    Updated Nov 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pat Scaramuzza (2023). Landsat 9 Collection 2 cloud truth mask validation set [Dataset]. http://doi.org/10.5066/P9VRGJ1J
    Explore at:
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Pat Scaramuzza
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Nov 1, 2021 - Jun 1, 2023
    Description

    The U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) Center in Sioux Falls, SD developed a cloud validation dataset from 48 unique Landsat 9 Collection 2 images. These images were selected at random from the Landsat 9 archive from various locations around the world. While these validation images were subjectively designed by a single analyst, they provide useful information for quantifying the accuracy of clouds flagged by various cloud masking algorithms. Each mask is provided in GeoTIFF format, and includes all bands from the original Landsat 9 Collection 2 Level-1 data product (COG GeoTIFF), and its associated Level-1 metadata (MTL.txt file).

  17. R

    Validation Set Dataset

    • universe.roboflow.com
    zip
    Updated Apr 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashwin Alinkil (2023). Validation Set Dataset [Dataset]. https://universe.roboflow.com/ashwin-alinkil-5bvgz/validation-set-2fjfc/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 17, 2023
    Dataset authored and provided by
    Ashwin Alinkil
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    Cars Trucks Vans Pedestrians Bounding Boxes
    Description

    Validation Set

    ## Overview
    
    Validation Set is a dataset for object detection tasks - it contains Cars Trucks Vans Pedestrians annotations for 1,500 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
    
  18. R

    Validation Data Set Dataset

    • universe.roboflow.com
    zip
    Updated Oct 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Santo Tomas (2022). Validation Data Set Dataset [Dataset]. https://universe.roboflow.com/university-of-santo-tomas/validation-data-set/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 13, 2022
    Dataset authored and provided by
    University of Santo Tomas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Microscopic Eggs Bounding Boxes
    Description

    Validation Data Set

    ## Overview
    
    Validation Data Set is a dataset for object detection tasks - it contains Microscopic Eggs annotations for 300 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  19. f

    Comparison of classification results of different models on the validation...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Nov 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shao, Ran; Bi, Xiao-Jun; Chen, Zheng (2022). Comparison of classification results of different models on the validation set. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000425794
    Explore at:
    Dataset updated
    Nov 11, 2022
    Authors
    Shao, Ran; Bi, Xiao-Jun; Chen, Zheng
    Description

    Comparison of classification results of different models on the validation set.

  20. f

    Validation metrics for 10 random cross-validation sets.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jan 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brown, Samantha; Kosciuch, Karl; Riser-Espinoza, Daniel; Hallingstad, Eric; Haddock, Jeanette; Rabie, Paul (2023). Validation metrics for 10 random cross-validation sets. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001101855
    Explore at:
    Dataset updated
    Jan 3, 2023
    Authors
    Brown, Samantha; Kosciuch, Karl; Riser-Espinoza, Daniel; Hallingstad, Eric; Haddock, Jeanette; Rabie, Paul
    Description

    For each validation set the following metrics were calculated: RMSE, Pearson’s correlation coefficient, proportion of predictions exceeding the OOS estimates, and average absolute error (average of the absolute value of the difference between predicted and actual raptor average probability of persistence).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Organization logo

Machine learning algorithm validation with a limited sample size

Explore at:
text/x-pythonAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Search
Clear search
Close search
Google apps
Main menu