100+ datasets found
  1. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  2. Images used for training, validation, and testing.

    • kaggle.com
    Updated Mar 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chrysthian Chrisley (2024). Images used for training, validation, and testing. [Dataset]. https://www.kaggle.com/datasets/chrysthian/images-used-for-training-validation-and-testing
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2024
    Dataset provided by
    Kaggle
    Authors
    Chrysthian Chrisley
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Imports:

    # All Imports
    import os
    from matplotlib import pyplot as plt
    import pandas as pd
    from sklearn.calibration import LabelEncoder
    import seaborn as sns
    import matplotlib.image as mpimg
    import cv2
    import numpy as np
    import pickle
    
    # Tensflor and Keras Layer and Model and Optimize and Loss
    import tensorflow as tf
    from tensorflow import keras
    from keras import Sequential
    from keras.layers import *
    
    #Kernel Intilizer 
    from keras.optimizers import Adamax
    
    # PreTrained Model
    from keras.applications import *
    
    #Early Stopping
    from keras.callbacks import EarlyStopping
    import warnings 
    

    Warnings Suppression | Configuration

    # Warnings Remove 
    warnings.filterwarnings("ignore")
    
    # Define the base path for the training folder
    base_path = 'jaguar_cheetah/train'
    
    # Weights file
    weights_file = 'Model_train_weights.weights.h5'
    
    # Path to the saved or to save the model:
    model_file = 'Model-cheetah_jaguar_Treined.keras'
    
    # Model history
    history_path = 'training_history_cheetah_jaguar.pkl'
    
    # Initialize lists to store file paths and labels
    filepaths = []
    labels = []
    
    # Iterate over folders and files within the training directory
    for folder in ['Cheetah', 'Jaguar']:
      folder_path = os.path.join(base_path, folder)
      for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        filepaths.append(file_path)
        labels.append(folder)
    
    # Create the TRAINING dataframe
    file_path_series = pd.Series(filepaths , name= 'filepath')
    Label_path_series = pd.Series(labels , name = 'label')
    df_train = pd.concat([file_path_series ,Label_path_series ] , axis = 1)
    
    
    # Define the base path for the test folder
    directory = "jaguar_cheetah/test"
    
    filepath =[]
    label = []
    
    folds = os.listdir(directory)
    
    for fold in folds:
      f_path = os.path.join(directory , fold)
      
      imgs = os.listdir(f_path)
      
      for img in imgs:
        
        img_path = os.path.join(f_path , img)
        filepath.append(img_path)
        label.append(fold)
        
    # Create the TEST dataframe
    file_path_series = pd.Series(filepath , name= 'filepath')
    Label_path_series = pd.Series(label , name = 'label')
    df_test = pd.concat([file_path_series ,Label_path_series ] , axis = 1)
    
    # Display the first rows of the dataframe for verification
    #print(df_train)
    
    # Folders with Training and Test files
    data_dir = 'jaguar_cheetah/train'
    test_dir = 'jaguar_cheetah/test'
    
    # Image size 256x256
    IMAGE_SIZE = (256,256) 
    

    Tain | Test

    #print('Training Images:')
    
    # Create the TRAIN dataframe
    train_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir,
      validation_split=0.1,
      subset='training',
      seed=123,
      image_size=IMAGE_SIZE,
      batch_size=32)
    
    #Testing Data
    #print('Validation Images:')
    validation_ds = tf.keras.utils.image_dataset_from_directory(
      data_dir, 
      validation_split=0.1,
      subset='validation',
      seed=123,
      image_size=IMAGE_SIZE,
      batch_size=32)
    
    print('Testing Images:')
    test_ds = tf.keras.utils.image_dataset_from_directory(
      test_dir, 
      seed=123,
      image_size=IMAGE_SIZE,
      batch_size=32)
    
    # Extract labels
    train_labels = train_ds.class_names
    test_labels = test_ds.class_names
    validation_labels = validation_ds.class_names
    
    # Encode labels
    # Defining the class labels
    class_labels = ['CHEETAH', 'JAGUAR'] 
    
    # Instantiate (encoder) LabelEncoder
    label_encoder = LabelEncoder()
    
    # Fit the label encoder on the class labels
    label_encoder.fit(class_labels)
    
    # Transform the labels for the training dataset
    train_labels_encoded = label_encoder.transform(train_labels)
    
    # Transform the labels for the validation dataset
    validation_labels_encoded = label_encoder.transform(validation_labels)
    
    # Transform the labels for the testing dataset
    test_labels_encoded = label_encoder.transform(test_labels)
    
    # Normalize the pixel values
    
    # Train files 
    train_ds = train_ds.map(lambda x, y: (x / 255.0, y))
    # Validate files
    validation_ds = validation_ds.map(lambda x, y: (x / 255.0, y))
    # Test files
    test_ds = test_ds.map(lambda x, y: (x / 255.0, y))
    
    #TRAINING VISUALIZATION
    #Count the occurrences of each category in the column
    count = df_train['label'].value_counts()
    
    # Create a figure with 2 subplots
    fig, axs = plt.subplots(1, 2, figsize=(12, 6), facecolor='white')
    
    # Plot a pie chart on the first subplot
    palette = sns.color_palette("viridis")
    sns.set_palette(palette)
    axs[0].pie(count, labels=count.index, autopct='%1.1f%%', startangle=140)
    axs[0].set_title('Distribution of Training Categories')
    
    # Plot a bar chart on the second subplot
    sns.barplot(x=count.index, y=count.values, ax=axs[1], palette="viridis")
    axs[1].set_title('Count of Training Categories')
    
    # Adjust the layout
    plt.tight_layout()
    
    # Visualize
    plt.show()
    
    # TEST VISUALIZATION
    count = df_test['label'].value_counts()
    
    # Create a figure with 2 subplots
    fig, axs = plt.subplots(1, 2, figsize=(12, 6), facec...
    
  3. d

    ECCOE 2022 Surface Reflectance Validation Dataset

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). ECCOE 2022 Surface Reflectance Validation Dataset [Dataset]. https://catalog.data.gov/dataset/eccoe-2022-surface-reflectance-validation-dataset
    Explore at:
    Dataset updated
    Nov 20, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    Scientists and engineers from the U.S. Geological Survey (USGS) Earth Resources Observation and Science Center (EROS) Cal/Val Center of Excellence (ECCOE) collected in situ measurements using field spectrometers to support the validation of surface reflectance products derived from Earth observing remote sensing imagery. Data provided in this data release were collected during select Earth observing satellite overpasses and tests during the months of May through October 2022. Data was collected at three field sites: the ground viewing radiometer (GVR) site on the USGS EROS facility in Minnehaha County, South Dakota, a private land holding near the City of Arlington in Brookings County, South Dakota, and a private land holding in Sanborn County, South Dakota. Each field collection file includes the calculated surface reflectance of each wavelength collected using a dual field spectrometer methodology. The dual field spectrometer methodology allows for the calculated surface reflectance of each wavelength to be computed using one or both of the spectrometers. The use of the dual field spectrometers system reduces uncertainty in the field measurements by accounting for changes in solar irradiance. Both single and dual spectrometer calculated surface reflectance are included with this dataset. The differing methodologies of the calculated surface reflectance data are denoted as "Single Spectrometer" and "Dual Spectrometer". Field spectrometer data are provided as Comma Separated Values (CSV) files and GeoPackage files. The 09 May 2022 and the 16 June 2022 collection data are calculated using single spectrometer only, due to a technical issue with a field spectrometer.

  4. LLM Science: Validation Data

    • kaggle.com
    zip
    Updated Oct 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yash Goel (2023). LLM Science: Validation Data [Dataset]. https://www.kaggle.com/datasets/goelyash/llm-science-validation-data
    Explore at:
    zip(571549 bytes)Available download formats
    Dataset updated
    Oct 7, 2023
    Authors
    Yash Goel
    Description

    Why this Dataset? This dataset is showing positive correlation between CV and LB. Discussion Here

    How is it Made? This contains 300 Ques/Ans dataset created by @yalickj + 200 Ques/Ans provided by the Competition.

    Context retrieval is done using @mbanaei fantastic notebook.

  5. R

    Data Validation Dataset

    • universe.roboflow.com
    zip
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Izzatul (2025). Data Validation Dataset [Dataset]. https://universe.roboflow.com/izzatul/data-validation-nf4ld/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 2, 2025
    Dataset authored and provided by
    Izzatul
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cats Bounding Boxes
    Description

    Data Validation

    ## Overview
    
    Data Validation is a dataset for object detection tasks - it contains Cats annotations for 1,542 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  6. f

    Results on validation set data.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jul 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rutten, Matthieu; Smits, Henk; Kurstjens, Steef; Çallı, Erdi; Murphy, Keelin; van Ginneken, Bram; Samson, Tijs; Herpers, Robert (2021). Results on validation set data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000918798
    Explore at:
    Dataset updated
    Jul 30, 2021
    Authors
    Rutten, Matthieu; Smits, Henk; Kurstjens, Steef; Çallı, Erdi; Murphy, Keelin; van Ginneken, Bram; Samson, Tijs; Herpers, Robert
    Description

    Five models are trained using various input masking probabilities (IMP). Each resulting model is validated using the heavily masked validation dataset of 13596 samples (5668 positive) to evaluate their performance in the context of missing input data. AUC values for the optimal training IMP are shown, along with those achieved with no input masking (NIM). Bold font indicates the highest AUC in the table. Results for other IMP values are provided in the S1 File.

  7. R

    Validation Data Set Dataset

    • universe.roboflow.com
    zip
    Updated Oct 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Santo Tomas (2022). Validation Data Set Dataset [Dataset]. https://universe.roboflow.com/university-of-santo-tomas/validation-data-set/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 13, 2022
    Dataset authored and provided by
    University of Santo Tomas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Microscopic Eggs Bounding Boxes
    Description

    Validation Data Set

    ## Overview
    
    Validation Data Set is a dataset for object detection tasks - it contains Microscopic Eggs annotations for 300 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  8. Address & ZIP Validation Dataset | Mobility Data | Geospatial Checks +...

    • datarade.ai
    .csv
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GeoPostcodes (2024). Address & ZIP Validation Dataset | Mobility Data | Geospatial Checks + Coverage Flags (Global) [Dataset]. https://datarade.ai/data-products/geopostcodes-geospatial-data-zip-code-data-address-vali-geopostcodes
    Explore at:
    .csvAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset authored and provided by
    GeoPostcodes
    Area covered
    Bolivia (Plurinational State of), Cabo Verde, Mongolia, Kazakhstan, Ireland, Sint Maarten (Dutch part), Colombia, Korea (Republic of), South Africa, French Guiana
    Description

    Our location data powers the most advanced address validation solutions for enterprise backend and frontend systems.

    A global, standardized, self-hosted location dataset containing all administrative divisions, cities, and zip codes for 247 countries.

    All geospatial data for address data validation is updated weekly to maintain the highest data quality, including challenging countries such as China, Brazil, Russia, and the United Kingdom.

    Use cases for the Address Validation at Zip Code Level Database (Geospatial data)

    • Address capture and address validation

    • Address autocomplete

    • Address verification

    • Reporting and Business Intelligence (BI)

    • Master Data Mangement

    • Logistics and Supply Chain Management

    • Sales and Marketing

    Product Features

    • Dedicated features to deliver best-in-class user experience

    • Multi-language support including address names in local and foreign languages

    • Comprehensive city definitions across countries

    Data export methodology

    Our location data packages are offered in variable formats, including .csv. All geospatial data for address validation are optimized for seamless integration with popular systems like Esri ArcGIS, Snowflake, QGIS, and more.

    Why do companies choose our location databases

    • Enterprise-grade service

    • Full control over security, speed, and latency

    • Reduce integration time and cost by 30%

    • Weekly updates for the highest quality

    • Seamlessly integrated into your software

    Note: Custom address validation packages are available. Please submit a request via the above contact button for more details.

  9. c

    Procurement Validation Dataset

    • repository.clarin.lv
    Updated Aug 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daiga Deksne; Raivis Skadiņš; Andris Hohbergs; Rūdolfs Jaunzars; Andrejs Petrovs; Justīne Rūdule; Mārcis Pinnis (2025). Procurement Validation Dataset [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/135
    Explore at:
    Dataset updated
    Aug 27, 2025
    Authors
    Daiga Deksne; Raivis Skadiņš; Andris Hohbergs; Rūdolfs Jaunzars; Andrejs Petrovs; Justīne Rūdule; Mārcis Pinnis
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Procurement Validation Dataset was created within the framework of the State Research Programme project "Analysis of the Applicability of Artificial Intelligence Methods in the Field of European Union Fund Projects".

    The dataset consists of 30 procurement documents evaluated by CFCA experts. The procurement checklists prepared by the experts have been transformed into machine-readable form. For each procurement, 168 questions are asked regarding its compliance with legislation, and each question has an answer provided.

    The dataset is divided into two subsets: a development dataset (10 procurements) and an evaluation dataset (20 procurements). The dataset consists of:
    1) questions based on the checklist S.7.1.-PL-21 (09.12.2019 edition);
    2) a labeled dataset corresponding to 30 procurements evaluated by CFCA.

    The dataset is distributed under the CC-BY-NC-SA license: https://creativecommons.org/licenses/by-nc-sa/4.0/ When using this dataset, please cite as:

    Project "Analysis of the Applicability of Artificial Intelligence Methods in the Field of European Union Fund Projects" (VPP-CFLA-Mākslīgais intelekts-2024/1-0003). Procurement Validation Dataset. Licensed under CC BY-NC-SA 4.0.

  10. d

    Light and GPP estimates for 173 U.S. rivers: 3. Model validation dataset

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Light and GPP estimates for 173 U.S. rivers: 3. Model validation dataset [Dataset]. https://catalog.data.gov/dataset/light-and-gpp-estimates-for-173-u-s-rivers-3-model-validation-dataset
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    This dataset provides a common validation set for estimates of gross primary productivity. This data represents a subset of all predictions made in the model inputs and outputs that were converted to GPP based on a light use efficiency. The data was subsetted for only days were all light estimates could be produced. This dataset is part of a larger data release of inputs and outputs from a model to predict light at the stream surface and within the water column for 173 streams and rivers across the continental United States. The complete release contains model input data, modeled estimates of light at the stream surface and within the water column, and modeled estimates of gross primary productivity.

  11. R

    Agrisense Validation Dataset Dataset

    • universe.roboflow.com
    zip
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Capstone Project (2024). Agrisense Validation Dataset Dataset [Dataset]. https://universe.roboflow.com/capstone-project-kvpdq/agrisense-validation-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2024
    Dataset authored and provided by
    Capstone Project
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Rice Plants Bounding Boxes
    Description

    Agrisense Validation Dataset

    ## Overview
    
    Agrisense Validation Dataset is a dataset for object detection tasks - it contains Rice Plants annotations for 200 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  12. R

    Drone Validation Dataset Dataset

    • universe.roboflow.com
    zip
    Updated Jan 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zee Zee (2024). Drone Validation Dataset Dataset [Dataset]. https://universe.roboflow.com/zee-zee-urhcf/drone-validation-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 30, 2024
    Dataset authored and provided by
    Zee Zee
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    Drone Bounding Boxes
    Description

    Drone Validation Dataset

    ## Overview
    
    Drone Validation Dataset is a dataset for object detection tasks - it contains Drone annotations for 560 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
    
  13. R

    New For Validation Dataset

    • universe.roboflow.com
    zip
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chungnam national university (2023). New For Validation Dataset [Dataset]. https://universe.roboflow.com/chungnam-national-university-tpjip/new-dataset-for-validation
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2023
    Dataset authored and provided by
    chungnam national university
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Free Space Masks
    Description

    New Dataset For Validation

    ## Overview
    
    New Dataset For Validation is a dataset for semantic segmentation tasks - it contains Free Space annotations for 404 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  14. R

    K Fold Cross Validation Dataset

    • universe.roboflow.com
    zip
    Updated Apr 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Santo Tomas (2023). K Fold Cross Validation Dataset [Dataset]. https://universe.roboflow.com/university-of-santo-tomas-htnuv/k-fold-cross-validation/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 25, 2023
    Dataset authored and provided by
    University of Santo Tomas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    People Bounding Boxes
    Description

    K Fold Cross Validation

    ## Overview
    
    K Fold Cross Validation is a dataset for object detection tasks - it contains People annotations for 3,500 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  15. R

    Part Validation Dataset

    • universe.roboflow.com
    zip
    Updated Aug 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IFB (2025). Part Validation Dataset [Dataset]. https://universe.roboflow.com/ifb-ohzgt/part-validation/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset authored and provided by
    IFB
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Objects Bounding Boxes
    Description

    Part Validation

    ## Overview
    
    Part Validation is a dataset for object detection tasks - it contains Objects annotations for 2,152 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  16. R

    Flir Validation Dataset

    • universe.roboflow.com
    zip
    Updated Feb 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    new-workspace-2uoxs (2022). Flir Validation Dataset [Dataset]. https://universe.roboflow.com/new-workspace-2uoxs/flir-validation-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 7, 2022
    Dataset authored and provided by
    new-workspace-2uoxs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Valid Bounding Boxes
    Description

    FLIR Validation Dataset

    ## Overview
    
    FLIR Validation Dataset is a dataset for object detection tasks - it contains Valid annotations for 1,366 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  17. Object Detection Validation Dataset

    • universe.roboflow.com
    zip
    Updated May 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Object detection (2022). Object Detection Validation Dataset [Dataset]. https://universe.roboflow.com/object-detection-ireqt/object-detection-validation-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 25, 2022
    Dataset authored and provided by
    Object detection
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Vehicles And Traffic Signs Bounding Boxes
    Description

    Object Detection Validation Dataset

    ## Overview
    
    Object Detection Validation Dataset is a dataset for object detection tasks - it contains Vehicles And Traffic Signs annotations for 249 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  18. Z

    DocTOR models and cross-validation dataset

    • data.niaid.nih.gov
    Updated Mar 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Galletti Cristiano (2022). DocTOR models and cross-validation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_6337103
    Explore at:
    Dataset updated
    Mar 23, 2022
    Dataset provided by
    Vic University
    Authors
    Galletti Cristiano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset necessary for DocTOR utility.

    DocTOR (Direct fOreCast Target On Reaction), is a utility written in python3.9 (using the conda workframe) that allows the user to upload a list of Uniprot IDs and Adverse reactions (from the available models) in order to study the relationship between the two.

    On output the program will assign a positive or negative class to the protein, assessing its possible involvement in the selected ADRs onset.

    DocTOR exploits the data coming from T-ARDIS [https://doi.org/10.1093/database/baab068] to train different Machine Learning approaches (SVM, RF, NN) using network topological measurements as features.

    The prediction coming from the single trained models are combined in a meta-predictor exploiting three different voting systems.

    The results of the meta-predictor together with the ones from the single ML method will be available in the output log file (named "predictions_community" or "predictions_curated" based on the database type).

    The DocTOR utility is avaiable at https://github.com/cristian931/DocTOR

  19. File Validation and Training Statistics

    • kaggle.com
    zip
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). File Validation and Training Statistics [Dataset]. https://www.kaggle.com/datasets/thedevastator/file-validation-and-training-statistics
    Explore at:
    zip(16413235 bytes)Available download formats
    Dataset updated
    Dec 1, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    File Validation and Training Statistics

    Validation, Training, and Testing Statistics for tasksource/leandojo Files

    By tasksource (From Huggingface) [source]

    About this dataset

    The tasksource/leandojo: File Validation, Training, and Testing Statistics dataset is a comprehensive collection of information regarding the validation, training, and testing processes of files in the tasksource/leandojo repository. This dataset is essential for gaining insights into the file management practices within this specific repository.

    The dataset consists of three distinct files: validation.csv, train.csv, and test.csv. Each file serves a unique purpose in providing statistics and information about the different stages involved in managing files within the repository.

    In validation.csv, you will find detailed information about the validation process undergone by each file. This includes data such as file paths within the repository (file_path), full names of each file (full_name), associated commit IDs (commit), traced tactics implemented (traced_tactics), URLs pointing to each file (url), and respective start and end dates for validation.

    train.csv focuses on providing valuable statistics related to the training phase of files. Here, you can access data such as file paths within the repository (file_path), full names of individual files (full_name), associated commit IDs (commit), traced tactics utilized during training activities (traced_tactics), URLs linking to each specific file undergoing training procedures (url).

    Lastly, test.csv encompasses pertinent statistics concerning testing activities performed on different files within the tasksource/leandojo repository. This data includes information such as file paths within the repo structure (file_path), full names assigned to each individual file tested (full_name) , associated commit IDs linked with these files' versions being tested(commit) , traced tactics incorporated during testing procedures regarded(traced_tactics) ,relevant URLs directing to specific tested files(url).

    By exploring this comprehensive dataset consisting of three separate CSV files - validation.csv, train.csv, test.csv - researchers can gain crucial insights into how effective strategies pertaining to validating ,training or testing tasks have been implemented in order to maintain high-quality standards within the tasksource/leandojo repository

    How to use the dataset

    • Familiarize Yourself with the Dataset Structure:

      • The dataset consists of three separate files: validation.csv, train.csv, and test.csv.
      • Each file contains multiple columns providing different information about file validation, training, and testing.
    • Explore the Columns:

      • 'file_path': This column represents the path of the file within the repository.
      • 'full_name': This column displays the full name of each file.
      • 'commit': The commit ID associated with each file is provided in this column.
      • 'traced_tactics': The tactics traced in each file are listed in this column.
      • 'url': This column provides the URL of each file.
    • Understand Each File's Purpose:

    Validation.csv - This file contains information related to the validation process of files in the tasksource/leandojo repository.

    Train.csv - Utilize this file if you need statistics and information regarding the training phase of files in tasksource/leandojo repository.

    Test.csv - For insights into statistics and information about testing individual files within tasksource/leandojo repository, refer to this file.

    • Generate Insights & Analyze Data:
    • Once you have a clear understanding of each column's purpose, you can start generating insights from your analysis using various statistical techniques or machine learning algorithms.
    • Explore patterns or trends by examining specific columns such as 'traced_tactics' or analyzing multiple columns together.

    • Combine Multiple Files (if necessary):

    • If required, you can merge/correlate data across different csv files based on common fields such as 'file_path', 'full_name', or 'commit'.

    • Visualize the Data (Optional):

    • To enhance your analysis, consider creating visualizations such as plots, charts, or graphs. Visualization can offer a clear representation of patterns or relationships within the dataset.

    • Obtain Further Information:

    • If you need additional details about any specific file, make use of the provided 'url' column to access further information.

    Remember that this guide provides a general overview of how to utilize this dataset effectively. Feel ...

  20. ETHOS.ActivityAssure Dataset

    • zenodo.org
    application/gzip, pdf +1
    Updated Nov 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Neuroth; David Neuroth; Noah Pflugradt; Noah Pflugradt; Jann Michael Weinand; Jann Michael Weinand; Detlef Stolten; Detlef Stolten (2024). ETHOS.ActivityAssure Dataset [Dataset]. http://doi.org/10.5281/zenodo.11035251
    Explore at:
    zip, application/gzip, pdfAvailable download formats
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Neuroth; David Neuroth; Noah Pflugradt; Noah Pflugradt; Jann Michael Weinand; Jann Michael Weinand; Detlef Stolten; Detlef Stolten
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 15, 2024
    Description

    ETHOS.ActivityAssure Dataset

    The ETHOS.ActivityAssure dataset is an aggregated activity dataset derived from the HETUS 2010 time use survey. It is intended to enable reusable and reproducible validation of various behavior models.

    The ETHOS.ActivityAssure software framework belongs to this dataset, and together they can be used to validate activity profiles, e.g. the results of an occupant behavior model. It provides modules for preprocessing and categorising activity profiles, and comparing them to the statistics in this dataset using indicators and plots. It also contains the code that was used to create this dataset out of the HETUS 2010 data, so that the generation of this dataset is fully reproducible.

    Activity Profile Categorization

    The HETUS dataset consists of many single-day activity profiles. These cannot be made publicly accessible due to data protection regulations. The idea of the ETHOS.ActivityAssure dataset is to aggregate these activity profiles using a meaningful classification, to provide behavior statistics for different types of activity profiles. For that, the attributes country, sex, work status, and day type are used.

    Activity Statistics

    Human behavior is complex, and in order to thorougly validate it, multiple aspects have to be taken into account. Therefore, the ETHOS.ActivityAssure dataset contains distributions for duration and frequency of each activity, as well as the temporal distribution throughout the day. For that purpose, a set of 15 common activity groups is defined. The mapping from the 108 activity codes used in HEUTS 2010 is provided as part of the validation framework.

    File Overview

    For convenience, the ETHOS.ActivityAssure dataset is provided both as .tar.gz and as .zip archive. Both files contain the same content, the full activity validation dataset.
    Additionally, the document ActivityAssure_data_set_description.pdf contains a more thorough description of the dataset, including its file structure, the content and meaning of its files, and examples on how to use it.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Organization logo

Machine learning algorithm validation with a limited sample size

Explore at:
text/x-pythonAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Search
Clear search
Close search
Google apps
Main menu