100+ datasets found
  1. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  2. happywhale-tfrecords-25val

    • kaggle.com
    zip
    Updated Mar 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rickyinferno (2022). happywhale-tfrecords-25val [Dataset]. https://www.kaggle.com/datasets/runjiali/happywhale-tfrecords-25val/code
    Explore at:
    zip(61907695094 bytes)Available download formats
    Dataset updated
    Mar 20, 2022
    Authors
    Rickyinferno
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    val_half: containing 1/4 ids which have 50% of pictures in this validation set and 50% in the training set

    val_all: containing 1/4 ids whose pictures are not included in the training set

    train: training set

    test: test set

  3. d

    Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

    • demo-b2find.dkrz.de
    Updated Nov 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/8f288eb3-f541-5fca-a337-d519f903668f
    Explore at:
    Dataset updated
    Nov 27, 2020
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  4. f

    Results on validation set data.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jul 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rutten, Matthieu; Smits, Henk; Kurstjens, Steef; Çallı, Erdi; Murphy, Keelin; van Ginneken, Bram; Samson, Tijs; Herpers, Robert (2021). Results on validation set data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000918798
    Explore at:
    Dataset updated
    Jul 30, 2021
    Authors
    Rutten, Matthieu; Smits, Henk; Kurstjens, Steef; Çallı, Erdi; Murphy, Keelin; van Ginneken, Bram; Samson, Tijs; Herpers, Robert
    Description

    Five models are trained using various input masking probabilities (IMP). Each resulting model is validated using the heavily masked validation dataset of 13596 samples (5668 positive) to evaluate their performance in the context of missing input data. AUC values for the optimal training IMP are shown, along with those achieved with no input masking (NIM). Bold font indicates the highest AUC in the table. Results for other IMP values are provided in the S1 File.

  5. S

    Feature data of training set and verification set in utLIFE-PC article

    • scidb.cn
    Updated Oct 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LOU; Xing Nianzeng (2024). Feature data of training set and verification set in utLIFE-PC article [Dataset]. http://doi.org/10.57760/sciencedb.14508
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2024
    Dataset provided by
    Science Data Bank
    Authors
    LOU; Xing Nianzeng
    Description

    File description: 1. train.Mutation_Meth_CNV_data.xls : The feature matrix file used in the training model includes sample name, point mutation data, methylation data and CNV data. The first column must be the sample name. 2. train.sample_label.xls : Pathological information of the training set samples, where 1 represents prostate cancer and 0 represents non-prostate cancer. 3. validation.Mutation_Meth_CNV_data.xls : The feature matrix file used in validation set includes sample name, point mutation data, methylation dataand CNV data. The first column must be the sample name. 4. validation.sample_label.xls : Pathological information of the validation set samples, where 1 represents prostate cancer and 0 represents non-prostate cancer.

  6. H

    Rainbow training and validation data

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Nov 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kimberly Carlson (2022). Rainbow training and validation data [Dataset]. http://doi.org/10.7910/DVN/YTRMGN
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 26, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Kimberly Carlson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset includes the date and time, latitude (“lat”), longitude (“lon”), sun angle (“sun_angle”, in degrees [o]), rainbow presence (TRUE = rainbow, FALSE = no rainbow), cloud cover (“cloud_cover”, proportion), and liquid precipitation (“liquid_precip”, kg m-2 s-1) for each record used to train and/or validate the models.

  7. Z

    Downsized camera trap images for automated classification

    • data.niaid.nih.gov
    Updated Dec 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Norman, Danielle L; Wearne, Oliver R; Chapman, Philip M; Heon, Sui P; Ewers, Robert M (2022). Downsized camera trap images for automated classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6627706
    Explore at:
    Dataset updated
    Dec 1, 2022
    Dataset provided by
    Imperial College London
    Authors
    Norman, Danielle L; Wearne, Oliver R; Chapman, Philip M; Heon, Sui P; Ewers, Robert M
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description: Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707. Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions Funding: These data were collected as part of research funded by:

    NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A) This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.

    XML metadata: GEMINI compliant metadata for this dataset is available here Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip CT_image_data_info2.xlsx This file contains dataset metadata and 1 data tables:

    Dataset Images (described in worksheet Dataset_images) Description: This worksheet details the composition of each dataset used in the analyses Number of fields: 69 Number of data rows: 270287 Fields:

    filename: Root ID (Field type: id) camera_trap_site: Site ID for the camera trap location (Field type: location) taxon: Taxon recorded by camera trap (Field type: taxa) dist_level: Level of disturbance at site (Field type: ordered categorical) baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical) increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_all_1_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3, 4 and 5 (all)' training set, or not included (NA) (Field type: categorical) dist_camera_level_individ_1: Label as to whether image is included in the 'disturbance level combination analysis split at camera level: disturbance

  8. Validation verification based on prediction performance of those molecules...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Dec 2, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max K. Leong; Hong-Bin Chen; Yu-Hsuan Shih (2015). Validation verification based on prediction performance of those molecules in the training set, test set and outlier set. [Dataset]. http://doi.org/10.1371/journal.pone.0033829.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 2, 2015
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Max K. Leong; Hong-Bin Chen; Yu-Hsuan Shih
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    †Not applicable.

  9. f

    DataSheet_1_Construction and Validation of a Lung Cancer Risk Prediction...

    • datasetcatalog.nlm.nih.gov
    Updated Mar 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xu, Hui-Fang; Chen, Qiong; Lyu, Zhang-Yan; Kang, Rui-Hua; Zhang, Shao-Kai; Zhang, Jian-Gong; Zhang, Lu-Yao; Zheng, Li-Yang; Sun, Xi-Bin; Guo, Lan-Wei; Cao, Xiao-Qin; Liu, Shu-Zheng; Meng, Qing-Cheng; Liu, Yin (2022). DataSheet_1_Construction and Validation of a Lung Cancer Risk Prediction Model for Non-Smokers in China.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000407531
    Explore at:
    Dataset updated
    Mar 3, 2022
    Authors
    Xu, Hui-Fang; Chen, Qiong; Lyu, Zhang-Yan; Kang, Rui-Hua; Zhang, Shao-Kai; Zhang, Jian-Gong; Zhang, Lu-Yao; Zheng, Li-Yang; Sun, Xi-Bin; Guo, Lan-Wei; Cao, Xiao-Qin; Liu, Shu-Zheng; Meng, Qing-Cheng; Liu, Yin
    Area covered
    China
    Description

    BackgroundAbout 15% of lung cancers in men and 53% in women are not attributable to smoking worldwide. The aim was to develop and validate a simple and non-invasive model which could assess and stratify lung cancer risk in non-smokers in China.MethodsA large-sample size, population-based study was conducted under the framework of the Cancer Screening Program in Urban China (CanSPUC). Data on the lung cancer screening in Henan province, China, from October 2013 to October 2019 were used and randomly divided into the training and validation sets. Related risk factors were identified through multivariable Cox regression analysis, followed by establishment of risk prediction nomogram. Discrimination [area under the curve (AUC)] and calibration were further performed to assess the validation of risk prediction nomogram in the training set, and then validated by the validation set.ResultsA total of 214,764 eligible subjects were included, with a mean age of 55.19 years. Subjects were randomly divided into the training (107,382) and validation (107,382) sets. Elder age, being male, a low education level, family history of lung cancer, history of tuberculosis, and without a history of hyperlipidemia were the independent risk factors for lung cancer. Using these six variables, we plotted 1-year, 3-year, and 5-year lung cancer risk prediction nomogram. The AUC was 0.753, 0.752, and 0.755 for the 1-, 3- and 5-year lung cancer risk in the training set, respectively. In the validation set, the model showed a moderate predictive discrimination, with the AUC was 0.668, 0.678, and 0.685 for the 1-, 3- and 5-year lung cancer risk.ConclusionsWe developed and validated a simple and non-invasive lung cancer risk model in non-smokers. This model can be applied to identify and triage patients at high risk for developing lung cancers in non-smokers.

  10. h

    ears-reverb-dataset-validation

    • huggingface.co
    Updated Sep 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amayas (2025). ears-reverb-dataset-validation [Dataset]. https://huggingface.co/datasets/Amayas/ears-reverb-dataset-validation
    Explore at:
    Dataset updated
    Sep 6, 2025
    Authors
    Amayas
    Description

    EARS-Reverb_v2 Dataset Card

      Overview
    

    EARS-Reverb_v2 is a large-scale dataset designed for speech enhancement and dereverberation research. It contains reverberant speech data generated as the output of the code from the ears_benchmark repository. The dataset is intended for training and validation purposes and does not include a test set.

      Dataset Structure
    

    validation/: Contains the validation data. validation.csv: Metadata for the validation set.

    There is no… See the full description on the dataset page: https://huggingface.co/datasets/Amayas/ears-reverb-dataset-validation.

  11. Validation Data for Google Landmark 2021

    • kaggle.com
    zip
    Updated Sep 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    takedarts (2021). Validation Data for Google Landmark 2021 [Dataset]. https://www.kaggle.com/datasets/takedarts/google-landmark-2021-validation
    Explore at:
    zip(12159866744 bytes)Available download formats
    Dataset updated
    Sep 24, 2021
    Authors
    takedarts
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    What is this dataset?

    This is a validation dataset for Google Landmark Recognition 2021 (GLRec2021). This might be able to used as validation data of Google Landmark Retrieval 2021).

    This dataset is imported from Google Landmarks Dataset v2 (GLDv2). The images are test images in GLDv2, and the label file is a simplified version of recognition_solution_v2.1.csv. In order to use this dataset in GLRec2021, the label file is modified in the same manner in train.csv of GLRec2021, but labels of non-landmark images are marked as -1. In addition, records which are not related with any landmarks in train.csv are removed.

    The details of the imported dataset (GLDv2) is described in the following paper: "Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval" T. Weyand*, A. Araujo*, B. Cao, J. Sim Proc. CVPR'20

    License (IMPORTANT)

    The license complies with the license of GLDv2. Check GLDv2 repository.

    Model file

    This dataset contains the model files trained on the GLRec2021 training dataset. The model has a ResNet-34 as a backbone CNN and a head module for extracting image features. This model is included for use in the code of GLRec2021, but the model file can be used as follows.

    model = torch.jit.load(path_to_the_model_file)
    
  12. Food-5K image dataset

    • kaggle.com
    zip
    Updated Nov 30, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksandr Antonov (2019). Food-5K image dataset [Dataset]. https://www.kaggle.com/datasets/trolukovich/food5k-image-dataset/code
    Explore at:
    zip(446963301 bytes)Available download formats
    Dataset updated
    Nov 30, 2019
    Authors
    Aleksandr Antonov
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Content

    This dataset contains food and non-food images.

    It divided by 3 sets - train, validation and evaluation.

    Each set contains 2 categories - food and non_food, each with 500 images

    The dataset was taken from official source, the only difference that I divided images by categories in each set (train, validation and evaluation) to make model training process more convenient.

  13. d

    Segment interpretation and validation data - Land Cover Mapping, North Slope...

    • catalog.data.gov
    Updated Nov 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Fish and Wildlife Service (2025). Segment interpretation and validation data - Land Cover Mapping, North Slope of the Arctic National Wildlife Refuge, Alaska, 2019 [Dataset]. https://catalog.data.gov/dataset/segment-interpretation-and-validation-data-land-cover-mapping-north-slope-of-the-arctic-na
    Explore at:
    Dataset updated
    Nov 25, 2025
    Dataset provided by
    U.S. Fish and Wildlife Service
    Area covered
    Arctic, North Slope Borough, Arctic National Wildlife Refuge, Alaska
    Description

    The field data and WorldView imagery were leveraged to generate an extensive set of segments labeled with land cover class. These segment interpretations provided the training and validation data for the mapping. Analysts reviewed each aerial and ground plot from the 2019 field survey, examining the plot center and training polygon over the WorldView mosaic, and reviewing field photos, cover estimates, and notes. For each plot, one image segment was identified as the primary example of the vegetation type of the plot (unless there was no suitable example segment, as in cases when a ground plot was targeting a small but distinct vegetation patch that was not captured in the image segmentation). Usually, the primary segment included or was close to the nominal plot center, but this was not always the case, since the target area for the aerial plots could encompass several segments. After identifying a primary segment, the analyst also identified a set of 0–15 secondary segments that were good examples of the same vegetation type. This assessment was informed by field experience, review of field photos of the landscape setting, and photo-interpretation of the WorldView mosaic. An additional set of auxiliary segments were identified and assigned to a land cover class. The first set of auxiliary segments was assigned to non-vegetated classes such as lakes, ponds, ocean, barrens, and snowfields or aufeis. While a limited effort was expended to sample such classes during field work, we knew that these would be readily identifiable with high confidence from the WorldView imagery and so focused the field sampling on vegetated classes. Later, after reviewing preliminary models and receiving feedback from Janet Jorgenson (retired plant ecologist for the Arctic Refuge), we added additional auxiliary segments for vegetated classes based on expert photo interpretation. These were designed to provide the model with additional training data to define the breakpoints between similar classes. Land cover classes were assigned to all of the primary, secondary, and auxiliary segments. 20% of the segments were randomly selected to be withheld from model training. The final model was validated using the reserved validation segment interpretation points (20% of the full set). These segments were not used to develop the model. The map class was extracted from the final land cover map for each validation point. A confusion matrix, overall accuracy metrics, and per-class performance metrics were calculated from the validation data.

  14. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  15. Model weights and training, validation, and test set images and masks for...

    • zenodo.org
    Updated Feb 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kylen Solvik; Kylen Solvik; Yaffa Truelove; Yaffa Truelove; JENNIFER BALCH; JENNIFER BALCH; Michael Lathuilliere; Michael Lathuilliere; Thiago Fontenelle; Andrea Castanho; Andrea Castanho; Michael Coe; Michael Coe; Christina Shintani; Christina Shintani; CARLOS Souza Jr; CARLOS Souza Jr; Marcia Nunes Macedo; Marcia Nunes Macedo; Thiago Fontenelle (2025). Model weights and training, validation, and test set images and masks for "Uncovering a million small dams in Brazil using deep learning" [Dataset]. http://doi.org/10.5281/zenodo.14927197
    Explore at:
    Dataset updated
    Feb 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kylen Solvik; Kylen Solvik; Yaffa Truelove; Yaffa Truelove; JENNIFER BALCH; JENNIFER BALCH; Michael Lathuilliere; Michael Lathuilliere; Thiago Fontenelle; Andrea Castanho; Andrea Castanho; Michael Coe; Michael Coe; Christina Shintani; Christina Shintani; CARLOS Souza Jr; CARLOS Souza Jr; Marcia Nunes Macedo; Marcia Nunes Macedo; Thiago Fontenelle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated masks and Sentinel-1/-2 images split into training, validation, and test sets. Used for training convolutional neural network for small reservoir mapping.

    - manet_sentinel.ckpt: PyTorch model checkpoint file containing model weights.

    - annotations.zip: Contains binary reservoir masks (0 is non-reservoir, 1 is reservoir) split into training, validation, and test sets.

    - images.zip: Contains Sentinel-1/-2 images split into training, validation, and test sets with the following bands:

    1. Blue
    2. Green
    3. Red
    4. Near-infrared
    5. Sentinel-1 SAR VV
    6. Sentinel-1 SAR VH
    7. NDVI
    8. NDWI
    9. Gao's NDWI
    10. MNDWI

  16. Data from: Development, validation and integration of in silico models to...

    • catalog.data.gov
    Updated Sep 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). Development, validation and integration of in silico models to identify androgen active chemicals [Dataset]. https://catalog.data.gov/dataset/development-validation-and-integration-of-in-silico-models-to-identify-androgen-active-che
    Explore at:
    Dataset updated
    Sep 1, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    A diverse data set of 1667 chemicals with AR experimental activity were provided by the U.S. EPA from the oxicity Forecaster (ToxCast) program which generates data using in vitro high-throughput screening (HTS) assays measuring activity of chemicals at multiple points along the androgen receptor (AR) activity pathway. The Endocrine Disruptor Knowledgebase (EDKB) androgen receptor (AR) binding data set (Fang et al., 2003) was downloaded from the FDA website and was produced expressly as a training set designed for developing predictive models. The data is based on a validated assay using recombinant AR. The dataset contains 146 AR binders and 56 non-AR binders. These training set chemicals were selected for both chemical structure diversity and range of activity, both of which are essential to develop robust QSAR and other models (Perkins, 2003). This dataset is associated with the following publication: Manganelli, S., A. Roncaglioni, K. Mansouri, R. Judson, E. Benfenati, A. Manganaro, and P. Ruiz. Development, validation and integration of in silico models to identify androgen active chemicals. CHEMOSPHERE. Elsevier Science Ltd, New York, NY, USA, 220: 204-215, (2019).

  17. DataSheet_1_Construction and Validation of a Lung Cancer Risk Prediction...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lan-Wei Guo; Zhang-Yan Lyu; Qing-Cheng Meng; Li-Yang Zheng; Qiong Chen; Yin Liu; Hui-Fang Xu; Rui-Hua Kang; Lu-Yao Zhang; Xiao-Qin Cao; Shu-Zheng Liu; Xi-Bin Sun; Jian-Gong Zhang; Shao-Kai Zhang (2023). DataSheet_1_Construction and Validation of a Lung Cancer Risk Prediction Model for Non-Smokers in China.docx [Dataset]. http://doi.org/10.3389/fonc.2021.766939.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Lan-Wei Guo; Zhang-Yan Lyu; Qing-Cheng Meng; Li-Yang Zheng; Qiong Chen; Yin Liu; Hui-Fang Xu; Rui-Hua Kang; Lu-Yao Zhang; Xiao-Qin Cao; Shu-Zheng Liu; Xi-Bin Sun; Jian-Gong Zhang; Shao-Kai Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundAbout 15% of lung cancers in men and 53% in women are not attributable to smoking worldwide. The aim was to develop and validate a simple and non-invasive model which could assess and stratify lung cancer risk in non-smokers in China.MethodsA large-sample size, population-based study was conducted under the framework of the Cancer Screening Program in Urban China (CanSPUC). Data on the lung cancer screening in Henan province, China, from October 2013 to October 2019 were used and randomly divided into the training and validation sets. Related risk factors were identified through multivariable Cox regression analysis, followed by establishment of risk prediction nomogram. Discrimination [area under the curve (AUC)] and calibration were further performed to assess the validation of risk prediction nomogram in the training set, and then validated by the validation set.ResultsA total of 214,764 eligible subjects were included, with a mean age of 55.19 years. Subjects were randomly divided into the training (107,382) and validation (107,382) sets. Elder age, being male, a low education level, family history of lung cancer, history of tuberculosis, and without a history of hyperlipidemia were the independent risk factors for lung cancer. Using these six variables, we plotted 1-year, 3-year, and 5-year lung cancer risk prediction nomogram. The AUC was 0.753, 0.752, and 0.755 for the 1-, 3- and 5-year lung cancer risk in the training set, respectively. In the validation set, the model showed a moderate predictive discrimination, with the AUC was 0.668, 0.678, and 0.685 for the 1-, 3- and 5-year lung cancer risk.ConclusionsWe developed and validated a simple and non-invasive lung cancer risk model in non-smokers. This model can be applied to identify and triage patients at high risk for developing lung cancers in non-smokers.

  18. d

    Data from: Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  19. f

    Table_1_Development and Validation of Predictive Model—HASBLAD Score—For...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated May 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xu, Weixian; Li, Yanguang; Shang, Zhi; Xu, Yuan; Gao, Wei; Zeng, Lin; Zu, Lingyun; Wu, Cencen; Fan, Yuanyuan; Xu, Mao; Cai, Jiageng; Zhao, Menglin; Cai, Hong (2022). Table_1_Development and Validation of Predictive Model—HASBLAD Score—For Major Adverse Cardiovascular Events During Perioperative Period of Non-cardiac Surgery: A Single Center Experience in China.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000355798
    Explore at:
    Dataset updated
    May 9, 2022
    Authors
    Xu, Weixian; Li, Yanguang; Shang, Zhi; Xu, Yuan; Gao, Wei; Zeng, Lin; Zu, Lingyun; Wu, Cencen; Fan, Yuanyuan; Xu, Mao; Cai, Jiageng; Zhao, Menglin; Cai, Hong
    Description

    BackgroundMajor adverse cardiovascular events (MACEs) represent a significant reason of morbidity and mortality in non-cardiac surgery during perioperative period. The prevention of perioperative MACEs has always been one of the hotspots in the research field. Current existing models have not been validated in Chinese population, and have become increasingly unable to adapt to current clinical needs.ObjectivesTo establish and validate several simple bedside tools for predicting MACEs during perioperative period of non-cardiac surgery in Chinese hospitalized patients.DesignWe used a nested case-control study to establish our prediction models. A nomogram along with a risk score were developed using logistic regression analysis. An internal cohort was used to evaluate the performance of discrimination and calibration of these predictive models including the revised cardiac risk index (RCRI) score recommended by current guidelines.SettingPeking University Third Hospital between January 2010 and December 2020.PatientsTwo hundred and fifty three patients with MACEs and 1,012 patients without were included in the training set from January 2010 to December 2019 while 38,897 patients were included in the validation set from January 2020 and December 2020, of whom 112 patients had MACEs.Main Outcome MeasuresThe MACEs included the composite outcomes of cardiac death, non-fatal myocardial infarction, non-fatal congestive cardiac failure or hemodynamically significant ventricular arrhythmia, and Takotsubo cardiomyopathy.ResultsSeven predictors, including Hemoglobin, CARDIAC diseases, Aspartate aminotransferase (AST), high Blood pressure, Leukocyte count, general Anesthesia, and Diabetes mellitus (HASBLAD), were selected in the final model. The nomogram and HASBLAD score all achieved satisfactory prediction performance in the training set (C statistic, 0.781 vs. 0.768) and the validation set (C statistic, 0.865 vs. 0.843). Good calibration was observed for the probability of MACEs in the training set and the validation set. The two predictive models both had excellent discrimination that performed better than RCRI in the validation set (C statistic, 0.660, P < 0.05 vs. nomogram and HASBLAD score).ConclusionThe nomogram and HASBLAD score could be useful bedside tools for predicting perioperative MACEs of non-cardiac surgery in Chinese hospitalized patients.

  20. d

    Data from: Rangeland Condition Monitoring Assessment and Projection (RCMAP)...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Rangeland Condition Monitoring Assessment and Projection (RCMAP) Independent Validation Data [Dataset]. https://catalog.data.gov/dataset/rangeland-condition-monitoring-assessment-and-projection-rcmap-independent-validation-data
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Rangeland ecosystems provide critical wildlife habitat (e.g., greater sage grouse, pronghorn, black-footed ferret), forage for livestock, carbon sequestration, provision of water resources, and recreational opportunities. At the same time, rangelands are vulnerable to climate change, fire, and anthropogenic disturbances. The arid-semiarid climate in most rangelands fluctuates widely, impacting livestock forage availability, wildlife habitat, and water resources. Many of these changes can be subtle or evolve over long time periods, responding to climate, anthropogenic, and disturbance driving forces. To understand vegetation change, scientists from the USGS and Bureau of Land Management (BLM) developed the Rangeland Condition Monitoring Assessment and Projection (RCMAP) project. RCMAP provides robust, long-term, and floristically detailed maps of vegetation cover at yearly time-steps, a critical reference to advancing science in the BLM and assessing Landscape Health standards. RCMAP quantifies the percent cover of ten rangeland components (annual herbaceous, bare ground, herbaceous, litter, non-sagebrush shrub, perennial herbaceous, sagebrush, shrub, and tree cover and shrub height) at yearly time-steps across the western U.S. using field training data, Landsat imagery, and machine learning. We utilize an ecologically comprehensive series of field-trained, high-resolution predictions of component cover and BLM Analysis Inventory and Monitoring (AIM) data to train machine learning models predicting component cover over the Landsat time-series. This dataset enables retrospective analysis of vegetation condition, impacts of weather variation and longer-term climatic change, and understanding of vegetation treatment and altered management practice effectiveness. RCMAP data can be used to answer critical questions regarding the influence of climate change and the suitability of management practices. Component products can be downloaded https://www.mrlc.gov/data. Independent validation was our primary validation approach, consisting of field measurements of component cover at stratified-random locations. Independent validation point placement used a stratified random design, with two levels of stratified restrictions to simplify logistics of field sampling (Rigge et al. 2020, Xian et al. 2015). The first level of stratification randomly selected 15, 8 km in diameter, sites across each mapping region. First level sites excluded areas less than 30 km away from training sites and other validation sites. The second level stratification randomly placed 6–10 points within each 8 km diameter validation site (total n = 2,014 points at n = 229 sites). Only sites on public land, between 100 and 1000 m from the nearest road, and in rangeland vegetation cover within each site were considered. The random points within a site were evenly allocated to three NDVI thresholds from a leaf-on Landsat image (low, medium, and high). Sites with relatively high spatial variance within a 90 m by 90 m patch (3 × 3 Landsat pixels) were excluded to minimize plot-pixel locational error. Using NDVI as a stratum ensured plot locations were distributed across the range of validation site productivity. At each validation point, we measured component cover using the line point intercept method along 2, 30 m transects. Data were collected from the first hit perspective.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Organization logo

Machine learning algorithm validation with a limited sample size

Explore at:
text/x-pythonAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Search
Clear search
Close search
Google apps
Main menu