100+ datasets found

Machine learning algorithm validation with a limited sample size
plos.figshare.com
text/x-python
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0224365
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
happywhale-tfrecords-25val
kaggle.com
zip
Updated Mar 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rickyinferno (2022). happywhale-tfrecords-25val [Dataset]. https://www.kaggle.com/datasets/runjiali/happywhale-tfrecords-25val/code
Explore at:
zip(61907695094 bytes)Available download formats
Dataset updated
Mar 20, 2022
Authors
Rickyinferno
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
val_half: containing 1/4 ids which have 50% of pictures in this validation set and 50% in the training set

val_all: containing 1/4 ids whose pictures are not included in the training set

train: training set

test: test set
d
Web Data Commons Training and Test Sets for Large-Scale Product Matching -...
demo-b2find.dkrz.de
Updated Nov 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/8f288eb3-f541-5fca-a337-d519f903668f
Explore at:
Dataset updated
Nov 27, 2020
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
f
Results on validation set data.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jul 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rutten, Matthieu; Smits, Henk; Kurstjens, Steef; Çallı, Erdi; Murphy, Keelin; van Ginneken, Bram; Samson, Tijs; Herpers, Robert (2021). Results on validation set data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000918798
Explore at:
Dataset updated
Jul 30, 2021
Authors
Rutten, Matthieu; Smits, Henk; Kurstjens, Steef; Çallı, Erdi; Murphy, Keelin; van Ginneken, Bram; Samson, Tijs; Herpers, Robert
Description
Five models are trained using various input masking probabilities (IMP). Each resulting model is validated using the heavily masked validation dataset of 13596 samples (5668 positive) to evaluate their performance in the context of missing input data. AUC values for the optimal training IMP are shown, along with those achieved with no input masking (NIM). Bold font indicates the highest AUC in the table. Results for other IMP values are provided in the S1 File.
S
Feature data of training set and verification set in utLIFE-PC article
scidb.cn
Updated Oct 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LOU; Xing Nianzeng (2024). Feature data of training set and verification set in utLIFE-PC article [Dataset]. http://doi.org/10.57760/sciencedb.14508
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.14508
Dataset updated
Oct 12, 2024
Dataset provided by
Science Data Bank
Authors
LOU; Xing Nianzeng
Description
File description: 1. train.Mutation_Meth_CNV_data.xls : The feature matrix file used in the training model includes sample name, point mutation data, methylation data and CNV data. The first column must be the sample name. 2. train.sample_label.xls : Pathological information of the training set samples, where 1 represents prostate cancer and 0 represents non-prostate cancer. 3. validation.Mutation_Meth_CNV_data.xls : The feature matrix file used in validation set includes sample name, point mutation data, methylation dataand CNV data. The first column must be the sample name. 4. validation.sample_label.xls : Pathological information of the validation set samples, where 1 represents prostate cancer and 0 represents non-prostate cancer.
H
Rainbow training and validation data
dataverse.harvard.edu
search.dataone.org
Updated Nov 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kimberly Carlson (2022). Rainbow training and validation data [Dataset]. http://doi.org/10.7910/DVN/YTRMGN
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/YTRMGN
Dataset updated
Nov 26, 2022
Dataset provided by
Harvard Dataverse
Authors
Kimberly Carlson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset includes the date and time, latitude (“lat”), longitude (“lon”), sun angle (“sun_angle”, in degrees [o]), rainbow presence (TRUE = rainbow, FALSE = no rainbow), cloud cover (“cloud_cover”, proportion), and liquid precipitation (“liquid_precip”, kg m-2 s-1) for each record used to train and/or validate the models.
Z
Downsized camera trap images for automated classification
data.niaid.nih.gov
Updated Dec 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Norman, Danielle L; Wearne, Oliver R; Chapman, Philip M; Heon, Sui P; Ewers, Robert M (2022). Downsized camera trap images for automated classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6627706
Explore at:
Dataset updated
Dec 1, 2022
Dataset provided by
Imperial College London
Authors
Norman, Danielle L; Wearne, Oliver R; Chapman, Philip M; Heon, Sui P; Ewers, Robert M
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description: Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707. Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions Funding: These data were collected as part of research funded by:

NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A) This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.

XML metadata: GEMINI compliant metadata for this dataset is available here Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip CT_image_data_info2.xlsx This file contains dataset metadata and 1 data tables:

Dataset Images (described in worksheet Dataset_images) Description: This worksheet details the composition of each dataset used in the analyses Number of fields: 69 Number of data rows: 270287 Fields:

filename: Root ID (Field type: id) camera_trap_site: Site ID for the camera trap location (Field type: location) taxon: Taxon recorded by camera trap (Field type: taxa) dist_level: Level of disturbance at site (Field type: ordered categorical) baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical) increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_all_1_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3, 4 and 5 (all)' training set, or not included (NA) (Field type: categorical) dist_camera_level_individ_1: Label as to whether image is included in the 'disturbance level combination analysis split at camera level: disturbance
Validation verification based on prediction performance of those molecules...
plos.figshare.com
figshare.com
xls
Updated Dec 2, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max K. Leong; Hong-Bin Chen; Yu-Hsuan Shih (2015). Validation verification based on prediction performance of those molecules in the training set, test set and outlier set. [Dataset]. http://doi.org/10.1371/journal.pone.0033829.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0033829.t003
Dataset updated
Dec 2, 2015
Dataset provided by
PLOShttp://plos.org/
Authors
Max K. Leong; Hong-Bin Chen; Yu-Hsuan Shih
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
†Not applicable.
f
DataSheet_1_Construction and Validation of a Lung Cancer Risk Prediction...
datasetcatalog.nlm.nih.gov
Updated Mar 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xu, Hui-Fang; Chen, Qiong; Lyu, Zhang-Yan; Kang, Rui-Hua; Zhang, Shao-Kai; Zhang, Jian-Gong; Zhang, Lu-Yao; Zheng, Li-Yang; Sun, Xi-Bin; Guo, Lan-Wei; Cao, Xiao-Qin; Liu, Shu-Zheng; Meng, Qing-Cheng; Liu, Yin (2022). DataSheet_1_Construction and Validation of a Lung Cancer Risk Prediction Model for Non-Smokers in China.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000407531
Explore at:
Dataset updated
Mar 3, 2022
Authors
Xu, Hui-Fang; Chen, Qiong; Lyu, Zhang-Yan; Kang, Rui-Hua; Zhang, Shao-Kai; Zhang, Jian-Gong; Zhang, Lu-Yao; Zheng, Li-Yang; Sun, Xi-Bin; Guo, Lan-Wei; Cao, Xiao-Qin; Liu, Shu-Zheng; Meng, Qing-Cheng; Liu, Yin
Area covered
China
Description
BackgroundAbout 15% of lung cancers in men and 53% in women are not attributable to smoking worldwide. The aim was to develop and validate a simple and non-invasive model which could assess and stratify lung cancer risk in non-smokers in China.MethodsA large-sample size, population-based study was conducted under the framework of the Cancer Screening Program in Urban China (CanSPUC). Data on the lung cancer screening in Henan province, China, from October 2013 to October 2019 were used and randomly divided into the training and validation sets. Related risk factors were identified through multivariable Cox regression analysis, followed by establishment of risk prediction nomogram. Discrimination [area under the curve (AUC)] and calibration were further performed to assess the validation of risk prediction nomogram in the training set, and then validated by the validation set.ResultsA total of 214,764 eligible subjects were included, with a mean age of 55.19 years. Subjects were randomly divided into the training (107,382) and validation (107,382) sets. Elder age, being male, a low education level, family history of lung cancer, history of tuberculosis, and without a history of hyperlipidemia were the independent risk factors for lung cancer. Using these six variables, we plotted 1-year, 3-year, and 5-year lung cancer risk prediction nomogram. The AUC was 0.753, 0.752, and 0.755 for the 1-, 3- and 5-year lung cancer risk in the training set, respectively. In the validation set, the model showed a moderate predictive discrimination, with the AUC was 0.668, 0.678, and 0.685 for the 1-, 3- and 5-year lung cancer risk.ConclusionsWe developed and validated a simple and non-invasive lung cancer risk model in non-smokers. This model can be applied to identify and triage patients at high risk for developing lung cancers in non-smokers.
h
ears-reverb-dataset-validation
huggingface.co
Updated Sep 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amayas (2025). ears-reverb-dataset-validation [Dataset]. https://huggingface.co/datasets/Amayas/ears-reverb-dataset-validation
Explore at:
Dataset updated
Sep 6, 2025
Authors
Amayas
Description
EARS-Reverb_v2 Dataset Card

Overview

EARS-Reverb_v2 is a large-scale dataset designed for speech enhancement and dereverberation research. It contains reverberant speech data generated as the output of the code from the ears_benchmark repository. The dataset is intended for training and validation purposes and does not include a test set.

Dataset Structure

validation/: Contains the validation data. validation.csv: Metadata for the validation set.

There is no… See the full description on the dataset page: https://huggingface.co/datasets/Amayas/ears-reverb-dataset-validation.
Validation Data for Google Landmark 2021
kaggle.com
zip
Updated Sep 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
takedarts (2021). Validation Data for Google Landmark 2021 [Dataset]. https://www.kaggle.com/datasets/takedarts/google-landmark-2021-validation
Explore at:
zip(12159866744 bytes)Available download formats
Dataset updated
Sep 24, 2021
Authors
takedarts
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
What is this dataset?

This is a validation dataset for Google Landmark Recognition 2021 (GLRec2021). This might be able to used as validation data of Google Landmark Retrieval 2021).

This dataset is imported from Google Landmarks Dataset v2 (GLDv2). The images are test images in GLDv2, and the label file is a simplified version of recognition_solution_v2.1.csv. In order to use this dataset in GLRec2021, the label file is modified in the same manner in train.csv of GLRec2021, but labels of non-landmark images are marked as -1. In addition, records which are not related with any landmarks in train.csv are removed.

The details of the imported dataset (GLDv2) is described in the following paper: "Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval" T. Weyand*, A. Araujo*, B. Cao, J. Sim Proc. CVPR'20

License (IMPORTANT)

The license complies with the license of GLDv2. Check GLDv2 repository.

Model file

This dataset contains the model files trained on the GLRec2021 training dataset. The model has a ResNet-34 as a backbone CNN and a head module for extracting image features. This model is included for use in the code of GLRec2021, but the model file can be used as follows.

model = torch.jit.load(path_to_the_model_file)
Food-5K image dataset
kaggle.com
zip
Updated Nov 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleksandr Antonov (2019). Food-5K image dataset [Dataset]. https://www.kaggle.com/datasets/trolukovich/food5k-image-dataset/code
Explore at:
zip(446963301 bytes)Available download formats
Dataset updated
Nov 30, 2019
Authors
Aleksandr Antonov
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Content

This dataset contains food and non-food images.

It divided by 3 sets - train, validation and evaluation.

Each set contains 2 categories - food and non_food, each with 500 images

The dataset was taken from official source, the only difference that I divided images by categories in each set (train, validation and evaluation) to make model training process more convenient.
d
Segment interpretation and validation data - Land Cover Mapping, North Slope...
catalog.data.gov
Updated Nov 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Fish and Wildlife Service (2025). Segment interpretation and validation data - Land Cover Mapping, North Slope of the Arctic National Wildlife Refuge, Alaska, 2019 [Dataset]. https://catalog.data.gov/dataset/segment-interpretation-and-validation-data-land-cover-mapping-north-slope-of-the-arctic-na
Explore at:
Dataset updated
Nov 25, 2025
Dataset provided by
U.S. Fish and Wildlife Service
Area covered
Arctic, North Slope Borough, Arctic National Wildlife Refuge, Alaska
Description
The field data and WorldView imagery were leveraged to generate an extensive set of segments labeled with land cover class. These segment interpretations provided the training and validation data for the mapping. Analysts reviewed each aerial and ground plot from the 2019 field survey, examining the plot center and training polygon over the WorldView mosaic, and reviewing field photos, cover estimates, and notes. For each plot, one image segment was identified as the primary example of the vegetation type of the plot (unless there was no suitable example segment, as in cases when a ground plot was targeting a small but distinct vegetation patch that was not captured in the image segmentation). Usually, the primary segment included or was close to the nominal plot center, but this was not always the case, since the target area for the aerial plots could encompass several segments. After identifying a primary segment, the analyst also identified a set of 0–15 secondary segments that were good examples of the same vegetation type. This assessment was informed by field experience, review of field photos of the landscape setting, and photo-interpretation of the WorldView mosaic. An additional set of auxiliary segments were identified and assigned to a land cover class. The first set of auxiliary segments was assigned to non-vegetated classes such as lakes, ponds, ocean, barrens, and snowfields or aufeis. While a limited effort was expended to sample such classes during field work, we knew that these would be readily identifiable with high confidence from the WorldView imagery and so focused the field sampling on vegetated classes. Later, after reviewing preliminary models and receiving feedback from Janet Jorgenson (retired plant ecologist for the Arctic Refuge), we added additional auxiliary segments for vegetated classes based on expert photo interpretation. These were designed to provide the model with additional training data to define the breakpoints between similar classes. Land cover classes were assigned to all of the primary, secondary, and auxiliary segments. 20% of the segments were randomly selected to be withheld from model training. The final model was validated using the reserved validation segment interpretation points (20% of the full set). These segments were not used to develop the model. The map class was extracted from the final land cover map for each validation point. A confusion matrix, overall accuracy metrics, and per-class performance metrics were calculated from the validation data.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Model weights and training, validation, and test set images and masks for...
zenodo.org
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kylen Solvik; Kylen Solvik; Yaffa Truelove; Yaffa Truelove; JENNIFER BALCH; JENNIFER BALCH; Michael Lathuilliere; Michael Lathuilliere; Thiago Fontenelle; Andrea Castanho; Andrea Castanho; Michael Coe; Michael Coe; Christina Shintani; Christina Shintani; CARLOS Souza Jr; CARLOS Souza Jr; Marcia Nunes Macedo; Marcia Nunes Macedo; Thiago Fontenelle (2025). Model weights and training, validation, and test set images and masks for "Uncovering a million small dams in Brazil using deep learning" [Dataset]. http://doi.org/10.5281/zenodo.14927197
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14927197
Dataset updated
Feb 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kylen Solvik; Kylen Solvik; Yaffa Truelove; Yaffa Truelove; JENNIFER BALCH; JENNIFER BALCH; Michael Lathuilliere; Michael Lathuilliere; Thiago Fontenelle; Andrea Castanho; Andrea Castanho; Michael Coe; Michael Coe; Christina Shintani; Christina Shintani; CARLOS Souza Jr; CARLOS Souza Jr; Marcia Nunes Macedo; Marcia Nunes Macedo; Thiago Fontenelle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated masks and Sentinel-1/-2 images split into training, validation, and test sets. Used for training convolutional neural network for small reservoir mapping.

- manet_sentinel.ckpt: PyTorch model checkpoint file containing model weights.

- annotations.zip: Contains binary reservoir masks (0 is non-reservoir, 1 is reservoir) split into training, validation, and test sets.

- images.zip: Contains Sentinel-1/-2 images split into training, validation, and test sets with the following bands:

Blue

Green

Red

Near-infrared

Sentinel-1 SAR VV

Sentinel-1 SAR VH

NDVI

NDWI

Gao's NDWI

MNDWI
Data from: Development, validation and integration of in silico models to...
catalog.data.gov
Updated Sep 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Development, validation and integration of in silico models to identify androgen active chemicals [Dataset]. https://catalog.data.gov/dataset/development-validation-and-integration-of-in-silico-models-to-identify-androgen-active-che
Explore at:
Dataset updated
Sep 1, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
A diverse data set of 1667 chemicals with AR experimental activity were provided by the U.S. EPA from the oxicity Forecaster (ToxCast) program which generates data using in vitro high-throughput screening (HTS) assays measuring activity of chemicals at multiple points along the androgen receptor (AR) activity pathway. The Endocrine Disruptor Knowledgebase (EDKB) androgen receptor (AR) binding data set (Fang et al., 2003) was downloaded from the FDA website and was produced expressly as a training set designed for developing predictive models. The data is based on a validated assay using recombinant AR. The dataset contains 146 AR binders and 56 non-AR binders. These training set chemicals were selected for both chemical structure diversity and range of activity, both of which are essential to develop robust QSAR and other models (Perkins, 2003). This dataset is associated with the following publication: Manganelli, S., A. Roncaglioni, K. Mansouri, R. Judson, E. Benfenati, A. Manganaro, and P. Ruiz. Development, validation and integration of in silico models to identify androgen active chemicals. CHEMOSPHERE. Elsevier Science Ltd, New York, NY, USA, 220: 204-215, (2019).
DataSheet_1_Construction and Validation of a Lung Cancer Risk Prediction...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
docx
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lan-Wei Guo; Zhang-Yan Lyu; Qing-Cheng Meng; Li-Yang Zheng; Qiong Chen; Yin Liu; Hui-Fang Xu; Rui-Hua Kang; Lu-Yao Zhang; Xiao-Qin Cao; Shu-Zheng Liu; Xi-Bin Sun; Jian-Gong Zhang; Shao-Kai Zhang (2023). DataSheet_1_Construction and Validation of a Lung Cancer Risk Prediction Model for Non-Smokers in China.docx [Dataset]. http://doi.org/10.3389/fonc.2021.766939.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2021.766939.s001
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Lan-Wei Guo; Zhang-Yan Lyu; Qing-Cheng Meng; Li-Yang Zheng; Qiong Chen; Yin Liu; Hui-Fang Xu; Rui-Hua Kang; Lu-Yao Zhang; Xiao-Qin Cao; Shu-Zheng Liu; Xi-Bin Sun; Jian-Gong Zhang; Shao-Kai Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundAbout 15% of lung cancers in men and 53% in women are not attributable to smoking worldwide. The aim was to develop and validate a simple and non-invasive model which could assess and stratify lung cancer risk in non-smokers in China.MethodsA large-sample size, population-based study was conducted under the framework of the Cancer Screening Program in Urban China (CanSPUC). Data on the lung cancer screening in Henan province, China, from October 2013 to October 2019 were used and randomly divided into the training and validation sets. Related risk factors were identified through multivariable Cox regression analysis, followed by establishment of risk prediction nomogram. Discrimination [area under the curve (AUC)] and calibration were further performed to assess the validation of risk prediction nomogram in the training set, and then validated by the validation set.ResultsA total of 214,764 eligible subjects were included, with a mean age of 55.19 years. Subjects were randomly divided into the training (107,382) and validation (107,382) sets. Elder age, being male, a low education level, family history of lung cancer, history of tuberculosis, and without a history of hyperlipidemia were the independent risk factors for lung cancer. Using these six variables, we plotted 1-year, 3-year, and 5-year lung cancer risk prediction nomogram. The AUC was 0.753, 0.752, and 0.755 for the 1-, 3- and 5-year lung cancer risk in the training set, respectively. In the validation set, the model showed a moderate predictive discrimination, with the AUC was 0.668, 0.678, and 0.685 for the 1-, 3- and 5-year lung cancer risk.ConclusionsWe developed and validated a simple and non-invasive lung cancer risk model in non-smokers. This model can be applied to identify and triage patients at high risk for developing lung cancers in non-smokers.
d
Data from: Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
f
Table_1_Development and Validation of Predictive Model—HASBLAD Score—For...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated May 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xu, Weixian; Li, Yanguang; Shang, Zhi; Xu, Yuan; Gao, Wei; Zeng, Lin; Zu, Lingyun; Wu, Cencen; Fan, Yuanyuan; Xu, Mao; Cai, Jiageng; Zhao, Menglin; Cai, Hong (2022). Table_1_Development and Validation of Predictive Model—HASBLAD Score—For Major Adverse Cardiovascular Events During Perioperative Period of Non-cardiac Surgery: A Single Center Experience in China.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000355798
Explore at:
Dataset updated
May 9, 2022
Authors
Xu, Weixian; Li, Yanguang; Shang, Zhi; Xu, Yuan; Gao, Wei; Zeng, Lin; Zu, Lingyun; Wu, Cencen; Fan, Yuanyuan; Xu, Mao; Cai, Jiageng; Zhao, Menglin; Cai, Hong
Description
BackgroundMajor adverse cardiovascular events (MACEs) represent a significant reason of morbidity and mortality in non-cardiac surgery during perioperative period. The prevention of perioperative MACEs has always been one of the hotspots in the research field. Current existing models have not been validated in Chinese population, and have become increasingly unable to adapt to current clinical needs.ObjectivesTo establish and validate several simple bedside tools for predicting MACEs during perioperative period of non-cardiac surgery in Chinese hospitalized patients.DesignWe used a nested case-control study to establish our prediction models. A nomogram along with a risk score were developed using logistic regression analysis. An internal cohort was used to evaluate the performance of discrimination and calibration of these predictive models including the revised cardiac risk index (RCRI) score recommended by current guidelines.SettingPeking University Third Hospital between January 2010 and December 2020.PatientsTwo hundred and fifty three patients with MACEs and 1,012 patients without were included in the training set from January 2010 to December 2019 while 38,897 patients were included in the validation set from January 2020 and December 2020, of whom 112 patients had MACEs.Main Outcome MeasuresThe MACEs included the composite outcomes of cardiac death, non-fatal myocardial infarction, non-fatal congestive cardiac failure or hemodynamically significant ventricular arrhythmia, and Takotsubo cardiomyopathy.ResultsSeven predictors, including Hemoglobin, CARDIAC diseases, Aspartate aminotransferase (AST), high Blood pressure, Leukocyte count, general Anesthesia, and Diabetes mellitus (HASBLAD), were selected in the final model. The nomogram and HASBLAD score all achieved satisfactory prediction performance in the training set (C statistic, 0.781 vs. 0.768) and the validation set (C statistic, 0.865 vs. 0.843). Good calibration was observed for the probability of MACEs in the training set and the validation set. The two predictive models both had excellent discrimination that performed better than RCRI in the validation set (C statistic, 0.660, P < 0.05 vs. nomogram and HASBLAD score).ConclusionThe nomogram and HASBLAD score could be useful bedside tools for predicting perioperative MACEs of non-cardiac surgery in Chinese hospitalized patients.
d
Data from: Rangeland Condition Monitoring Assessment and Projection (RCMAP)...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Rangeland Condition Monitoring Assessment and Projection (RCMAP) Independent Validation Data [Dataset]. https://catalog.data.gov/dataset/rangeland-condition-monitoring-assessment-and-projection-rcmap-independent-validation-data
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
U.S. Geological Survey
Description
Rangeland ecosystems provide critical wildlife habitat (e.g., greater sage grouse, pronghorn, black-footed ferret), forage for livestock, carbon sequestration, provision of water resources, and recreational opportunities. At the same time, rangelands are vulnerable to climate change, fire, and anthropogenic disturbances. The arid-semiarid climate in most rangelands fluctuates widely, impacting livestock forage availability, wildlife habitat, and water resources. Many of these changes can be subtle or evolve over long time periods, responding to climate, anthropogenic, and disturbance driving forces. To understand vegetation change, scientists from the USGS and Bureau of Land Management (BLM) developed the Rangeland Condition Monitoring Assessment and Projection (RCMAP) project. RCMAP provides robust, long-term, and floristically detailed maps of vegetation cover at yearly time-steps, a critical reference to advancing science in the BLM and assessing Landscape Health standards. RCMAP quantifies the percent cover of ten rangeland components (annual herbaceous, bare ground, herbaceous, litter, non-sagebrush shrub, perennial herbaceous, sagebrush, shrub, and tree cover and shrub height) at yearly time-steps across the western U.S. using field training data, Landsat imagery, and machine learning. We utilize an ecologically comprehensive series of field-trained, high-resolution predictions of component cover and BLM Analysis Inventory and Monitoring (AIM) data to train machine learning models predicting component cover over the Landsat time-series. This dataset enables retrospective analysis of vegetation condition, impacts of weather variation and longer-term climatic change, and understanding of vegetation treatment and altered management practice effectiveness. RCMAP data can be used to answer critical questions regarding the influence of climate change and the suitability of management practices. Component products can be downloaded https://www.mrlc.gov/data. Independent validation was our primary validation approach, consisting of field measurements of component cover at stratified-random locations. Independent validation point placement used a stratified random design, with two levels of stratified restrictions to simplify logistics of field sampling (Rigge et al. 2020, Xian et al. 2015). The first level of stratification randomly selected 15, 8 km in diameter, sites across each mapping region. First level sites excluded areas less than 30 km away from training sites and other validation sites. The second level stratification randomly placed 6–10 points within each 8 km diameter validation site (total n = 2,014 points at n = 229 sites). Only sites on public land, between 100 and 1000 m from the nearest road, and in rangeland vegetation cover within each site were considered. The random points within a site were evenly allocated to three NDVI thresholds from a leaf-on Landsat image (low, medium, and high). Sites with relatively high spatial variance within a 90 m by 90 m patch (3 × 3 Landsat pixels) were excluded to minimize plot-pixel locational error. Using NDVI as a stratum ensured plot locations were distributed across the range of validation site productivity. At each validation point, we measured component cover using the line point intercept method along 2, 30 m transects. Data were collected from the first hit perspective.

Facebook

Twitter

Click to copy link

Link copied

Cite

Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365

Machine learning algorithm validation with a limited sample size

Explore at:

text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0224365

Dataset updated

May 30, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Clear search

Close search

Google apps

Main menu

Machine learning algorithm validation with a limited sample size

happywhale-tfrecords-25val

Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

Results on validation set data.

Feature data of training set and verification set in utLIFE-PC article

Rainbow training and validation data

Downsized camera trap images for automated classification

Validation verification based on prediction performance of those molecules...

DataSheet_1_Construction and Validation of a Lung Cancer Risk Prediction...

ears-reverb-dataset-validation

Validation Data for Google Landmark 2021

What is this dataset?

License (IMPORTANT)

Model file

Food-5K image dataset

Content

Segment interpretation and validation data - Land Cover Mapping, North Slope...

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Model weights and training, validation, and test set images and masks for...

Data from: Development, validation and integration of in silico models to...

DataSheet_1_Construction and Validation of a Lung Cancer Risk Prediction...

Data from: Training dataset for NABat Machine Learning V1.0

Table_1_Development and Validation of Predictive Model—HASBLAD Score—For...

Data from: Rangeland Condition Monitoring Assessment and Projection (RCMAP)...

Machine learning algorithm validation with a limited sample size