100+ datasets found

Machine learning algorithm validation with a limited sample size
plos.figshare.com
text/x-python
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0224365
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
DRIVE Train/Validation Split Dataset
kaggle.com
Updated Feb 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sovit Ranjan Rath (2023). DRIVE Train/Validation Split Dataset [Dataset]. https://www.kaggle.com/datasets/sovitrath/drive-trainvalidation-split-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sovit Ranjan Rath
Description
This dataset contains images and masks for Retinal Vessel Extraction (Segmentation). It contains a training and validation split to easily train semantic segmentation models.

The original dataset can be found here => https://www.kaggle.com/datasets/andrewmvd/drive-digital-retinal-images-for-vessel-extraction

This dataset also has an accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation

Split sample numbers: Training images and masks: 16 Validation images and masks: 4 Test images: 20
h
ASDiv-train-test
huggingface.co
Updated Nov 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeong Seong Cheol (2025). ASDiv-train-test [Dataset]. https://huggingface.co/datasets/lejelly/ASDiv-train-test
Explore at:
Dataset updated
Nov 3, 2025
Authors
Jeong Seong Cheol
Description
ASDiv (train/test 1:9)

This dataset is derived from EleutherAI/asdiv by splitting the original validation split into train and test with a ratio of 1:9.

Source

Original dataset: EleutherAI/asdivLink: https://huggingface.co/datasets/EleutherAI/asdiv

License

Inherits the original dataset's license (CC-BY-NC-4.0) unless otherwise noted in this repository.

Splitting Details

Method: datasets.Dataset.train_test_split Source split: validation Test… See the full description on the dataset page: https://huggingface.co/datasets/lejelly/ASDiv-train-test.
Data from: Web Data Commons Training and Test Sets for Large-Scale Product...
linkagelibrary.icpsr.umich.edu
da-ra.de
Updated Nov 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
Explore at:
Unique identifier
https://doi.org/10.3886/E127481V1
Dataset updated
Nov 26, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Ralph Peeters; Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
f
Data from: Time-Split Cross-Validation as a Method for Estimating the...
acs.figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/ci400084k.s001
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
FER_my_split
kaggle.com
zip
Updated Feb 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neoklis Masmanidis (2021). FER_my_split [Dataset]. https://www.kaggle.com/neoklismasmanidis/fer-my-split
Explore at:
zip(88489637 bytes)Available download formats
Dataset updated
Feb 3, 2021
Authors
Neoklis Masmanidis
Description
Dataset

This dataset was created by Neoklis Masmanidis

Contents
1.125s Heart Sound Data TVT Split acc. 98.5%
kaggle.com
zip
Updated Jul 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raiyun Razeen Kabir (2023). 1.125s Heart Sound Data TVT Split acc. 98.5% [Dataset]. https://www.kaggle.com/datasets/razeen08/1125s-heart-sound-data-tvt-split-acc-985
Explore at:
zip(4217986 bytes)Available download formats
Dataset updated
Jul 19, 2023
Authors
Raiyun Razeen Kabir
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Train-Validation-Test Split for 1.125sdataforheartsound dataset that achieved 98.5% test accuracy using ResNet34

Load using

import pickle with open('/kaggle/input/1125s-heart-sound-data-tvt-split-acc-985/split_98_5.pkl', 'rb') as f: data = pickle.load(f) x_train = data['x_train'] x_test = data['x_test'] x_val = data['x_val'] y_train = data['y_train'] y_test = data['y_test'] y_val = data['y_val']

Copy and edit the sample notebook tryPickle
Image-dataset-FER-Test,Train,Val
kaggle.com
zip
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dolly prajapati 182 (2024). Image-dataset-FER-Test,Train,Val [Dataset]. https://www.kaggle.com/datasets/dollyprajapati182/image-dataset-fer-testtrainval/code
Explore at:
zip(248085782 bytes)Available download formats
Dataset updated
Oct 8, 2024
Authors
dolly prajapati 182
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset is a split version of the original Image-dataset found here. The dataset consists of 8 emotion classes: angry, contempt, disgust, fear, happiness, neutral, sadness, and surprise.

To facilitate model training and evaluation, I have organized the dataset into three subsets:

Train: Used for training machine learning models. Test: Used to evaluate model performance after training. Validation: Used during training to tune hyperparameters and prevent overfitting.

This split allows for more effective usage in tasks such as Facial Emotion Recognition (FER) and other emotion analysis projects.
d
Data from: Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
f
Summary of the training and testing data.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jul 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meysman, Pieter; Laukens, Kris; Bui-Thi, Danh; Rivière, Emmanuel (2022). Summary of the training and testing data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000250189
Explore at:
Dataset updated
Jul 21, 2022
Authors
Meysman, Pieter; Laukens, Kris; Bui-Thi, Danh; Rivière, Emmanuel
Description
To small datasets, human and C.elegans, we evaluate the models’ performance using k-fold cross validation, with k = 5. To the other datasets, we split them into three sets: training, validation and testing.
f
Dataset
figshare.com
application/x-gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moynuddin Ahmed Shibly (2023). Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13577873.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13577873.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Moynuddin Ahmed Shibly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.
Global Wheat Detection 2020 Train/Valid/Test Split
kaggle.com
zip
Updated Dec 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sovit Ranjan Rath (2022). Global Wheat Detection 2020 Train/Valid/Test Split [Dataset]. https://www.kaggle.com/datasets/sovitrath/global-wheat-detection-2021-trainvalidtest-split
Explore at:
zip(630636820 bytes)Available download formats
Dataset updated
Dec 26, 2022
Authors
Sovit Ranjan Rath
Description
This is the Global Wheat Detection dataset with train, validation, and test split. The labels are in XML format. The training and validation sets were created randomly. The test folder only contains a few images as per the original dataset.

Acknowledgment: https://www.kaggle.com/competitions/global-wheat-detection
h
ccisd-teks-alignment-split
huggingface.co
Updated Nov 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Robson (2025). ccisd-teks-alignment-split [Dataset]. https://huggingface.co/datasets/robworks-software/ccisd-teks-alignment-split
Explore at:
Dataset updated
Nov 9, 2025
Authors
Ryan Robson
Area covered
Clear Creek Independent School District
Description
CCISD TEKS Alignment Dataset (3-Way Split)

Dataset Description

This dataset contains the alignment between Clear Creek ISD (CCISD) curriculum and Texas Essential Knowledge and Skills (TEKS) standards. The dataset is split into training, validation, and test sets for machine learning applications.

Dataset Summary

Total Records: 428 TEKS standards Train Split: 299 records (69.9%) Validation Split: 64 records (15.0%) Test Split: 65 records (15.2%) Subject Areas:… See the full description on the dataset page: https://huggingface.co/datasets/robworks-software/ccisd-teks-alignment-split.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
f
Datasets used in the study.
figshare.com
xls
Updated Dec 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erik Bergman; Luise Dürlich; Veronica Arthurson; Anders Sundström; Maria Larsson; Shamima Bhuiyan; Andreas Jakobsson; Gabriel Westman (2023). Datasets used in the study. [Dataset]. http://doi.org/10.1371/journal.pdig.0000409.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000409.t001
Dataset updated
Dec 6, 2023
Dataset provided by
PLOS Digital Health
Authors
Erik Bergman; Luise Dürlich; Veronica Arthurson; Anders Sundström; Maria Larsson; Shamima Bhuiyan; Andreas Jakobsson; Gabriel Westman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Post-marketing reports of suspected adverse drug reactions are important for establishing the safety profile of a medicinal product. However, a high influx of reports poses a challenge for regulatory authorities as a delay in identification of previously unknown adverse drug reactions can potentially be harmful to patients. In this study, we use natural language processing (NLP) to predict whether a report is of serious nature based solely on the free-text fields and adverse event terms in the report, potentially allowing reports mislabelled at time of reporting to be detected and prioritized for assessment. We consider four different NLP models at various levels of complexity, bootstrap their train-validation data split to eliminate random effects in the performance estimates and conduct prospective testing to avoid the risk of data leakage. Using a Swedish BERT based language model, continued language pre-training and final classification training, we achieve close to human-level performance in this task. Model architectures based on less complex technical foundation such as bag-of-words approaches and LSTM neural networks trained with random initiation of weights appear to perform less well, likely due to the lack of robustness that a base of general language training provides.
Z
Data Cleaning, Translation & Split of the Dataset for the Automatic...
data.niaid.nih.gov
zenodo.org
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
Explore at:
Dataset updated
Aug 8, 2022
Authors
Köhler, Juliane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
ChilliLeafDataset_TrainValTest
kaggle.com
zip
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taibur Rahaman (2025). ChilliLeafDataset_TrainValTest [Dataset]. https://www.kaggle.com/datasets/taiburrahaman/chillileafdataset-trainvaltest
Explore at:
zip(43069840 bytes)Available download formats
Dataset updated
Oct 1, 2025
Authors
Taibur Rahaman
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Taibur Rahaman

Released under Apache 2.0

Contents
h
world-survey-train
huggingface.co
Updated Sep 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maksim Zubok (2025). world-survey-train [Dataset]. https://huggingface.co/datasets/antndlcrx/world-survey-train
Explore at:
Dataset updated
Sep 27, 2025
Authors
Maksim Zubok
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
WVS Snippet — Train Split Only

This dataset provides the training split of the curated and limited snippet of World Values Survey (WVS) data, prepared for experiments with large language models as in silico survey respondents.

Contents

Split: train only (validation and test splits will be released later). Columns:
respondent_id
feature questions (Q1…Q280, depending on cleaning)
target questions (prefixed t_, e.g. t_Q23)

Missing values

Missing:… See the full description on the dataset page: https://huggingface.co/datasets/antndlcrx/world-survey-train.
Model weights and training, validation, and test set images and masks for...
zenodo.org
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kylen Solvik; Kylen Solvik; Yaffa Truelove; Yaffa Truelove; JENNIFER BALCH; JENNIFER BALCH; Michael Lathuilliere; Michael Lathuilliere; Thiago Fontenelle; Andrea Castanho; Andrea Castanho; Michael Coe; Michael Coe; Christina Shintani; Christina Shintani; CARLOS Souza Jr; CARLOS Souza Jr; Marcia Nunes Macedo; Marcia Nunes Macedo; Thiago Fontenelle (2025). Model weights and training, validation, and test set images and masks for "Uncovering a million small dams in Brazil using deep learning" [Dataset]. http://doi.org/10.5281/zenodo.14927197
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14927197
Dataset updated
Feb 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kylen Solvik; Kylen Solvik; Yaffa Truelove; Yaffa Truelove; JENNIFER BALCH; JENNIFER BALCH; Michael Lathuilliere; Michael Lathuilliere; Thiago Fontenelle; Andrea Castanho; Andrea Castanho; Michael Coe; Michael Coe; Christina Shintani; Christina Shintani; CARLOS Souza Jr; CARLOS Souza Jr; Marcia Nunes Macedo; Marcia Nunes Macedo; Thiago Fontenelle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated masks and Sentinel-1/-2 images split into training, validation, and test sets. Used for training convolutional neural network for small reservoir mapping.

- manet_sentinel.ckpt: PyTorch model checkpoint file containing model weights.

- annotations.zip: Contains binary reservoir masks (0 is non-reservoir, 1 is reservoir) split into training, validation, and test sets.

- images.zip: Contains Sentinel-1/-2 images split into training, validation, and test sets with the following bands:

Blue

Green

Red

Near-infrared

Sentinel-1 SAR VV

Sentinel-1 SAR VH

NDVI

NDWI

Gao's NDWI

MNDWI
h
wiki_paragraphs_english
huggingface.co
Updated Jan 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Per Kummervold (2023). wiki_paragraphs_english [Dataset]. https://huggingface.co/datasets/pere/wiki_paragraphs_english
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2023
Authors
Per Kummervold
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
WIKI Paragraphs English

A multi-split dataset for machine learning research and evaluation, containing text samples in JSON Lines format.

Features

Multiple splits for different use cases Random shuffle with Fisher-Yates algorithm Structured format with text and metadata Size-varied validation/test sets (100 to 10k samples)

Splits Overview

Split Name Samples Typical Usage

train 1,000,000 Primary training data

validation 10,000 Standard validation… See the full description on the dataset page: https://huggingface.co/datasets/pere/wiki_paragraphs_english.

Facebook

Twitter

Click to copy link

Link copied

Cite

Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365

Machine learning algorithm validation with a limited sample size

Explore at:

text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0224365

Dataset updated

May 30, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Clear search

Close search

Google apps

Main menu

Machine learning algorithm validation with a limited sample size

DRIVE Train/Validation Split Dataset

ASDiv-train-test

Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

Data from: Time-Split Cross-Validation as a Method for Estimating the...

FER_my_split

Dataset

Contents

1.125s Heart Sound Data TVT Split acc. 98.5%

Image-dataset-FER-Test,Train,Val

Data from: Training dataset for NABat Machine Learning V1.0

Summary of the training and testing data.

Dataset

Global Wheat Detection 2020 Train/Valid/Test Split

ccisd-teks-alignment-split

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Datasets used in the study.

Data Cleaning, Translation & Split of the Dataset for the Automatic...

ChilliLeafDataset_TrainValTest

Dataset

Contents

world-survey-train

Model weights and training, validation, and test set images and masks for...

wiki_paragraphs_english

Machine learning algorithm validation with a limited sample size