100+ datasets found

Machine learning algorithm validation with a limited sample size
plos.figshare.com
text/x-python
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0224365
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
DRIVE Train/Validation Split Dataset
kaggle.com
Updated Feb 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sovit Ranjan Rath (2023). DRIVE Train/Validation Split Dataset [Dataset]. https://www.kaggle.com/datasets/sovitrath/drive-trainvalidation-split-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sovit Ranjan Rath
Description
This dataset contains images and masks for Retinal Vessel Extraction (Segmentation). It contains a training and validation split to easily train semantic segmentation models.

The original dataset can be found here => https://www.kaggle.com/datasets/andrewmvd/drive-digital-retinal-images-for-vessel-extraction

This dataset also has an accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation

Split sample numbers: Training images and masks: 16 Validation images and masks: 4 Test images: 20
h
ASDiv-train-test
huggingface.co
Updated Nov 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeong Seong Cheol (2025). ASDiv-train-test [Dataset]. https://huggingface.co/datasets/lejelly/ASDiv-train-test
Explore at:
Dataset updated
Nov 3, 2025
Authors
Jeong Seong Cheol
Description
ASDiv (train/test 1:9)

This dataset is derived from EleutherAI/asdiv by splitting the original validation split into train and test with a ratio of 1:9.

Source

Original dataset: EleutherAI/asdivLink: https://huggingface.co/datasets/EleutherAI/asdiv

License

Inherits the original dataset's license (CC-BY-NC-4.0) unless otherwise noted in this repository.

Splitting Details

Method: datasets.Dataset.train_test_split Source split: validation Test… See the full description on the dataset page: https://huggingface.co/datasets/lejelly/ASDiv-train-test.
FER_my_split
kaggle.com
zip
Updated Feb 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neoklis Masmanidis (2021). FER_my_split [Dataset]. https://www.kaggle.com/neoklismasmanidis/fer-my-split
Explore at:
zip(88489637 bytes)Available download formats
Dataset updated
Feb 3, 2021
Authors
Neoklis Masmanidis
Description
Dataset

This dataset was created by Neoklis Masmanidis

Contents
1.125s Heart Sound Data TVT Split acc. 98.5%
kaggle.com
zip
Updated Jul 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raiyun Razeen Kabir (2023). 1.125s Heart Sound Data TVT Split acc. 98.5% [Dataset]. https://www.kaggle.com/datasets/razeen08/1125s-heart-sound-data-tvt-split-acc-985
Explore at:
zip(4217986 bytes)Available download formats
Dataset updated
Jul 19, 2023
Authors
Raiyun Razeen Kabir
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Train-Validation-Test Split for 1.125sdataforheartsound dataset that achieved 98.5% test accuracy using ResNet34

Load using

import pickle with open('/kaggle/input/1125s-heart-sound-data-tvt-split-acc-985/split_98_5.pkl', 'rb') as f: data = pickle.load(f) x_train = data['x_train'] x_test = data['x_test'] x_val = data['x_val'] y_train = data['y_train'] y_test = data['y_test'] y_val = data['y_val']

Copy and edit the sample notebook tryPickle
Data from: Web Data Commons Training and Test Sets for Large-Scale Product...
linkagelibrary.icpsr.umich.edu
da-ra.de
Updated Nov 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
Explore at:
Unique identifier
https://doi.org/10.3886/E127481V1
Dataset updated
Nov 26, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Ralph Peeters; Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
f
Dataset
figshare.com
application/x-gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moynuddin Ahmed Shibly (2023). Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13577873.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13577873.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Moynuddin Ahmed Shibly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.
f
Data from: Time-Split Cross-Validation as a Method for Estimating the...
acs.figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/ci400084k.s001
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
h
NoteChat
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Montecino (2024). NoteChat [Dataset]. https://huggingface.co/datasets/DanielMontecino/NoteChat
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
Daniel Montecino
Description
This dataset is just a split of the original akemiH/NoteChat.

70% for train 15% for validation 15% for test

Below is the code snipped used to split the dataset.

from datasets import DatasetDict from datasets import load_dataset

DATASET_SRC_NAME = "akemiH/NoteChat" DATASET_DST_NAME = "DanielMontecino/NoteChat"

dataset = load_dataset(DATASET_SRC_NAME, split="train")

70% train, 30% test + validation

train_testvalid = dataset.train_test_split(test_size=0.3, seed=2024)

Split the 30%… See the full description on the dataset page: https://huggingface.co/datasets/DanielMontecino/NoteChat.
d
Data from: Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Z
Data Cleaning, Translation & Split of the Dataset for the Automatic...
data.niaid.nih.gov
zenodo.org
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
Explore at:
Dataset updated
Aug 8, 2022
Authors
Köhler, Juliane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
Image-dataset-FER-Test,Train,Val
kaggle.com
zip
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dolly prajapati 182 (2024). Image-dataset-FER-Test,Train,Val [Dataset]. https://www.kaggle.com/datasets/dollyprajapati182/image-dataset-fer-testtrainval/code
Explore at:
zip(248085782 bytes)Available download formats
Dataset updated
Oct 8, 2024
Authors
dolly prajapati 182
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset is a split version of the original Image-dataset found here. The dataset consists of 8 emotion classes: angry, contempt, disgust, fear, happiness, neutral, sadness, and surprise.

To facilitate model training and evaluation, I have organized the dataset into three subsets:

Train: Used for training machine learning models. Test: Used to evaluate model performance after training. Validation: Used during training to tune hyperparameters and prevent overfitting.

This split allows for more effective usage in tasks such as Facial Emotion Recognition (FER) and other emotion analysis projects.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Z
Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jun 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henderson, N. Ashley; Kauwe, K. Steven; Sparks, D. Taylor (2021). Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4903957
Explore at:
Dataset updated
Jun 6, 2021
Authors
Henderson, N. Ashley; Kauwe, K. Steven; Sparks, D. Taylor
Description
This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.

For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.

For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0
f
Probing Datasets for Noisy Texts
federation.figshare.com
researchdata.edu.au
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte (2021). Probing Datasets for Noisy Texts [Dataset]. http://doi.org/10.25955/604c5307db043
Explore at:
Unique identifier
https://doi.org/10.25955/604c5307db043
Dataset updated
Mar 14, 2021
Dataset provided by
Federation University Australia
Authors
Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
ContextProbing tasks are popular among NLP researchers to assess the richness of the encoded representations of linguistic information. Each probing task is a classification problem, and the model’s performance shall vary depending on the richness of the linguistic properties crammed into the representation.

This dataset contains five new probing datasets consist of noisy texts (Tweets) which can serve as a benchmark dataset for researchers to study the linguistic characteristics of unstructured and noisy texts.File StructureFormat: A tab-separated text file

Column 1: train/test/validation split (tr-train, te-test, va-validation)

Column 2: class label (refer to the content

section for the class labels of each task file)

Column 3: Tweet message (text) Column

4: a unique ID Contentsent_len.tsvIn this classification task, the goal is to predict the sentence length in 8 possible bins (0-7) based on their lengths; 0: (5-8), 1: (9-12), 2: (13-16), 3: (17-20), 4: (21-25), 5: (26-29), 6: (30-33), 7: (34-70). This task is called “SentLen” in the paper.word_content.tsvWe consider a 10-way classifications task with 10 words as targets considering the available manually annotated instances. The task is predicting which of the target words appears on the given sentence. We have considered only the words that appear in the BERT vocabulary as target words. We constructed the data by picking the first 10 lower-cased words occurring in the corpus vocabulary ordered by frequency and having a length of at least 4 characters (to remove noise). Each sentence contains a single target word, and the word occurs precisely once in the sentence. The task is referred to as “WC” in the paper. bigram_shift.tsvThe purpose of the Bigram Shift task is to test whether an encoder is sensitive to legal word orders. Two adjacent words in a Tweet are inverted, and the classification model performs a binary classification to identify inverted (I) and non-inverted/original (O) Tweets. The task is referred to as “BShift” in the paper. tree_depth.tsvThe Tree Depth task evaluates the encoded sentence's ability to understand the hierarchical structure by allowing the classification model to predict the depth of the longest path from the root to any leaf in the Tweet's parser tree. The task is referred to as “TreeDepth” in the paper. odd_man_out.tsv

The Tweets are modified by replacing a random noun or a verb o with another noun or verb r. The task of the classifier is to identify whether the sentence gets modified due to this change. Class label O refers to the unmodified sentences while C refers to modified sentences. The task is called “SOMO” in the paper.
h
ccisd-teks-alignment-split
huggingface.co
Updated Nov 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Robson (2025). ccisd-teks-alignment-split [Dataset]. https://huggingface.co/datasets/robworks-software/ccisd-teks-alignment-split
Explore at:
Dataset updated
Nov 9, 2025
Authors
Ryan Robson
Area covered
Clear Creek Independent School District
Description
CCISD TEKS Alignment Dataset (3-Way Split)

Dataset Description

This dataset contains the alignment between Clear Creek ISD (CCISD) curriculum and Texas Essential Knowledge and Skills (TEKS) standards. The dataset is split into training, validation, and test sets for machine learning applications.

Dataset Summary

Total Records: 428 TEKS standards Train Split: 299 records (69.9%) Validation Split: 64 records (15.0%) Test Split: 65 records (15.2%) Subject Areas:… See the full description on the dataset page: https://huggingface.co/datasets/robworks-software/ccisd-teks-alignment-split.
h
world-survey-train
huggingface.co
Updated Sep 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maksim Zubok (2025). world-survey-train [Dataset]. https://huggingface.co/datasets/antndlcrx/world-survey-train
Explore at:
Dataset updated
Sep 27, 2025
Authors
Maksim Zubok
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
WVS Snippet — Train Split Only

This dataset provides the training split of the curated and limited snippet of World Values Survey (WVS) data, prepared for experiments with large language models as in silico survey respondents.

Contents

Split: train only (validation and test splits will be released later). Columns:
respondent_id
feature questions (Q1…Q280, depending on cleaning)
target questions (prefixed t_, e.g. t_Q23)

Missing values

Missing:… See the full description on the dataset page: https://huggingface.co/datasets/antndlcrx/world-survey-train.
Global Wheat Detection 2020 Train/Valid/Test Split
kaggle.com
zip
Updated Dec 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sovit Ranjan Rath (2022). Global Wheat Detection 2020 Train/Valid/Test Split [Dataset]. https://www.kaggle.com/datasets/sovitrath/global-wheat-detection-2021-trainvalidtest-split
Explore at:
zip(630636820 bytes)Available download formats
Dataset updated
Dec 26, 2022
Authors
Sovit Ranjan Rath
Description
This is the Global Wheat Detection dataset with train, validation, and test split. The labels are in XML format. The training and validation sets were created randomly. The test folder only contains a few images as per the original dataset.

Acknowledgment: https://www.kaggle.com/competitions/global-wheat-detection
h
iraq2025
huggingface.co
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Falahgs Saleh (2025). iraq2025 [Dataset]. https://huggingface.co/datasets/Falah/iraq2025
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2025
Authors
Falahgs Saleh
Description
iraq2025 Image Dataset

This dataset contains images with their corresponding captions.

Dataset Structure

Each record in the dataset contains:

file_name: Name of the image file text: Caption describing the image image: The actual image file split: Dataset split (train/test/validation)

Example Record

{ "file_name": "Picture3.png", "text": "saad2", "image": { "path": "data/Picture3.png", "bytes": { "type": "Buffer", "data": [… See the full description on the dataset page: https://huggingface.co/datasets/Falah/iraq2025.
R
Jarvis Test Data Dataset
universe.roboflow.com
zip
Updated Nov 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jarvis Test Data (2021). Jarvis Test Data Dataset [Dataset]. https://universe.roboflow.com/jarvis-test-data/jarvis-test-data/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Nov 30, 2021
Dataset authored and provided by
Jarvis Test Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Logos Bounding Boxes
Description
Bocconi University MSc. Data Science & Business Analytics 20600 Deep Learning for Computer Vision Team Jarvis

Object Detection Project

Test Data

This repository contains the test data used for the evaluation of the algorithms trained as part of the project. The data has been reannotated & resized to 640x640, but otherwise has not been touched. Especially, no augmentations or upsampling were done on this set. Instead, immediately after resizing and re-annotation, the train, validation & test set were split. Upsampling & augmentations were only done on the training set. Lastly, to further avoid leakage, duplicates were removed.

Facebook

Twitter

Click to copy link

Link copied

Cite

Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365

Machine learning algorithm validation with a limited sample size

Explore at:

text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0224365

Dataset updated

May 30, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Clear search

Close search

Google apps

Main menu

Machine learning algorithm validation with a limited sample size

DRIVE Train/Validation Split Dataset

ASDiv-train-test

FER_my_split

Dataset

Contents

1.125s Heart Sound Data TVT Split acc. 98.5%

Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

Dataset

Data from: Time-Split Cross-Validation as a Method for Estimating the...

NoteChat

70% train, 30% test + validation

Split the 30%… See the full description on the dataset page: https://huggingface.co/datasets/DanielMontecino/NoteChat.

Data from: Training dataset for NABat Machine Learning V1.0

Data Cleaning, Translation & Split of the Dataset for the Automatic...

Image-dataset-FER-Test,Train,Val

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...

Probing Datasets for Noisy Texts

ccisd-teks-alignment-split

world-survey-train

Global Wheat Detection 2020 Train/Valid/Test Split

iraq2025

Jarvis Test Data Dataset

Object Detection Project

Test Data

Machine learning algorithm validation with a limited sample size