100+ datasets found

h
cleaned-quora-dataset-train-test-split
huggingface.co
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fivesixseven (2024). cleaned-quora-dataset-train-test-split [Dataset]. https://huggingface.co/datasets/567-labs/cleaned-quora-dataset-train-test-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2024
Dataset authored and provided by
fivesixseven
Description
This is a cleaned version of the Quora dataset that's been configured with a train-test-val split.

Train : For training model Test : For running experiments and comparing different OSS models and closed sourced models Val : Only to be used at the end!

Colab Notebook to reproduce : https://colab.research.google.com/drive/1dGjGiqwPV1M7JOLfcPEsSh3SC37urItS?usp=sharing
CUB200-2011 with train/test split
kaggle.com
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GIOPAIK (2023). CUB200-2011 with train/test split [Dataset]. https://www.kaggle.com/datasets/skyil7/cub200-2011-with-traintest-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
GIOPAIK
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset

This dataset was created by GIOPAIK

Released under CC BY-SA 4.0

Contents
Caltech-256: Pre-Processed 80/20 Train-Test Split
kaggle.com
zip
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KUSHAGRA MATHUR (2025). Caltech-256: Pre-Processed 80/20 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kushubhai/caltech-256-train-test
Explore at:
zip(1138799273 bytes)Available download formats
Dataset updated
Nov 12, 2025
Authors
KUSHAGRA MATHUR
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).

The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:

A clean, pre-defined 80/20 train-test split.

Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.

A flat directory structure (train/, test/) for simplified file access.

File Content The dataset is organized into a single top-level folder and two CSV files:

train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.

test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.

Caltech-256_Train_Test/: The primary data folder.

train/: This directory contains 80% of the images from all 257 categories, intended for model training.

test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.

Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.

Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.

Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data

Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.
h
northern-elevator-64-test-split
huggingface.co
Updated Aug 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamad Bisher tello (2025). northern-elevator-64-test-split [Dataset]. https://huggingface.co/datasets/Bisher/northern-elevator-64-test-split
Explore at:
Dataset updated
Aug 31, 2025
Authors
Mohamad Bisher tello
Description
Bisher/northern-elevator-64-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community
Facial Emotion Recognition Train-Test Split
kaggle.com
zip
Updated Oct 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KevinKSU (2025). Facial Emotion Recognition Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kevinksu/facial-emotion-train-test-split
Explore at:
zip(208903118 bytes)Available download formats
Dataset updated
Oct 24, 2025
Authors
KevinKSU
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by KevinKSU

Released under CC0: Public Domain

Contents
R
Train Test Split For Freiburg In Yolov7 Format Dataset
universe.roboflow.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac H (2023). Train Test Split For Freiburg In Yolov7 Format Dataset [Dataset]. https://universe.roboflow.com/isaac-h/train-test-split-for-freiburg-dataset-in-yolov7-format/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Aug 4, 2023
Dataset authored and provided by
Isaac H
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Groceries Bounding Boxes
Description
Train Test Split For Freiburg Dataset In YOLOv7 Format

## Overview Train Test Split For Freiburg Dataset In YOLOv7 Format is a dataset for object detection tasks - it contains Groceries annotations for 8,879 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Bark-101 Train-Test Split
kaggle.com
zip
Updated Jan 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdul Hasib Uddin (2023). Bark-101 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/abdulhasibuddin/bark101-traintest-split
Explore at:
zip(386258861 bytes)Available download formats
Dataset updated
Jan 27, 2023
Authors
Abdul Hasib Uddin
Description
Dataset

This dataset was created by Abdul Hasib Uddin

Contents
Machine learning algorithm validation with a limited sample size
plos.figshare.com
text/x-python
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0224365
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
h
arc-agi-prompts-train-test-split
huggingface.co
Updated Jun 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bryce Sandlund (2025). arc-agi-prompts-train-test-split [Dataset]. https://huggingface.co/datasets/bcsandlund/arc-agi-prompts-train-test-split
Explore at:
Dataset updated
Jun 1, 2025
Authors
Bryce Sandlund
Description
bcsandlund/arc-agi-prompts-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community
h
healthbench-train-test-split
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Varun Mathur, healthbench-train-test-split [Dataset]. https://huggingface.co/datasets/varun500/healthbench-train-test-split
Explore at:
Authors
Varun Mathur
Description
varun500/healthbench-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community
c
test split Price Prediction Data
coinbase.com
Updated Nov 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). test split Price Prediction Data [Dataset]. https://www.coinbase.com/price-prediction/base-test-split-cf70
Explore at:
Dataset updated
Nov 11, 2025
Variables measured
Growth Rate, Predicted Price
Measurement technique
User-defined projections based on compound growth. This is not a formal financial forecast.
Description
This dataset contains the predicted prices of the asset test split over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
f
Model comparison using multiple metrics before balancing by SMOTE(Train-Test...
plos.figshare.com
xls
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Niguse Mamo; Agmasie Damtew Walle; Eden Ketema Woldekidan; Jibril Bashir Adem; Yosef Haile Gebremariam; Meron Asmamaw Alemayehu; Ermias Bekele Enyew; Shimels Derso Kebede (2025). Model comparison using multiple metrics before balancing by SMOTE(Train-Test Split (80%-20%)). [Dataset]. http://doi.org/10.1371/journal.pdig.0000707.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000707.t003
Dataset updated
Jan 9, 2025
Dataset provided by
PLOS Digital Health
Authors
Daniel Niguse Mamo; Agmasie Damtew Walle; Eden Ketema Woldekidan; Jibril Bashir Adem; Yosef Haile Gebremariam; Meron Asmamaw Alemayehu; Ermias Bekele Enyew; Shimels Derso Kebede
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Model comparison using multiple metrics before balancing by SMOTE(Train-Test Split (80%-20%)).
split-train-test
kaggle.com
zip
Updated May 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoàng Anh Nguyễn (2025). split-train-test [Dataset]. https://www.kaggle.com/datasets/hoanganhnguyen1005/split-train-test
Explore at:
zip(699256626 bytes)Available download formats
Dataset updated
May 18, 2025
Authors
Hoàng Anh Nguyễn
Description
Dataset

This dataset was created by Hoàng Anh Nguyễn

Released under Other (specified in description)

Contents
Dataskripsi_split
kaggle.com
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dewizzz (2023). Dataskripsi_split [Dataset]. https://www.kaggle.com/datasets/dewizzz/dataskripsi-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dewizzz
Description
Dataset

This dataset was created by Dewizzz

Contents
f
Splits of train, test, and validation samples for Urban dataset.
datasetcatalog.nlm.nih.gov
Updated Jul 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mantripragada, Kiran; Qureshi, Faisal Z.; Dao, Phuong D.; He, Yuhong (2022). Splits of train, test, and validation samples for Urban dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000391591
Explore at:
Dataset updated
Jul 14, 2022
Authors
Mantripragada, Kiran; Qureshi, Faisal Z.; Dao, Phuong D.; He, Yuhong
Description
Splits of train, test, and validation samples for Urban dataset.
split-dataset
kaggle.com
Updated Sep 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Büşra Ertekin (2025). split-dataset [Dataset]. https://www.kaggle.com/datasets/busraertekin/split-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 26, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Büşra Ertekin
Description
Dataset

This dataset was created by Büşra Ertekin

Contents
f
Table of averaged results over 100 train-test splits with a ratio of 0.33.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jul 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leibnitz, Kenji; Rattay, Frank; Golaszewski, Stefan Martin; Wutzl, Betty; Murata, Masayuki; Kronbichler, Martin (2019). Table of averaged results over 100 train-test splits with a ratio of 0.33. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000130899
Explore at:
Dataset updated
Jul 11, 2019
Authors
Leibnitz, Kenji; Rattay, Frank; Golaszewski, Stefan Martin; Wutzl, Betty; Murata, Masayuki; Kronbichler, Martin
Description
The AUC of the precision and recall curve is shown (AUC with feature selection) for training and testing with the most important ROIs and with all ROIs (AUC without feature selection).
split-test-data-1-fix
kaggle.com
zip
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HoangAnhVu (2024). split-test-data-1-fix [Dataset]. https://www.kaggle.com/hoanganhvu31102002/split-test-data-1-fix
Explore at:
zip(119136923 bytes)Available download formats
Dataset updated
Apr 30, 2024
Authors
HoangAnhVu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by HoangAnhVu

Released under MIT

Contents
f
Data from: Time-Split Cross-Validation as a Method for Estimating the...
acs.figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/ci400084k.s001
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
f
Data split for each class of each dataset for training and test.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niranjan, Mahesan; Fan, Keqiang; Cai, Xiaohao; Liu, Jiahui (2024). Data split for each class of each dataset for training and test. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001424294
Explore at:
Dataset updated
Nov 6, 2024
Authors
Niranjan, Mahesan; Fan, Keqiang; Cai, Xiaohao; Liu, Jiahui
Description
Data split for each class of each dataset for training and test.

Facebook

Twitter

Click to copy link

Link copied

Cite

fivesixseven (2024). cleaned-quora-dataset-train-test-split [Dataset]. https://huggingface.co/datasets/567-labs/cleaned-quora-dataset-train-test-split

cleaned-quora-dataset-train-test-split

567-labs/cleaned-quora-dataset-train-test-split

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 7, 2024

Dataset authored and provided by

fivesixseven

Description

This is a cleaned version of the Quora dataset that's been configured with a train-test-val split.

Train : For training model Test : For running experiments and comparing different OSS models and closed sourced models Val : Only to be used at the end!

Colab Notebook to reproduce : https://colab.research.google.com/drive/1dGjGiqwPV1M7JOLfcPEsSh3SC37urItS?usp=sharing

Clear search

Close search

Google apps

Main menu

cleaned-quora-dataset-train-test-split

CUB200-2011 with train/test split

Dataset

Contents

Caltech-256: Pre-Processed 80/20 Train-Test Split

northern-elevator-64-test-split

Facial Emotion Recognition Train-Test Split

Dataset

Contents

Train Test Split For Freiburg In Yolov7 Format Dataset

Train Test Split For Freiburg Dataset In YOLOv7 Format

Bark-101 Train-Test Split

Dataset

Contents

Machine learning algorithm validation with a limited sample size

arc-agi-prompts-train-test-split

healthbench-train-test-split

test split Price Prediction Data

Model comparison using multiple metrics before balancing by SMOTE(Train-Test...

split-train-test

Dataset

Contents

Dataskripsi_split

Dataset

Contents

Splits of train, test, and validation samples for Urban dataset.

split-dataset

Dataset

Contents

Table of averaged results over 100 train-test splits with a ratio of 0.33.

split-test-data-1-fix

Dataset

Contents

Data from: Time-Split Cross-Validation as a Method for Estimating the...

Data split for each class of each dataset for training and test.

cleaned-quora-dataset-train-test-split

567-labs/cleaned-quora-dataset-train-test-split