100+ datasets found
  1. h

    cleaned-quora-dataset-train-test-split

    • huggingface.co
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fivesixseven (2024). cleaned-quora-dataset-train-test-split [Dataset]. https://huggingface.co/datasets/567-labs/cleaned-quora-dataset-train-test-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    fivesixseven
    Description

    This is a cleaned version of the Quora dataset that's been configured with a train-test-val split.

    Train : For training model Test : For running experiments and comparing different OSS models and closed sourced models Val : Only to be used at the end!

    Colab Notebook to reproduce : https://colab.research.google.com/drive/1dGjGiqwPV1M7JOLfcPEsSh3SC37urItS?usp=sharing

  2. CUB200-2011 with train/test split

    • kaggle.com
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GIOPAIK (2023). CUB200-2011 with train/test split [Dataset]. https://www.kaggle.com/datasets/skyil7/cub200-2011-with-traintest-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    GIOPAIK
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by GIOPAIK

    Released under CC BY-SA 4.0

    Contents

  3. Caltech-256: Pre-Processed 80/20 Train-Test Split

    • kaggle.com
    zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KUSHAGRA MATHUR (2025). Caltech-256: Pre-Processed 80/20 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kushubhai/caltech-256-train-test
    Explore at:
    zip(1138799273 bytes)Available download formats
    Dataset updated
    Nov 12, 2025
    Authors
    KUSHAGRA MATHUR
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).

    The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:

    A clean, pre-defined 80/20 train-test split.

    Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.

    A flat directory structure (train/, test/) for simplified file access.

    File Content The dataset is organized into a single top-level folder and two CSV files:

    train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.

    test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.

    Caltech-256_Train_Test/: The primary data folder.

    train/: This directory contains 80% of the images from all 257 categories, intended for model training.

    test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.

    Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.

    Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.

    Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data

    Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.

  4. h

    arc-agi-prompts-train-test-split

    • huggingface.co
    Updated Jun 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bryce Sandlund (2025). arc-agi-prompts-train-test-split [Dataset]. https://huggingface.co/datasets/bcsandlund/arc-agi-prompts-train-test-split
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Bryce Sandlund
    Description

    bcsandlund/arc-agi-prompts-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    deepstock-sp500-companies-info-stonkv2-test-train-split

    • huggingface.co
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Abrie Nel (2025). deepstock-sp500-companies-info-stonkv2-test-train-split [Dataset]. https://huggingface.co/datasets/2084Collective/deepstock-sp500-companies-info-stonkv2-test-train-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2025
    Authors
    Lukas Abrie Nel
    Description

    2084Collective/deepstock-sp500-companies-info-stonkv2-test-train-split dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. R

    Train Test Split For Freiburg In Yolov7 Format Dataset

    • universe.roboflow.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac H (2023). Train Test Split For Freiburg In Yolov7 Format Dataset [Dataset]. https://universe.roboflow.com/isaac-h/train-test-split-for-freiburg-dataset-in-yolov7-format/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 4, 2023
    Dataset authored and provided by
    Isaac H
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Groceries Bounding Boxes
    Description

    Train Test Split For Freiburg Dataset In YOLOv7 Format

    ## Overview
    
    Train Test Split For Freiburg Dataset In YOLOv7 Format is a dataset for object detection tasks - it contains Groceries annotations for 8,879 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  7. f

    Splits of train, test, and validation samples for Urban dataset.

    • datasetcatalog.nlm.nih.gov
    Updated Jul 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mantripragada, Kiran; Qureshi, Faisal Z.; Dao, Phuong D.; He, Yuhong (2022). Splits of train, test, and validation samples for Urban dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000391591
    Explore at:
    Dataset updated
    Jul 14, 2022
    Authors
    Mantripragada, Kiran; Qureshi, Faisal Z.; Dao, Phuong D.; He, Yuhong
    Description

    Splits of train, test, and validation samples for Urban dataset.

  8. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  9. h

    healthbench-train-test-split

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Varun Mathur, healthbench-train-test-split [Dataset]. https://huggingface.co/datasets/varun500/healthbench-train-test-split
    Explore at:
    Authors
    Varun Mathur
    Description

    varun500/healthbench-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. split-train-test

    • kaggle.com
    zip
    Updated May 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HoĂ ng Anh Nguyá»…n (2025). split-train-test [Dataset]. https://www.kaggle.com/datasets/hoanganhnguyen1005/split-train-test
    Explore at:
    zip(699256626 bytes)Available download formats
    Dataset updated
    May 18, 2025
    Authors
    HoĂ ng Anh Nguyá»…n
    Description

    Dataset

    This dataset was created by HoĂ ng Anh Nguyá»…n

    Released under Other (specified in description)

    Contents

  11. Facial Emotion Recognition Train-Test Split

    • kaggle.com
    zip
    Updated Oct 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KevinKSU (2025). Facial Emotion Recognition Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kevinksu/facial-emotion-train-test-split
    Explore at:
    zip(208903118 bytes)Available download formats
    Dataset updated
    Oct 24, 2025
    Authors
    KevinKSU
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by KevinKSU

    Released under CC0: Public Domain

    Contents

  12. f

    Table of averaged results over 100 train-test splits with a ratio of 0.33.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jul 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leibnitz, Kenji; Rattay, Frank; Golaszewski, Stefan Martin; Wutzl, Betty; Murata, Masayuki; Kronbichler, Martin (2019). Table of averaged results over 100 train-test splits with a ratio of 0.33. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000130899
    Explore at:
    Dataset updated
    Jul 11, 2019
    Authors
    Leibnitz, Kenji; Rattay, Frank; Golaszewski, Stefan Martin; Wutzl, Betty; Murata, Masayuki; Kronbichler, Martin
    Description

    The AUC of the precision and recall curve is shown (AUC with feature selection) for training and testing with the most important ROIs and with all ROIs (AUC without feature selection).

  13. f

    Model comparison using multiple metrics before balancing by SMOTE(Train-Test...

    • plos.figshare.com
    xls
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Niguse Mamo; Agmasie Damtew Walle; Eden Ketema Woldekidan; Jibril Bashir Adem; Yosef Haile Gebremariam; Meron Asmamaw Alemayehu; Ermias Bekele Enyew; Shimels Derso Kebede (2025). Model comparison using multiple metrics before balancing by SMOTE(Train-Test Split (80%-20%)). [Dataset]. http://doi.org/10.1371/journal.pdig.0000707.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 9, 2025
    Dataset provided by
    PLOS Digital Health
    Authors
    Daniel Niguse Mamo; Agmasie Damtew Walle; Eden Ketema Woldekidan; Jibril Bashir Adem; Yosef Haile Gebremariam; Meron Asmamaw Alemayehu; Ermias Bekele Enyew; Shimels Derso Kebede
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Model comparison using multiple metrics before balancing by SMOTE(Train-Test Split (80%-20%)).

  14. Bark-101 Train-Test Split

    • kaggle.com
    zip
    Updated Jan 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Hasib Uddin (2023). Bark-101 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/abdulhasibuddin/bark101-traintest-split
    Explore at:
    zip(386258861 bytes)Available download formats
    Dataset updated
    Jan 27, 2023
    Authors
    Abdul Hasib Uddin
    Description

    Dataset

    This dataset was created by Abdul Hasib Uddin

    Contents

  15. Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

    • linkagelibrary.icpsr.umich.edu
    • da-ra.de
    Updated Nov 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Ralph Peeters; Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  16. f

    Mean performance for our 8 models across the 10 train-test splits.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Sep 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maenner, Matthew J.; Lee, Scott H.; Heilig, Charles M. (2019). Mean performance for our 8 models across the 10 train-test splits. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000184876
    Explore at:
    Dataset updated
    Sep 25, 2019
    Authors
    Maenner, Matthew J.; Lee, Scott H.; Heilig, Charles M.
    Description

    Metrics include sensitivity (Sens), specificity (Spec), positive predictive value (PPV), negative predictive value (NPV), F1, and accuracy (Acc), all shown as percentages. The best scores for each metric are shown in bold, and the final column presents differences in accuracy between each of the models and the most accurate model, the NB-SVM. Simultaneous confidence intervals are multiplicity-adjusted to control FWER.

  17. major-test-train-Split

    • kaggle.com
    zip
    Updated Apr 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Abdul Basit (2022). major-test-train-Split [Dataset]. https://www.kaggle.com/datasets/mohammadabdulbasit/majortesttrainsplit
    Explore at:
    zip(2120625 bytes)Available download formats
    Dataset updated
    Apr 10, 2022
    Authors
    Mohammad Abdul Basit
    Description

    Dataset

    This dataset was created by Mohammad Abdul Basit

    Contents

  18. h

    low_alt_satellite_image_dataset_500-train-test-split

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sadhana Shashidhar, low_alt_satellite_image_dataset_500-train-test-split [Dataset]. https://huggingface.co/datasets/Sadhana-24/low_alt_satellite_image_dataset_500-train-test-split
    Explore at:
    Authors
    Sadhana Shashidhar
    Description

    Sadhana-24/low_alt_satellite_image_dataset_500-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. f

    Data split for each class of each dataset for training and test.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niranjan, Mahesan; Fan, Keqiang; Cai, Xiaohao; Liu, Jiahui (2024). Data split for each class of each dataset for training and test. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001424294
    Explore at:
    Dataset updated
    Nov 6, 2024
    Authors
    Niranjan, Mahesan; Fan, Keqiang; Cai, Xiaohao; Liu, Jiahui
    Description

    Data split for each class of each dataset for training and test.

  20. Dataskripsi_split

    • kaggle.com
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dewizzz (2023). Dataskripsi_split [Dataset]. https://www.kaggle.com/datasets/dewizzz/dataskripsi-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dewizzz
    Description

    Dataset

    This dataset was created by Dewizzz

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
fivesixseven (2024). cleaned-quora-dataset-train-test-split [Dataset]. https://huggingface.co/datasets/567-labs/cleaned-quora-dataset-train-test-split

cleaned-quora-dataset-train-test-split

567-labs/cleaned-quora-dataset-train-test-split

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2024
Dataset authored and provided by
fivesixseven
Description

This is a cleaned version of the Quora dataset that's been configured with a train-test-val split.

Train : For training model Test : For running experiments and comparing different OSS models and closed sourced models Val : Only to be used at the end!

Colab Notebook to reproduce : https://colab.research.google.com/drive/1dGjGiqwPV1M7JOLfcPEsSh3SC37urItS?usp=sharing

Search
Clear search
Close search
Google apps
Main menu