Facebook
TwitterThis is a cleaned version of the Quora dataset that's been configured with a train-test-val split.
Train : For training model Test : For running experiments and comparing different OSS models and closed sourced models Val : Only to be used at the end!
Colab Notebook to reproduce : https://colab.research.google.com/drive/1dGjGiqwPV1M7JOLfcPEsSh3SC37urItS?usp=sharing
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was created by GIOPAIK
Released under CC BY-SA 4.0
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).
The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:
A clean, pre-defined 80/20 train-test split.
Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.
A flat directory structure (train/, test/) for simplified file access.
File Content The dataset is organized into a single top-level folder and two CSV files:
train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.
test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.
Caltech-256_Train_Test/: The primary data folder.
train/: This directory contains 80% of the images from all 257 categories, intended for model training.
test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.
Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.
Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.
Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data
Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.
Facebook
TwitterBisher/northern-elevator-64-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by KevinKSU
Released under CC0: Public Domain
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Train Test Split For Freiburg Dataset In YOLOv7 Format is a dataset for object detection tasks - it contains Groceries annotations for 8,879 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterThis dataset was created by Abdul Hasib Uddin
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
Facebook
Twitterbcsandlund/arc-agi-prompts-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittervarun500/healthbench-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset contains the predicted prices of the asset test split over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model comparison using multiple metrics before balancing by SMOTE(Train-Test Split (80%-20%)).
Facebook
TwitterThis dataset was created by Hoà ng Anh Nguyễn
Released under Other (specified in description)
Facebook
TwitterThis dataset was created by Dewizzz
Facebook
TwitterSplits of train, test, and validation samples for Urban dataset.
Facebook
TwitterThis dataset was created by Büşra Ertekin
Facebook
TwitterThe AUC of the precision and recall curve is shown (AUC with feature selection) for training and testing with the most important ROIs and with all ROIs (AUC without feature selection).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by HoangAnhVu
Released under MIT
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
Facebook
TwitterData split for each class of each dataset for training and test.
Facebook
TwitterThis is a cleaned version of the Quora dataset that's been configured with a train-test-val split.
Train : For training model Test : For running experiments and comparing different OSS models and closed sourced models Val : Only to be used at the end!
Colab Notebook to reproduce : https://colab.research.google.com/drive/1dGjGiqwPV1M7JOLfcPEsSh3SC37urItS?usp=sharing