100+ datasets found

h
alpaca-train-validation-test-split
huggingface.co
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
Training/Validation/Test set split
figshare.com
zip
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianfan Jin (2024). Training/Validation/Test set split [Dataset]. http://doi.org/10.6084/m9.figshare.25511056.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25511056.v1
Dataset updated
Mar 30, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Tianfan Jin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Including the split of real and null reactions for training, validation and test
P
PubMed (60%/20%/20% random splits) Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PubMed (60%/20%/20% random splits) Dataset [Dataset]. https://paperswithcode.com/dataset/pubmed-60-20-20-random-splits
Explore at:
Description
Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.
Train Test and Validation Split
kaggle.com
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IMT2022053 (2025). Train Test and Validation Split [Dataset]. https://www.kaggle.com/datasets/pranavakulkarni/train-test-and-validation-split/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
IMT2022053
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by IMT2022053

Released under Apache 2.0

Contents
P
Film (60%/20%/20% random splits) Dataset
paperswithcode.com
library.toponeai.link
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Film (60%/20%/20% random splits) Dataset [Dataset]. https://paperswithcode.com/dataset/film-60-20-20-random-splits
Explore at:
Description
Node classification on Film with 60%/20%/20% random splits for training/validation/test.
Data from: Time-Split Cross-Validation as a Method for Estimating the...
acs.figshare.com
figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/ci400084k.s001
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
P
Texas(60%/20%/20% random splits) Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Texas(60%/20%/20% random splits) Dataset [Dataset]. https://paperswithcode.com/dataset/texas-60-20-20-random-splits-1
Explore at:
Area covered
Texas
Description
Node classification on Texas with 60%/20%/20% random splits for training/validation/test.
R
Pothole Dataset
universe.roboflow.com
zip
Updated Nov 1, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brad Dwyer (2020). Pothole Dataset [Dataset]. https://universe.roboflow.com/brad-dwyer/pothole-voxrl/model/1
Explore at:
zipAvailable download formats
Dataset updated
Nov 1, 2020
Dataset authored and provided by
Brad Dwyer
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Variables measured
Potholes Bounding Boxes
Description
Pothole Dataset

https://i.imgur.com/7Xz8d5M.gif" alt="Example Image">

This is a collection of 665 images of roads with the potholes labeled. The dataset was created and shared by Atikur Rahman Chitholian as part of his undergraduate thesis and was originally shared on Kaggle.

Note: The original dataset did not contain a validation set; we have re-shuffled the images into a 70/20/10 train-valid-test split.

Usage

This dataset could be used for automatically finding and categorizing potholes in city streets so the worst ones can be fixed faster.

The dataset is provided in a wide variety of formats for various common machine learning models.
Data Cleaning, Translation & Split of the Dataset for the Automatic...
zenodo.org
data.niaid.nih.gov
bin, csv +1
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Explore at:
text/x-python, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6957842
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
h
Putnam-AXIOM-for-zip-fit-splits
huggingface.co
Updated Jan 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ZIP-FIT - Compression-Based Data Selection For Code (2023). Putnam-AXIOM-for-zip-fit-splits [Dataset]. https://huggingface.co/datasets/zipfit/Putnam-AXIOM-for-zip-fit-splits
Explore at:
Dataset updated
Jan 1, 2023
Dataset authored and provided by
ZIP-FIT - Compression-Based Data Selection For Code
Description
Putnam-AXIOM Splits for ZIP-FIT

This repository contains the train, validation, and test splits of the Putnam-AXIOM dataset specifically for use with the ZIP-FIT methodology research. The dataset is split as follows:

train: 150 examples validation: 150 examples test: 222 examples

These splits are derived from the original 522 Putnam problems found in the main Putnam-AXIOM repository.

Main Repository

The full dataset with original problems and variations is available… See the full description on the dataset page: https://huggingface.co/datasets/zipfit/Putnam-AXIOM-for-zip-fit-splits.
P
WDC LSPM Dataset
library.toponeai.link
paperswithcode.com
Updated Feb 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). WDC LSPM Dataset [Dataset]. https://library.toponeai.link/dataset/wdc-products
Explore at:
Dataset updated
Feb 8, 2025
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
f
Dataset
figshare.com
application/x-gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moynuddin Ahmed Shibly (2023). Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13577873.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13577873.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Moynuddin Ahmed Shibly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.
d
Web Data Commons Training and Test Sets for Large-Scale Product Matching -...
da-ra.de
Updated Oct 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Bizer; Anna Primpeli; Ralph Peeters (2019). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.7801/351
Explore at:
Unique identifier
https://doi.org/10.7801/351
Dataset updated
Oct 2019
Dataset provided by
da|ra
Mannheim University Library
Authors
Christian Bizer; Anna Primpeli; Ralph Peeters
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
Data from: Temporal Validity Change Prediction - Dataset
zenodo.org
data.niaid.nih.gov
csv
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georg Wenzel; Georg Wenzel (2025). Temporal Validity Change Prediction - Dataset [Dataset]. http://doi.org/10.5281/zenodo.8340858
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8340858
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Georg Wenzel; Georg Wenzel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data for temporal validity change prediction, an NLP task that will be defined in an upcoming publication. The dataset consists of five columns.

target - A Tweet ID. This column must be manually rehydrated via the Twitter API to obtain the tweet text.

follow_up - A synthetic follow-up tweet that semantically relates to the target tweet.

context_only_tv - The expected temporal validity duration of the target tweet, when read in isolation.

combined_tv - The expected temporal validity duration of the target tweet, when read together with the follow-up tweet.

change - The TVCP task label, i.e., whether the temporal validity duration of the target tweet is decreased, unchanged (neutral), or increased by the information in the follow-up tweet.

The duration labels (context_only_tv, combined_tv) are class indices of the following class distribution:
[no time-sensitive information, less than one minute, 1-5 minutes, 5-15 minutes, 15-45 minutes, 45 minutes - 2 hours, 2-6 hours, more than 6 hours, 1-3 days, 3-7 days, 1-4 weeks, more than one month]

Different dataset splits are provided.

"dataset.csv" contains the full dataset.

"train.csv", "val.csv", "test.csv" contain an 80-10-10 train-val-test split.

"train[0-4].csv" and "test[0-4].csv" respectively contain training and test data for one of 5 folds for 5-fold cross-validation. The train file contains 80% of the data, while the test file contains 20%. To replicate the original experiments, the train file should be sorted by the preprocessed target tweet text, then the first 12.5% of target tweets should be sampled to generate validation data, leading to a 70-10-20 train-val-test split.
P
20Newsgroup (10 tasks) Dataset
paperswithcode.com
Updated Feb 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zixuan Ke; Bing Liu; Nianzu Ma; Hu Xu; Lei Shu (2022). 20Newsgroup (10 tasks) Dataset [Dataset]. https://paperswithcode.com/dataset/20newsgroup-10-tasks
Explore at:
Dataset updated
Feb 22, 2022
Authors
Zixuan Ke; Bing Liu; Nianzu Ma; Hu Xu; Lei Shu
Description
This dataset has 20 classes and each class has about 1000 documents. The data split for train/validation/test is 1600/200/200. We created 10 tasks, 2 classes per task. Since this is topic-based text classification data, the classes are very different and have little shared knowledge. As mentioned above, this application (and dataset) is mainly used to show a CL model's ability to overcome forgetting. Detailed statistics please on page https://github.com/ZixuanKe/PyContinual
P
Paragraph Expanded Dataset
paperswithcode.com
Updated Feb 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Paragraph Expanded Dataset [Dataset]. https://paperswithcode.com/dataset/paragraph-expanded
Explore at:
Dataset updated
Feb 7, 2025
Description
To take advantage of the ever-increasing amount of structural data now available, we also trained Paragraph on a larger dataset. This new dataset was extracted from the Structural Antibody Database (SAbDab, Schneider et al., 2022) on March 31, 2022 and includes 1086 complexes which we divide into train, validation and test sets using a 60-20-20 split. Full details of both datasets are given in the Supplementary Information.
P
ImageNet VIPriors subset Dataset
paperswithcode.com
Updated Mar 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert-Jan Bruintjes; Attila Lengyel; Marcos Baptista Rios; Osman Semih Kayhan; Jan van Gemert (2021). ImageNet VIPriors subset Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet-vipriors-subset
Explore at:
Dataset updated
Mar 4, 2021
Authors
Robert-Jan Bruintjes; Attila Lengyel; Marcos Baptista Rios; Osman Semih Kayhan; Jan van Gemert
Description
The training and validation data are subsets of the training split of the Imagenet 2012. The test set is taken from the validation split of the Imagenet 2012 dataset. Each data set includes 50 images per class.
h
mnli_matched
huggingface.co
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Westphal (2025). mnli_matched [Dataset]. https://huggingface.co/datasets/westphal-jan/mnli_matched
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 26, 2025
Authors
Jan Westphal
Description
Dataset Description

This dataset provides easier accessibility to the original MNLI dataset. We randomly choose 10% of the original validation_matched split and use it as the validation split. The remaining 90% are used for the test split. The train split remains unchanged.
FSOCO-split
kaggle.com
Updated Oct 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tommaso Fava (2024). FSOCO-split [Dataset]. https://www.kaggle.com/datasets/tommasofava/fsoco-split/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tommaso Fava
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
FSOCO dataset split into train (80%), validation (10%), and test (10%) set. Ready for Ultralytics YOLO training.

Facebook

Twitter

Click to copy link

Link copied

Cite

Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 12, 2023

Authors

Doula Isham Rashik Hasan

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

Clear search

Close search

Google apps

Main menu

alpaca-train-validation-test-split

Training/Validation/Test set split

PubMed (60%/20%/20% random splits) Dataset

Train Test and Validation Split

Dataset

Contents

Film (60%/20%/20% random splits) Dataset

Data from: Time-Split Cross-Validation as a Method for Estimating the...

Texas(60%/20%/20% random splits) Dataset

Pothole Dataset

Pothole Dataset

Usage

Data Cleaning, Translation & Split of the Dataset for the Automatic...

Putnam-AXIOM-for-zip-fit-splits

WDC LSPM Dataset

Training dataset for NABat Machine Learning V1.0

Dataset

Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

Data from: Temporal Validity Change Prediction - Dataset

20Newsgroup (10 tasks) Dataset

Paragraph Expanded Dataset

ImageNet VIPriors subset Dataset

mnli_matched

FSOCO-split

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split