100+ datasets found

h
alpaca-train-validation-test-split
huggingface.co
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
DRIVE Train/Validation Split Dataset
kaggle.com
Updated Feb 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sovit Ranjan Rath (2023). DRIVE Train/Validation Split Dataset [Dataset]. https://www.kaggle.com/datasets/sovitrath/drive-trainvalidation-split-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sovit Ranjan Rath
Description
This dataset contains images and masks for Retinal Vessel Extraction (Segmentation). It contains a training and validation split to easily train semantic segmentation models.

The original dataset can be found here => https://www.kaggle.com/datasets/andrewmvd/drive-digital-retinal-images-for-vessel-extraction

This dataset also has an accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation

Split sample numbers: Training images and masks: 16 Validation images and masks: 4 Test images: 20
Train Test and Validation Split
kaggle.com
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IMT2022053 (2025). Train Test and Validation Split [Dataset]. https://www.kaggle.com/datasets/pranavakulkarni/train-test-and-validation-split/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
IMT2022053
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by IMT2022053

Released under Apache 2.0

Contents
DUDE competition train - validation - test splits ground truth
zenodo.org
json
Updated Mar 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordy Van Landeghem; Jordy Van Landeghem (2023). DUDE competition train - validation - test splits ground truth [Dataset]. http://doi.org/10.5281/zenodo.7680617
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7680617
Dataset updated
Mar 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jordy Van Landeghem; Jordy Van Landeghem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This JSON file contains the ground truth annotations for the train and validation set of the DUDE competition (https://rrc.cvc.uab.es/?ch=23&com=tasks) of ICDAR 2023 (https://icdar2023.org/).

V1.0.7 release: 41454 annotations for 4974 documents (train-validation-test)

DatasetDict({ train: Dataset({ features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'], num_rows: 23728 }) val: Dataset({ features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'], num_rows: 6315 }) test: Dataset({ features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'], num_rows: 11411 }) }) ++update on answer_type +++formatting change to answers_variants ++++stricter check on answer_variants & rename annotations file + blind test set (no ground truth answers provided)
Data from: Web Data Commons Training and Test Sets for Large-Scale Product...
linkagelibrary.icpsr.umich.edu
da-ra.de
Updated Nov 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
Explore at:
Unique identifier
https://doi.org/10.3886/E127481V1
Dataset updated
Nov 26, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Ralph Peeters; Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
Data from: Time-Split Cross-Validation as a Method for Estimating the...
figshare.com
acs.figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/ci400084k.s001
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
P
Film (60%/20%/20% random splits) Dataset
library.toponeai.link
Updated Feb 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Film (60%/20%/20% random splits) Dataset [Dataset]. https://library.toponeai.link/dataset/film-60-20-20-random-splits
Explore at:
Dataset updated
Feb 8, 2025
Description
Node classification on Film with 60%/20%/20% random splits for training/validation/test.
f
Cross-validation results of Support Vector Machine to predict pollen type,...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li, X.; Raine, J. I.; Prebble, J. G.; de Lange, P. J.; Newstrom-Lloyd, L. (2022). Cross-validation results of Support Vector Machine to predict pollen type, based on 99 iterations of random 80:20 splits of data into training:test sets. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000437870
Explore at:
Dataset updated
Jun 3, 2022
Authors
Li, X.; Raine, J. I.; Prebble, J. G.; de Lange, P. J.; Newstrom-Lloyd, L.
Description
Cross-validation results of Support Vector Machine to predict pollen type, based on 99 iterations of random 80:20 splits of data into training:test sets.
h
MNLP_M3_mcqa_dataset
huggingface.co
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
andres nowak (2025). MNLP_M3_mcqa_dataset [Dataset]. https://huggingface.co/datasets/andresnowak/MNLP_M3_mcqa_dataset
Explore at:
Dataset updated
Jun 10, 2025
Authors
andres nowak
Description
This dataset contains the MCQA and instruction finetuning datasets (and the test and validation splits are only used for testing not for training):

The messages column is used by the instruction finetuning dataset The choices, question, context, and answer columns are used by the MCQA dataset

For the MCQA dataset (of only single answer) contains a mixture of the train, validation and test splits from this datasets as to have for training and testing:

mmlu auxiliary train we only use the… See the full description on the dataset page: https://huggingface.co/datasets/andresnowak/MNLP_M3_mcqa_dataset.
f
Splits of train, test, and validation samples for Urban dataset.
datasetcatalog.nlm.nih.gov
Updated Jul 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mantripragada, Kiran; Qureshi, Faisal Z.; Dao, Phuong D.; He, Yuhong (2022). Splits of train, test, and validation samples for Urban dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000391591
Explore at:
Dataset updated
Jul 14, 2022
Authors
Mantripragada, Kiran; Qureshi, Faisal Z.; Dao, Phuong D.; He, Yuhong
Description
Splits of train, test, and validation samples for Urban dataset.
h
wiki_paragraphs_english
huggingface.co
Updated Jan 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Per Kummervold (2023). wiki_paragraphs_english [Dataset]. https://huggingface.co/datasets/pere/wiki_paragraphs_english
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2023
Authors
Per Kummervold
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
WIKI Paragraphs English

A multi-split dataset for machine learning research and evaluation, containing text samples in JSON Lines format.

Features

Multiple splits for different use cases Random shuffle with Fisher-Yates algorithm Structured format with text and metadata Size-varied validation/test sets (100 to 10k samples)

Splits Overview

Split Name Samples Typical Usage

train 1,000,000 Primary training data

validation 10,000 Standard validation… See the full description on the dataset page: https://huggingface.co/datasets/pere/wiki_paragraphs_english.
h
smaller_MultiLegalPile
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Montana, smaller_MultiLegalPile [Dataset]. https://huggingface.co/datasets/HannahMontana/smaller_MultiLegalPile
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Montana
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Summary

This is a smaller version of the joelniklaus/Multi_Legal_Pile dataset; but divided in train, test, and validation splits. It spans over four legal text types.

Supported Tasks and Leaderboards

The dataset supports the tasks of fill-mask.

Languages

Only english supported: en

Dataset Structure

type is one of the following:

caselaw contracts legislation other

Use the dataset like this: from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/HannahMontana/smaller_MultiLegalPile.
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
e
Web Data Commons Training and Test Sets for Large-Scale Product Matching -...
b2find.eudat.eu
Updated Nov 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/720b440c-eda0-5182-af9f-f868ed999bd7
Explore at:
Dataset updated
Nov 27, 2020
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
Yeast genotype/phenotype data partitioned into train/validation/test splits...
zenodo.org
zip
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Yeast genotype/phenotype data partitioned into train/validation/test splits from Ba Nguyen et al, 2022 [Dataset]. http://doi.org/10.5281/zenodo.15313069
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15313069
Dataset updated
Apr 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The raw data comes from Ba Nguyen et al, 2022, who hosted their data here. This dataset was used in an independent study in Rijal et al, 2025, who preprocessed the data using these notebook scripts. They did not release their processed data, so we reproduced their processing pipeline and have uploaded the data ourselves as part of this data resource.

This release accompanies this publication: https://doi.org/10.57844/arcadia-bmb9-fzxd
Data Cleaning, Translation & Split of the Dataset for the Automatic...
zenodo.org
data.niaid.nih.gov
bin, csv +1
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Explore at:
text/x-python, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6957842
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
Model weights and training, validation, and test set images and masks for...
zenodo.org
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kylen Solvik; Kylen Solvik; Yaffa Truelove; Yaffa Truelove; JENNIFER BALCH; JENNIFER BALCH; Michael Lathuilliere; Michael Lathuilliere; Thiago Fontenelle; Andrea Castanho; Andrea Castanho; Michael Coe; Michael Coe; Christina Shintani; Christina Shintani; CARLOS Souza Jr; CARLOS Souza Jr; Marcia Nunes Macedo; Marcia Nunes Macedo; Thiago Fontenelle (2025). Model weights and training, validation, and test set images and masks for "Uncovering a million small dams in Brazil using deep learning" [Dataset]. http://doi.org/10.5281/zenodo.14927197
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14927197
Dataset updated
Feb 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kylen Solvik; Kylen Solvik; Yaffa Truelove; Yaffa Truelove; JENNIFER BALCH; JENNIFER BALCH; Michael Lathuilliere; Michael Lathuilliere; Thiago Fontenelle; Andrea Castanho; Andrea Castanho; Michael Coe; Michael Coe; Christina Shintani; Christina Shintani; CARLOS Souza Jr; CARLOS Souza Jr; Marcia Nunes Macedo; Marcia Nunes Macedo; Thiago Fontenelle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated masks and Sentinel-1/-2 images split into training, validation, and test sets. Used for training convolutional neural network for small reservoir mapping.

- manet_sentinel.ckpt: PyTorch model checkpoint file containing model weights.

- annotations.zip: Contains binary reservoir masks (0 is non-reservoir, 1 is reservoir) split into training, validation, and test sets.

- images.zip: Contains Sentinel-1/-2 images split into training, validation, and test sets with the following bands:

Blue

Green

Red

Near-infrared

Sentinel-1 SAR VV

Sentinel-1 SAR VH

NDVI

NDWI

Gao's NDWI

MNDWI
P
WDC LSPM Dataset
library.toponeai.link
Updated Feb 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). WDC LSPM Dataset [Dataset]. https://library.toponeai.link/dataset/wdc-products
Explore at:
Dataset updated
Feb 8, 2025
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
o
Downsized camera trap images for automated classification
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Jan 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danielle L Norman; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers (2022). Downsized camera trap images for automated classification [Dataset]. http://doi.org/10.5281/zenodo.6627706
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6627706
Dataset updated
Jan 1, 2022
Authors
Danielle L Norman; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers
Description
Description: Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactionsFunding: These data were collected as part of research funded by: NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A)This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.XML metadata: GEMINI compliant metadata for this dataset is available hereFiles: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zipCT_image_data_info2.xlsxThis file contains dataset metadata and 1 data tables:Dataset Images (described in worksheet Dataset_images)Description: This worksheet details the composition of each dataset used in the analysesNumber of fields: 69Number of data rows: 270287Fields: filename: Root ID (Field type: id)camera_trap_site: Site ID for the camera trap location (Field type: location)taxon: Taxon recorded by camera trap (Field type: taxa)dist_level: Level of disturbance at site (Field type: ordered categorical)baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical)increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical)dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical)dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical)dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical)dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical)dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical)dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical)dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical)dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical)dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical)dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical)dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical)dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical)dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination ana...

Facebook

Twitter

Click to copy link

Link copied

Cite

Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 12, 2023

Authors

Doula Isham Rashik Hasan

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

Clear search

Close search

Google apps

Main menu

alpaca-train-validation-test-split

DRIVE Train/Validation Split Dataset

Train Test and Validation Split

Dataset

Contents

DUDE competition train - validation - test splits ground truth

Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

Data from: Time-Split Cross-Validation as a Method for Estimating the...

Film (60%/20%/20% random splits) Dataset

Cross-validation results of Support Vector Machine to predict pollen type,...

MNLP_M3_mcqa_dataset

Splits of train, test, and validation samples for Urban dataset.

wiki_paragraphs_english

smaller_MultiLegalPile

Training dataset for NABat Machine Learning V1.0

Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

Yeast genotype/phenotype data partitioned into train/validation/test splits...

Data Cleaning, Translation & Split of the Dataset for the Automatic...

Model weights and training, validation, and test set images and masks for...

WDC LSPM Dataset

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Downsized camera trap images for automated classification

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split