100+ datasets found

Training/Validation/Test set split
figshare.com
zip
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianfan Jin (2024). Training/Validation/Test set split [Dataset]. http://doi.org/10.6084/m9.figshare.25511056.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25511056.v1
Dataset updated
Mar 30, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Tianfan Jin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Including the split of real and null reactions for training, validation and test
t
MS Training Set, MS Validation Set, and UW Validation/Test Set - Dataset -...
service.tib.eu
Updated Dec 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). MS Training Set, MS Validation Set, and UW Validation/Test Set - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/ms-training-set--ms-validation-set--and-uw-validation-test-set
Explore at:
Dataset updated
Dec 17, 2024
Description
The MS Training Set, MS Validation Set, and UW Validation/Test Set are used for training, validation, and testing the proposed methods.
h
alpaca-train-validation-test-split
huggingface.co
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
f
10-fold cross-validation results.
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas W. Kelsey; Phoebe Wright; Scott M. Nelson; Richard A. Anderson; W. Hamish B Wallace (2023). 10-fold cross-validation results. [Dataset]. http://doi.org/10.1371/journal.pone.0022024.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0022024.t004
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Thomas W. Kelsey; Phoebe Wright; Scott M. Nelson; Richard A. Anderson; W. Hamish B Wallace
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The mean squared error (MSE) for the model with highest is given for both the training set (90% of the dataset) and the test set (the remaining 10%) for each of the ten folds. Since – both for individual folds and on average – the errors are similar, we consider the model to be validated. The and peak ages are for the highest ranked model returned by TableCurve2D for each fold.
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
e
Synthetic PDF Testset for File Format Validation - Dataset - B2FIND
b2find.eudat.eu
Updated Apr 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Synthetic PDF Testset for File Format Validation - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/113e334a-a103-5160-8bd4-53816f8d1018
Explore at:
Dataset updated
Apr 27, 2023
Description
This data set presents a corpus of light-weight files designed to test the validation criteria of JHOVE's PDF module against "well-formedness". Test cases are based on structural requirements for PDF files as per ISO 32000-1:2008 standard. The basis for all test files is a single page, one line document with no special features such as linearization. While such a light-weight document only allows to check against a fragment of standard requirements, the focus was put on basic structure violations at the header, trailer, document catalog, page tree node and cross-reference levels. The test set also checks for basic violations at the page node, page resource and stream object level. The accompanying spreadsheet briefly categorizes and describes the test set and includes the outcome when running the test set against JHOVE 1.16, PDF-hul 1.8 as well as Adobe Acrobat Professional XI Pro (11.0.15). The spreadsheet also includes a codecov coverage statistic for the test set in relation to the JHOVE 1.16, PDF-hul 1.8 module. Further information can be found in the paper "A PDF Test-Set for Well-Formedness Validation in JHOVE - The Good, the Bad and the Ugly", published in the proceedings of the 14th International Conference on Digital Preservation (Kyoto, Japan, September 25-29 2017). While the spreadsheet only contains results of running the test set against JHOVE, it can be used as a ground truth for any file format validation process.
f
Dataset
figshare.com
application/x-gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moynuddin Ahmed Shibly (2023). Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13577873.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13577873.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Moynuddin Ahmed Shibly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.
c
Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation,...
cancerimagingarchive.net
csv, dicom, n/a +1
Updated May 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2025). Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation, MIDI-B-Curated-Validation, MIDI-B-Synthetic-Test, MIDI-B-Curated-Test) [Dataset]. http://doi.org/10.7937/cf2p-aw56
Explore at:
sqlite and zip, dicom, csv, n/aAvailable download formats
Unique identifier
https://doi.org/10.7937/cf2p-aw56
Dataset updated
May 2, 2025
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 2, 2025
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
Abstract
These resources comprise a large and diverse collection of multi-site, multi-modality, and multi-cancer clinical DICOM images from 538 subjects infused with synthetic PHI/PII in areas encountered by TCIA curation teams. Also provided is a TCIA-curated version of the synthetic dataset, along with mapping files for mapping identifiers between the two.
This new MIDI data resource includes DICOM datasets used in the Medical Image De-Identification Benchmark (MIDI-B) challenge at MICCAI 2024. They are accompanied by ground truth answer keys and a validation script for evaluating the effectiveness of medical image de-identification workflows. The validation script systematically assesses de-identified data against an answer key outlining appropriate actions and values for proper de-identification of medical images, promoting safer and more consistent medical image sharing.
Introduction
Medical imaging research increasingly relies on large-scale data sharing. However, reliable de-identification of DICOM images still presents significant challenges due to the wide variety of DICOM header elements and pixel data where identifiable information may be embedded. To address this, we have developed an openly accessible synthetic dataset containing artificially generated protected health information (PHI) and personally identifiable information (PII).
These resources complement our earlier work (Pseudo-PHI-DICOM-data ) hosted on The Cancer Imaging Archive. As an example of its use, we also provide a version curated by The Cancer Imaging Archive (TCIA) curation team. This resource builds upon best practices emphasized by the MIDI Task Group who underscore the importance of transparency, documentation, and reproducibility in de-identification workflows, part of the themes at recent conferences (Synapse:syn53065760) and workshops (2024 MIDI-B Challenge Workshop).
This framework enables objective benchmarking of de-identification performance, promotes transparency in compliance with regulatory standards, and supports the establishment of consistent best practices for sharing clinical imaging data. We encourage the research community to use these resources to enhance and standardize their medical image de-identification workflows.
Methods
Subject Inclusion and Exclusion Criteria
The source data were selected from imaging already hosted in de-identified form on TCIA. Imaging containing faces were excluded, and no new human studies were performed for his project.
Data Acquisition
To build the synthetic dataset, image series were selected from TCIA’s curated datasets to represent a broad range of imaging modalities (CR, CT, DX, MG, MR, PT, SR, US) , manufacturers including (GE, Siemens, Varian , Confirma, Agfa, Eigen, Elekta, Hologic, KONICA MINOLTA, others) , scan parameters, and regions of the body. These were processed to inject the synthetic PHI/PII as described.
Data Analysis
Synthetic pools of PHI, like subject and scanning institution information, were generated using the Python package Faker (https://pypi.org/project/Faker/8.10.3/). These were inserted into DICOM metadata of selected imaging files using a system of inheritable rule-based templates outlining re-identification functions for data insertion and logging for answer key creation. Text was also burned-in to the pixel data of a number of images. By systematically embedding realistic synthetic PHI into image headers and pixel data, accompanied by a detailed ground-truth answer key, our framework enables users transparency, documentation, and reproducibility in de-identification practices, aligned with the HIPAA Safe Harbor method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices.
Usage Notes
This DICOM collection is split into two datasets, synthetic and curated. The synthetic dataset is the PHI/PII infused DICOM collection accompanied by a validation script and answer keys for testing, refining and benchmarking medical image de-identification pipelines. The curated dataset is a version of the synthetic dataset curated and de-identified by members of The Cancer Imaging Archive curation team. It can be used as a guide, an example of medical image curation best practices. For the purposes of the De-Identification challenge at MICCAI 2024, the synthetic and curated datasets each contain two subsets, a portion for Validation and the other for Testing.
To link a curated dataset to the original synthetic dataset and answer keys, a mapping between the unique identifiers (UIDs) and patient IDs must be provided in CSV format to the evaluation software. We include the mapping files associated with the TCIA-curated set as an example. Lastly, for both the Validation and Testing datasets, an answer key in sqlite.db format is provided. These components are for use with the Python validation script linked below (4). Combining these components, a user developing or evaluating de-identification methods can ensure they meet a specification for successfully de-identifying medical image data.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
100 Sports Image Classification
kaggle.com
Updated May 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerry (2023). 100 Sports Image Classification [Dataset]. https://www.kaggle.com/datasets/gpiosenka/sports-classification/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gerry
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Please upvote if you find this dataset of use. - Thank you This version is an update of the earlier version. I ran a data set quality evaluation program on the previous version which found a considerable number of duplicate and near duplicate images. Duplicate images can lead to falsely higher values of validation and test set accuracy and I have eliminated these images in this version of the dataset. Images were gathered from internet searches. The images were scanned with a duplicate image detector program I wrote. Any duplicate images were removed to prevent bleed through of images between the train, test and valid data sets. All images were then resized to 224 X224 X 3 and converted to jpg format. A csv file is included that for each image file contains the relative path to the image file, the image file class label and the dataset (train, test or valid) that the image file resides in. This is a clean dataset. If you build a good model you should achieve at least 95% accuracy on the test set. If you build a very good model for example using transfer learning you should be able to achieve 98%+ on test set accuracy. If you find this data set useful please upvote. Thanks

Content

Collection of sports images covering 100 different sports.. Images are 224,224,3 jpg format. Data is separated into train, test and valid directories. Additionallly a csv file is included for those that wish to use it to create there own train, test and validation datasets. .

Inspiration

Wanted to build a high quality clean data set that was easy to use and had no bad images or duplication between the train, test and validation data sets. Provides a good data set to test your models on. Design for straight forward application of keras preprocessing functions like ImageDataenerator.flow_from_directory or if you use the csv file ImageDataGenerator.flow_from_dataframe. This dataset was carefully created so that the region of interest (ROI) in this case the sport occupies approximately 50% of the pixels in the image. As a consequence even models of moderate complexity should achieve training and validation accuracies in the high 90's.
Data from: Time-Split Cross-Validation as a Method for Estimating the...
acs.figshare.com
figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/ci400084k.s001
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
Downsized camera trap images for automated classification
zenodo.org
explore.openaire.eu
+1more
bin, zip
Updated Dec 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers (2022). Downsized camera trap images for automated classification [Dataset]. http://doi.org/10.5281/zenodo.6627707
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6627707
Dataset updated
Dec 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description:
Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.
Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions
Funding: These data were collected as part of research funded by:
NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A)
This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.
XML metadata: GEMINI compliant metadata for this dataset is available here
Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip
CT_image_data_info2.xlsx
This file contains dataset metadata and 1 data tables:
Dataset Images (described in worksheet Dataset_images)
Description: This worksheet details the composition of each dataset used in the analyses
Number of fields: 69
Number of data rows: 270287
Fields:
filename: Root ID (Field type: id)
camera_trap_site: Site ID for the camera trap location (Field type: location)
taxon: Taxon recorded by camera trap (Field type: taxa)
dist_level: Level of disturbance at site (Field type: ordered categorical)
baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical)
increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical)
dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type:
4
Train, validation, test data sets and confusion matrices underlying...
data.4tu.nl
zip
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen (2023). Train, validation, test data sets and confusion matrices underlying publication: "Automated cell counting for Trypan blue stained cell cultures using machine learning" [Dataset]. http://doi.org/10.4121/21695819.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/21695819.v1
Dataset updated
Sep 7, 2023
Dataset provided by
4TU.ResearchData
Authors
Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Annotated test and train data sets. Both images and annotations are provided separately.

Validation data set for Hi5, Sf9 and HEK cells.

Confusion matrices for the determination of performance parameters
f
Data from: Analysis, Modeling, and Target-Specific Predictions of Linear...
acs.figshare.com
xlsx
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Boris Vishnepolsky; Maya Grigolava; Andrei Gabrielian; Alex Rosenthal; Darrell Hurt; Michael Tartakovsky; Malak Pirtskhalava (2023). Analysis, Modeling, and Target-Specific Predictions of Linear Peptides Inhibiting Virus Entry [Dataset]. http://doi.org/10.1021/acsomega.3c07521.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acsomega.3c07521.s001
Dataset updated
Nov 24, 2023
Dataset provided by
ACS Publications
Authors
Boris Vishnepolsky; Maya Grigolava; Andrei Gabrielian; Alex Rosenthal; Darrell Hurt; Michael Tartakovsky; Malak Pirtskhalava
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Antiviral peptides (AVPs) are bioactive peptides that exhibit the inhibitory activity against viruses through a range of mechanisms. Virus entry inhibitory peptides (VEIPs) make up a specific class of AVPs that can prevent envelope viruses from entering cells. With the growing number of experimentally verified VEIPs, there is an opportunity to use machine learning to predict peptides that inhibit the virus entry. In this paper, we have developed the first target-specific prediction model for the identification of new VEIPs using, along with the peptide sequence characteristics, the attributes of the envelope proteins of the target virus, which overcomes the problem of insufficient data for particular viral strains and improves the predictive ability. The model’s performance was evaluated through 10 repeats of 10-fold cross-validation on the training data set, and the results indicate that it can predict VEIPs with 87.33% accuracy and Matthews correlation coefficient (MCC) value of 0.76. The model also performs well on an independent test set with 90.91% accuracy and MCC of 0.81. We have also developed an automatic computational tool that predicts VEIPs, which is freely available at https://dbaasp.org/tools?page=linear-amp-prediction.
h
EN-MEDIQA
huggingface.co
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jonathan kang (2024). EN-MEDIQA [Dataset]. https://huggingface.co/datasets/jonathankang/EN-MEDIQA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2024
Authors
jonathan kang
Description
MTS-DIALOG fitted to look like SAMSUM dataset

Dataset Summary

The dataset consists of X examples split into training, validation, and test sets. It includes dialogues and their corresponding summaries, with a focus on medical conversations.

Dataset Structure

train: Training set, containing X examples. validation: Validation set, containing X examples. test: Test set, containing X examples.

Data Fields

id: Unique identifier for each example. dialogue: The… See the full description on the dataset page: https://huggingface.co/datasets/jonathankang/EN-MEDIQA.
DUDE competition train - validation - test splits ground truth
zenodo.org
json
Updated Mar 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordy Van Landeghem; Jordy Van Landeghem (2023). DUDE competition train - validation - test splits ground truth [Dataset]. http://doi.org/10.5281/zenodo.7680617
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7680617
Dataset updated
Mar 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jordy Van Landeghem; Jordy Van Landeghem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This JSON file contains the ground truth annotations for the train and validation set of the DUDE competition (https://rrc.cvc.uab.es/?ch=23&com=tasks) of ICDAR 2023 (https://icdar2023.org/).

V1.0.7 release: 41454 annotations for 4974 documents (train-validation-test)

DatasetDict({ train: Dataset({ features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'], num_rows: 23728 }) val: Dataset({ features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'], num_rows: 6315 }) test: Dataset({ features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'], num_rows: 11411 }) }) ++update on answer_type +++formatting change to answers_variants ++++stricter check on answer_variants & rename annotations file + blind test set (no ground truth answers provided)
P
WDC LSPM Dataset
library.toponeai.link
Updated Feb 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). WDC LSPM Dataset [Dataset]. https://library.toponeai.link/dataset/wdc-products
Explore at:
Dataset updated
Feb 8, 2025
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
e
Web Data Commons Training and Test Sets for Large-Scale Product Matching -...
b2find.eudat.eu
Updated Nov 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/720b440c-eda0-5182-af9f-f868ed999bd7
Explore at:
Dataset updated
Nov 27, 2020
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
Z
Extended dataset for the validation the competent Computational Thinking...
data.niaid.nih.gov
zenodo.org
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bruno, Barbara (2024). Extended dataset for the validation the competent Computational Thinking test in grades 3-6 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7983524
Explore at:
Dataset updated
Dec 31, 2024
Dataset provided by
Bruno, Barbara
El-Hamamsy, Laila
Dehler Zufferey, Jessica
Mondada, Francesco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Extended dataset for the validation the competent Computational Thinking test in grades 3-6

• If you publish material based on this dataset, please cite the following :

• The Zenodo repository : Laila El-Hamamsy, Barbara Bruno, Jessica Dehler Zufferey, & Francesco Mondada (2023). Extended dataset for the validation of the competent Computational Thinking test in grades 3-6 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7983525 • The article on the validation of the computational thinking test for grades 3-6 : El-Hamamsy, L., Zapata-Cáceres, M., Martín-Barroso, E., Mondada, F., Dehler Zufferey, J., Bruno, B., and Román-González, M. (2023). The competent computational thinking test (cCTt): a valid, reliable and gender-fair test for longitudinal CT studies in grades 3-6. El-Hamamsy, L., Zapata-Cáceres, M., Martín-Barroso, E., Mondada, F., Zufferey, J. D., Bruno, B., & Román-González, M. arXiv. https://doi.org/10.48550/arXiv.2305.19526

• License : This work is licensed under a Creative Commons Attribution 4.0 International license (CC-BY-4.0)

• Creators : El-Hamamsy, L., Bruno, B., Dehler Zufferey, J., and Mondada, F.

• Date May 30th 2023

• Subject : Computational Thinking (CT), Assessment, Primary education, Psychometric validation

• Dataset format : CSV. The dataset contains four files (one per grade, see detailed description below). Please note that the spreadsheets may contain missing values due to students not being present for a part of the data collection. To have access to the specific cCTt questions please refer to the original publication [1] and Zenodo repository [2] which provide the full set of questions and correct responses.

• Dataset size < 500 kB

• Data collection period : January and November 2021

• Abbreviations : - CT : Computational Thinking - cCTt: competent CT test

• Funding : This work was funded by the the NCCR Robotics, a National Centre of Competence in Research, funded by the Swiss National Science Foundation (grant number 51NF40_185543)

References

[1] El-Hamamsy, L., Zapata-Cáceres, M., Barroso, E. M., Mondada, F., Zufferey, J. D., & Bruno, B. (2022). The Competent Computational Thinking Test: Development and Validation of an Unplugged Computational Thinking Test for Upper Primary School. Journal of Educational Computing Research, 60(7), 1818–1866. https://doi.org/10.1177/07356331221081753

[2] El-Hamamsy, L., Zapata-Cáceres, M., Marcelino, P., Dehler Zufferey, J., Bruno, B., Martín Barroso, E., & ‪Román-González, M.‬ (2022). Dataset for the comparison of two Computational Thinking (CT) test for upper primary school (grades 3-4) : the Beginners' CT test (BCTt) and the competent CT test (cCTt) (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5885034 ‬‬‬‬‬‬‬‬‬‬‬

[3] El-Hamamsy, L., Zapata-Cáceres, M., Martín-Barroso, E., Mondada, F., Dehler Zufferey, J., Bruno, B., and Román-González, M. (2023). The competent computational thinking test (cCTt): a valid, reliable and gender-fair test for longitudinal CT studies in grades 3-6. El-Hamamsy, L., Zapata-Cáceres, M., Martín-Barroso, E., Mondada, F., Zufferey, J. D., Bruno, B., & Román-González, M. arXiv. https://doi.org/10.48550/arXiv.2305.19526

[4] Brennan, K. and Resnick, M. (2012). New frameworks for studying and assessing the development of computational thinking. page 25

[5] El-Hamamsy, L., Zapata-Cáceres, M., Marcelino, P., Bruno, B., Dehler Zufferey, J., Martín-Barroso, E., & Román-González, M. (2022). Comparing the psychometric properties of two primary school Computational Thinking (CT) assessments for grades 3 and 4: The Beginners’ CT test (BCTt) and the competent CT test (cCTt). Frontiers in Psychology, 13. https://www.frontiersin.org/articles/10.3389/fpsyg.2022.1082659
r
MangoYOLO data set
researchdata.edu.au
acquire.cqu.edu.au
Updated Apr 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Z Wang; Kerry Walsh; C McCarthy; Anand Koirala (2021). MangoYOLO data set [Dataset]. https://researchdata.edu.au/mangoyolo-set
Explore at:
Dataset updated
Apr 8, 2021
Dataset provided by
Central Queensland University
Authors
Z Wang; Kerry Walsh; C McCarthy; Anand Koirala
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets and directories are structured similar to the PASCAL VOC dataset, avoiding the need to change scripts already available, with the detection frameworks ready to parse PASCAL VOC annotations into their format.

The sub-directory JPEGImages consist of 1730 images (612x512 pixels) used for train, test and validation. Each image has at least one annotated fruit. The sub-directory Annotations consists of all the annotation files (record of bounding box coordinates for each image) in xml format and have the same name as the image name. The sub-directory Main consists of the text file that contains image names (without extension) used for train, test and validation. Training set (train.txt) lists 1300 train images Validation set (val.txt) lists 130 validation images Test set (test.txt) lists 300 test images

Each image has an XML annotation file (filename = image name) and each image set (training validation and test set) has associated text files (train.txt, val.txt and test.txt) containing the list of image names to be used for training and testing. The XML annotation file contains the image attributes (name, width, height), the object attributes (class name, object bounding box co-ordinates (xmin, ymin, xmax, ymax)). (xmin, ymin) and (xmax, ymax) are the pixel co-ordinates of the bounding box’s top-left corner and bottom-right corner respectively.