100+ datasets found

h
alpaca-train-validation-test-split
huggingface.co
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
CNN models and training, validation and test datasets for "PlotMI:...
zenodo.org
application/gzip
Updated Sep 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tuomo Hartonen; Tuomo Hartonen; Teemu Kivioja; Jussi Taipale; Teemu Kivioja; Jussi Taipale (2021). CNN models and training, validation and test datasets for "PlotMI: interpretation of pairwise interactions and positional preferences learned by a deep learning model from sequence data" [Dataset]. http://doi.org/10.5281/zenodo.5508698
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5508698
Dataset updated
Sep 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tuomo Hartonen; Tuomo Hartonen; Teemu Kivioja; Jussi Taipale; Teemu Kivioja; Jussi Taipale
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Convolutional neural network (CNN) models and their respective training, validation and test datasets used in manuscript:

Tuomo Hartonen, Teemu Kivioja and Jussi Taipale, "PlotMI: interpretation of pairwise interactions and positional preferences learned by a deep learning model from sequence data"
Training/Validation/Test set split
figshare.com
zip
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianfan Jin (2024). Training/Validation/Test set split [Dataset]. http://doi.org/10.6084/m9.figshare.25511056.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25511056.v1
Dataset updated
Mar 30, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Tianfan Jin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Including the split of real and null reactions for training, validation and test
Challenge Round 0 (Dry Run) Test Dataset
catalog.data.gov
data.nist.gov
+1more
Updated Jul 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Challenge Round 0 (Dry Run) Test Dataset [Dataset]. https://catalog.data.gov/dataset/challenge-round-0-dry-run-test-dataset-ff885
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This dataset was an initial test harness infrastructure test for the TrojAI program. It should not be used for research. Please use the more refined datasets generated for the other rounds. The data being generated and disseminated is training, validation, and test data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 200 trained, human level, image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
4
Train, validation, test data sets and confusion matrices underlying...
data.4tu.nl
zip
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen (2023). Train, validation, test data sets and confusion matrices underlying publication: "Automated cell counting for Trypan blue stained cell cultures using machine learning" [Dataset]. http://doi.org/10.4121/21695819.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/21695819.v1
Dataset updated
Sep 7, 2023
Dataset provided by
4TU.ResearchData
Authors
Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Annotated test and train data sets. Both images and annotations are provided separately.

Validation data set for Hi5, Sf9 and HEK cells.

Confusion matrices for the determination of performance parameters
Dataset, splits, models, and scripts for the QM descriptors prediction
zenodo.org
explore.openaire.eu
application/gzip
Updated Apr 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green (2024). Dataset, splits, models, and scripts for the QM descriptors prediction [Dataset]. http://doi.org/10.5281/zenodo.10668491
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10668491
Dataset updated
Apr 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.

Below are descriptions of the available scripts:

atom_bond_descriptors.sh: Trains atom/bond targets.

atom_bond_descriptors_predict.sh: Predicts atom/bond targets from pre-trained model.

dipole_quadrupole_moments.sh: Trains dipole and quadrupole moments.

dipole_quadrupole_moments_predict.sh: Predicts dipole and quadrupole moments from pre-trained model.

energy_gaps_IP_EA.sh: Trains energy gaps, ionization potential (IP), and electron affinity (EA).

energy_gaps_IP_EA_predict.sh: Predicts energy gaps, IP, and EA from pre-trained model.

get_constraints.py: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.

csv2pkl.py: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.

Below is the procedure for running the ml-QM-GNN on your own dataset:

Use get_constraints.py to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.

Execute atom_bond_descriptors_predict.sh to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh and energy_gaps_IP_EA_predict.sh to calculate molecular QM descriptors.

Utilize csv2pkl.py to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).

Run Chemprop to train your models using the additional predicted features supported here.
Train-validation-test database for LPM
figshare.com
zip
Updated Jul 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianfan Jin (2024). Train-validation-test database for LPM [Dataset]. http://doi.org/10.6084/m9.figshare.26380666.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26380666.v2
Dataset updated
Jul 26, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Tianfan Jin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the database for full model training and evaluation for LPM
Train Test and Validation Split
kaggle.com
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IMT2022053 (2025). Train Test and Validation Split [Dataset]. https://www.kaggle.com/datasets/pranavakulkarni/train-test-and-validation-split/suggestions?status=pending
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
IMT2022053
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by IMT2022053

Released under Apache 2.0

Contents
DUDE competition train - validation - test splits ground truth
zenodo.org
json
Updated Mar 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordy Van Landeghem; Jordy Van Landeghem (2023). DUDE competition train - validation - test splits ground truth [Dataset]. http://doi.org/10.5281/zenodo.7763635
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7763635
Dataset updated
Mar 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jordy Van Landeghem; Jordy Van Landeghem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This JSON file contains the ground truth annotations for the train and validation set of the DUDE competition (https://rrc.cvc.uab.es/?ch=23&com=tasks) of ICDAR 2023 (https://icdar2023.org/).

V1.0.7 release: 41454 annotations for 4974 documents (train-validation-test)

DatasetDict({ train: Dataset({ features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'], num_rows: 23728 }) val: Dataset({ features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'], num_rows: 6315 }) test: Dataset({ features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'], num_rows: 11402 }) }) ++update on answer_type +++formatting change to answers_variants ++++stricter check on answer_variants & rename annotations file + blind test set (no ground truth answers provided) ++ removed duplicates from test set:

"92bd5c758bda9bdceb5f67c17009207b_ac6964cbdf483e765b6668e27b3d0bc4",

"6ee71a16d4e4d1dbd7c1f569a92d4e08_549f2a163f8ff3e9f0293cf59fdd98bc",

"e6f3855472231a7ca6aada2f8e85fe5a_827c03a72f2552c722f2c872fd7f74c3",

"e3eecd7cca5de11f1d17cd94ae6a8d77_6300df64e4cf6ba0600ac81278f68de2",

"107b4037df8127a92ee4b6ae9b5df8fb_d7a60e7a9fc0b27487ea39cd7f56f98e",

"300cc3900080064d308983f958141232_6a7cf1aad908d58a75ab8e02ddc856f4",

"fdd3308efacddb88d4aa6e2073f481d4_138cb868ecc804a63cc7a4502c0009b2",

"1f7de256ff1743d329a8402ba0d132e7_95b6e8758533a9817b9f20a958e7b776",

"4f399b8c526ffb6a2fd585a18d4ed5ec_51097231bc327c26c59a4fd8d3ff3069",
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
f
Training, validation and test datasets and model files for larger US Health...
ufs.figshare.com
txt
Updated Dec 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Marthinus Blomerus (2023). Training, validation and test datasets and model files for larger US Health Insurance dataset [Dataset]. http://doi.org/10.38140/ufs.24598881.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.38140/ufs.24598881.v2
Dataset updated
Dec 12, 2023
Dataset provided by
University of the Free State
Authors
Jan Marthinus Blomerus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Formats1.xlsx contains the descriptions of the columns of the following datasets: Training, validation and test datasets in combination are all the records.sens1.csv and and meansdX.csv are required for testing.
Dog vs Cat Images Data
kaggle.com
Updated Sep 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kunal Gupta (2020). Dog vs Cat Images Data [Dataset]. https://www.kaggle.com/datasets/kunalgupta2616/dog-vs-cat-images-data/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kunal Gupta
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Brief

The data consist of 45.5K images of dogs and cats well distributed into 3 different sets. This is the first Dataset uploaded by me at Kaggle.

Download Api

> kaggle datasets download -d kunalgupta2616/dog-vs-cat-images-data

Content

Train Data - consist of 25000 images divided into equal half (12500 * 2) of cats and dogs images in separate directories.

Validation Data - consist of 8000 images divided into equal half (4000 * 2) of cats and dogs images in separate directories.

Test Data - consist of 12500 images of cats and dogs images.

sampleSubmission.csv - The submission file for the test data which can be found at link given under Acknowledgements.

Acknowledgements

This is an improvement to the dataset from the competition that can be found at this link. in terms of addition of more data and hierarchical distribution.
Z
Machine learning models, and training, validation and test datasets for:...
data.niaid.nih.gov
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daub, Carsten O (2021). Machine learning models, and training, validation and test datasets for: "Sequence determinants of human gene regulatory elements" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5101419
Explore at:
Dataset updated
Nov 12, 2021
Dataset provided by
Cramer, Patrick
Sahu, Biswajyoti
Hartonen, Tuomo
Kivioja, Teemu
Wei, Bei
Lidschreiber, Michael
Zhu, Fangjie
Pihlajamaa, Päivi
Dave, Kashyap
Kaasinen, Eevi
Lidschreiber, Katja
Daub, Carsten O
Taipale, Jussi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This record contains the training, test and validation datasets used to train and evaluate the machine learning models in manuscript:

Sahu, Biswajyoti, et al. "Sequence determinants of human gene regulatory elements." (2021).

This record contains also the final hyperparameter-optimized models for each training dataset/task combination described in the manuscript. The README-files provided with the record describe the datasets and models in more detail. The datasets deposited here are derived from the original raw data (GEO accession: GSE180158) as described in the Methods of the manuscript.
Z
Data from: Self-Supervised Representation Learning on Neural Network Weights...
data.niaid.nih.gov
zenodo.org
Updated Nov 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kostadinov, Dimche (2021). Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction - Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5645137
Explore at:
Dataset updated
Nov 13, 2021
Dataset provided by
Schürholt, Kontantin
Kostadinov, Dimche
Borth, Damian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets to NeurIPS 2021 accepted paper "Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction".

Datasets are pytorch files containing a dictionary with training, validation and test sets. Train, validation and test sets are custom dataset classes which inherit from the standard torch dataset class. Corresponding code an be found at https://github.com/HSG-AIML/NeurIPS_2021-Weight_Space_Learning.

Datasets 41, 42, 43 and 44 are our dataset format wrapped around the zoos from Unterthiner et al, 2020 (https://github.com/google-research/google-research/tree/master/dnn_predict_accuracy)

Abstract: Self-Supervised Learning (SSL) has been shown to learn useful and information-preserving representations. Neural Networks (NNs) are widely applied, yet their weight space is still not fully understood. Therefore, we propose to use SSL to learn neural representations of the weights of populations of NNs. To that end, we introduce domain specific data augmentations and an adapted attention architecture. Our empirical evaluation demonstrates that self-supervised representation learning in this domain is able to recover diverse NN model characteristics. Further, we show that the proposed learned representations outperform prior work for predicting hyper-parameters, test accuracy, and generalization gap as well as transfer to out-of-distribution settings.
f
Dataset
figshare.com
application/x-gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moynuddin Ahmed Shibly (2023). Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13577873.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13577873.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Moynuddin Ahmed Shibly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.
m
Ultimate_Analysis
data.mendeley.com
Updated Jan 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akara Kijkarncharoensin (2022). Ultimate_Analysis [Dataset]. http://doi.org/10.17632/t8x96g88p3.2
Explore at:
Unique identifier
https://doi.org/10.17632/t8x96g88p3.2
Dataset updated
Jan 28, 2022
Authors
Akara Kijkarncharoensin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This database studies the performance inconsistency on the biomass HHV ultimate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.

The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These machine learning models consist of eight regressions, four supervised learnings, and three neural networks.

An excel workbook, "BiomassDataSetUltimate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Ultimate," contains 908 HHV data from 20 pieces of literature. The names of the worksheet column indicate the elements of the ultimate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article (Kijkarncharoensin & Innet, 2021) verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.

A file named "SourceCodeUltimate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, name "runStudyUltimate.m," is the article's main program to analyze the performance consistency of the biomass HHV model through the ultimate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.

The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.

Reference : Kijkarncharoensin, A., & Innet, S. (2022). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Ultimate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Complete Blood Count (CBC)
kaggle.com
Updated Aug 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Noukhez (2024). Complete Blood Count (CBC) [Dataset]. https://www.kaggle.com/datasets/mdnoukhej/complete-blood-count-cbc
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Muhammad Noukhez
Description
Dataset Description:

This dataset is a comprehensive collection of Complete Blood Count (CBC) images, meticulously organized to support machine learning and deep learning projects, especially in the domain of medical image analysis. The dataset's structure ensures a balanced and systematic approach to model development, validation, and testing.

Dataset Breakdown:

Training Images: 300

Validation Images: 60

Test Images: 60

Annotations: Detailed annotations included for all images

Overview:

The Complete Blood Count (CBC) is a crucial test used in medical diagnostics to evaluate the overall health and detect a variety of disorders, including anemia, infection, and many other diseases. This dataset provides a rich source of CBC images that can be used to train machine learning models to automate the analysis and interpretation of these tests.

Data Composition:

Training Set:

Contains 300 images

These images are used to train machine learning models, enabling them to learn and recognize patterns associated with various blood cell types and conditions.

Validation Set:

Contains 60 images

Used to tune the models and optimize their performance, ensuring that the models generalize well to new, unseen data.

Test Set:

Contains 60 images

Used to evaluate the final model performance, providing an unbiased assessment of how well the model performs on new data.

Annotations:

Each image in the dataset is accompanied by detailed annotations, which include information about the different types of blood cells present and any relevant diagnostic features. These annotations are essential for supervised learning, allowing models to learn from labeled examples and improve their accuracy and reliability.

Key Features:

High-Quality Images: All images are of high quality, making them suitable for a variety of machine learning tasks, including image classification, object detection, and segmentation.

Comprehensive Annotations: Each image is thoroughly annotated, providing valuable information that can be used to train and validate models.

Balanced Dataset: The dataset is carefully balanced with distinct sets for training, validation, and testing, ensuring that models trained on this data will be robust and generalizable.

Applications:

This dataset is ideal for researchers and practitioners in the fields of machine learning, deep learning, and medical image analysis. Potential applications include: - Automated CBC Analysis: Developing algorithms to automatically analyze CBC images and provide diagnostic insights. - Blood Cell Classification: Training models to accurately classify different types of blood cells, which is critical for diagnosing various blood disorders. - Educational Purposes: Using the dataset as a teaching tool to help students and new practitioners understand the complexities of CBC image analysis.

Usage Notes:

Data Augmentation: Users may consider applying data augmentation techniques to increase the diversity of the training data and improve model robustness.

Preprocessing: Proper preprocessing, such as normalization and noise reduction, can enhance model performance.

Evaluation Metrics: It is recommended to use standard evaluation metrics such as accuracy, precision, recall, and F1-score to assess model performance.

Conclusion:

This CBC dataset is a valuable resource for anyone looking to advance the field of automated medical diagnostics through machine learning and deep learning. With its high-quality images, detailed annotations, and balanced composition, it provides the necessary foundation for developing accurate and reliable models for CBC analysis.
Sample, test, and validation data for findmycells
biorxiv.org
zip
Updated Feb 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dennis Segebarth; Dennis Segebarth (2023). Sample, test, and validation data for findmycells [Dataset]. http://doi.org/10.5281/zenodo.7655292
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7655292
Dataset updated
Feb 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dennis Segebarth; Dennis Segebarth
License
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Description
findmycells is an open source python package, developed to foster the use of deep-learning based python tools for bioimage analysis, specifically for researchers with limited python coding experience. It is developed and maintained in the following GitHub repository: https://github.com/Defense-Circuits-Lab/findmycells

Disclaimer: All data (including the model ensemble) uploaded here serve solely as a test dataset for findmycells and are not intended for any other purposes.

For instance, the group, subgroup, or subject IDs don´t refer to the actual experimental conditions. Likewise, also the included ROI-files were only created to allow the testing of findmycells and may not live up to scientific standards. Furthermore, the image data represents a subset of a dataset that is already published here:

Segebarth, Dennis et al. (2020), Data from: On the objectivity, reliability, and validity of deep learning enabled bioimage analyses, Dryad, Dataset, https://doi.org/10.5061/dryad.4b8gtht9d

The model ensemble (cfos_ensemble.zip) was trained using deepflash2 (v 0.1.7)

Griebel, M., Segebarth, D., Stein, N., Schukraft, N., Tovote, P., Blum, R., & Flath, C. M. (2021). Deep-learning in the bioimaging wild: Handling ambiguous data with deepflash2. arXiv preprint arXiv:2111.06693.

The training was performed on a subset of the "lab-wue1" training dataset, using only the 27 images with IDs 0000 - 0099 (cfos_training_images.zip) and the corresponding est. GT masks (cfos_training_masks.zip). The images used in "cfos_fmc_test_project.zip" for the actual testing of findmycells are the images with the IDs 0100, 0106, 0149, and 0152 of the aforementioned "lab-wue1" training dataset. They were randomly distributed to the made-up subject folders and renamed to "dentate_gyrus_01" or "dentate_gyrus_02".
t
Training and validation dataset 2 of milling processes for time series...
service.tib.eu
Updated Nov 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Training and validation dataset 2 of milling processes for time series prediction - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-35097-1738
Explore at:
Dataset updated
Nov 28, 2024
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Abstract: Das Ziel des Datensatzes ist das Training und die Validierung von Modellen zur Vorhersage von Zeitreihen für Fräsprozesse. Dazu wurden an einer DMC 60H Prozesse mit einer Abtastrate von 500 Hz durch eine Siemens Industrial Edge aufgenommen. Die Maschine wurde steuerungstechnisch aufgerüstet. Es wurden Prozesse für das Modelltraining und die Validierung aufgenommen, welche sowohl für die Bearbeitung von Stahl sowie von Aluminium verwendet wurden. Es wurden mehrere Aufnahmen mit und ohne Werkstück (Aircut) erstellt, um möglichst viele Fälle abdecken zu können. Es handelt sich um die gleiche Versuchsreihe wie in "Training and validation dataset of milling processes for time series prediction" mit der DOI 10.5445/IR/1000157789 und hat zum Ziel, eine Untersuchung der Übertragbarkeit von Modellen zwischen verschiedenen Maschinen zu ermöglichen. Abstract: The aim of the dataset is to train and validate models for predicting time series for milling processes. For this purpose, processes were recorded at a sampling rate of 500 Hz by a Siemens Industrial Edge on a DMC 60H. The machine was upgraded in terms of control technology. Processes for model training and validation were recorded, suitable for both steel and aluminum machining. Several recordings were made with and without the workpiece (aircut) in order to cover as many cases as possible. This is the same series of experiments as in "Training and validation dataset of milling processes for time series prediction" with DOI 10.5445/IR/1000157789 and allows an investigation of the transferability of models between different machines. TechnicalRemarks: Documents: -Design of Experiments: Information on the paths such as the technological values of the experiments -Recording information: Information about the recordings with comments -Data: All recorded datasets. The first level contains the folders for training and validation both with and without the workpiece. In the next level, the individual test executions are located. The individual recordings are stored in the form of a JSON file. This consists of a header with all relevant information such as the signal sources. This is followed by the entries of the recorded time series. -NC-Code: NC programs executed on the machine Experimental data: -Machine: Retrofitted DMC 60H -Material: S235JR, 2007 T4 -Tools: -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 5mm -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 10mm -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 20mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 10mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -Workpiece blank dimensions: 150x75x50mm License: This work is licensed under a Creative Commons Attribution 4.0 International License. Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).

Facebook

Twitter

Click to copy link

Link copied

Cite

Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 12, 2023

Authors

Doula Isham Rashik Hasan

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

Clear search

Close search

Google apps

Main menu

alpaca-train-validation-test-split

CNN models and training, validation and test datasets for "PlotMI:...

Training/Validation/Test set split

Challenge Round 0 (Dry Run) Test Dataset

Train, validation, test data sets and confusion matrices underlying...

Dataset, splits, models, and scripts for the QM descriptors prediction

Train-validation-test database for LPM

Train Test and Validation Split

Dataset

Contents

DUDE competition train - validation - test splits ground truth

Training dataset for NABat Machine Learning V1.0

Training, validation and test datasets and model files for larger US Health...

Dog vs Cat Images Data

Brief

Download Api

Content

Train Data - consist of 25000 images divided into equal half (12500 * 2) of cats and dogs images in separate directories.

Validation Data - consist of 8000 images divided into equal half (4000 * 2) of cats and dogs images in separate directories.

Test Data - consist of 12500 images of cats and dogs images.

sampleSubmission.csv - The submission file for the test data which can be found at link given under Acknowledgements.

Acknowledgements

Machine learning models, and training, validation and test datasets for:...

Data from: Self-Supervised Representation Learning on Neural Network Weights...

Dataset

Ultimate_Analysis

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Complete Blood Count (CBC)

Dataset Breakdown:

Overview:

Data Composition:

Annotations:

Key Features:

Applications:

Usage Notes:

Conclusion:

Sample, test, and validation data for findmycells

Training and validation dataset 2 of milling processes for time series...

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split