100+ datasets found
  1. h

    alpaca-train-validation-test-split

    • huggingface.co
    Updated Aug 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2023
    Authors
    Doula Isham Rashik Hasan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca

    I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

      Dataset Summary
    

    Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

  2. CNN models and training, validation and test datasets for "PlotMI:...

    • zenodo.org
    application/gzip
    Updated Sep 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tuomo Hartonen; Tuomo Hartonen; Teemu Kivioja; Jussi Taipale; Teemu Kivioja; Jussi Taipale (2021). CNN models and training, validation and test datasets for "PlotMI: interpretation of pairwise interactions and positional preferences learned by a deep learning model from sequence data" [Dataset]. http://doi.org/10.5281/zenodo.5508698
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Sep 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tuomo Hartonen; Tuomo Hartonen; Teemu Kivioja; Jussi Taipale; Teemu Kivioja; Jussi Taipale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Convolutional neural network (CNN) models and their respective training, validation and test datasets used in manuscript:

    Tuomo Hartonen, Teemu Kivioja and Jussi Taipale, "PlotMI: interpretation of pairwise interactions and positional preferences learned by a deep learning model from sequence data"

  3. Training/Validation/Test set split

    • figshare.com
    zip
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianfan Jin (2024). Training/Validation/Test set split [Dataset]. http://doi.org/10.6084/m9.figshare.25511056.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 30, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Tianfan Jin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Including the split of real and null reactions for training, validation and test

  4. Challenge Round 0 (Dry Run) Test Dataset

    • catalog.data.gov
    • data.nist.gov
    • +1more
    Updated Jul 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). Challenge Round 0 (Dry Run) Test Dataset [Dataset]. https://catalog.data.gov/dataset/challenge-round-0-dry-run-test-dataset-ff885
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This dataset was an initial test harness infrastructure test for the TrojAI program. It should not be used for research. Please use the more refined datasets generated for the other rounds. The data being generated and disseminated is training, validation, and test data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 200 trained, human level, image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.

  5. 4

    Train, validation, test data sets and confusion matrices underlying...

    • data.4tu.nl
    zip
    Updated Sep 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen (2023). Train, validation, test data sets and confusion matrices underlying publication: "Automated cell counting for Trypan blue stained cell cultures using machine learning" [Dataset]. http://doi.org/10.4121/21695819.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Annotated test and train data sets. Both images and annotations are provided separately.


    Validation data set for Hi5, Sf9 and HEK cells.


    Confusion matrices for the determination of performance parameters

  6. Dataset, splits, models, and scripts for the QM descriptors prediction

    • zenodo.org
    • explore.openaire.eu
    application/gzip
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green (2024). Dataset, splits, models, and scripts for the QM descriptors prediction [Dataset]. http://doi.org/10.5281/zenodo.10668491
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.

    Below are descriptions of the available scripts:

    1. atom_bond_descriptors.sh: Trains atom/bond targets.
    2. atom_bond_descriptors_predict.sh: Predicts atom/bond targets from pre-trained model.
    3. dipole_quadrupole_moments.sh: Trains dipole and quadrupole moments.
    4. dipole_quadrupole_moments_predict.sh: Predicts dipole and quadrupole moments from pre-trained model.
    5. energy_gaps_IP_EA.sh: Trains energy gaps, ionization potential (IP), and electron affinity (EA).
    6. energy_gaps_IP_EA_predict.sh: Predicts energy gaps, IP, and EA from pre-trained model.
    7. get_constraints.py: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.
    8. csv2pkl.py: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.

    Below is the procedure for running the ml-QM-GNN on your own dataset:

    1. Use get_constraints.py to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.
    2. Execute atom_bond_descriptors_predict.sh to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh and energy_gaps_IP_EA_predict.sh to calculate molecular QM descriptors.
    3. Utilize csv2pkl.py to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).
    4. Run Chemprop to train your models using the additional predicted features supported here.
  7. Train-validation-test database for LPM

    • figshare.com
    zip
    Updated Jul 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianfan Jin (2024). Train-validation-test database for LPM [Dataset]. http://doi.org/10.6084/m9.figshare.26380666.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 26, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Tianfan Jin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the database for full model training and evaluation for LPM

  8. Train Test and Validation Split

    • kaggle.com
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IMT2022053 (2025). Train Test and Validation Split [Dataset]. https://www.kaggle.com/datasets/pranavakulkarni/train-test-and-validation-split/suggestions?status=pending
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    IMT2022053
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by IMT2022053

    Released under Apache 2.0

    Contents

  9. DUDE competition train - validation - test splits ground truth

    • zenodo.org
    json
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordy Van Landeghem; Jordy Van Landeghem (2023). DUDE competition train - validation - test splits ground truth [Dataset]. http://doi.org/10.5281/zenodo.7763635
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Mar 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jordy Van Landeghem; Jordy Van Landeghem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This JSON file contains the ground truth annotations for the train and validation set of the DUDE competition (https://rrc.cvc.uab.es/?ch=23&com=tasks) of ICDAR 2023 (https://icdar2023.org/).

    V1.0.7 release: 41454 annotations for 4974 documents (train-validation-test)

    DatasetDict({
      train: Dataset({
        features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'],
        num_rows: 23728
      })
      val: Dataset({
        features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'],
        num_rows: 6315
      })
      test: Dataset({
        features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'],
        num_rows: 11402
      })
    })
    
    ++update on answer_type
    +++formatting change to answers_variants
    ++++stricter check on answer_variants & rename annotations file
    
    + blind test set (no ground truth answers provided)
    ++ removed duplicates from test set: 
    

    "92bd5c758bda9bdceb5f67c17009207b_ac6964cbdf483e765b6668e27b3d0bc4",

    "6ee71a16d4e4d1dbd7c1f569a92d4e08_549f2a163f8ff3e9f0293cf59fdd98bc",

    "e6f3855472231a7ca6aada2f8e85fe5a_827c03a72f2552c722f2c872fd7f74c3",

    "e3eecd7cca5de11f1d17cd94ae6a8d77_6300df64e4cf6ba0600ac81278f68de2",

    "107b4037df8127a92ee4b6ae9b5df8fb_d7a60e7a9fc0b27487ea39cd7f56f98e",

    "300cc3900080064d308983f958141232_6a7cf1aad908d58a75ab8e02ddc856f4",

    "fdd3308efacddb88d4aa6e2073f481d4_138cb868ecc804a63cc7a4502c0009b2",

    "1f7de256ff1743d329a8402ba0d132e7_95b6e8758533a9817b9f20a958e7b776",

    "4f399b8c526ffb6a2fd585a18d4ed5ec_51097231bc327c26c59a4fd8d3ff3069",

  10. d

    Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  11. f

    Training, validation and test datasets and model files for larger US Health...

    • ufs.figshare.com
    txt
    Updated Dec 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Marthinus Blomerus (2023). Training, validation and test datasets and model files for larger US Health Insurance dataset [Dataset]. http://doi.org/10.38140/ufs.24598881.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 12, 2023
    Dataset provided by
    University of the Free State
    Authors
    Jan Marthinus Blomerus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Formats1.xlsx contains the descriptions of the columns of the following datasets: Training, validation and test datasets in combination are all the records.sens1.csv and and meansdX.csv are required for testing.

  12. Dog vs Cat Images Data

    • kaggle.com
    Updated Sep 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kunal Gupta (2020). Dog vs Cat Images Data [Dataset]. https://www.kaggle.com/datasets/kunalgupta2616/dog-vs-cat-images-data/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kunal Gupta
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Brief

    The data consist of 45.5K images of dogs and cats well distributed into 3 different sets. This is the first Dataset uploaded by me at Kaggle.

    Download Api

    > kaggle datasets download -d kunalgupta2616/dog-vs-cat-images-data

    Content

    Train Data - consist of 25000 images divided into equal half (12500 * 2) of cats and dogs images in separate directories.

    Validation Data - consist of 8000 images divided into equal half (4000 * 2) of cats and dogs images in separate directories.

    Test Data - consist of 12500 images of cats and dogs images.

    sampleSubmission.csv - The submission file for the test data which can be found at link given under Acknowledgements.

    Acknowledgements

    This is an improvement to the dataset from the competition that can be found at this link. in terms of addition of more data and hierarchical distribution.

  13. Z

    Machine learning models, and training, validation and test datasets for:...

    • data.niaid.nih.gov
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daub, Carsten O (2021). Machine learning models, and training, validation and test datasets for: "Sequence determinants of human gene regulatory elements" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5101419
    Explore at:
    Dataset updated
    Nov 12, 2021
    Dataset provided by
    Cramer, Patrick
    Sahu, Biswajyoti
    Hartonen, Tuomo
    Kivioja, Teemu
    Wei, Bei
    Lidschreiber, Michael
    Zhu, Fangjie
    Pihlajamaa, Päivi
    Dave, Kashyap
    Kaasinen, Eevi
    Lidschreiber, Katja
    Daub, Carsten O
    Taipale, Jussi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This record contains the training, test and validation datasets used to train and evaluate the machine learning models in manuscript:

    Sahu, Biswajyoti, et al. "Sequence determinants of human gene regulatory elements." (2021).

    This record contains also the final hyperparameter-optimized models for each training dataset/task combination described in the manuscript. The README-files provided with the record describe the datasets and models in more detail. The datasets deposited here are derived from the original raw data (GEO accession: GSE180158) as described in the Methods of the manuscript.

  14. Z

    Data from: Self-Supervised Representation Learning on Neural Network Weights...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kostadinov, Dimche (2021). Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction - Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5645137
    Explore at:
    Dataset updated
    Nov 13, 2021
    Dataset provided by
    Schürholt, Kontantin
    Kostadinov, Dimche
    Borth, Damian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets to NeurIPS 2021 accepted paper "Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction".

    Datasets are pytorch files containing a dictionary with training, validation and test sets. Train, validation and test sets are custom dataset classes which inherit from the standard torch dataset class. Corresponding code an be found at https://github.com/HSG-AIML/NeurIPS_2021-Weight_Space_Learning.

    Datasets 41, 42, 43 and 44 are our dataset format wrapped around the zoos from Unterthiner et al, 2020 (https://github.com/google-research/google-research/tree/master/dnn_predict_accuracy)

    Abstract: Self-Supervised Learning (SSL) has been shown to learn useful and information-preserving representations. Neural Networks (NNs) are widely applied, yet their weight space is still not fully understood. Therefore, we propose to use SSL to learn neural representations of the weights of populations of NNs. To that end, we introduce domain specific data augmentations and an adapted attention architecture. Our empirical evaluation demonstrates that self-supervised representation learning in this domain is able to recover diverse NN model characteristics. Further, we show that the proposed learned representations outperform prior work for predicting hyper-parameters, test accuracy, and generalization gap as well as transfer to out-of-distribution settings.

  15. f

    Dataset

    • figshare.com
    application/x-gzip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moynuddin Ahmed Shibly (2023). Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13577873.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Moynuddin Ahmed Shibly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.

  16. m

    Ultimate_Analysis

    • data.mendeley.com
    Updated Jan 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akara Kijkarncharoensin (2022). Ultimate_Analysis [Dataset]. http://doi.org/10.17632/t8x96g88p3.2
    Explore at:
    Dataset updated
    Jan 28, 2022
    Authors
    Akara Kijkarncharoensin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This database studies the performance inconsistency on the biomass HHV ultimate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.

    The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These machine learning models consist of eight regressions, four supervised learnings, and three neural networks.

    An excel workbook, "BiomassDataSetUltimate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Ultimate," contains 908 HHV data from 20 pieces of literature. The names of the worksheet column indicate the elements of the ultimate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article (Kijkarncharoensin & Innet, 2021) verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.

    A file named "SourceCodeUltimate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, name "runStudyUltimate.m," is the article's main program to analyze the performance consistency of the biomass HHV model through the ultimate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.

    The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.

    Reference : Kijkarncharoensin, A., & Innet, S. (2022). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Ultimate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.

  17. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  18. Complete Blood Count (CBC)

    • kaggle.com
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Noukhez (2024). Complete Blood Count (CBC) [Dataset]. https://www.kaggle.com/datasets/mdnoukhej/complete-blood-count-cbc
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Muhammad Noukhez
    Description

    Dataset Description:

    This dataset is a comprehensive collection of Complete Blood Count (CBC) images, meticulously organized to support machine learning and deep learning projects, especially in the domain of medical image analysis. The dataset's structure ensures a balanced and systematic approach to model development, validation, and testing.

    Dataset Breakdown:

    • Training Images: 300
    • Validation Images: 60
    • Test Images: 60
    • Annotations: Detailed annotations included for all images

    Overview:

    The Complete Blood Count (CBC) is a crucial test used in medical diagnostics to evaluate the overall health and detect a variety of disorders, including anemia, infection, and many other diseases. This dataset provides a rich source of CBC images that can be used to train machine learning models to automate the analysis and interpretation of these tests.

    Data Composition:

    1. Training Set:

      • Contains 300 images
      • These images are used to train machine learning models, enabling them to learn and recognize patterns associated with various blood cell types and conditions.
    2. Validation Set:

      • Contains 60 images
      • Used to tune the models and optimize their performance, ensuring that the models generalize well to new, unseen data.
    3. Test Set:

      • Contains 60 images
      • Used to evaluate the final model performance, providing an unbiased assessment of how well the model performs on new data.

    Annotations:

    Each image in the dataset is accompanied by detailed annotations, which include information about the different types of blood cells present and any relevant diagnostic features. These annotations are essential for supervised learning, allowing models to learn from labeled examples and improve their accuracy and reliability.

    Key Features:

    • High-Quality Images: All images are of high quality, making them suitable for a variety of machine learning tasks, including image classification, object detection, and segmentation.
    • Comprehensive Annotations: Each image is thoroughly annotated, providing valuable information that can be used to train and validate models.
    • Balanced Dataset: The dataset is carefully balanced with distinct sets for training, validation, and testing, ensuring that models trained on this data will be robust and generalizable.

    Applications:

    This dataset is ideal for researchers and practitioners in the fields of machine learning, deep learning, and medical image analysis. Potential applications include: - Automated CBC Analysis: Developing algorithms to automatically analyze CBC images and provide diagnostic insights. - Blood Cell Classification: Training models to accurately classify different types of blood cells, which is critical for diagnosing various blood disorders. - Educational Purposes: Using the dataset as a teaching tool to help students and new practitioners understand the complexities of CBC image analysis.

    Usage Notes:

    • Data Augmentation: Users may consider applying data augmentation techniques to increase the diversity of the training data and improve model robustness.
    • Preprocessing: Proper preprocessing, such as normalization and noise reduction, can enhance model performance.
    • Evaluation Metrics: It is recommended to use standard evaluation metrics such as accuracy, precision, recall, and F1-score to assess model performance.

    Conclusion:

    This CBC dataset is a valuable resource for anyone looking to advance the field of automated medical diagnostics through machine learning and deep learning. With its high-quality images, detailed annotations, and balanced composition, it provides the necessary foundation for developing accurate and reliable models for CBC analysis.

  19. Sample, test, and validation data for findmycells

    • biorxiv.org
    zip
    Updated Feb 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dennis Segebarth; Dennis Segebarth (2023). Sample, test, and validation data for findmycells [Dataset]. http://doi.org/10.5281/zenodo.7655292
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dennis Segebarth; Dennis Segebarth
    License

    Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    findmycells is an open source python package, developed to foster the use of deep-learning based python tools for bioimage analysis, specifically for researchers with limited python coding experience. It is developed and maintained in the following GitHub repository: https://github.com/Defense-Circuits-Lab/findmycells

    Disclaimer: All data (including the model ensemble) uploaded here serve solely as a test dataset for findmycells and are not intended for any other purposes.

    For instance, the group, subgroup, or subject IDs don´t refer to the actual experimental conditions. Likewise, also the included ROI-files were only created to allow the testing of findmycells and may not live up to scientific standards. Furthermore, the image data represents a subset of a dataset that is already published here:

    Segebarth, Dennis et al. (2020), Data from: On the objectivity, reliability, and validity of deep learning enabled bioimage analyses, Dryad, Dataset, https://doi.org/10.5061/dryad.4b8gtht9d

    The model ensemble (cfos_ensemble.zip) was trained using deepflash2 (v 0.1.7)

    Griebel, M., Segebarth, D., Stein, N., Schukraft, N., Tovote, P., Blum, R., & Flath, C. M. (2021). Deep-learning in the bioimaging wild: Handling ambiguous data with deepflash2. arXiv preprint arXiv:2111.06693.

    The training was performed on a subset of the "lab-wue1" training dataset, using only the 27 images with IDs 0000 - 0099 (cfos_training_images.zip) and the corresponding est. GT masks (cfos_training_masks.zip). The images used in "cfos_fmc_test_project.zip" for the actual testing of findmycells are the images with the IDs 0100, 0106, 0149, and 0152 of the aforementioned "lab-wue1" training dataset. They were randomly distributed to the made-up subject folders and renamed to "dentate_gyrus_01" or "dentate_gyrus_02".

  20. t

    Training and validation dataset 2 of milling processes for time series...

    • service.tib.eu
    Updated Nov 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Training and validation dataset 2 of milling processes for time series prediction - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-35097-1738
    Explore at:
    Dataset updated
    Nov 28, 2024
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Abstract: Das Ziel des Datensatzes ist das Training und die Validierung von Modellen zur Vorhersage von Zeitreihen für Fräsprozesse. Dazu wurden an einer DMC 60H Prozesse mit einer Abtastrate von 500 Hz durch eine Siemens Industrial Edge aufgenommen. Die Maschine wurde steuerungstechnisch aufgerüstet. Es wurden Prozesse für das Modelltraining und die Validierung aufgenommen, welche sowohl für die Bearbeitung von Stahl sowie von Aluminium verwendet wurden. Es wurden mehrere Aufnahmen mit und ohne Werkstück (Aircut) erstellt, um möglichst viele Fälle abdecken zu können. Es handelt sich um die gleiche Versuchsreihe wie in "Training and validation dataset of milling processes for time series prediction" mit der DOI 10.5445/IR/1000157789 und hat zum Ziel, eine Untersuchung der Übertragbarkeit von Modellen zwischen verschiedenen Maschinen zu ermöglichen. Abstract: The aim of the dataset is to train and validate models for predicting time series for milling processes. For this purpose, processes were recorded at a sampling rate of 500 Hz by a Siemens Industrial Edge on a DMC 60H. The machine was upgraded in terms of control technology. Processes for model training and validation were recorded, suitable for both steel and aluminum machining. Several recordings were made with and without the workpiece (aircut) in order to cover as many cases as possible. This is the same series of experiments as in "Training and validation dataset of milling processes for time series prediction" with DOI 10.5445/IR/1000157789 and allows an investigation of the transferability of models between different machines. TechnicalRemarks: Documents: -Design of Experiments: Information on the paths such as the technological values of the experiments -Recording information: Information about the recordings with comments -Data: All recorded datasets. The first level contains the folders for training and validation both with and without the workpiece. In the next level, the individual test executions are located. The individual recordings are stored in the form of a JSON file. This consists of a header with all relevant information such as the signal sources. This is followed by the entries of the recorded time series. -NC-Code: NC programs executed on the machine Experimental data: -Machine: Retrofitted DMC 60H -Material: S235JR, 2007 T4 -Tools: -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 5mm -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 10mm -VHM-Fräser HPC, TiSi, ⌀ f8 DC: 20mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 10mm -Schaftfräser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -Workpiece blank dimensions: 150x75x50mm License: This work is licensed under a Creative Commons Attribution 4.0 International License. Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

Search
Clear search
Close search
Google apps
Main menu