100+ datasets found
  1. h

    alpaca-train-validation-test-split

    • huggingface.co
    Updated Aug 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2023
    Authors
    Doula Isham Rashik Hasan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca

    I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

      Dataset Summary
    

    Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.
 See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

  2. Challenge Round 0 (Dry Run) Test Dataset

    • catalog.data.gov
    • data.nist.gov
    • +2more
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). Challenge Round 0 (Dry Run) Test Dataset [Dataset]. https://catalog.data.gov/dataset/challenge-round-0-dry-run-test-dataset-ff885
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This dataset was an initial test harness infrastructure test for the TrojAI program. It should not be used for research. Please use the more refined datasets generated for the other rounds. The data being generated and disseminated is training, validation, and test data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 200 trained, human level, image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.

  3. f

    Training, validation and test datasets and model files for larger US Health...

    • ufs.figshare.com
    txt
    Updated Dec 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Marthinus Blomerus (2023). Training, validation and test datasets and model files for larger US Health Insurance dataset [Dataset]. http://doi.org/10.38140/ufs.24598881.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 12, 2023
    Dataset provided by
    University of the Free State
    Authors
    Jan Marthinus Blomerus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Formats1.xlsx contains the descriptions of the columns of the following datasets: Training, validation and test datasets in combination are all the records.sens1.csv and and meansdX.csv are required for testing.

  4. 4

    Train, validation, test data sets and confusion matrices underlying...

    • data.4tu.nl
    zip
    Updated Sep 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen (2023). Train, validation, test data sets and confusion matrices underlying publication: "Automated cell counting for Trypan blue stained cell cultures using machine learning" [Dataset]. http://doi.org/10.4121/21695819.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Annotated test and train data sets. Both images and annotations are provided separately.


    Validation data set for Hi5, Sf9 and HEK cells.


    Confusion matrices for the determination of performance parameters

  5. Training and Validation Datasets for Neural Network to Fill in Missing Data...

    • catalog.data.gov
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). Training and Validation Datasets for Neural Network to Fill in Missing Data in EBSD Maps [Dataset]. https://catalog.data.gov/dataset/training-and-validation-datasets-for-neural-network-to-fill-in-missing-data-in-ebsd-maps
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This dataset consists of the synthetic electron backscatter diffraction (EBSD) maps generated for the paper, titled "Hybrid Algorithm for Filling in Missing Data in Electron Backscatter Diffraction Maps" by Emmanuel Atindama, Conor Miller-Lynch, Huston Wilhite, Cody Mattice, GĂŒnay Doğan, and Prashant Athavale. The EBSD maps were used to train, test, and validate a neural network algorithm to fill in missing data points in a given EBSD map.The dataset includes 8000 maps for training, 1000 maps for testing, 2000 maps for validation. The dataset also includes noise-added versions of the maps, namely, one more map per each clean map.

  6. Dataset, splits, models, and scripts for the QM descriptors prediction

    • zenodo.org
    • explore.openaire.eu
    application/gzip
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green (2024). Dataset, splits, models, and scripts for the QM descriptors prediction [Dataset]. http://doi.org/10.5281/zenodo.10668491
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.

    Below are descriptions of the available scripts:

    1. atom_bond_descriptors.sh: Trains atom/bond targets.
    2. atom_bond_descriptors_predict.sh: Predicts atom/bond targets from pre-trained model.
    3. dipole_quadrupole_moments.sh: Trains dipole and quadrupole moments.
    4. dipole_quadrupole_moments_predict.sh: Predicts dipole and quadrupole moments from pre-trained model.
    5. energy_gaps_IP_EA.sh: Trains energy gaps, ionization potential (IP), and electron affinity (EA).
    6. energy_gaps_IP_EA_predict.sh: Predicts energy gaps, IP, and EA from pre-trained model.
    7. get_constraints.py: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.
    8. csv2pkl.py: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.

    Below is the procedure for running the ml-QM-GNN on your own dataset:

    1. Use get_constraints.py to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.
    2. Execute atom_bond_descriptors_predict.sh to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh and energy_gaps_IP_EA_predict.sh to calculate molecular QM descriptors.
    3. Utilize csv2pkl.py to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).
    4. Run Chemprop to train your models using the additional predicted features supported here.
  7. CNN models and training, validation and test datasets for "PlotMI:...

    • zenodo.org
    application/gzip
    Updated Sep 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tuomo Hartonen; Tuomo Hartonen; Teemu Kivioja; Jussi Taipale; Teemu Kivioja; Jussi Taipale (2021). CNN models and training, validation and test datasets for "PlotMI: interpretation of pairwise interactions and positional preferences learned by a deep learning model from sequence data" [Dataset]. http://doi.org/10.5281/zenodo.5508698
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Sep 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tuomo Hartonen; Tuomo Hartonen; Teemu Kivioja; Jussi Taipale; Teemu Kivioja; Jussi Taipale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Convolutional neural network (CNN) models and their respective training, validation and test datasets used in manuscript:

    Tuomo Hartonen, Teemu Kivioja and Jussi Taipale, "PlotMI: interpretation of pairwise interactions and positional preferences learned by a deep learning model from sequence data"

  8. Train-validation-test database for LPM

    • figshare.com
    zip
    Updated Jul 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianfan Jin (2024). Train-validation-test database for LPM [Dataset]. http://doi.org/10.6084/m9.figshare.26380666.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 26, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Tianfan Jin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the database for full model training and evaluation for LPM

  9. t

    Training and validation dataset of milling processes for time series...

    • service.tib.eu
    Updated Nov 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Training and validation dataset of milling processes for time series prediction - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-35097-1462
    Explore at:
    Dataset updated
    Nov 28, 2024
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Abstract: Ziel des Datensatzes ist das Training sowie die Validierung von Modellen zur Prognose von Zeitreihen fĂŒr FrĂ€sprozesse. HierfĂŒr wurden an einer DMG CMX 600 V durch eine Siemens Industrial Edge Prozesse mit einer Abtastrate von 500 Hz aufgenommen. Es wurde ein Prozess fĂŒr das Modelltraining und ein Prozess fĂŒr die Validierung aufgenommen, welche sowohl fĂŒr die Bearbeitung von Stahl sowie von Aluminium verwendet wurden. Es wurden mehrere Aufnahmen mit und ohne WerkstĂŒck (Aircut) aufgenommen, um möglichst viele FĂ€lle abdecken zu können. Abstract: The aim of the data set is the training as well as the validation of models for the prediction of time series for milling processes. For this purpose, processes with a sampling rate of 500 Hz were recorded on a DMG CMX 600 V by a Siemens Industrial Edge. A process for model training and a process for validation were recorded, which were used for both steel and aluminum machining. Several recordings were made with and without the workpiece (aircut) in order to cover as many cases as possible. TechnicalRemarks: Documents: -Design of Experiments: Information on the paths such as the technological values of the experiments -Recording information: Information about the recordings with comments -Data: All recorded datasets. The first level contains the folders for training and validation both with and without the workpiece. In the next level, the individual test executions are located. The individual recordings are stored in the form of a JSON file. This consists of a header with all relevant information such as the signal sources. This is followed by the entries of the recorded time series. -NC-Code: NC programs executed on the machine -Workpiece: Pictures of the raw parts as well as the machined workpieces. The pictures show the unfinished part on the left, the training part in the middle and a part with two validation runs on the right. Experimental data: -Machine: DMG CMX 600 V -Material: S235JR, 2007 T4 -Tools: -VHM-FrĂ€ser HPC, TiSi, ⌀ f8 DC: 5mm -VHM-FrĂ€ser HPC, TiSi, ⌀ f8 DC: 10mm -VHM-FrĂ€ser HPC, TiSi, ⌀ f8 DC: 20mm -SchaftfrĂ€ser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -SchaftfrĂ€ser HSS-Co8, TiAlN, ⌀ k10 DC: 10mm -SchaftfrĂ€ser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -Workpiece blank dimensions: 150x75x50mm License: This work is licensed under a Creative Commons Attribution 4.0 International License. Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).

  10. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  11. d

    Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  12. Z

    Machine learning models, and training, validation and test datasets for:...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lidschreiber, Michael (2021). Machine learning models, and training, validation and test datasets for: "Sequence determinants of human gene regulatory elements" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5101419
    Explore at:
    Dataset updated
    Nov 12, 2021
    Dataset provided by
    Lidschreiber, Michael
    Zhu, Fangjie
    Taipale, Jussi
    Kivioja, Teemu
    Wei, Bei
    Cramer, Patrick
    Sahu, Biswajyoti
    Hartonen, Tuomo
    Daub, Carsten O
    Lidschreiber, Katja
    Kaasinen, Eevi
    Pihlajamaa, PĂ€ivi
    Dave, Kashyap
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This record contains the training, test and validation datasets used to train and evaluate the machine learning models in manuscript:

    Sahu, Biswajyoti, et al. "Sequence determinants of human gene regulatory elements." (2021).

    This record contains also the final hyperparameter-optimized models for each training dataset/task combination described in the manuscript. The README-files provided with the record describe the datasets and models in more detail. The datasets deposited here are derived from the original raw data (GEO accession: GSE180158) as described in the Methods of the manuscript.

  13. 100 Sports Image Classification

    • kaggle.com
    Updated May 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gerry (2023). 100 Sports Image Classification [Dataset]. https://www.kaggle.com/datasets/gpiosenka/sports-classification/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gerry
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Please upvote if you find this dataset of use. - Thank you This version is an update of the earlier version. I ran a data set quality evaluation program on the previous version which found a considerable number of duplicate and near duplicate images. Duplicate images can lead to falsely higher values of validation and test set accuracy and I have eliminated these images in this version of the dataset. Images were gathered from internet searches. The images were scanned with a duplicate image detector program I wrote. Any duplicate images were removed to prevent bleed through of images between the train, test and valid data sets. All images were then resized to 224 X224 X 3 and converted to jpg format. A csv file is included that for each image file contains the relative path to the image file, the image file class label and the dataset (train, test or valid) that the image file resides in. This is a clean dataset. If you build a good model you should achieve at least 95% accuracy on the test set. If you build a very good model for example using transfer learning you should be able to achieve 98%+ on test set accuracy. If you find this data set useful please upvote. Thanks

    Content

    Collection of sports images covering 100 different sports.. Images are 224,224,3 jpg format. Data is separated into train, test and valid directories. Additionallly a csv file is included for those that wish to use it to create there own train, test and validation datasets. .

    Inspiration

    Wanted to build a high quality clean data set that was easy to use and had no bad images or duplication between the train, test and validation data sets. Provides a good data set to test your models on. Design for straight forward application of keras preprocessing functions like ImageDataenerator.flow_from_directory or if you use the csv file ImageDataGenerator.flow_from_dataframe. This dataset was carefully created so that the region of interest (ROI) in this case the sport occupies approximately 50% of the pixels in the image. As a consequence even models of moderate complexity should achieve training and validation accuracies in the high 90's.

  14. t

    Data from: Spam Mails Dataset

    • dbrepo.datalab.tuwien.ac.at
    • test.dbrepo.tuwien.ac.at
    Updated Apr 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernal, Nicolas (2025). Spam Mails Dataset [Dataset]. http://doi.org/10.82556/bexb-5283
    Explore at:
    Dataset updated
    Apr 19, 2025
    Authors
    Bernal, Nicolas
    Time period covered
    2025
    Description

    Preprocessed data derived from the "spam-mails" dataset, containing email messages labeled as spam or ham. Each record includes a unique identifier from the original dataset and an experiment_id indicating its assignment to a specific data split (training, validation, or test) used in this experiment. The email content has been lemmatized and cleaned to remove noise such as punctuation, special characters, and stopwords, ensuring consistent input for embedding and model training. Original data source: https://www.kaggle.com/datasets/venky73/spam-mails-dataset

  15. Training/Validation/Test set split

    • figshare.com
    zip
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianfan Jin (2024). Training/Validation/Test set split [Dataset]. http://doi.org/10.6084/m9.figshare.25511056.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 30, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Tianfan Jin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Including the split of real and null reactions for training, validation and test

  16. Z

    pH reactor dataset

    • data.niaid.nih.gov
    Updated Oct 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabio Bonassi (2021). pH reactor dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3956066
    Explore at:
    Dataset updated
    Oct 5, 2021
    Dataset provided by
    Enrico Terzi
    Riccardo Scattolini
    Fabio Bonassi
    Marcello Farina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset of noisy input and output measurements collected from a simulator of a pH reactor. The system is single input single output.

    The dataset consists of the following files:

    'PH_U_Train.csv' and 'PH_Y_Train.csv': training dataset. The training dataset consists of 15 experiments, each one with 2000 (u, y) samples. The i-th column corresponds to the input (or output) variable of the i-th experiment.

    'PH_U_Val.csv' and 'PH_Y_Val.csv': validation dataset. The validation dataset consists of 4 experiments, each one with 2000 (u, y) samples. The i-th column corresponds to the input (or output) variable of the i-th experiment.

    'PH_U_Test.csv' and 'PH_Y_Test.csv': test dataset. The test dataset consists of 1 experiment of 2000 (u, y) samples.

    'PH_U.csv' and 'PH_Y.csv': concatenation of the training, validation, and test datasets.

    Moreover, the parameters of an Incrementally Input-to-State Stable LSTM neural network trained to learn the dynamical system are reported in the file 'LSTM_parameters.mat'.

    If you use this data please cite the related paper

    Terzi, E., Bonassi, F., Farina, M., & Scattolini, R. (2021). Learning model predictive control with long short‐term memory networks. International Journal of Robust and Nonlinear Control.

  17. t

    Training and validation dataset 2 of milling processes for time series...

    • service.tib.eu
    Updated Nov 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Training and validation dataset 2 of milling processes for time series prediction - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-35097-1738
    Explore at:
    Dataset updated
    Nov 28, 2024
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Abstract: Das Ziel des Datensatzes ist das Training und die Validierung von Modellen zur Vorhersage von Zeitreihen fĂŒr FrĂ€sprozesse. Dazu wurden an einer DMC 60H Prozesse mit einer Abtastrate von 500 Hz durch eine Siemens Industrial Edge aufgenommen. Die Maschine wurde steuerungstechnisch aufgerĂŒstet. Es wurden Prozesse fĂŒr das Modelltraining und die Validierung aufgenommen, welche sowohl fĂŒr die Bearbeitung von Stahl sowie von Aluminium verwendet wurden. Es wurden mehrere Aufnahmen mit und ohne WerkstĂŒck (Aircut) erstellt, um möglichst viele FĂ€lle abdecken zu können. Es handelt sich um die gleiche Versuchsreihe wie in "Training and validation dataset of milling processes for time series prediction" mit der DOI 10.5445/IR/1000157789 und hat zum Ziel, eine Untersuchung der Übertragbarkeit von Modellen zwischen verschiedenen Maschinen zu ermöglichen. Abstract: The aim of the dataset is to train and validate models for predicting time series for milling processes. For this purpose, processes were recorded at a sampling rate of 500 Hz by a Siemens Industrial Edge on a DMC 60H. The machine was upgraded in terms of control technology. Processes for model training and validation were recorded, suitable for both steel and aluminum machining. Several recordings were made with and without the workpiece (aircut) in order to cover as many cases as possible. This is the same series of experiments as in "Training and validation dataset of milling processes for time series prediction" with DOI 10.5445/IR/1000157789 and allows an investigation of the transferability of models between different machines. TechnicalRemarks: Documents: -Design of Experiments: Information on the paths such as the technological values of the experiments -Recording information: Information about the recordings with comments -Data: All recorded datasets. The first level contains the folders for training and validation both with and without the workpiece. In the next level, the individual test executions are located. The individual recordings are stored in the form of a JSON file. This consists of a header with all relevant information such as the signal sources. This is followed by the entries of the recorded time series. -NC-Code: NC programs executed on the machine Experimental data: -Machine: Retrofitted DMC 60H -Material: S235JR, 2007 T4 -Tools: -VHM-FrĂ€ser HPC, TiSi, ⌀ f8 DC: 5mm -VHM-FrĂ€ser HPC, TiSi, ⌀ f8 DC: 10mm -VHM-FrĂ€ser HPC, TiSi, ⌀ f8 DC: 20mm -SchaftfrĂ€ser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -SchaftfrĂ€ser HSS-Co8, TiAlN, ⌀ k10 DC: 10mm -SchaftfrĂ€ser HSS-Co8, TiAlN, ⌀ k10 DC: 5mm -Workpiece blank dimensions: 150x75x50mm License: This work is licensed under a Creative Commons Attribution 4.0 International License. Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).

  18. Train Test and Validation Split

    • kaggle.com
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IMT2022053 (2025). Train Test and Validation Split [Dataset]. https://www.kaggle.com/datasets/pranavakulkarni/train-test-and-validation-split/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    IMT2022053
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by IMT2022053

    Released under Apache 2.0

    Contents

  19. o

    Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...

    • explore.openaire.eu
    • data.niaid.nih.gov
    Updated Jun 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    N. Ashley Henderson; K. Steven Kauwe; D. Taylor Sparks (2021). Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics [Dataset]. http://doi.org/10.5281/zenodo.4903957
    Explore at:
    Dataset updated
    Jun 5, 2021
    Authors
    N. Ashley Henderson; K. Steven Kauwe; D. Taylor Sparks
    Description

    This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits. For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method. For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0

  20. m

    ARKOMA: The Dataset to Build Neural Networks-Based Inverse Kinematics for...

    • data.mendeley.com
    • ieee-dataport.org
    Updated Aug 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Nugroho (2023). ARKOMA: The Dataset to Build Neural Networks-Based Inverse Kinematics for NAO Robot Arms [Dataset]. http://doi.org/10.17632/brg4dz8nbb.1
    Explore at:
    Dataset updated
    Aug 23, 2023
    Authors
    Arif Nugroho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset that we published in this data repository can be used to build neural networks-based inverse kinematics for NAO robot arms. This dataset is named ARKOMA. ARKOMA is an acronym for ARif eKO MAuridhi, all of whom are the creators of this dataset. This dataset contains input-output data pairs. In this dataset, the input data is the end-effector position and orientation in the three-dimensional cartesian space, and the output data is a set of joint angular positions. These joint angular positions are in radians. For further applications, this dataset was split into the training dataset, validation dataset, and testing dataset. The training dataset is used to train neural networks. The validation dataset is utilized to validate neural networks’ performance during the training process. Meanwhile, the testing dataset is employed after the training process to test the performance of trained neural networks. From a set of 10000 data, 60% of data was allocated for the training dataset, 20% of data for the validation dataset, and the other 20% of data for the testing dataset. It should be noted, this dataset is compatible with NAO H25 v3.3 or later.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.
 See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

Search
Clear search
Close search
Google apps
Main menu