Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.⊠See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
This dataset was an initial test harness infrastructure test for the TrojAI program. It should not be used for research. Please use the more refined datasets generated for the other rounds. The data being generated and disseminated is training, validation, and test data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 200 trained, human level, image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Formats1.xlsx contains the descriptions of the columns of the following datasets: Training, validation and test datasets in combination are all the records.sens1.csv and and meansdX.csv are required for testing.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Annotated test and train data sets. Both images and annotations are provided separately.
Validation data set for Hi5, Sf9 and HEK cells.
Confusion matrices for the determination of performance parameters
This dataset consists of the synthetic electron backscatter diffraction (EBSD) maps generated for the paper, titled "Hybrid Algorithm for Filling in Missing Data in Electron Backscatter Diffraction Maps" by Emmanuel Atindama, Conor Miller-Lynch, Huston Wilhite, Cody Mattice, GĂŒnay DoÄan, and Prashant Athavale. The EBSD maps were used to train, test, and validate a neural network algorithm to fill in missing data points in a given EBSD map.The dataset includes 8000 maps for training, 1000 maps for testing, 2000 maps for validation. The dataset also includes noise-added versions of the maps, namely, one more map per each clean map.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.
Below are descriptions of the available scripts:
atom_bond_descriptors.sh
: Trains atom/bond targets.atom_bond_descriptors_predict.sh
: Predicts atom/bond targets from pre-trained model.dipole_quadrupole_moments.sh
: Trains dipole and quadrupole moments.dipole_quadrupole_moments_predict.sh
: Predicts dipole and quadrupole moments from pre-trained model.energy_gaps_IP_EA.sh
: Trains energy gaps, ionization potential (IP), and electron affinity (EA).energy_gaps_IP_EA_predict.sh
: Predicts energy gaps, IP, and EA from pre-trained model.get_constraints.py
: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.csv2pkl.py
: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.Below is the procedure for running the ml-QM-GNN on your own dataset:
get_constraints.py
to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.atom_bond_descriptors_predict.sh
to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh
and energy_gaps_IP_EA_predict.sh
to calculate molecular QM descriptors.csv2pkl.py
to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Convolutional neural network (CNN) models and their respective training, validation and test datasets used in manuscript:
Tuomo Hartonen, Teemu Kivioja and Jussi Taipale, "PlotMI: interpretation of pairwise interactions and positional preferences learned by a deep learning model from sequence data"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the database for full model training and evaluation for LPM
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Abstract: Ziel des Datensatzes ist das Training sowie die Validierung von Modellen zur Prognose von Zeitreihen fĂŒr FrĂ€sprozesse. HierfĂŒr wurden an einer DMG CMX 600 V durch eine Siemens Industrial Edge Prozesse mit einer Abtastrate von 500 Hz aufgenommen. Es wurde ein Prozess fĂŒr das Modelltraining und ein Prozess fĂŒr die Validierung aufgenommen, welche sowohl fĂŒr die Bearbeitung von Stahl sowie von Aluminium verwendet wurden. Es wurden mehrere Aufnahmen mit und ohne WerkstĂŒck (Aircut) aufgenommen, um möglichst viele FĂ€lle abdecken zu können. Abstract: The aim of the data set is the training as well as the validation of models for the prediction of time series for milling processes. For this purpose, processes with a sampling rate of 500 Hz were recorded on a DMG CMX 600 V by a Siemens Industrial Edge. A process for model training and a process for validation were recorded, which were used for both steel and aluminum machining. Several recordings were made with and without the workpiece (aircut) in order to cover as many cases as possible. TechnicalRemarks: Documents: -Design of Experiments: Information on the paths such as the technological values of the experiments -Recording information: Information about the recordings with comments -Data: All recorded datasets. The first level contains the folders for training and validation both with and without the workpiece. In the next level, the individual test executions are located. The individual recordings are stored in the form of a JSON file. This consists of a header with all relevant information such as the signal sources. This is followed by the entries of the recorded time series. -NC-Code: NC programs executed on the machine -Workpiece: Pictures of the raw parts as well as the machined workpieces. The pictures show the unfinished part on the left, the training part in the middle and a part with two validation runs on the right. Experimental data: -Machine: DMG CMX 600 V -Material: S235JR, 2007 T4 -Tools: -VHM-FrĂ€ser HPC, TiSi, â f8 DC: 5mm -VHM-FrĂ€ser HPC, TiSi, â f8 DC: 10mm -VHM-FrĂ€ser HPC, TiSi, â f8 DC: 20mm -SchaftfrĂ€ser HSS-Co8, TiAlN, â k10 DC: 5mm -SchaftfrĂ€ser HSS-Co8, TiAlN, â k10 DC: 10mm -SchaftfrĂ€ser HSS-Co8, TiAlN, â k10 DC: 5mm -Workpiece blank dimensions: 150x75x50mm License: This work is licensed under a Creative Commons Attribution 4.0 International License. Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv
, validation_data.csv
, and test_data.csv
. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas
, numpy
, scikit-learn
, matplotlib
, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of âA Plan for the North American Bat Monitoring Programâ (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in âA Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Programâ (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record contains the training, test and validation datasets used to train and evaluate the machine learning models in manuscript:
Sahu, Biswajyoti, et al. "Sequence determinants of human gene regulatory elements." (2021).
This record contains also the final hyperparameter-optimized models for each training dataset/task combination described in the manuscript. The README-files provided with the record describe the datasets and models in more detail. The datasets deposited here are derived from the original raw data (GEO accession: GSE180158) as described in the Methods of the manuscript.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Please upvote if you find this dataset of use. - Thank you This version is an update of the earlier version. I ran a data set quality evaluation program on the previous version which found a considerable number of duplicate and near duplicate images. Duplicate images can lead to falsely higher values of validation and test set accuracy and I have eliminated these images in this version of the dataset. Images were gathered from internet searches. The images were scanned with a duplicate image detector program I wrote. Any duplicate images were removed to prevent bleed through of images between the train, test and valid data sets. All images were then resized to 224 X224 X 3 and converted to jpg format. A csv file is included that for each image file contains the relative path to the image file, the image file class label and the dataset (train, test or valid) that the image file resides in. This is a clean dataset. If you build a good model you should achieve at least 95% accuracy on the test set. If you build a very good model for example using transfer learning you should be able to achieve 98%+ on test set accuracy. If you find this data set useful please upvote. Thanks
Collection of sports images covering 100 different sports.. Images are 224,224,3 jpg format. Data is separated into train, test and valid directories. Additionallly a csv file is included for those that wish to use it to create there own train, test and validation datasets. .
Wanted to build a high quality clean data set that was easy to use and had no bad images or duplication between the train, test and validation data sets. Provides a good data set to test your models on. Design for straight forward application of keras preprocessing functions like ImageDataenerator.flow_from_directory or if you use the csv file ImageDataGenerator.flow_from_dataframe. This dataset was carefully created so that the region of interest (ROI) in this case the sport occupies approximately 50% of the pixels in the image. As a consequence even models of moderate complexity should achieve training and validation accuracies in the high 90's.
Preprocessed data derived from the "spam-mails" dataset, containing email messages labeled as spam or ham. Each record includes a unique identifier from the original dataset and an experiment_id indicating its assignment to a specific data split (training, validation, or test) used in this experiment. The email content has been lemmatized and cleaned to remove noise such as punctuation, special characters, and stopwords, ensuring consistent input for embedding and model training. Original data source: https://www.kaggle.com/datasets/venky73/spam-mails-dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Including the split of real and null reactions for training, validation and test
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of noisy input and output measurements collected from a simulator of a pH reactor. The system is single input single output.
The dataset consists of the following files:
'PH_U_Train.csv' and 'PH_Y_Train.csv': training dataset. The training dataset consists of 15 experiments, each one with 2000 (u, y) samples. The i-th column corresponds to the input (or output) variable of the i-th experiment.
'PH_U_Val.csv' and 'PH_Y_Val.csv': validation dataset. The validation dataset consists of 4 experiments, each one with 2000 (u, y) samples. The i-th column corresponds to the input (or output) variable of the i-th experiment.
'PH_U_Test.csv' and 'PH_Y_Test.csv': test dataset. The test dataset consists of 1 experiment of 2000 (u, y) samples.
'PH_U.csv' and 'PH_Y.csv': concatenation of the training, validation, and test datasets.
Moreover, the parameters of an Incrementally Input-to-State Stable LSTM neural network trained to learn the dynamical system are reported in the file 'LSTM_parameters.mat'.
If you use this data please cite the related paper
Terzi, E., Bonassi, F., Farina, M., & Scattolini, R. (2021). Learning model predictive control with long shortâterm memory networks. International Journal of Robust and Nonlinear Control.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Abstract: Das Ziel des Datensatzes ist das Training und die Validierung von Modellen zur Vorhersage von Zeitreihen fĂŒr FrĂ€sprozesse. Dazu wurden an einer DMC 60H Prozesse mit einer Abtastrate von 500 Hz durch eine Siemens Industrial Edge aufgenommen. Die Maschine wurde steuerungstechnisch aufgerĂŒstet. Es wurden Prozesse fĂŒr das Modelltraining und die Validierung aufgenommen, welche sowohl fĂŒr die Bearbeitung von Stahl sowie von Aluminium verwendet wurden. Es wurden mehrere Aufnahmen mit und ohne WerkstĂŒck (Aircut) erstellt, um möglichst viele FĂ€lle abdecken zu können. Es handelt sich um die gleiche Versuchsreihe wie in "Training and validation dataset of milling processes for time series prediction" mit der DOI 10.5445/IR/1000157789 und hat zum Ziel, eine Untersuchung der Ăbertragbarkeit von Modellen zwischen verschiedenen Maschinen zu ermöglichen. Abstract: The aim of the dataset is to train and validate models for predicting time series for milling processes. For this purpose, processes were recorded at a sampling rate of 500 Hz by a Siemens Industrial Edge on a DMC 60H. The machine was upgraded in terms of control technology. Processes for model training and validation were recorded, suitable for both steel and aluminum machining. Several recordings were made with and without the workpiece (aircut) in order to cover as many cases as possible. This is the same series of experiments as in "Training and validation dataset of milling processes for time series prediction" with DOI 10.5445/IR/1000157789 and allows an investigation of the transferability of models between different machines. TechnicalRemarks: Documents: -Design of Experiments: Information on the paths such as the technological values of the experiments -Recording information: Information about the recordings with comments -Data: All recorded datasets. The first level contains the folders for training and validation both with and without the workpiece. In the next level, the individual test executions are located. The individual recordings are stored in the form of a JSON file. This consists of a header with all relevant information such as the signal sources. This is followed by the entries of the recorded time series. -NC-Code: NC programs executed on the machine Experimental data: -Machine: Retrofitted DMC 60H -Material: S235JR, 2007 T4 -Tools: -VHM-FrĂ€ser HPC, TiSi, â f8 DC: 5mm -VHM-FrĂ€ser HPC, TiSi, â f8 DC: 10mm -VHM-FrĂ€ser HPC, TiSi, â f8 DC: 20mm -SchaftfrĂ€ser HSS-Co8, TiAlN, â k10 DC: 5mm -SchaftfrĂ€ser HSS-Co8, TiAlN, â k10 DC: 10mm -SchaftfrĂ€ser HSS-Co8, TiAlN, â k10 DC: 5mm -Workpiece blank dimensions: 150x75x50mm License: This work is licensed under a Creative Commons Attribution 4.0 International License. Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by IMT2022053
Released under Apache 2.0
This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits. For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method. For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset that we published in this data repository can be used to build neural networks-based inverse kinematics for NAO robot arms. This dataset is named ARKOMA. ARKOMA is an acronym for ARif eKO MAuridhi, all of whom are the creators of this dataset. This dataset contains input-output data pairs. In this dataset, the input data is the end-effector position and orientation in the three-dimensional cartesian space, and the output data is a set of joint angular positions. These joint angular positions are in radians. For further applications, this dataset was split into the training dataset, validation dataset, and testing dataset. The training dataset is used to train neural networks. The validation dataset is utilized to validate neural networksâ performance during the training process. Meanwhile, the testing dataset is employed after the training process to test the performance of trained neural networks. From a set of 10000 data, 60% of data was allocated for the training dataset, 20% of data for the validation dataset, and the other 20% of data for the testing dataset. It should be noted, this dataset is compatible with NAO H25 v3.3 or later.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.⊠See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.