Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prediction Data of Base Models from AutoGluon on 71 classification datasets from the AutoML Benchmark for Balanced Accuracy and ROC AUC.
The files of this figshare item include data that was collected for the paper: CMA-ES for Post Hoc Ensembling in AutoML: A Great Success and Salvageable Failure, Lennart Purucker, Joeran Beel, Second International Conference on Automated Machine Learning, 2023.
The data was stored and used with the assembled framework: https://github.com/ISG-Siegen/assembled.
In detail, the data contains the predictions of base models on validation and test as produced by running AutoGluon for 4 hours. Such prediction data is included for each model produced by AutoGluon on each fold of 10-fold cross-validation on the 71 classification datasets from the AutoML Benchmark. The data exists for two metrics (ROC AUC and Balanced Accuracy). More details can be found in the paper.
The data was collected by code created for the paper and is available in its reproducibility repository: https://doi.org/10.6084/m9.figshare.23609226.
Its usage is intended for but not limited to using assembled to evaluate post hoc ensembling methods for AutoML.
Details The link above points to a hosted server that facilitates the download. We opted for a hosted server, as we found no other suitable solution to share these large files (due to file size or storage limits) for a reasonable price. If you want to obtain the data in another way or know of a more suitable alternative, please contact Lennart Purucker.
The link resolves to a directory containing the following:
example_metatasks: contains an example metatask for test purposes before committing to downloading all files.
metatasks_roc_auc.zip: The Metatasks obtained by running AutoGluon for ROC AUC.
metatasks_bacc.zip: The Metatasks obtained by running AutoGluon for Balanced Accuracy.
The size after unzipping is:
metatasks_roc_auc.zip: ~85GB metatasks_bacc.zip: ~100GB
The metatask .zip files contain 2 files per metatask. One .json file with metadata information and a .hdf file containing the prediction data. The details on how this should be read and used as a Metatask can be found in the assembled framework and the reproducibility repository. To obtain the data without Metataks, we advise looking at the file content and metadata individually or parsing them by using Metatasks first.
This dataset was an initial test harness infrastructure test for the TrojAI program. It should not be used for research. Please use the more refined datasets generated for the other rounds. The data being generated and disseminated is training, validation, and test data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 200 trained, human level, image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training, test data and model parameters. The last 3 columns show the MinORG, LT and HT parameters used to create the pathogenicity families and build the model for each of the 10 models. Zthr is a threshold value, calculated for each model at the cross validation phase, which is used, given the final prediction score, to decide if the input organisms will be predicted as pathogenic or non-pathogenic. The parameters for each model are chosen after 5-fold cross-validation tests.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Annotated test and train data sets. Both images and annotations are provided separately.
Validation data set for Hi5, Sf9 and HEK cells.
Confusion matrices for the determination of performance parameters
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
findmycells is an open source python package, developed to foster the use of deep-learning based python tools for bioimage analysis, specifically for researchers with limited python coding experience. It is developed and maintained in the following GitHub repository: https://github.com/Defense-Circuits-Lab/findmycells
Disclaimer: All data (including the model ensemble) uploaded here serve solely as a test dataset for findmycells and are not intended for any other purposes.
For instance, the group, subgroup, or subject IDs don´t refer to the actual experimental conditions. Likewise, also the included ROI-files were only created to allow the testing of findmycells and may not live up to scientific standards. Furthermore, the image data represents a subset of a dataset that is already published here:
Segebarth, Dennis et al. (2020), Data from: On the objectivity, reliability, and validity of deep learning enabled bioimage analyses, Dryad, Dataset, https://doi.org/10.5061/dryad.4b8gtht9d
The model ensemble (cfos_ensemble.zip) was trained using deepflash2 (v 0.1.7)
Griebel, M., Segebarth, D., Stein, N., Schukraft, N., Tovote, P., Blum, R., & Flath, C. M. (2021). Deep-learning in the bioimaging wild: Handling ambiguous data with deepflash2. arXiv preprint arXiv:2111.06693.
The training was performed on a subset of the "lab-wue1" training dataset, using only the 27 images with IDs 0000 - 0099 (cfos_training_images.zip) and the corresponding est. GT masks (cfos_training_masks.zip). The images used in "cfos_fmc_test_project.zip" for the actual testing of findmycells are the images with the IDs 0100, 0106, 0149, and 0152 of the aforementioned "lab-wue1" training dataset. They were randomly distributed to the made-up subject folders and renamed to "dentate_gyrus_01" or "dentate_gyrus_02".
This is a program that takes in a description of a cryptographic algorithm implementation's capabilities, and generates test vectors to ensure the implementation conforms to the standard. After generating the test vectors, the program also validates the correctness of the responses from the user.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The repo includes the training, test and validation data used in the paper "On the accuracy of posterior recovery with neural network emulators". Note that due to the convention employed by the emulator framework in the paper the test data is the data used for early stopping and the validation data is used to measure the accuracy of the emulator after training. This is the opposite convention to most machine learning literature.
The corresponding code used in the paper is found at: https://github.com/htjb/validating_posteriors.
`_data.txt` corresponds to the ARES parameters used to generate the signals in `_labels.txt`.
These are CARLA Simulation Datasets of the project "Out-Of-Domain Data Detection using Uncertainty Quantification in End-to-End Driving Algorithms". The simulations are generated in CARLA Town 02 for different sun angles (in degrees). You will find image frames, command labels, and steering control values in the respective 'xxxx_files_data' folder. You will find videos of each simulation run in the 'xxxx_files_visualizations' folder.
The 8 simulation runs for Training Data, are with the Sun Angles : 90, 80, 70, 60, 50, 40, 30, 20
The 8 simulation runs for Training Data were seeded at 0000, 1000, 2000, 3000, 4000, 5000, 6000, 7000 respectively
The 4 simulation runs for Validation Data, are with the Sun Angles : 87, 67, 47, 23
The 4 simulation runs for Validation Data were seeded at 0000, 2000, 4000, 7000 respectively
The 29 simulation runs for Testing Data, are with the Sun Angles : 85, 75, 65, 55, 45, 35, 25, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 09, 08, 07, 06, 05, 04, 03, 02, 01, 00, -1, -10
The 29 simulation runs for Testing Data were all seeded at 5000 respectively
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
The data consist of 45.5K images of dogs and cats well distributed into 3 different sets. This is the first Dataset uploaded by me at Kaggle.
> kaggle datasets download -d kunalgupta2616/dog-vs-cat-images-data
This is an improvement to the dataset from the competition that can be found at this link. in terms of addition of more data and hierarchical distribution.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training, Validation and Test Data for model presented in paper 'A Little Data Goes A Long Way: Automating Seismic Phase Arrival Picking at Nabro Volcano with Transfer Learning', submitted to Journal of Geophysical Research: Solid Earth.
Files:
- train_events_2498.h5 = training set of seismic waveforms (events with P-/S-wave labelled arrivals only, i.e., no noise waveforms)
- train_events_2498.pkl = event training set metadata (UTC P-/S-wave phase arrival times)
- train_noise_2498.h5 = training set of seismic waveforms (noise sections only, i.e., no event waveforms)
- train_noise_2498.pkl = noise training set metadata (UTC time for training noise waveforms)
- val_events.h5 = validation set of seismic waveforms (events with P-/S-wave labelled arrivals only, i.e., no noise waveforms)
- val_events.pkl = event validation set metadata (UTC P-/S-wave phase arrival times)
- val_noise.h5 = validation set of seismic waveforms (noise sections only, i.e., no event waveforms)
- val_noise.pkl = noise validation set metadata (UTC time for validation noise waveforms)
- test.h5 = test set of seismic waveforms (events and noise)
- test_events.pkl = event test set metadata (UTC P-/S-wave phase arrival times for test event waveforms)
- test_noise.pkl = noise test set metadata (UTC time for test noise waveforms)
- nabro_2011-247.mseed = 24 hours seismic data from Nabro Urgency Array (2011-09-04), saved in mseed format (e.g., can be read with obspy)
- nabro_2011-269.mseed = 24 hours seismic data from Nabro Urgency Array (2011-09-26), saved in mseed format (e.g., can be read with obspy)
Further details and code for reading and using these files can be found at the GitHub repo for this paper: https://github.com/sachalapins/U-GPD
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prediction Data of Base Models from Auto-Sklearn 1 on 71 classification datasets from the AutoML Benchmark for Balanced Accuracy and ROC AUC.
The files of this figshare item include data that was collected for the paper:
Q(D)O-ES: Population-based Quality (Diversity) Optimisation for Post Hoc Ensemble Selection in AutoML, Lennart Purucker, Lennart Schneider, Marie Anastacio, Joeran Beel, Bernd Bischl, Holger Hoos, Second International Conference on Automated Machine Learning, 2023.
The data was stored and used with the assembled framework: https://github.com/ISG-Siegen/assembled.
In detail, the data contains the predictions of base models on validation and test as produced by running Auto-Sklearn 1 for 4 hours. Such prediction data is included for each model produced by Auto-Sklearn 1 on each fold of 10-fold cross-validation on the 71 classification datasets from the AutoML Benchmark. The data exists for two metrics (ROC AUC and Balanced Accuracy). More details can be found in the paper.
The data was collected by code created for the paper and is available in its reproducibility repository: https://doi.org/10.6084/m9.figshare.23613624.
Its usage is intended for but not limited to using assembled to evaluate post hoc ensembling methods for AutoML.
Details The link above points to a hosted server that facilitates the download. We opted for a hosted server, as we found no other suitable solution to share these large files (due to file size or storage limits) for a reasonable price. If you want to obtain the data in another way or know of a more suitable alternative, please contact Lennart Purucker.
The link resolves to a directory containing the following:
example_metatasks: contains an example metatask for test purposes before committing to downloading all files.
metatasks_roc_auc.zip: The Metatasks obtained by running Auto-Sklearn 1 for ROC AUC.
metatasks_bacc.zip: The Metatasks obtained by running Auto-Sklearn 1 for Balanced Accuracy.
The size after unzipping the entire file is:
metatasks_roc_auc.zip: ~450GB metatasks_bacc.zip: ~330GB
We suggest extracting only files that are of interest from the .zip archive, as these can be much smaller in size and might suffice for experiments.
The metatask .zip files contain 2 subdirectories for Metatasks produced based on TopN or SiloTopN pruning (see paper for details). In each of these subdirectories, 2 files per metatask exist. One .json file with metadata information and a .hdf or .csv file containing the prediction data. The details on how this should be read and used as a Metatask can be found in the assembled framework and the reproducibility repository. To obtain the data without Metataks, we advise looking at the file content and metadata individually or parsing them by using Metatasks first.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv
, validation_data.csv
, and test_data.csv
. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas
, numpy
, scikit-learn
, matplotlib
, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Structure tensor validationGeneral informationThis item contains test data to validate the structure tensor algorithms and a supplemental paper describing how the data was generated and used.ContentsThe test_data.zip archive contains 101 slices of a cylinder (701x701 pixels) with two artificially created fibre orientations. The outer fibres are oriented longitudinally, and the inner fibres are oriented circumferentially, similar to the ones found in the rat uterus.The SupplementaryMaterials_rat_uterus_texture_validation.pdf file is a short supplemental paper describing the generation of the test data and the results after being processed with the structure tensor code.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets were used to validate and test the data pipeline deployment following the RADON approach. The dataset has a CSV file that contains around 32000 Twitter tweets. 100 CSV files have been created from the single CSV file and each CSV file containing 320 tweets. Those 100 CSV files are used to validate and test (performance/load testing) the data pipeline components.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.
Below are descriptions of the available scripts:
atom_bond_descriptors.sh
: Trains atom/bond targets.atom_bond_descriptors_predict.sh
: Predicts atom/bond targets from pre-trained model.dipole_quadrupole_moments.sh
: Trains dipole and quadrupole moments.dipole_quadrupole_moments_predict.sh
: Predicts dipole and quadrupole moments from pre-trained model.energy_gaps_IP_EA.sh
: Trains energy gaps, ionization potential (IP), and electron affinity (EA).energy_gaps_IP_EA_predict.sh
: Predicts energy gaps, IP, and EA from pre-trained model.get_constraints.py
: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.csv2pkl.py
: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.Below is the procedure for running the ml-QM-GNN on your own dataset:
get_constraints.py
to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.atom_bond_descriptors_predict.sh
to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh
and energy_gaps_IP_EA_predict.sh
to calculate molecular QM descriptors.csv2pkl.py
to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SPSS Data for 37 Intensivists and 74 Novices.Their scores on the 25 original questions.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The case and data files of ANSYS® FLUENT that included the simulations results of Blaisonneau's fracture test
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Convolutional neural network (CNN) models and their respective training, validation and test datasets used in manuscript:
Tuomo Hartonen, Teemu Kivioja and Jussi Taipale, "PlotMI: visualization of pairwise interactions and positional preferences learned by a deep learning model from sequence data", bioRxiv 2021.03.14.435285; doi: https://doi.org/10.1101/2021.03.14.435285
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Skin Cancer Classification Dataset is designed to support the development and evaluation of machine learning models for classifying skin cancer images into 8 distinct classes. This dataset provides a robust foundation for training, validating, and testing image classification models, particularly for deep learning frameworks.
train_x.pkl
, train_y.pkl
val_x.pkl
, val_y.pkl
test_x.pkl
, test_y.pkl
Split | Features File | Labels File | Description |
---|---|---|---|
Training | train_x.pkl | train_y.pkl | Contains input features and labels for training |
Validation | val_x.pkl | val_y.pkl | Data used for model evaluation during training |
Testing | test_x.pkl | test_y.pkl | Data for final performance testing |
(224, 224, 3)
(Height, Width, Channels)This dataset is ideal for: - Building deep learning models for multi-class image classification. - Experimenting with transfer learning and ensemble methods. - Developing tools for skin cancer detection in clinical applications.
The dataset is saved as pickle files for efficient storage and loading. Use the following Python code to load the data:
import pickle
# Example: Loading training data
with open('train_x.pkl', 'rb') as f:
train_x = pickle.load(f)
with open('train_y.pkl', 'rb') as f:
train_y = pickle.load(f)
print("Training data loaded successfully!")
The dataset is compatible with popular deep learning frameworks like TensorFlow
and PyTorch
. Ensure to preprocess the data as per your model's requirements.
This dataset was prepared with the goal of aiding researchers and developers in advancing skin cancer detection technologies. Special thanks to all contributors and sources for the dataset's creation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prediction Data of Base Models from AutoGluon on 71 classification datasets from the AutoML Benchmark for Balanced Accuracy and ROC AUC.
The files of this figshare item include data that was collected for the paper: CMA-ES for Post Hoc Ensembling in AutoML: A Great Success and Salvageable Failure, Lennart Purucker, Joeran Beel, Second International Conference on Automated Machine Learning, 2023.
The data was stored and used with the assembled framework: https://github.com/ISG-Siegen/assembled.
In detail, the data contains the predictions of base models on validation and test as produced by running AutoGluon for 4 hours. Such prediction data is included for each model produced by AutoGluon on each fold of 10-fold cross-validation on the 71 classification datasets from the AutoML Benchmark. The data exists for two metrics (ROC AUC and Balanced Accuracy). More details can be found in the paper.
The data was collected by code created for the paper and is available in its reproducibility repository: https://doi.org/10.6084/m9.figshare.23609226.
Its usage is intended for but not limited to using assembled to evaluate post hoc ensembling methods for AutoML.
Details The link above points to a hosted server that facilitates the download. We opted for a hosted server, as we found no other suitable solution to share these large files (due to file size or storage limits) for a reasonable price. If you want to obtain the data in another way or know of a more suitable alternative, please contact Lennart Purucker.
The link resolves to a directory containing the following:
example_metatasks: contains an example metatask for test purposes before committing to downloading all files.
metatasks_roc_auc.zip: The Metatasks obtained by running AutoGluon for ROC AUC.
metatasks_bacc.zip: The Metatasks obtained by running AutoGluon for Balanced Accuracy.
The size after unzipping is:
metatasks_roc_auc.zip: ~85GB metatasks_bacc.zip: ~100GB
The metatask .zip files contain 2 files per metatask. One .json file with metadata information and a .hdf file containing the prediction data. The details on how this should be read and used as a Metatask can be found in the assembled framework and the reproducibility repository. To obtain the data without Metataks, we advise looking at the file content and metadata individually or parsing them by using Metatasks first.