This dataset was created by Hakan İrek
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.
Below are descriptions of the available scripts:
atom_bond_descriptors.sh
: Trains atom/bond targets.atom_bond_descriptors_predict.sh
: Predicts atom/bond targets from pre-trained model.dipole_quadrupole_moments.sh
: Trains dipole and quadrupole moments.dipole_quadrupole_moments_predict.sh
: Predicts dipole and quadrupole moments from pre-trained model.energy_gaps_IP_EA.sh
: Trains energy gaps, ionization potential (IP), and electron affinity (EA).energy_gaps_IP_EA_predict.sh
: Predicts energy gaps, IP, and EA from pre-trained model.get_constraints.py
: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.csv2pkl.py
: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.Below is the procedure for running the ml-QM-GNN on your own dataset:
get_constraints.py
to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.atom_bond_descriptors_predict.sh
to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh
and energy_gaps_IP_EA_predict.sh
to calculate molecular QM descriptors.csv2pkl.py
to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Amrr Sheta
Released under Apache 2.0
This dataset was created by XUE LINGJUN
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
ger_train.csv – The German training set as CSV file.
ger_validation.csv – The German validation set as CSV file.
en_test.csv – The English test set as CSV file.
en_train.csv – The English training set as CSV file.
en_validation.csv – The English validation set as CSV file.
splitting.py – The python code for splitting a dataset into train, test and validation set.
DataSetTrans_de.csv – The final German dataset as a CSV file.
DataSetTrans_en.csv – The final English dataset as a CSV file.
translation.py – The python code for translating the cleaned dataset.
Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Hard Hat
dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.
The original dataset has a 75/25 train-test split.
Example Image:
https://i.imgur.com/7spoIJT.png" alt="Example Image">
One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.
Use the fork
or Download this Dataset
button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Image Preprocessing | Image Augmentation | Modify Classes
* v1
(resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations
* v2
(raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images
* v3
(v3): generated with the original 75/25 train-test split | Modify Classes used to drop person
class | Preprocessing and Augmentation applied
* v5
(raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person
class
* v8
(raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head
and person
classes
* v9
(raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head
and helmet
classes
* v10
(raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images
* v11
(augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model
* v12
(augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person
class | 3x image generation | Trained with Roboflow's Fast Model
* v13
(augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person
class | 3x image generation | Trained with Roboflow's Accurate Model
* v14
(raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person
class, and remap/relabel helmet
class to head
Choosing Between Computer Vision Model Sizes | Roboflow Train
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
https://i.imgur.com/7Xz8d5M.gif" alt="Example Image">
This is a collection of 665 images of roads with the potholes labeled. The dataset was created and shared by Atikur Rahman Chitholian as part of his undergraduate thesis and was originally shared on Kaggle.
Note: The original dataset did not contain a validation set; we have re-shuffled the images into a 70/20/10 train-valid-test split.
This dataset could be used for automatically finding and categorizing potholes in city streets so the worst ones can be fixed faster.
The dataset is provided in a wide variety of formats for various common machine learning models.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset was created by pascalammeter
Released under CC BY-NC-SA 4.0
This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.
For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.
For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0
Adapting Large Language Models to Domains via Continual Pre-Training
This repo contains the ConvFinQA dataset used in our ICLR 2024 paper Adapting Large Language Models via Reading Comprehension. We explore continued pre-training on domain-specific corpora for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a… See the full description on the dataset page: https://huggingface.co/datasets/AdaptLLM/ConvFinQA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fashion-MNIST
is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST
to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.
* Source
Here's an example of how the data looks (each class takes three-rows):
https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">
train
(86% of images - 60,000 images) set and test
(14% of images - 10,000 images) set only.train
set split to provide 80% of its images to the training set and 20% of its images to the validation set@online{xiao2017/online,
author = {Han Xiao and Kashif Rasul and Roland Vollgraf},
title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
date = {2017-08-28},
year = {2017},
eprintclass = {cs.LG},
eprinttype = {arXiv},
eprint = {cs.LG/1708.07747},
}
This dataset was created by Arya B
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.
This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.
These are the preprocessing steps that were performed:
This is the label mapping:
Category | label |
day bed | 0 |
dishrag | 1 |
plate | 2 |
running shoe | 3 |
soap dispenser | 4 |
street sign | 5 |
table lamp | 6 |
tile roof | 7 |
toilet seat | 8 |
washing machine | 9 |
Checkout this notebook to see how the subset was created.
The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.
Splits of aggregated data into testing and training subsets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SDC-Scissor tool for Cost-effective Simulation-based Test Selection in Self-driving Cars Software
This dataset provides test cases for self-driving cars with the BeamNG simulator. Check out the repository and demo video to get started.
GitHub: github.com/ChristianBirchler/sdc-scissor
This project extends the tool competition platform from the Cyber-Phisical Systems Testing Competition which was part of the SBST Workshop in 2021.
Usage
Demo
Installation
The tool can either be run with Docker or locally using Poetry.
When running the simulations a working installation of BeamNG.research is required. Additionally, this simulation cannot be run in a Docker container but must run locally.
To install the application use one of the following approaches:
docker build --tag sdc-scissor .
poetry install
Using the Tool
The tool can be used with the following two commands:
docker run --volume "$(pwd)/results:/out" --rm sdc-scissor [COMMAND] [OPTIONS]
(this will write all files written to /out
to the local folder results
)poetry run python sdc-scissor.py [COMMAND] [OPTIONS]
There are multiple commands to use. For simplifying the documentation only the command and their options are described.
generate-tests --out-path /path/to/store/tests
label-tests --road-scenarios /path/to/tests --result-folder /path/to/store/labeled/tests
evaluate-models --dataset /path/to/train/set --save
split-train-test-data --scenarios /path/to/scenarios --train-dir /path/for/train/data --test-dir /path/for/test/data --train-ratio 0.8
predict-tests --scenarios /path/to/scenarios --classifier /path/to/model.joblib
evaluate --scenarios /path/to/test/scenarios --classifier /path/to/model.joblib
The possible parameters are always documented with --help
.
Linting
The tool is verified the linters flake8 and pylint. These are automatically enabled in Visual Studio Code and can be run manually with the following commands:
poetry run flake8 . poetry run pylint **/*.py
License
The software we developed is distributed under GNU GPL license. See the LICENSE.md file.
Contacts
Christian Birchler - Zurich University of Applied Science (ZHAW), Switzerland - birc@zhaw.ch
Nicolas Ganz - Zurich University of Applied Science (ZHAW), Switzerland - gann@zhaw.ch
Sajad Khatiri - Zurich University of Applied Science (ZHAW), Switzerland - mazr@zhaw.ch
Dr. Alessio Gambi - Passau University, Germany - alessio.gambi@uni-passau.de
Dr. Sebastiano Panichella - Zurich University of Applied Science (ZHAW), Switzerland - panc@zhaw.ch
References
If you use this tool in your research, please cite the following papers:
@INPROCEEDINGS{Birchler2022,
author={Birchler, Christian and Ganz, Nicolas and Khatiri, Sajad and Gambi, Alessio, and Panichella, Sebastiano},
booktitle={2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER),
title={Cost-effective Simulationbased Test Selection in Self-driving Cars Software with SDC-Scissor},
year={2022},
}
This dataset was created by De Chef
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The runtime benchmarks were obtained by running each algorithm on the seed and full multi-MSAs Pfam-A.seed and Pfam-A.full on 2 cores with 8 GB RAM for the seed alignments and on 3 cores with 12 GB RAM for the full alignments. We did not compute the maximum runtime of the Blue algorithm; the algorithm failed to terminate within 6 days for 34 families.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CIFAR-10 and CIFAR-100 dataset contains labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.
* More info on CIFAR-100: https://www.cs.toronto.edu/~kriz/cifar.html
* TensorFlow listing of the dataset: https://www.tensorflow.org/datasets/catalog/cifar100
* GitHub repo for converting CIFAR-100 tarball
files to png
format: https://github.com/knjcode/cifar2png
The CIFAR-10
dataset consists of 60,000 32x32 colour images in 10 classes
, with 6,000 images per class. There are 50,000
training images and 10,000 test
images [in the original dataset].
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training
images and 100 testing
images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). However, this project does not contain the superclasses.
* Superclasses version: https://universe.roboflow.com/popular-benchmarks/cifar100-with-superclasses/
More background on the dataset:
https://i.imgur.com/5w8A0Vm.png" alt="CIFAR-100 Dataset Classes and Superclassees">
train
(83.33% of images - 50,000 images) set and test
(16.67% of images - 10,000 images) set only.train
set split to provide 80% of its images to the training set (approximately 40,000 images) and 20% of its images to the validation set (approximately 10,000 images)@TECHREPORT{Krizhevsky09learningmultiple,
author = {Alex Krizhevsky},
title = {Learning multiple layers of features from tiny images},
institution = {},
year = {2009}
}
Node classification on Film with 60%/20%/20% random splits for training/validation/test.