42 datasets found

Data from: Time-Split Cross-Validation as a Method for Estimating the...
acs.figshare.com
figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/ci400084k.s001
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
Downsized camera trap images for automated classification
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Dec 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers (2022). Downsized camera trap images for automated classification [Dataset]. http://doi.org/10.5281/zenodo.6627707
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6627707
Dataset updated
Dec 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description:
Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.
Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions
Funding: These data were collected as part of research funded by:
NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A)
This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.
XML metadata: GEMINI compliant metadata for this dataset is available here
Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip
CT_image_data_info2.xlsx
This file contains dataset metadata and 1 data tables:
Dataset Images (described in worksheet Dataset_images)
Description: This worksheet details the composition of each dataset used in the analyses
Number of fields: 69
Number of data rows: 270287
Fields:
filename: Root ID (Field type: id)
camera_trap_site: Site ID for the camera trap location (Field type: location)
taxon: Taxon recorded by camera trap (Field type: taxa)
dist_level: Level of disturbance at site (Field type: ordered categorical)
baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical)
increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical)
dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type:
m
hardRain: an R package for quick, automated rainfall detection in...
data.mendeley.com
Updated Aug 27, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliver Metcalf (2019). hardRain: an R package for quick, automated rainfall detection in ecoacoustic datasets using a threshold-based approach-training and test data sets [Dataset]. http://doi.org/10.17632/mpjrkg34jw.8
Explore at:
Unique identifier
https://doi.org/10.17632/mpjrkg34jw.8
Dataset updated
Aug 27, 2019
Authors
Oliver Metcalf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ecoacoustic data files collected with automated recording units
Z
Personal Protective Equipment Dataset (PPED)
data.niaid.nih.gov
zenodo.org
Updated May 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous (2022). Personal Protective Equipment Dataset (PPED) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6551757
Explore at:
Dataset updated
May 17, 2022
Dataset authored and provided by
Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Personal Protective Equipment Dataset (PPED)

This dataset serves as a benchmark for PPE in chemical plants We provide datasets and experimental results.

The dataset

We produced a data set based on the actual needs and relevant regulations in chemical plants. The standard GB 39800.1-2020 formulated by the Ministry of Emergency Management of the People’s Republic of China defines the protective requirements for plants and chemical laboratories. The complete dataset is contained in the folder PPED/data.

1.1. Image collection

We took more than 3300 pictures. We set the following different characteristics, including different environments, different distances, different lighting conditions, different angles, and the diversity of the number of people photographed.

Backgrounds: There are 4 backgrounds, including office, near machines, factory and regular outdoor scenes.

Scale: By taking pictures from different distances, the captured PPEs are classified in small, medium and large scales.

Light: Good lighting conditions and poor lighting conditions were studied.

Diversity: Some images contain a single person, and some contain multiple people.

Angle: The pictures we took can be divided into front and side.

A total of more than 3300 photos were taken in the raw data under all conditions. All images are located in the folder “PPED/data/JPEGImages”.

1.2. Label

We use Labelimg as the labeling tool, and we use the PASCAL-VOC labelimg format. Yolo use the txt format, we can use trans_voc2yolo.py to convert the XML file in PASCAL-VOC format to txt file. Annotations are stored in the folder PPED/data/Annotations

1.3. Dataset Features

The pictures are made by us according to the different conditions mentioned above. The file PPED/data/feature.csv is a CSV file which notes all the .os of all the image. It records every feature of the picture, including lighting conditions, angles, backgrounds, number of people and scale.

1.4. Dataset Division

The data set is divided into 9:1 training set and test set.

Baseline Experiments

We provide baseline results with five models, namely Faster R-CNN ®, Faster R-CNN (M), SSD, YOLOv3-spp, and YOLOv5. All code and results is given in folder PPED/experiment.

2.1. Environment and Configuration:

Intel Core i7-8700 CPU

NVIDIA GTX1060 GPU

16 GB of RAM

Python: 3.8.10

pytorch: 1.9.0

pycocotools: pycocotools-win

Windows 10

2.2. Applied Models

The source codes and results of the applied models is given in folder PPED/experiment with sub-folders corresponding to the model names.

2.2.1. Faster R-CNN

Faster R-CNN

backbone: resnet50+fpn

We downloaded the pre-training weights from https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth.

We modified the dataset path, training classes and training parameters including batch size.

We run train_res50_fpn.py start training.

Then, the weights are trained by the training set.

Finally, we validate the results on the test set.

backbone: mobilenetv2

the same training method as resnet50+fpn, but the effect is not as good as resnet50+fpn, so it is directly discarded.

The Faster R-CNN source code used in our experiment is given in folder PPED/experiment/Faster R-CNN. The weights of the fully-trained Faster R-CNN (R), Faster R-CNN (M) model are stored in file PPED/experiment/trained_models/resNetFpn-model-19.pth and mobile-model.pth. The performance measurements of Faster R-CNN (R) Faster R-CNN (M) are stored in folder PPED/experiment/results/Faster RCNN(R)and Faster RCNN(M).

2.2.2. SSD

backbone: resnet50

We downloaded pre-training weights from https://download.pytorch.org/models/resnet50-19c8e357.pth.

The same training method as Faster R-CNN is applied.

The SSD source code used in our experiment is given in folder PPED/experiment/ssd. The weights of the fully-trained SSD model are stored in file PPED/experiment/trained_models/SSD_19.pth. The performance measurements of SSD are stored in folder PPED/experiment/results/SSD.

2.2.3. YOLOv3-spp

backbone: DarkNet53

We modified the type information of the XML file to match our application.

We run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.

The weights used are: yolov3-spp-ultralytics-608.pt.

The YOLOv3-spp source code used in our experiment is given in folder PPED/experiment/YOLOv3-spp. The weights of the fully-trained YOLOv3-spp model are stored in file PPED/experiment/trained_models/YOLOvspp-19.pt. The performance measurements of YOLOv3-spp are stored in folder PPED/experiment/results/YOLOv3-spp.

2.2.4. YOLOv5

backbone: CSP_DarkNet

We modified the type information of the XML file to match our application.

We run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.

The weights used are: yolov5s.

The YOLOv5 source code used in our experiment is given in folder PPED/experiment/yolov5. The weights of the fully-trained YOLOv5 model are stored in file PPED/experiment/trained_models/YOLOv5.pt. The performance measurements of YOLOv5 are stored in folder PPED/experiment/results/YOLOv5.

2.3. Evaluation

The computed evaluation metrics as well as the code needed to compute them from our dataset are provided in the folder PPED/experiment/eval.

Code Sources

Faster R-CNN (R and M)

https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/faster_rcnn

official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/faster_rcnn.py

SSD

https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/ssd

official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/ssd.py

YOLOv3-spp

https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/yolov3-spp

YOLOv5

https://github.com/ultralytics/yolov5
Z
Data for the manuscript "Spatially resolved uncertainties for machine...
data.niaid.nih.gov
zenodo.org
Updated Aug 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Madsen, Georg K. H. (2024). Data for the manuscript "Spatially resolved uncertainties for machine learning potentials" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11086346
Explore at:
Dataset updated
Aug 1, 2024
Dataset provided by
Wanzenböck, Ralf
Madsen, Georg K. H.
Heid, Esther
Schörghuber, Johannes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository accompanies the manuscript "Spatially resolved uncertainties for machine learning potentials" by E. Heid, J. Schörghuber, R. Wanzenböck, and G. K. H. Madsen. The following files are available:

mc_experiment.ipynb is a Jupyter notebook for the Monte Carlo experiment described in the study (artificial model with only variance as error source).

aggregate_cut_relax.py contains code to cut and relax boxes for the water active learning cycle.

data_t1x.tar.gz contains reaction pathways for 10,073 reactions from a subset of the Transition1x dataset, split into training, validation and test sets. The training and validation sets contain the indices 1, 2, 9, and 10 from a 10-image nudged-elastic band search (40k datapoints), while the test set contains indices 3-8 (60k datapoints). The test set is ordered according to the reaction and index, i.e. rxn1_index3, rxn1_index4, [...] rxn1_index8, rxn2_index3, [...].

data_sto.tar.gz contains surface reconstructions of SrTiO3, randomly split into a training and validation set, as well as a test set.

data_h2o.tar.gz contains:

full_db.extxyz: The full dataset of 1.5k structures.

iter00_train.extxyz and iter00_validation.extxyz: The initial training and validation set for the active learning cycle.

the subfolders in the folders random, and uncertain, and atomic contain the training and validation sets for the random and uncertainty-based (local or atomic) active learning loops.
Z
Chinese Chemical Safety Signs (CCSS)
data.niaid.nih.gov
Updated Mar 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous (2023). Chinese Chemical Safety Signs (CCSS) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5482333
Explore at:
Dataset updated
Mar 21, 2023
Dataset authored and provided by
Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Notice: We have currently a paper under double-blind review that introduces this dataset. Therefore, we have anonymized the dataset authorship. Once the review process has concluded, we will update the authorship information of this dataset.

Chinese Chemical Safety Signs (CCSS)

This dataset is compiled as a benchmark for recognizing chemical safety signs from images. We provide both the dataset and the experimental results at doi:10.5281/zenodo.5482334.

The Dataset

The complete dataset is contained in the folder ccss/data in archive css_data.zip. The images include signs based on the Chinese standard "Safety Signs and their Application Guidelines" (GB 2894-2008) for safety signs in chemical environments. This standard, in turn, refers to the standards ISO 7010 (Graphical symbols – Safety Colours and Safety Signs – Safety signs used in workplaces and public areas), GB/T 10001 (Public Information Graphic Symbols for Signs), and GB 13495 (Fire Safety Signs)

1.1. Image Collection

We collect photos commonly used chemical safety signs in chemical laboratories and chemical teaching buildings. For a discussion of the standards we base our collections, refer to the book "Talking about Hazardous Chemicals and Safety Signs" for common signs, and refer to the safety signs guidelines (GB 2894-2008).

The shooting was mainly carried out in 6 locations, namely on the road, in a parking lot, construction walls, in a chemical laboratory, outside near big machines, and inside the factory and corridor.

Shooting scale: Images in which the signs appear in small, medium and large scales were taken for each location by shooting photos from different distances.

Shooting light: good lighting conditions and poor lighting conditions were investigated.

Part of the images contain multiple targets and the other part contains only single signs.

Under all conditions, a total of 4650 photos were taken in the original data. These were expanded to 27'900 photos were via data enhancement. All images are located in folder ccss/data/JPEGImages.

The file ccss/data/features/enhanced_data_to_original_data.csv provides a mapping between the enhanced image name and the corresponding original image.

1.2. Annotation and Labelling

The labelling tool is Labelimg, which uses the PASCAL-VOC labelling format. The annotation is stored in the folder ccss/data/Annotations.

Faster R-CNN and SSD are two algorithms that use this format. When training YOLOv5, you can run trans_voc2yolo.py to convert the XML file in PASCAL-VOC format to a txt file.

We provide further meta-information about the dataset in form of a CSV file features.csv which notes, for each image, which other features it has (lighting conditions, scale, multiplicity, etc.).

1.3. Dataset Features

As stated above, the images have been shot under different conditions. We provide all the feature information in folder ccss/data/features. For each feature, there is a separate list of file names in that folder. The file ccss/data/features/features_on_original_data.csv is a CSV file which notes all the features of each original image.

1.4. Dataset Division

The data set is fixedly divided into 7:3 training set and test set. You can find the corresponding image names in the files ccss/data/training_data_file_names.txt and ccss/data/test_data_file_names.txt.

Baseline Experiments

We provide baseline results with the three models of Faster R-CNN, SSD, and YOLOv5. All code and results is given in folder ccss/experiment in archive ccss_experiment.

2.2. Environment and Configuration

Single Intel Core i7-8700 CPU

NVIDIA GTX1060 GPU

16 GB of RAM

Python: 3.8.10

pytorch: 1.9.0

pycocotools: pycocotools-win

Visual Studio 2017

Windows 10

2.3. Applied Models

The source codes and results of the applied models is given in folder ccss/experiment with sub-folders corresponding to the model names.

2.3.1. Faster R-CNN

backbone: resnet50+fpn.

we downloaded the pre-training weights from

we modify the type information of the JSON file to match our application.

run train_res50_fpn.py

finally, the weights trained by the training set.

backbone: mobilenetv2

the same training method as resnet50+fpn, but the effect is not as good as resnet50+fpn, so it is directly discarded.

The Faster R-CNN source code used in our experiment is given in folder ccss/experiment/sources/faster_rcnn. The weights of the fully-trained Faster R-CNN model are stored in file ccss/experiment/trained_models/faster_rcnn.pth. The performance measurements of Faster R-CNN are stored in folder ccss/experiment/performance_indicators/faster_rcnn.

2.3.2. SSD

backbone: resnet50

we downloaded pre-training weights from

the same training method as Faster R-CNN is applied.

The SSD source code used in our experiment is given in folder ccss/experiment/sources/ssd. The weights of the fully-trained SSD model are stored in file ccss/experiment/trained_models/ssd.pth. The performance measurements of SSD are stored in folder ccss/experiment/performance_indicators/ssd.

2.3.4. YOLOv5

backbone: CSP_DarkNet

we modified the type information of the YML file to match our application

run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.

the weights used are: yolov5s.

The YOLOv5 source code used in our experiment is given in folder ccss/experiment/sources/yolov5. The weights of the fully-trained YOLOv5 model are stored in file ccss/experiment/trained_models/yolov5.pt. The performance measurements of YOLOv5 are stored in folder ccss/experiment/performance_indicators/yolov5.

2.4. Evaluation

The computed evaluation metrics as well as the code needed to compute them from our dataset are provided in the folder ccss/experiment/performance_indicators. They are provided over the complete test st as well as separately for the image features (over the test set).

Code Sources

Faster R-CNN

official code:

SSD

official code:

YOLOv5

We are particularly thankful to the author of the GitHub repository WZMIAOMIAO/deep-learning-for-image-processing (with whom we are not affiliated). Their instructive videos and codes were most helpful during our work. In particular, we based our own experimental codes on his work (and obtained permission to include it in this archive).

Licensing

While our dataset and results are published under the Creative Commons Attribution 4.0 License, this does not hold for the included code sources. These sources are under the particular license of the repository where they have been obtained from (see Section 3 above).
H
Data from: TüEyeQ: A rich IQ test performance data set with eye movement,...
dataverse.harvard.edu
Updated Mar 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enkelejda Kasneci; Gjergji Kasneci; Tobias Appel; Johannes Haug; Franz Wortha; Maike Tibus; Ulrich Trautwein; Peter Gerjets (2021). TüEyeQ: A rich IQ test performance data set with eye movement, educational and socio-demographic information [Dataset]. http://doi.org/10.7910/DVN/JGOCKI
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/JGOCKI
Dataset updated
Mar 5, 2021
Dataset provided by
Harvard Dataverse
Authors
Enkelejda Kasneci; Gjergji Kasneci; Tobias Appel; Johannes Haug; Franz Wortha; Maike Tibus; Ulrich Trautwein; Peter Gerjets
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We present the TüEyeQ data set - to the best of our knowledge - the most comprehensive data set generated on a culture fair intelligence test (CFT 20-R), i.e., an IQ Test, consisting of 41 single tasks, taken by 315 individuals aged between 18 and 30 years. In addition to socio-demographic and educational information, the data set also includes the eye movements of the individuals while taking the IQ test. Along with distributional information we also highlight the potential for predictive analysis on the TüEyeQ data set and report the most important covariates for predicting the performance of a subject on a given task along with their influence on the prediction.
Counting using deep learning regression gives value to ecological surveys.
dataverse2-dmz.nioz.nl
dataverse.nioz.nl
+1more
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NIOZ (2022). Counting using deep learning regression gives value to ecological surveys. [Dataset]. http://doi.org/10.33591/nioz/7b.b.0c
Explore at:
bin(140592439), bin(51244052), csv(3808), csv(67458), application/x-ipynb+json(4860742), bin(140592445), csv(16711), csv(3320), csv(219477), csv(14487), bin(51244047), txt(2343), bin(140592444)Available download formats
Unique identifier
https://doi.org/10.33591/nioz/7b.b.0c
Dataset updated
May 13, 2022
Dataset provided by
Royal Netherlands Institute for Sea Research
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Many ecological studies rely on count data and involve manual counting of objects of interest, which is time-consuming and especially disadvantageous when time in the field or lab is limited. However, an increasing number of works uses digital imagery, which opens opportunities to automatise counting tasks. In this study, we use machine learning to automate counting objects of interest without the need to label individual objects. By leveraging already existing image-level annotations, this approach can also give value to historical data that were collected and annotated over longer time series (typical for many ecological studies), without the aim of deep learning applications. We demonstrate deep learning regression on two fundamentally different counting tasks: (i) daily growth rings from microscopic images of fish otolith (i.e., hearing stone) and (ii) hauled out seals from highly variable aerial imagery. In the otolith images, our deep learning-based regressor yields an RMSE of 3.40 day-rings and an R^2 of 0.92. Initial performance in the seal images is lower (RMSE of 23.46 seals and R^2 of 0.72), which can be attributed to a lack of images with a high number of seals in the initial training set, compared to the test set. We then show how to improve performance substantially (RMSE of 19.03 seals and R^2 of 0.77) by carefully selecting and relabelling just 100 additional training images based on initial model prediction discrepancy. The regression-based approach used here returns accurate counts (R^2 of 0.92 and 0.77 for the rings and seals, respectively), directly usable in ecological research.
P
Real Blur Dataset Dataset
paperswithcode.com
Updated Nov 17, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaesung Rim; Haeyun Lee; Jucheol Won; Sunghyun Cho (2021). Real Blur Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/real-blur-dataset
Explore at:
Dataset updated
Nov 17, 2021
Authors
Jaesung Rim; Haeyun Lee; Jucheol Won; Sunghyun Cho
Description
The dataset consists of 4,738 pairs of images of 232 different scenes including reference pairs. All images were captured both in the camera raw and JPEG formats, hence generating two datasets: RealBlur-R from the raw images, and RealBlur-J from the JPEG images. Each training set consists of 3,758 image pairs, while each test set consists of 980 image pairs.

The deblurring result is first aligned to its ground truth sharp image using a homography estimated by the enhanced correlation coefficients method, and PSNR or SSIM is computed in sRGB color space.
Chinese Chemical Safety Signs (CCSS)
zenodo.org
bin, html, xz
Updated Mar 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2023). Chinese Chemical Safety Signs (CCSS) [Dataset]. http://doi.org/10.5281/zenodo.5938816
Explore at:
xz, html, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5938816
Dataset updated
Mar 21, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Chinese Chemical Safety Signs (CCSS)

This dataset is compiled as a benchmark for recognizing chemical safety signs from images. We provide both the dataset and the experimental results.

1. The Dataset

The complete dataset is contained in the folder ccss/data. The images include signs based on the Chinese standard "Safety Signs and their Application Guidelines" (GB 2894-2008) for safety signs in chemical environments. This standard, in turn, refers to the standards ISO 7010 (Graphical symbols – Safety Colours and Safety Signs – Safety signs used in workplaces and public areas), GB/T 10001 (Public Information Graphic Symbols for Signs), and GB 13495 (Fire Safety Signs)

1.1. Image Collection

We collect photos of commonly used chemical safety signs in chemical laboratories and chemistry teaching. For a discussion of the standards we base our collections, refer to the book "Talking about Hazardous Chemicals and Safety Signs" for common signs, and refer to the safety signs guidelines (GB 2894-2008).

The shooting was mainly carried out in 6 locations, namely on the road, in a parking lot, construction walls, in a chemical laboratory, outside near big machines, and inside the factory and corridor.

Shooting scale: Images in which the signs appear in small, medium and large scales were taken for each location by shooting photos from different distances.

Shooting light: good lighting conditions and poor lighting conditions were investigated.

Part of the images contain multiple targets and the other part contains only single signs.

Under all conditions, a total of 4650 photos were taken in the original data. These were expanded to 27,900 photos were via data enhancement. All images are located in folder ccss/data/JPEGImages.

The file ccss/data/features/enhanced_data_to_original_data.csv provides a mapping between the enhanced image name and the corresponding original image.

1.2. Annotation and Labelimg

We use Labelimg as labeling tool, which, in turn, uses the PASCAL-VOC labelimg format. The annotation is stored in the folder ccss/data/Annotations.

Faster R-CNN and SSD are two algorithms that use this format. When training YOLOv5, you can run trans_voc2yolo.py to convert the XML file in PASCAL-VOC format to a txt file.

We provide further meta-information about the dataset in form of a CSV file features.csv which notes, for each image, which other features it has (lighting conditions, scale, multiplicity, etc.). We apply the COCO standard for deciding whether a target is small, medium, or large in size.

1.3. Dataset Features

As stated above, the images have been shot under different conditions. We provide all the feature information in folder ccss/data/features. For each feature, there is a separate list of file names in that folder. The file ccss/data/features/features_on_original_data.csv is a CSV file which notes all the features of each original image.

1.4. Dataset Division

The data set is fixedly divided into 7:3 training set and test set. You can find the corresponding image names in the files ccss/data/training_data_file_names.txt and ccss/data/test_data_file_names.txt.

2. Baseline Experiments

We provide baseline results with five models, namely Faster R-CNN (R), Faster R-CNN (M), SSD, YOLOv3-spp, and YOLOv5. All code and results is given in folder ccss/experiment.

2.2. Environment and Configuration:

Single Intel Core i7-8700 CPU

NVIDIA GTX1060 GPU

16 GB of RAM

Python: 3.8.10

pytorch: 1.9.0

pycocotools: pycocotools-win

Visual Studio 2017

Windows 10

2.3. Applied Models

The source codes and results of the applied models is given in folder ccss/experiment with sub-folders corresponding to the model names.

2.3.1. Faster R-CNN

Faster R-CNN (R) has the backbone resnet50+fpn.

we downloaded the pre-training weights from https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth

we modify the type information of the JSON file to match our application.

run train_res50_fpn.py

finally, the weights trained by the training set.

The Faster R-CNN (R) source code used in our experiment is given in folder ccss/experiment/sources/faster_rcnn (R). The weights of the fully-trained Faster R-CNN (R) model are stored in file ccss/experiment/trained_models/faster_rcnn (R).pth. The performance measurements of Faster R-CNN (R) are stored in folder ccss/experiment/performance_indicators/faster_rcnn (R).

Faster R-CNN (M) has the backbone mobilenetv2.

backbone: MobileNetV2.

we modify the type information of the JSON file to match our application.

run train_mobilenetv2.py

finally, the weights trained by the training set.

The Faster R-CNN (M) source code used in our experiment is given in folder ccss/experiment/sources/faster_rcnn (M). The weights of the fully-trained Faster R-CNN (M) model are stored in file ccss/experiment/trained_models/faster_rcnn (M).pth. The performance measurements of Faster R-CNN (M) are stored in folder ccss/experiment/performance_indicators/faster_rcnn (M).

2.3.2. SSD

backbone: resnet50

we downloaded pre-training weights from https://download.pytorch.org/models/resnet50-19c8e357.pth

the same training method as Faster R-CNN is applied.

The SSD source code used in our experiment is given in folder ccss/experiment/sources/ssd. The weights of the fully-trained SSD model are stored in file ccss/experiment/trained_models/ssd.pth. The performance measurements of SSD are stored in folder ccss/experiment/performance_indicators/ssd.

2.3.3. YOLOv3-spp

backbone: DarkNet53

we modified the type information of the XML file to match our application

run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.

the weights used are: yolov3-spp-ultralytics-608.pt.

The YOLOv3-spp source code used in our experiment is given in folder ccss/experiment/sources/yolov3-spp. The weights of the fully-trained YOLOv3-spp model are stored in file ccss/experiment/trained_models/yolov3-spp.pt. The performance measurements of YOLOv3-spp are stored in folder ccss/experiment/performance_indicators/yolov3-spp.

2.3.4. YOLOv5

backbone: CSP_DarkNet

we modified the type information of the XML file to match our application

run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.

the weights used are: yolov5s.

The YOLOv5 source code used in our experiment is given in folder ccss/experiment/sources/yolov5. The weights of the fully-trained YOLOv5 model are stored in file ccss/experiment/trained_models/yolov5.pt. The performance measurements of YOLOv5 are stored in folder ccss/experiment/performance_indicators/yolov5.

2.4. Evaluation

The computed evaluation metrics as well as the code needed to compute them from our dataset are provided in the folder ccss/experiment/performance_indicators. They are provided over the complete test st as well as separately for the image features (over the test set).

3. Code Sources

Faster R-CNN (R and M)

https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/faster_rcnn

official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/faster_rcnn.py

SSD

https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/ssd

official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/ssd.py

YOLOv3-spp

https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/yolov3-spp

YOLOv5

https://github.com/ultralytics/yolov5

We are particularly thankful to the author of the GitHub repository WZMIAOMIAO/deep-learning-for-image-processing (with whom we are not affiliated). Their instructive videos and codes were most helpful during our work. In
d
Relaxed Naïve Bayes Data
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Relaxed Naïve Bayes Team (2023). Relaxed Naïve Bayes Data [Dataset]. http://doi.org/10.7910/DVN/7KNKLL
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/7KNKLL
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Relaxed Naïve Bayes Team
Description
NaiveBayes_R.xlsx: This Excel file includes information as to how probabilities of observed features are calculated given recidivism (P(x_ij│R)) in the training data. Each cell is embedded with an Excel function to render appropriate figures. P(Xi|R): This tab contains probabilities of feature attributes among recidivated offenders. NIJ_Recoded: This tab contains re-coded NIJ recidivism challenge data following our coding schema described in Table 1. Recidivated_Train: This tab contains re-coded features of recidivated offenders. Tabs from [Gender] through [Condition_Other]: Each tab contains probabilities of feature attributes given recidivism. We use these conditional probabilities to replace the raw values of each feature in P(Xi|R) tab. NaiveBayes_NR.xlsx: This Excel file includes information as to how probabilities of observed features are calculated given non-recidivism (P(x_ij│N)) in the training data. Each cell is embedded with an Excel function to render appropriate figures. P(Xi|N): This tab contains probabilities of feature attributes among non-recidivated offenders. NIJ_Recoded: This tab contains re-coded NIJ recidivism challenge data following our coding schema described in Table 1. NonRecidivated_Train: This tab contains re-coded features of non-recidivated offenders. Tabs from [Gender] through [Condition_Other]: Each tab contains probabilities of feature attributes given non-recidivism. We use these conditional probabilities to replace the raw values of each feature in P(Xi|N) tab. Training_LnTransformed.xlsx: Figures in each cell are log-transformed ratios of probabilities in NaiveBayes_R.xlsx (P(Xi|R)) to the probabilities in NaiveBayes_NR.xlsx (P(Xi|N)). TestData.xlsx: This Excel file includes the following tabs based on the test data: P(Xi|R), P(Xi|N), NIJ_Recoded, and Test_LnTransformed (log-transformed P(Xi|R)/ P(Xi|N)). Training_LnTransformed.dta: We transform Training_LnTransformed.xlsx to Stata data set. We use Stat/Transfer 13 software package to transfer the file format. StataLog.smcl: This file includes the results of the logistic regression analysis. Both estimated intercept and coefficient estimates in this Stata log correspond to the raw weights and standardized weights in Figure 1. Brier Score_Re-Check.xlsx: This Excel file recalculates Brier scores of Relaxed Naïve Bayes Classifier in Table 3, showing evidence that results displayed in Table 3 are correct. *****Full List***** NaiveBayes_R.xlsx NaiveBayes_NR.xlsx Training_LnTransformed.xlsx TestData.xlsx Training_LnTransformed.dta StataLog.smcl Brier Score_Re-Check.xlsx Data for Weka (Training Set): Bayes_2022_NoID Data for Weka (Test Set): BayesTest_2022_NoID Weka output for machine learning models (Conventional naïve Bayes, AdaBoost, Multilayer Perceptron, Logistic Regression, and Random Forest)
g
In Silico Prediction of Physicochemical Properties of Environmental...
gimi9.com
Updated Jan 10, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). In Silico Prediction of Physicochemical Properties of Environmental Chemicals Using Molecular Fingerprints and Machine Learning | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_in-silico-prediction-of-physicochemical-properties-of-environmental-chemicals-using-molecu/
Explore at:
Dataset updated
Jan 10, 2017
Description
QSAR Model Reporting Formats. Examples of R code: feature selection and regression analysis. Figure S1: Data distribution of logBCF, BP, MP and logVP. Figures S2–S5: Relationship between model complexity and prediction errors as well as the plots of estimated values versus experimental data for logBCF, BP, MP, and logVP, respectively. Figure S6: Plots of leverage versus standardized residuals for logBCF, BP, MP, and logVP models. Table S1: Chemical product classes for training and test sets. Tables S2–S5: Regression statistics for logBCF, BP, MP, and logVP, respectively. Table S6: Applicability domains for logBCF, BP, MP, and logVP. Tables S7–S12: Chemicals with large prediction residuals for the six properties (PDF) Chemical names, CAS registry number and SMILES as well as experimentally measured and estimated property values of the training and test sets (XLSX). This dataset is associated with the following publication: Zang, Q., K. Mansouri, A. Williams, R. Judson, D. Allen, W.M. Casey, and N.C. Kleinstreuer. (Journal of Chemical Information and Modeling) In Silico Prediction of Physicochemical Properties of Environmental Chemicals Using Molecular Fingerprints and Machine Learning. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, USA, 57(1): 36-49, (2017).
YOGData: Labelled data (YOLO and Mask R-CNN) for yogurt cup identification...
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Jun 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Symeon Symeonidis; Vasiliki Balaska; Dimitrios Tsilis; Fotis K. Konstantinidis; Fotis K. Konstantinidis; Symeon Symeonidis; Vasiliki Balaska; Dimitrios Tsilis (2022). YOGData: Labelled data (YOLO and Mask R-CNN) for yogurt cup identification within production lines [Dataset]. http://doi.org/10.5281/zenodo.6773531
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6773531
Dataset updated
Jun 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Symeon Symeonidis; Vasiliki Balaska; Dimitrios Tsilis; Fotis K. Konstantinidis; Fotis K. Konstantinidis; Symeon Symeonidis; Vasiliki Balaska; Dimitrios Tsilis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data abstract:
The YogDATA dataset contains images from an industrial laboratory production line when it is functioned to quality yogurts. The case-study for the recognition of yogurt cups requires training of Mask R-CNN and YOLO v5.0 models with a set of corresponding images. Thus, it is important to collect the corresponding images to train and evaluate the class. Specifically, the YogDATA dataset includes the same labeled data for Mask R-CNN (coco format) and YOLO models. For the YOLO architecture, training and validation datsets include sets of images in jpg format and their annotations in txt file format. For the Mask R-CNN architecture, the annotation of the same sets of images are included in json file format (80% of images and annotations of each subset are in training set and 20% of images of each subset are in test set.)

Paper abstract:
The explosion of the digitisation of the traditional industrial processes and procedures is consolidating a positive impact on modern society by offering a critical contribution to its economic development. In particular, the dairy sector consists of various processes, which are very demanding and thorough. It is crucial to leverage modern automation tools and through-engineering solutions to increase their efficiency and continuously meet challenging standards. Towards this end, in this work, an intelligent algorithm based on machine vision and artificial intelligence, which identifies dairy products within production lines, is presented. Furthermore, in order to train and validate the model, the YogDATA dataset was created that includes yogurt cups within a production line. Specifically, we evaluate two deep learning models (Mask R-CNN and YOLO v5.0) to recognise and detect each yogurt cup in a production line, in order to automate the packaging processes of the products. According to our results, the performance precision of the two models is similar, estimating its at 99\%.
f
Data from: Combining Group Contribution Method and Semisupervised Learning...
acs.figshare.com
xlsx
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhao Liu; Lanyu Shang; Kuan Huang; Zhenrui Yue; Alan Y. Han; Dong Wang; Huichun Zhang (2024). Combining Group Contribution Method and Semisupervised Learning to Build Machine Learning Models for Predicting Hydroxyl Radical Rate Constants of Water Contaminants [Dataset]. http://doi.org/10.1021/acs.est.4c11950.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.est.4c11950.s002
Dataset updated
Dec 26, 2024
Dataset provided by
ACS Publications
Authors
Zhao Liu; Lanyu Shang; Kuan Huang; Zhenrui Yue; Alan Y. Han; Dong Wang; Huichun Zhang
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Machine learning is an effective tool for predicting reaction rate constants for many organic compounds with the hydroxyl radical (HO•). Previously reported models have achieved relatively good performance, but due to scarce data (
Data from: X-ray diffraction dataset for experimental noise filtering
zenodo.org
bin
Updated Mar 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J. Oppliger; M. M. Denner; J. Küspert; R. Frison; Q. Wang; A. Morawietz; O. Ivashko; A.-C. Dippel; M. v. Zimmermann; N. B. Christensen; T. Kurosawa; N. Momono; M. Oda; F. D. Natterer; M. H. Fischer; T. Neupert; J. Chang; J. Oppliger; M. M. Denner; J. Küspert; R. Frison; Q. Wang; A. Morawietz; O. Ivashko; A.-C. Dippel; M. v. Zimmermann; N. B. Christensen; T. Kurosawa; N. Momono; M. Oda; F. D. Natterer; M. H. Fischer; T. Neupert; J. Chang (2024). X-ray diffraction dataset for experimental noise filtering [Dataset]. http://doi.org/10.5281/zenodo.10816877
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10816877
Dataset updated
Mar 14, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
J. Oppliger; M. M. Denner; J. Küspert; R. Frison; Q. Wang; A. Morawietz; O. Ivashko; A.-C. Dippel; M. v. Zimmermann; N. B. Christensen; T. Kurosawa; N. Momono; M. Oda; F. D. Natterer; M. H. Fischer; T. Neupert; J. Chang; J. Oppliger; M. M. Denner; J. Küspert; R. Frison; Q. Wang; A. Morawietz; O. Ivashko; A.-C. Dippel; M. v. Zimmermann; N. B. Christensen; T. Kurosawa; N. Momono; M. Oda; F. D. Natterer; M. H. Fischer; T. Neupert; J. Chang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
X-ray diffraction data set for the training of noise filtering algorithms. The data set contains groups of low- and high-counting statistics pairs. The sampling times are mostly 1 (20) seconds for low (high) counting data. Three files in HDF5 format are provided, corresponding to a training, validation and test data set. Each data group contains sequences of 41 consecutive frames, corresponding to a scan along the reciprocal h-direction. Next to the raw data, sampling times and monitor values are included. The test data set additionally contains denoised low-count frames obtained from a pre-trained neural network.

Additionally, files containing the trained model weights are included for two different architectures described in the main article (10.1038/s42256-024-00790-1).

The data has been recorded on a La_1.88Sr_0.12CuO₄ single crystal at the beamline P21.1 at the PETRA III storage ring at DESY in Hamburg, Germany. The scattering intensities were recorded using Dectris Pilatus 100K CdTe detector. The diffractometer was operated with 100 keV photons and the sample was cooled to T ~ 30 K. The data contains different signals such as weak 2D charge density wave order, fundamental Bragg peaks, powder lines, spurions and dead pixels.
Data from: Quantitative Prediction of Antitarget Interaction Profiles for...
figshare.com
application/cdfv2
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexey V. Zakharov; Alexey A. Lagunin; Dmitry A. Filimonov; Vladimir V. Poroikov (2023). Quantitative Prediction of Antitarget Interaction Profiles for Chemical Compounds [Dataset]. http://doi.org/10.1021/tx300247r.s001
Explore at:
application/cdfv2Available download formats
Unique identifier
https://doi.org/10.1021/tx300247r.s001
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Alexey V. Zakharov; Alexey A. Lagunin; Dmitry A. Filimonov; Vladimir V. Poroikov
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The evaluation of possible interactions between chemical compounds and antitarget proteins is an important task of the research and development process. Here, we describe the development and validation of QSAR models for the prediction of antitarget end-points, created on the basis of multilevel and quantitative neighborhoods of atom descriptors and self-consistent regression. Data on 4000 chemical compounds interacting with 18 antitarget proteins (13 receptors, 2 enzymes, and 3 transporters) were used to model 32 sets of end-points (IC50, Ki, and Kact). Each set was randomly divided into training and test sets in a ratio of 80% to 20%, respectively. The test sets were used for external validation of QSAR models created on the basis of the training sets. The coverage of prediction for all test sets exceeded 95%, and for half of the test sets, it was 100%. The accuracy of prediction for 29 of the end-points, based on the external test sets, was typically in the range of R2test = 0.6–0.9; three tests sets had lower R2test values, specifically 0.55–0.6. The proposed approach showed a reasonable accuracy of prediction for 91% of the antitarget end-points and high coverage for all external test sets. On the basis of the created models, we have developed a freely available online service for in silico prediction of 32 antitarget end-points: http://www.pharmaexpert.ru/GUSAR/antitargets.html.
f
Data from: Retip: Retention Time Prediction for Compound Annotation in...
figshare.com
acs.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paolo Bonini; Tobias Kind; Hiroshi Tsugawa; Dinesh Kumar Barupal; Oliver Fiehn (2023). Retip: Retention Time Prediction for Compound Annotation in Untargeted Metabolomics [Dataset]. http://doi.org/10.1021/acs.analchem.9b05765.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.analchem.9b05765.s001
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Paolo Bonini; Tobias Kind; Hiroshi Tsugawa; Dinesh Kumar Barupal; Oliver Fiehn
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Unidentified peaks remain a major problem in untargeted metabolomics by LC-MS/MS. Confidence in peak annotations increases by combining MS/MS matching and retention time. We here show how retention times can be predicted from molecular structures. Two large, publicly available data sets were used for model training in machine learning: the Fiehn hydrophilic interaction liquid chromatography data set (HILIC) of 981 primary metabolites and biogenic amines,and the RIKEN plant specialized metabolome annotation (PlaSMA) database of 852 secondary metabolites that uses reversed-phase liquid chromatography (RPLC). Five different machine learning algorithms have been integrated into the Retip R package: the random forest, Bayesian-regularized neural network, XGBoost, light gradient-boosting machine (LightGBM), and Keras algorithms for building the retention time prediction models. A complete workflow for retention time prediction was developed in R. It can be freely downloaded from the GitHub repository (https://www.retip.app). Keras outperformed other machine learning algorithms in the test set with minimum overfitting, verified by small error differences between training, test, and validation sets. Keras yielded a mean absolute error of 0.78 min for HILIC and 0.57 min for RPLC. Retip is integrated into the mass spectrometry software tools MS-DIAL and MS-FINDER, allowing a complete compound annotation workflow. In a test application on mouse blood plasma samples, we found a 68% reduction in the number of candidate structures when searching all isomers in MS-FINDER compound identification software. Retention time prediction increases the identification rate in liquid chromatography and subsequently leads to an improved biological interpretation of metabolomics data.
3D Mask R-CNN data
figshare.com
zip
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel David; emmanuel Faure (2024). 3D Mask R-CNN data [Dataset]. http://doi.org/10.6084/m9.figshare.26973085.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26973085.v3
Dataset updated
Dec 12, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Gabriel David; emmanuel Faure
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Image repository of the paper 3D Mask R-CNN Benchmarks in Controlled Environment and Morphogenesis StudyThis repository contains the test set of the Phallusia mammillata dataset and used in the paper to train and validate the 3D Mask R-CNN. The test set is composed of the PM1 embryo images and ground truth. In details:the 3D input image of Phallusia mammillata PM1 embryo (Inputs.zip)the ground truth ASTEC instance segmentation of these input images (ASTEC_Ground_truth.zip)the predictions infered by the trained 3D Mask R-CNN (3D_Mask_R_CNN.zip)the predictions infered by the state-of-the-art network for cell instance segmentation Cellpose with its pre-trained weights cyto3 (Cellpose_cyto3.zip)the predictions infered by Cellpose after being retrained over a sample of the Phallusia mammillata dataset (Cellpose_retrained.zip)the resulting weights of the 3D Mask R-CNN trained over the Phallusia mammillata dataset.These data can be used to reproduce the 3D Mask R-CNN inference over the Phallusia mammillata test set and to evaluate the predictions against the ASTEC ground truth.The 3D Mask R-CNN code is available here: https://github.com/gdavid57/3d-mask-r-cnnSee the "morphogenesis" branch for the Phallusia mammillata dataset.
f
Data from: Predicting CO2 Solubility in Diverse Ionic Liquids: A Data-Driven...
acs.figshare.com
xlsx
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zahra Bastami; Mohammad Amin Sobati; Mahdieh Amereh (2025). Predicting CO2 Solubility in Diverse Ionic Liquids: A Data-Driven Approach Using Machine Learning Algorithms [Dataset]. http://doi.org/10.1021/acs.energyfuels.5c01345.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.energyfuels.5c01345.s002
Dataset updated
Jun 3, 2025
Dataset provided by
ACS Publications
Authors
Zahra Bastami; Mohammad Amin Sobati; Mahdieh Amereh
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In this study, new machine-learning-based models have been developed for the prediction of carbon dioxide (CO2) solubility in different Ionic Liquids (ILs). An extensive data set comprising 16,480 experimental data points of CO2 solubility in 296 ILs, consisting of 103 different cation and 78 different anion structures, was utilized for this purpose. Quantitative Structure–Property Relationship (QSPR) models were developed using linear and nonlinear methods based on this large data set. To consider the effect of cation and anion structures on the CO2 solubility, basic descriptors, including zero-dimensional, one-dimensional, and fingerprint descriptors (a category of two-dimensional descriptors), were calculated. Subsequently, the most relevant variables were identified through the StepWise Regression (SWR), resulting in the selection of 18 categories of cationic and anionic descriptors, in addition to temperature and pressure, as inputs for nonlinear Machine Learning (ML) models such as MultiLayer Perceptron (MLP), Radial Basis Function (RBF), Random Forest (RF), and Least-Squares Boosting (LSBoost). Internal and external validation of the models indicated that the LSBoost model displayed the highest accuracy in predicting CO2 solubility and demonstrated superior capability in modeling complex data. R2 and MSE values for this model were 0.9962 and 0.0070 for the training set and 0.9243 and 0.1277 for the test set, respectively. Furthermore, comparisons between the LSBoost model and the available models in the literature demonstrated that the LSBoost model surpasses the other models in performance, proving to be reliable for predicting CO2 solubility in new ILs, thereby aiding in the design and selection of ILs for CO2 capture.
Dataset and Model Weights for Plasma Sheet Model Graph Network Simulator
zenodo.org
paperswithcode.com
zip
Updated May 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diogo D. Carvalho; Diogo D. Carvalho; Diogo R. Ferreira; Diogo R. Ferreira; Luís O. Silva; Luís O. Silva (2024). Dataset and Model Weights for Plasma Sheet Model Graph Network Simulator [Dataset]. http://doi.org/10.5281/zenodo.10440186
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10440186
Dataset updated
May 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Diogo D. Carvalho; Diogo D. Carvalho; Diogo R. Ferreira; Diogo R. Ferreira; Luís O. Silva; Luís O. Silva
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This repository contains the simulation data and pre-trained Graph Neural Network (GNN) models produced in [1].

Two *.zip files are provided:

data.zip - contains the datasets of train/test simulations produced using the Sheet Model algorithm [1, 2]

models.zip - contains the GNN model weights (*.pkl) + relevant training information and model parameters (*.yml and *.txt)

Dataset subfolders are named according to dataset/{'train' or 'test'}/{number of sheets}/{boundary condition}/. Each subfolder contains multiple simulations and a single info.yml file with relevant information regarding the overall setup. For each i-th simulation the following files are provided:

x_{i}.npy - array with sheet trajectories (#time-steps, #sheets)

v_{i}.npy - array with sheet velocities (#time-steps, #sheets)

x_eq_{i}.npy - array with sheet equilibrium positions (#time-steps, #sheets)

Model sub-folders are named according to :

models/{time step}/{seed} - default architecture (preferred)

models/{time step}/{'collisions', 'nosent' or 'equivariant'}/{seed} - alternative (less performing) architectures mentioned in the paper appendices.

For each model we provide:

params_best.pkl - model weights that performed the best during training on the validation set

params_final.pkl - model weights at the end of training

model_cfg.yml - GNN architecture metadata

train_cfg.yml - training configuration metadata

train_data.yml - training dataset metadata

loss.txt - training and validation loss per epoch

loss_i.txt - training loss per gradient update step

Source Code

The source code used to produce the data, train, and test the models can be found at: https://github.com/diogodcarvalho/gns-sheet-model

References

[1] D. D. Carvalho, D. R. Ferreira, L. O. Silva, "Learning the dynamics of a one-dimensional plasma model with graph neural networks", Mach. Learn.: Sci. Technol. 5 025048 (2024)

[2] J. Dawson, "One‐Dimensional Plasma Model", The Physics of Fluids 5.4 (1962): 445-459.

Facebook

Twitter

Click to copy link

Link copied

Cite

Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001

Data from: Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction.

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.1021/ci400084k.s001

Dataset updated

Jun 2, 2023

Dataset provided by

ACS Publications

Authors

Robert P. Sheridan

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

Clear search

Close search

Google apps

Main menu

Data from: Time-Split Cross-Validation as a Method for Estimating the...

Downsized camera trap images for automated classification

hardRain: an R package for quick, automated rainfall detection in...

Personal Protective Equipment Dataset (PPED)

Data for the manuscript "Spatially resolved uncertainties for machine...

Chinese Chemical Safety Signs (CCSS)

Data from: TüEyeQ: A rich IQ test performance data set with eye movement,...

Counting using deep learning regression gives value to ecological surveys.

Real Blur Dataset Dataset

Chinese Chemical Safety Signs (CCSS)

Relaxed Naïve Bayes Data

In Silico Prediction of Physicochemical Properties of Environmental...

YOGData: Labelled data (YOLO and Mask R-CNN) for yogurt cup identification...

Data from: Combining Group Contribution Method and Semisupervised Learning...

Data from: X-ray diffraction dataset for experimental noise filtering

Data from: Quantitative Prediction of Antitarget Interaction Profiles for...

Data from: Retip: Retention Time Prediction for Compound Annotation in...

3D Mask R-CNN data

Data from: Predicting CO2 Solubility in Diverse Ionic Liquids: A Data-Driven...

Dataset and Model Weights for Plasma Sheet Model Graph Network Simulator

Source Code

References

Data from: Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction.