42 datasets found
  1. Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  2. Downsized camera trap images for automated classification

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Dec 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers (2022). Downsized camera trap images for automated classification [Dataset]. http://doi.org/10.5281/zenodo.6627707
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Dec 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description:

    Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.

    Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions

    Funding: These data were collected as part of research funded by:

    This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.

    XML metadata: GEMINI compliant metadata for this dataset is available here

    Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip

    CT_image_data_info2.xlsx

    This file contains dataset metadata and 1 data tables:

    1. Dataset Images (described in worksheet Dataset_images)

      Description: This worksheet details the composition of each dataset used in the analyses

      Number of fields: 69

      Number of data rows: 270287

      Fields:

      • filename: Root ID (Field type: id)
      • camera_trap_site: Site ID for the camera trap location (Field type: location)
      • taxon: Taxon recorded by camera trap (Field type: taxa)
      • dist_level: Level of disturbance at site (Field type: ordered categorical)
      • baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical)
      • increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical)
      • dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical)
      • dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type:

  3. m

    hardRain: an R package for quick, automated rainfall detection in...

    • data.mendeley.com
    Updated Aug 27, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Metcalf (2019). hardRain: an R package for quick, automated rainfall detection in ecoacoustic datasets using a threshold-based approach-training and test data sets [Dataset]. http://doi.org/10.17632/mpjrkg34jw.8
    Explore at:
    Dataset updated
    Aug 27, 2019
    Authors
    Oliver Metcalf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ecoacoustic data files collected with automated recording units

  4. Z

    Personal Protective Equipment Dataset (PPED)

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous (2022). Personal Protective Equipment Dataset (PPED) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6551757
    Explore at:
    Dataset updated
    May 17, 2022
    Dataset authored and provided by
    Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Personal Protective Equipment Dataset (PPED)

    This dataset serves as a benchmark for PPE in chemical plants We provide datasets and experimental results.

    1. The dataset

    We produced a data set based on the actual needs and relevant regulations in chemical plants. The standard GB 39800.1-2020 formulated by the Ministry of Emergency Management of the People’s Republic of China defines the protective requirements for plants and chemical laboratories. The complete dataset is contained in the folder PPED/data.

    1.1. Image collection

    We took more than 3300 pictures. We set the following different characteristics, including different environments, different distances, different lighting conditions, different angles, and the diversity of the number of people photographed.

    Backgrounds: There are 4 backgrounds, including office, near machines, factory and regular outdoor scenes.

    Scale: By taking pictures from different distances, the captured PPEs are classified in small, medium and large scales.

    Light: Good lighting conditions and poor lighting conditions were studied.

    Diversity: Some images contain a single person, and some contain multiple people.

    Angle: The pictures we took can be divided into front and side.

    A total of more than 3300 photos were taken in the raw data under all conditions. All images are located in the folder “PPED/data/JPEGImages”.

    1.2. Label

    We use Labelimg as the labeling tool, and we use the PASCAL-VOC labelimg format. Yolo use the txt format, we can use trans_voc2yolo.py to convert the XML file in PASCAL-VOC format to txt file. Annotations are stored in the folder PPED/data/Annotations

    1.3. Dataset Features

    The pictures are made by us according to the different conditions mentioned above. The file PPED/data/feature.csv is a CSV file which notes all the .os of all the image. It records every feature of the picture, including lighting conditions, angles, backgrounds, number of people and scale.

    1.4. Dataset Division

    The data set is divided into 9:1 training set and test set.

    1. Baseline Experiments

    We provide baseline results with five models, namely Faster R-CNN ®, Faster R-CNN (M), SSD, YOLOv3-spp, and YOLOv5. All code and results is given in folder PPED/experiment.

    2.1. Environment and Configuration:

    Intel Core i7-8700 CPU

    NVIDIA GTX1060 GPU

    16 GB of RAM

    Python: 3.8.10

    pytorch: 1.9.0

    pycocotools: pycocotools-win

    Windows 10

    2.2. Applied Models

    The source codes and results of the applied models is given in folder PPED/experiment with sub-folders corresponding to the model names.

    2.2.1. Faster R-CNN

    Faster R-CNN

    backbone: resnet50+fpn

    We downloaded the pre-training weights from https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth.

    We modified the dataset path, training classes and training parameters including batch size.

    We run train_res50_fpn.py start training.

    Then, the weights are trained by the training set.

    Finally, we validate the results on the test set.

    backbone: mobilenetv2

    the same training method as resnet50+fpn, but the effect is not as good as resnet50+fpn, so it is directly discarded.

    The Faster R-CNN source code used in our experiment is given in folder PPED/experiment/Faster R-CNN. The weights of the fully-trained Faster R-CNN (R), Faster R-CNN (M) model are stored in file PPED/experiment/trained_models/resNetFpn-model-19.pth and mobile-model.pth. The performance measurements of Faster R-CNN (R) Faster R-CNN (M) are stored in folder PPED/experiment/results/Faster RCNN(R)and Faster RCNN(M).

    2.2.2. SSD

    backbone: resnet50

    We downloaded pre-training weights from https://download.pytorch.org/models/resnet50-19c8e357.pth.

    The same training method as Faster R-CNN is applied.

    The SSD source code used in our experiment is given in folder PPED/experiment/ssd. The weights of the fully-trained SSD model are stored in file PPED/experiment/trained_models/SSD_19.pth. The performance measurements of SSD are stored in folder PPED/experiment/results/SSD.

    2.2.3. YOLOv3-spp

    backbone: DarkNet53

    We modified the type information of the XML file to match our application.

    We run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.

    The weights used are: yolov3-spp-ultralytics-608.pt.

    The YOLOv3-spp source code used in our experiment is given in folder PPED/experiment/YOLOv3-spp. The weights of the fully-trained YOLOv3-spp model are stored in file PPED/experiment/trained_models/YOLOvspp-19.pt. The performance measurements of YOLOv3-spp are stored in folder PPED/experiment/results/YOLOv3-spp.

    2.2.4. YOLOv5

    backbone: CSP_DarkNet

    We modified the type information of the XML file to match our application.

    We run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.

    The weights used are: yolov5s.

    The YOLOv5 source code used in our experiment is given in folder PPED/experiment/yolov5. The weights of the fully-trained YOLOv5 model are stored in file PPED/experiment/trained_models/YOLOv5.pt. The performance measurements of YOLOv5 are stored in folder PPED/experiment/results/YOLOv5.

    2.3. Evaluation

    The computed evaluation metrics as well as the code needed to compute them from our dataset are provided in the folder PPED/experiment/eval.

    1. Code Sources

    Faster R-CNN (R and M)

    https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/faster_rcnn

    official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/faster_rcnn.py

    SSD

    https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/ssd

    official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/ssd.py

    YOLOv3-spp

    https://github.com/WZMIAOMIAO/deep-learning-for-image-processing/tree/master/pytorch_object_detection/yolov3-spp

    YOLOv5

    https://github.com/ultralytics/yolov5

  5. Z

    Data for the manuscript "Spatially resolved uncertainties for machine...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madsen, Georg K. H. (2024). Data for the manuscript "Spatially resolved uncertainties for machine learning potentials" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11086346
    Explore at:
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Wanzenböck, Ralf
    Madsen, Georg K. H.
    Heid, Esther
    Schörghuber, Johannes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository accompanies the manuscript "Spatially resolved uncertainties for machine learning potentials" by E. Heid, J. Schörghuber, R. Wanzenböck, and G. K. H. Madsen. The following files are available:

    mc_experiment.ipynb is a Jupyter notebook for the Monte Carlo experiment described in the study (artificial model with only variance as error source).

    aggregate_cut_relax.py contains code to cut and relax boxes for the water active learning cycle.

    data_t1x.tar.gz contains reaction pathways for 10,073 reactions from a subset of the Transition1x dataset, split into training, validation and test sets. The training and validation sets contain the indices 1, 2, 9, and 10 from a 10-image nudged-elastic band search (40k datapoints), while the test set contains indices 3-8 (60k datapoints). The test set is ordered according to the reaction and index, i.e. rxn1_index3, rxn1_index4, [...] rxn1_index8, rxn2_index3, [...].

    data_sto.tar.gz contains surface reconstructions of SrTiO3, randomly split into a training and validation set, as well as a test set.

    data_h2o.tar.gz contains:

    full_db.extxyz: The full dataset of 1.5k structures.

    iter00_train.extxyz and iter00_validation.extxyz: The initial training and validation set for the active learning cycle.

    the subfolders in the folders random, and uncertain, and atomic contain the training and validation sets for the random and uncertainty-based (local or atomic) active learning loops.

  6. Z

    Chinese Chemical Safety Signs (CCSS)

    • data.niaid.nih.gov
    Updated Mar 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous (2023). Chinese Chemical Safety Signs (CCSS) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5482333
    Explore at:
    Dataset updated
    Mar 21, 2023
    Dataset authored and provided by
    Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Notice: We have currently a paper under double-blind review that introduces this dataset. Therefore, we have anonymized the dataset authorship. Once the review process has concluded, we will update the authorship information of this dataset.

    Chinese Chemical Safety Signs (CCSS)

    This dataset is compiled as a benchmark for recognizing chemical safety signs from images. We provide both the dataset and the experimental results at doi:10.5281/zenodo.5482334.

    1. The Dataset

    The complete dataset is contained in the folder ccss/data in archive css_data.zip. The images include signs based on the Chinese standard "Safety Signs and their Application Guidelines" (GB 2894-2008) for safety signs in chemical environments. This standard, in turn, refers to the standards ISO 7010 (Graphical symbols – Safety Colours and Safety Signs – Safety signs used in workplaces and public areas), GB/T 10001 (Public Information Graphic Symbols for Signs), and GB 13495 (Fire Safety Signs)

    1.1. Image Collection

    We collect photos commonly used chemical safety signs in chemical laboratories and chemical teaching buildings. For a discussion of the standards we base our collections, refer to the book "Talking about Hazardous Chemicals and Safety Signs" for common signs, and refer to the safety signs guidelines (GB 2894-2008).

    The shooting was mainly carried out in 6 locations, namely on the road, in a parking lot, construction walls, in a chemical laboratory, outside near big machines, and inside the factory and corridor.

    Shooting scale: Images in which the signs appear in small, medium and large scales were taken for each location by shooting photos from different distances.

    Shooting light: good lighting conditions and poor lighting conditions were investigated.

    Part of the images contain multiple targets and the other part contains only single signs.

    Under all conditions, a total of 4650 photos were taken in the original data. These were expanded to 27'900 photos were via data enhancement. All images are located in folder ccss/data/JPEGImages.

    The file ccss/data/features/enhanced_data_to_original_data.csv provides a mapping between the enhanced image name and the corresponding original image.

    1.2. Annotation and Labelling

    The labelling tool is Labelimg, which uses the PASCAL-VOC labelling format. The annotation is stored in the folder ccss/data/Annotations.

    Faster R-CNN and SSD are two algorithms that use this format. When training YOLOv5, you can run trans_voc2yolo.py to convert the XML file in PASCAL-VOC format to a txt file.

    We provide further meta-information about the dataset in form of a CSV file features.csv which notes, for each image, which other features it has (lighting conditions, scale, multiplicity, etc.).

    1.3. Dataset Features

    As stated above, the images have been shot under different conditions. We provide all the feature information in folder ccss/data/features. For each feature, there is a separate list of file names in that folder. The file ccss/data/features/features_on_original_data.csv is a CSV file which notes all the features of each original image.

    1.4. Dataset Division

    The data set is fixedly divided into 7:3 training set and test set. You can find the corresponding image names in the files ccss/data/training_data_file_names.txt and ccss/data/test_data_file_names.txt.

    1. Baseline Experiments

    We provide baseline results with the three models of Faster R-CNN, SSD, and YOLOv5. All code and results is given in folder ccss/experiment in archive ccss_experiment.

    2.2. Environment and Configuration

    Single Intel Core i7-8700 CPU

    NVIDIA GTX1060 GPU

    16 GB of RAM

    Python: 3.8.10

    pytorch: 1.9.0

    pycocotools: pycocotools-win

    Visual Studio 2017

    Windows 10

    2.3. Applied Models

    The source codes and results of the applied models is given in folder ccss/experiment with sub-folders corresponding to the model names.

    2.3.1. Faster R-CNN

    backbone: resnet50+fpn.

    we downloaded the pre-training weights from

    we modify the type information of the JSON file to match our application.

    run train_res50_fpn.py

    finally, the weights trained by the training set.

    backbone: mobilenetv2

    the same training method as resnet50+fpn, but the effect is not as good as resnet50+fpn, so it is directly discarded.

    The Faster R-CNN source code used in our experiment is given in folder ccss/experiment/sources/faster_rcnn. The weights of the fully-trained Faster R-CNN model are stored in file ccss/experiment/trained_models/faster_rcnn.pth. The performance measurements of Faster R-CNN are stored in folder ccss/experiment/performance_indicators/faster_rcnn.

    2.3.2. SSD

    backbone: resnet50

    we downloaded pre-training weights from

    the same training method as Faster R-CNN is applied.

    The SSD source code used in our experiment is given in folder ccss/experiment/sources/ssd. The weights of the fully-trained SSD model are stored in file ccss/experiment/trained_models/ssd.pth. The performance measurements of SSD are stored in folder ccss/experiment/performance_indicators/ssd.

    2.3.4. YOLOv5

    backbone: CSP_DarkNet

    we modified the type information of the YML file to match our application

    run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.

    the weights used are: yolov5s.

    The YOLOv5 source code used in our experiment is given in folder ccss/experiment/sources/yolov5. The weights of the fully-trained YOLOv5 model are stored in file ccss/experiment/trained_models/yolov5.pt. The performance measurements of YOLOv5 are stored in folder ccss/experiment/performance_indicators/yolov5.

    2.4. Evaluation

    The computed evaluation metrics as well as the code needed to compute them from our dataset are provided in the folder ccss/experiment/performance_indicators. They are provided over the complete test st as well as separately for the image features (over the test set).

    1. Code Sources

    Faster R-CNN

    official code:

    SSD

    official code:

    YOLOv5

    We are particularly thankful to the author of the GitHub repository WZMIAOMIAO/deep-learning-for-image-processing (with whom we are not affiliated). Their instructive videos and codes were most helpful during our work. In particular, we based our own experimental codes on his work (and obtained permission to include it in this archive).

    1. Licensing

    While our dataset and results are published under the Creative Commons Attribution 4.0 License, this does not hold for the included code sources. These sources are under the particular license of the repository where they have been obtained from (see Section 3 above).

  7. H

    Data from: TüEyeQ: A rich IQ test performance data set with eye movement,...

    • dataverse.harvard.edu
    Updated Mar 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enkelejda Kasneci; Gjergji Kasneci; Tobias Appel; Johannes Haug; Franz Wortha; Maike Tibus; Ulrich Trautwein; Peter Gerjets (2021). TüEyeQ: A rich IQ test performance data set with eye movement, educational and socio-demographic information [Dataset]. http://doi.org/10.7910/DVN/JGOCKI
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 5, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Enkelejda Kasneci; Gjergji Kasneci; Tobias Appel; Johannes Haug; Franz Wortha; Maike Tibus; Ulrich Trautwein; Peter Gerjets
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We present the TüEyeQ data set - to the best of our knowledge - the most comprehensive data set generated on a culture fair intelligence test (CFT 20-R), i.e., an IQ Test, consisting of 41 single tasks, taken by 315 individuals aged between 18 and 30 years. In addition to socio-demographic and educational information, the data set also includes the eye movements of the individuals while taking the IQ test. Along with distributional information we also highlight the potential for predictive analysis on the TüEyeQ data set and report the most important covariates for predicting the performance of a subject on a given task along with their influence on the prediction.

  8. Counting using deep learning regression gives value to ecological surveys.

    • dataverse2-dmz.nioz.nl
    • dataverse.nioz.nl
    • +1more
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NIOZ (2022). Counting using deep learning regression gives value to ecological surveys. [Dataset]. http://doi.org/10.33591/nioz/7b.b.0c
    Explore at:
    bin(140592439), bin(51244052), csv(3808), csv(67458), application/x-ipynb+json(4860742), bin(140592445), csv(16711), csv(3320), csv(219477), csv(14487), bin(51244047), txt(2343), bin(140592444)Available download formats
    Dataset updated
    May 13, 2022
    Dataset provided by
    Royal Netherlands Institute for Sea Research
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Many ecological studies rely on count data and involve manual counting of objects of interest, which is time-consuming and especially disadvantageous when time in the field or lab is limited. However, an increasing number of works uses digital imagery, which opens opportunities to automatise counting tasks. In this study, we use machine learning to automate counting objects of interest without the need to label individual objects. By leveraging already existing image-level annotations, this approach can also give value to historical data that were collected and annotated over longer time series (typical for many ecological studies), without the aim of deep learning applications. We demonstrate deep learning regression on two fundamentally different counting tasks: (i) daily growth rings from microscopic images of fish otolith (i.e., hearing stone) and (ii) hauled out seals from highly variable aerial imagery. In the otolith images, our deep learning-based regressor yields an RMSE of 3.40 day-rings and an R^2 of 0.92. Initial performance in the seal images is lower (RMSE of 23.46 seals and R^2 of 0.72), which can be attributed to a lack of images with a high number of seals in the initial training set, compared to the test set. We then show how to improve performance substantially (RMSE of 19.03 seals and R^2 of 0.77) by carefully selecting and relabelling just 100 additional training images based on initial model prediction discrepancy. The regression-based approach used here returns accurate counts (R^2 of 0.92 and 0.77 for the rings and seals, respectively), directly usable in ecological research.

  9. P

    Real Blur Dataset Dataset

    • paperswithcode.com
    Updated Nov 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaesung Rim; Haeyun Lee; Jucheol Won; Sunghyun Cho (2021). Real Blur Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/real-blur-dataset
    Explore at:
    Dataset updated
    Nov 17, 2021
    Authors
    Jaesung Rim; Haeyun Lee; Jucheol Won; Sunghyun Cho
    Description

    The dataset consists of 4,738 pairs of images of 232 different scenes including reference pairs. All images were captured both in the camera raw and JPEG formats, hence generating two datasets: RealBlur-R from the raw images, and RealBlur-J from the JPEG images. Each training set consists of 3,758 image pairs, while each test set consists of 980 image pairs.

    The deblurring result is first aligned to its ground truth sharp image using a homography estimated by the enhanced correlation coefficients method, and PSNR or SSIM is computed in sRGB color space.

  10. Chinese Chemical Safety Signs (CCSS)

    • zenodo.org
    bin, html, xz
    Updated Mar 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2023). Chinese Chemical Safety Signs (CCSS) [Dataset]. http://doi.org/10.5281/zenodo.5938816
    Explore at:
    xz, html, binAvailable download formats
    Dataset updated
    Mar 21, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Chinese Chemical Safety Signs (CCSS)

    This dataset is compiled as a benchmark for recognizing chemical safety signs from images. We provide both the dataset and the experimental results.

    1. The Dataset

    The complete dataset is contained in the folder ccss/data. The images include signs based on the Chinese standard "Safety Signs and their Application Guidelines" (GB 2894-2008) for safety signs in chemical environments. This standard, in turn, refers to the standards ISO 7010 (Graphical symbols – Safety Colours and Safety Signs – Safety signs used in workplaces and public areas), GB/T 10001 (Public Information Graphic Symbols for Signs), and GB 13495 (Fire Safety Signs)

    1.1. Image Collection

    We collect photos of commonly used chemical safety signs in chemical laboratories and chemistry teaching. For a discussion of the standards we base our collections, refer to the book "Talking about Hazardous Chemicals and Safety Signs" for common signs, and refer to the safety signs guidelines (GB 2894-2008).

    • The shooting was mainly carried out in 6 locations, namely on the road, in a parking lot, construction walls, in a chemical laboratory, outside near big machines, and inside the factory and corridor.
    • Shooting scale: Images in which the signs appear in small, medium and large scales were taken for each location by shooting photos from different distances.
    • Shooting light: good lighting conditions and poor lighting conditions were investigated.
    • Part of the images contain multiple targets and the other part contains only single signs.

    Under all conditions, a total of 4650 photos were taken in the original data. These were expanded to 27,900 photos were via data enhancement. All images are located in folder ccss/data/JPEGImages.

    The file ccss/data/features/enhanced_data_to_original_data.csv provides a mapping between the enhanced image name and the corresponding original image.

    1.2. Annotation and Labelimg

    We use Labelimg as labeling tool, which, in turn, uses the PASCAL-VOC labelimg format. The annotation is stored in the folder ccss/data/Annotations.

    Faster R-CNN and SSD are two algorithms that use this format. When training YOLOv5, you can run trans_voc2yolo.py to convert the XML file in PASCAL-VOC format to a txt file.

    We provide further meta-information about the dataset in form of a CSV file features.csv which notes, for each image, which other features it has (lighting conditions, scale, multiplicity, etc.). We apply the COCO standard for deciding whether a target is small, medium, or large in size.

    1.3. Dataset Features

    As stated above, the images have been shot under different conditions. We provide all the feature information in folder ccss/data/features. For each feature, there is a separate list of file names in that folder. The file ccss/data/features/features_on_original_data.csv is a CSV file which notes all the features of each original image.

    1.4. Dataset Division

    The data set is fixedly divided into 7:3 training set and test set. You can find the corresponding image names in the files ccss/data/training_data_file_names.txt and ccss/data/test_data_file_names.txt.

    2. Baseline Experiments

    We provide baseline results with five models, namely Faster R-CNN (R), Faster R-CNN (M), SSD, YOLOv3-spp, and YOLOv5. All code and results is given in folder ccss/experiment.

    2.2. Environment and Configuration:

    • Single Intel Core i7-8700 CPU
    • NVIDIA GTX1060 GPU
    • 16 GB of RAM
    • Python: 3.8.10
    • pytorch: 1.9.0
    • pycocotools: pycocotools-win
    • Visual Studio 2017
    • Windows 10

    2.3. Applied Models

    The source codes and results of the applied models is given in folder ccss/experiment with sub-folders corresponding to the model names.

    2.3.1. Faster R-CNN

    • Faster R-CNN (R) has the backbone resnet50+fpn. The Faster R-CNN (R) source code used in our experiment is given in folder ccss/experiment/sources/faster_rcnn (R). The weights of the fully-trained Faster R-CNN (R) model are stored in file ccss/experiment/trained_models/faster_rcnn (R).pth. The performance measurements of Faster R-CNN (R) are stored in folder ccss/experiment/performance_indicators/faster_rcnn (R).
    • Faster R-CNN (M) has the backbone mobilenetv2.
      • backbone: MobileNetV2.
      • we modify the type information of the JSON file to match our application.
      • run train_mobilenetv2.py
      • finally, the weights trained by the training set.
      The Faster R-CNN (M) source code used in our experiment is given in folder ccss/experiment/sources/faster_rcnn (M). The weights of the fully-trained Faster R-CNN (M) model are stored in file ccss/experiment/trained_models/faster_rcnn (M).pth. The performance measurements of Faster R-CNN (M) are stored in folder ccss/experiment/performance_indicators/faster_rcnn (M).

    2.3.2. SSD

    The SSD source code used in our experiment is given in folder ccss/experiment/sources/ssd. The weights of the fully-trained SSD model are stored in file ccss/experiment/trained_models/ssd.pth. The performance measurements of SSD are stored in folder ccss/experiment/performance_indicators/ssd.

    2.3.3. YOLOv3-spp

    • backbone: DarkNet53
    • we modified the type information of the XML file to match our application
    • run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.
    • the weights used are: yolov3-spp-ultralytics-608.pt.

    The YOLOv3-spp source code used in our experiment is given in folder ccss/experiment/sources/yolov3-spp. The weights of the fully-trained YOLOv3-spp model are stored in file ccss/experiment/trained_models/yolov3-spp.pt. The performance measurements of YOLOv3-spp are stored in folder ccss/experiment/performance_indicators/yolov3-spp.

    2.3.4. YOLOv5

    • backbone: CSP_DarkNet
    • we modified the type information of the XML file to match our application
    • run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.
    • the weights used are: yolov5s.

    The YOLOv5 source code used in our experiment is given in folder ccss/experiment/sources/yolov5. The weights of the fully-trained YOLOv5 model are stored in file ccss/experiment/trained_models/yolov5.pt. The performance measurements of YOLOv5 are stored in folder ccss/experiment/performance_indicators/yolov5.

    2.4. Evaluation

    The computed evaluation metrics as well as the code needed to compute them from our dataset are provided in the folder ccss/experiment/performance_indicators. They are provided over the complete test st as well as separately for the image features (over the test set).

    3. Code Sources

    1. Faster R-CNN (R and M)
    2. SSD
    3. YOLOv3-spp
    4. YOLOv5

    We are particularly thankful to the author of the GitHub repository WZMIAOMIAO/deep-learning-for-image-processing (with whom we are not affiliated). Their instructive videos and codes were most helpful during our work. In

  11. d

    Relaxed Naïve Bayes Data

    • search.dataone.org
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Relaxed Naïve Bayes Team (2023). Relaxed Naïve Bayes Data [Dataset]. http://doi.org/10.7910/DVN/7KNKLL
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Relaxed Naïve Bayes Team
    Description

    NaiveBayes_R.xlsx: This Excel file includes information as to how probabilities of observed features are calculated given recidivism (P(x_ij│R)) in the training data. Each cell is embedded with an Excel function to render appropriate figures. P(Xi|R): This tab contains probabilities of feature attributes among recidivated offenders. NIJ_Recoded: This tab contains re-coded NIJ recidivism challenge data following our coding schema described in Table 1. Recidivated_Train: This tab contains re-coded features of recidivated offenders. Tabs from [Gender] through [Condition_Other]: Each tab contains probabilities of feature attributes given recidivism. We use these conditional probabilities to replace the raw values of each feature in P(Xi|R) tab. NaiveBayes_NR.xlsx: This Excel file includes information as to how probabilities of observed features are calculated given non-recidivism (P(x_ij│N)) in the training data. Each cell is embedded with an Excel function to render appropriate figures. P(Xi|N): This tab contains probabilities of feature attributes among non-recidivated offenders. NIJ_Recoded: This tab contains re-coded NIJ recidivism challenge data following our coding schema described in Table 1. NonRecidivated_Train: This tab contains re-coded features of non-recidivated offenders. Tabs from [Gender] through [Condition_Other]: Each tab contains probabilities of feature attributes given non-recidivism. We use these conditional probabilities to replace the raw values of each feature in P(Xi|N) tab. Training_LnTransformed.xlsx: Figures in each cell are log-transformed ratios of probabilities in NaiveBayes_R.xlsx (P(Xi|R)) to the probabilities in NaiveBayes_NR.xlsx (P(Xi|N)). TestData.xlsx: This Excel file includes the following tabs based on the test data: P(Xi|R), P(Xi|N), NIJ_Recoded, and Test_LnTransformed (log-transformed P(Xi|R)/ P(Xi|N)). Training_LnTransformed.dta: We transform Training_LnTransformed.xlsx to Stata data set. We use Stat/Transfer 13 software package to transfer the file format. StataLog.smcl: This file includes the results of the logistic regression analysis. Both estimated intercept and coefficient estimates in this Stata log correspond to the raw weights and standardized weights in Figure 1. Brier Score_Re-Check.xlsx: This Excel file recalculates Brier scores of Relaxed Naïve Bayes Classifier in Table 3, showing evidence that results displayed in Table 3 are correct. *****Full List***** NaiveBayes_R.xlsx NaiveBayes_NR.xlsx Training_LnTransformed.xlsx TestData.xlsx Training_LnTransformed.dta StataLog.smcl Brier Score_Re-Check.xlsx Data for Weka (Training Set): Bayes_2022_NoID Data for Weka (Test Set): BayesTest_2022_NoID Weka output for machine learning models (Conventional naïve Bayes, AdaBoost, Multilayer Perceptron, Logistic Regression, and Random Forest)

  12. g

    In Silico Prediction of Physicochemical Properties of Environmental...

    • gimi9.com
    Updated Jan 10, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). In Silico Prediction of Physicochemical Properties of Environmental Chemicals Using Molecular Fingerprints and Machine Learning | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_in-silico-prediction-of-physicochemical-properties-of-environmental-chemicals-using-molecu/
    Explore at:
    Dataset updated
    Jan 10, 2017
    Description

    QSAR Model Reporting Formats. Examples of R code: feature selection and regression analysis. Figure S1: Data distribution of logBCF, BP, MP and logVP. Figures S2–S5: Relationship between model complexity and prediction errors as well as the plots of estimated values versus experimental data for logBCF, BP, MP, and logVP, respectively. Figure S6: Plots of leverage versus standardized residuals for logBCF, BP, MP, and logVP models. Table S1: Chemical product classes for training and test sets. Tables S2–S5: Regression statistics for logBCF, BP, MP, and logVP, respectively. Table S6: Applicability domains for logBCF, BP, MP, and logVP. Tables S7–S12: Chemicals with large prediction residuals for the six properties (PDF) Chemical names, CAS registry number and SMILES as well as experimentally measured and estimated property values of the training and test sets (XLSX). This dataset is associated with the following publication: Zang, Q., K. Mansouri, A. Williams, R. Judson, D. Allen, W.M. Casey, and N.C. Kleinstreuer. (Journal of Chemical Information and Modeling) In Silico Prediction of Physicochemical Properties of Environmental Chemicals Using Molecular Fingerprints and Machine Learning. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, USA, 57(1): 36-49, (2017).

  13. YOGData: Labelled data (YOLO and Mask R-CNN) for yogurt cup identification...

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Jun 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Symeon Symeonidis; Vasiliki Balaska; Dimitrios Tsilis; Fotis K. Konstantinidis; Fotis K. Konstantinidis; Symeon Symeonidis; Vasiliki Balaska; Dimitrios Tsilis (2022). YOGData: Labelled data (YOLO and Mask R-CNN) for yogurt cup identification within production lines [Dataset]. http://doi.org/10.5281/zenodo.6773531
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jun 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Symeon Symeonidis; Vasiliki Balaska; Dimitrios Tsilis; Fotis K. Konstantinidis; Fotis K. Konstantinidis; Symeon Symeonidis; Vasiliki Balaska; Dimitrios Tsilis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data abstract:
    The YogDATA dataset contains images from an industrial laboratory production line when it is functioned to quality yogurts. The case-study for the recognition of yogurt cups requires training of Mask R-CNN and YOLO v5.0 models with a set of corresponding images. Thus, it is important to collect the corresponding images to train and evaluate the class. Specifically, the YogDATA dataset includes the same labeled data for Mask R-CNN (coco format) and YOLO models. For the YOLO architecture, training and validation datsets include sets of images in jpg format and their annotations in txt file format. For the Mask R-CNN architecture, the annotation of the same sets of images are included in json file format (80% of images and annotations of each subset are in training set and 20% of images of each subset are in test set.)

    Paper abstract:
    The explosion of the digitisation of the traditional industrial processes and procedures is consolidating a positive impact on modern society by offering a critical contribution to its economic development. In particular, the dairy sector consists of various processes, which are very demanding and thorough. It is crucial to leverage modern automation tools and through-engineering solutions to increase their efficiency and continuously meet challenging standards. Towards this end, in this work, an intelligent algorithm based on machine vision and artificial intelligence, which identifies dairy products within production lines, is presented. Furthermore, in order to train and validate the model, the YogDATA dataset was created that includes yogurt cups within a production line. Specifically, we evaluate two deep learning models (Mask R-CNN and YOLO v5.0) to recognise and detect each yogurt cup in a production line, in order to automate the packaging processes of the products. According to our results, the performance precision of the two models is similar, estimating its at 99\%.

  14. f

    Data from: Combining Group Contribution Method and Semisupervised Learning...

    • acs.figshare.com
    xlsx
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhao Liu; Lanyu Shang; Kuan Huang; Zhenrui Yue; Alan Y. Han; Dong Wang; Huichun Zhang (2024). Combining Group Contribution Method and Semisupervised Learning to Build Machine Learning Models for Predicting Hydroxyl Radical Rate Constants of Water Contaminants [Dataset]. http://doi.org/10.1021/acs.est.4c11950.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 26, 2024
    Dataset provided by
    ACS Publications
    Authors
    Zhao Liu; Lanyu Shang; Kuan Huang; Zhenrui Yue; Alan Y. Han; Dong Wang; Huichun Zhang
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Machine learning is an effective tool for predicting reaction rate constants for many organic compounds with the hydroxyl radical (HO•). Previously reported models have achieved relatively good performance, but due to scarce data (

  15. Data from: X-ray diffraction dataset for experimental noise filtering

    • zenodo.org
    bin
    Updated Mar 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. Oppliger; M. M. Denner; J. Küspert; R. Frison; Q. Wang; A. Morawietz; O. Ivashko; A.-C. Dippel; M. v. Zimmermann; N. B. Christensen; T. Kurosawa; N. Momono; M. Oda; F. D. Natterer; M. H. Fischer; T. Neupert; J. Chang; J. Oppliger; M. M. Denner; J. Küspert; R. Frison; Q. Wang; A. Morawietz; O. Ivashko; A.-C. Dippel; M. v. Zimmermann; N. B. Christensen; T. Kurosawa; N. Momono; M. Oda; F. D. Natterer; M. H. Fischer; T. Neupert; J. Chang (2024). X-ray diffraction dataset for experimental noise filtering [Dataset]. http://doi.org/10.5281/zenodo.10816877
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    J. Oppliger; M. M. Denner; J. Küspert; R. Frison; Q. Wang; A. Morawietz; O. Ivashko; A.-C. Dippel; M. v. Zimmermann; N. B. Christensen; T. Kurosawa; N. Momono; M. Oda; F. D. Natterer; M. H. Fischer; T. Neupert; J. Chang; J. Oppliger; M. M. Denner; J. Küspert; R. Frison; Q. Wang; A. Morawietz; O. Ivashko; A.-C. Dippel; M. v. Zimmermann; N. B. Christensen; T. Kurosawa; N. Momono; M. Oda; F. D. Natterer; M. H. Fischer; T. Neupert; J. Chang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    X-ray diffraction data set for the training of noise filtering algorithms. The data set contains groups of low- and high-counting statistics pairs. The sampling times are mostly 1 (20) seconds for low (high) counting data. Three files in HDF5 format are provided, corresponding to a training, validation and test data set. Each data group contains sequences of 41 consecutive frames, corresponding to a scan along the reciprocal h-direction. Next to the raw data, sampling times and monitor values are included. The test data set additionally contains denoised low-count frames obtained from a pre-trained neural network.

    Additionally, files containing the trained model weights are included for two different architectures described in the main article (10.1038/s42256-024-00790-1).

    The data has been recorded on a La1.88Sr0.12CuO4 single crystal at the beamline P21.1 at the PETRA III storage ring at DESY in Hamburg, Germany. The scattering intensities were recorded using Dectris Pilatus 100K CdTe detector. The diffractometer was operated with 100 keV photons and the sample was cooled to T ~ 30 K. The data contains different signals such as weak 2D charge density wave order, fundamental Bragg peaks, powder lines, spurions and dead pixels.

  16. Data from: Quantitative Prediction of Antitarget Interaction Profiles for...

    • figshare.com
    application/cdfv2
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexey V. Zakharov; Alexey A. Lagunin; Dmitry A. Filimonov; Vladimir V. Poroikov (2023). Quantitative Prediction of Antitarget Interaction Profiles for Chemical Compounds [Dataset]. http://doi.org/10.1021/tx300247r.s001
    Explore at:
    application/cdfv2Available download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    ACS Publications
    Authors
    Alexey V. Zakharov; Alexey A. Lagunin; Dmitry A. Filimonov; Vladimir V. Poroikov
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The evaluation of possible interactions between chemical compounds and antitarget proteins is an important task of the research and development process. Here, we describe the development and validation of QSAR models for the prediction of antitarget end-points, created on the basis of multilevel and quantitative neighborhoods of atom descriptors and self-consistent regression. Data on 4000 chemical compounds interacting with 18 antitarget proteins (13 receptors, 2 enzymes, and 3 transporters) were used to model 32 sets of end-points (IC50, Ki, and Kact). Each set was randomly divided into training and test sets in a ratio of 80% to 20%, respectively. The test sets were used for external validation of QSAR models created on the basis of the training sets. The coverage of prediction for all test sets exceeded 95%, and for half of the test sets, it was 100%. The accuracy of prediction for 29 of the end-points, based on the external test sets, was typically in the range of R2test = 0.6–0.9; three tests sets had lower R2test values, specifically 0.55–0.6. The proposed approach showed a reasonable accuracy of prediction for 91% of the antitarget end-points and high coverage for all external test sets. On the basis of the created models, we have developed a freely available online service for in silico prediction of 32 antitarget end-points: http://www.pharmaexpert.ru/GUSAR/antitargets.html.

  17. f

    Data from: Retip: Retention Time Prediction for Compound Annotation in...

    • figshare.com
    • acs.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paolo Bonini; Tobias Kind; Hiroshi Tsugawa; Dinesh Kumar Barupal; Oliver Fiehn (2023). Retip: Retention Time Prediction for Compound Annotation in Untargeted Metabolomics [Dataset]. http://doi.org/10.1021/acs.analchem.9b05765.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Paolo Bonini; Tobias Kind; Hiroshi Tsugawa; Dinesh Kumar Barupal; Oliver Fiehn
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Unidentified peaks remain a major problem in untargeted metabolomics by LC-MS/MS. Confidence in peak annotations increases by combining MS/MS matching and retention time. We here show how retention times can be predicted from molecular structures. Two large, publicly available data sets were used for model training in machine learning: the Fiehn hydrophilic interaction liquid chromatography data set (HILIC) of 981 primary metabolites and biogenic amines,and the RIKEN plant specialized metabolome annotation (PlaSMA) database of 852 secondary metabolites that uses reversed-phase liquid chromatography (RPLC). Five different machine learning algorithms have been integrated into the Retip R package: the random forest, Bayesian-regularized neural network, XGBoost, light gradient-boosting machine (LightGBM), and Keras algorithms for building the retention time prediction models. A complete workflow for retention time prediction was developed in R. It can be freely downloaded from the GitHub repository (https://www.retip.app). Keras outperformed other machine learning algorithms in the test set with minimum overfitting, verified by small error differences between training, test, and validation sets. Keras yielded a mean absolute error of 0.78 min for HILIC and 0.57 min for RPLC. Retip is integrated into the mass spectrometry software tools MS-DIAL and MS-FINDER, allowing a complete compound annotation workflow. In a test application on mouse blood plasma samples, we found a 68% reduction in the number of candidate structures when searching all isomers in MS-FINDER compound identification software. Retention time prediction increases the identification rate in liquid chromatography and subsequently leads to an improved biological interpretation of metabolomics data.

  18. 3D Mask R-CNN data

    • figshare.com
    zip
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel David; emmanuel Faure (2024). 3D Mask R-CNN data [Dataset]. http://doi.org/10.6084/m9.figshare.26973085.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 12, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Gabriel David; emmanuel Faure
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Image repository of the paper 3D Mask R-CNN Benchmarks in Controlled Environment and Morphogenesis StudyThis repository contains the test set of the Phallusia mammillata dataset and used in the paper to train and validate the 3D Mask R-CNN. The test set is composed of the PM1 embryo images and ground truth. In details:the 3D input image of Phallusia mammillata PM1 embryo (Inputs.zip)the ground truth ASTEC instance segmentation of these input images (ASTEC_Ground_truth.zip)the predictions infered by the trained 3D Mask R-CNN (3D_Mask_R_CNN.zip)the predictions infered by the state-of-the-art network for cell instance segmentation Cellpose with its pre-trained weights cyto3 (Cellpose_cyto3.zip)the predictions infered by Cellpose after being retrained over a sample of the Phallusia mammillata dataset (Cellpose_retrained.zip)the resulting weights of the 3D Mask R-CNN trained over the Phallusia mammillata dataset.These data can be used to reproduce the 3D Mask R-CNN inference over the Phallusia mammillata test set and to evaluate the predictions against the ASTEC ground truth.The 3D Mask R-CNN code is available here: https://github.com/gdavid57/3d-mask-r-cnnSee the "morphogenesis" branch for the Phallusia mammillata dataset.

  19. f

    Data from: Predicting CO2 Solubility in Diverse Ionic Liquids: A Data-Driven...

    • acs.figshare.com
    xlsx
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zahra Bastami; Mohammad Amin Sobati; Mahdieh Amereh (2025). Predicting CO2 Solubility in Diverse Ionic Liquids: A Data-Driven Approach Using Machine Learning Algorithms [Dataset]. http://doi.org/10.1021/acs.energyfuels.5c01345.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 3, 2025
    Dataset provided by
    ACS Publications
    Authors
    Zahra Bastami; Mohammad Amin Sobati; Mahdieh Amereh
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In this study, new machine-learning-based models have been developed for the prediction of carbon dioxide (CO2) solubility in different Ionic Liquids (ILs). An extensive data set comprising 16,480 experimental data points of CO2 solubility in 296 ILs, consisting of 103 different cation and 78 different anion structures, was utilized for this purpose. Quantitative Structure–Property Relationship (QSPR) models were developed using linear and nonlinear methods based on this large data set. To consider the effect of cation and anion structures on the CO2 solubility, basic descriptors, including zero-dimensional, one-dimensional, and fingerprint descriptors (a category of two-dimensional descriptors), were calculated. Subsequently, the most relevant variables were identified through the StepWise Regression (SWR), resulting in the selection of 18 categories of cationic and anionic descriptors, in addition to temperature and pressure, as inputs for nonlinear Machine Learning (ML) models such as MultiLayer Perceptron (MLP), Radial Basis Function (RBF), Random Forest (RF), and Least-Squares Boosting (LSBoost). Internal and external validation of the models indicated that the LSBoost model displayed the highest accuracy in predicting CO2 solubility and demonstrated superior capability in modeling complex data. R2 and MSE values for this model were 0.9962 and 0.0070 for the training set and 0.9243 and 0.1277 for the test set, respectively. Furthermore, comparisons between the LSBoost model and the available models in the literature demonstrated that the LSBoost model surpasses the other models in performance, proving to be reliable for predicting CO2 solubility in new ILs, thereby aiding in the design and selection of ILs for CO2 capture.

  20. Dataset and Model Weights for Plasma Sheet Model Graph Network Simulator

    • zenodo.org
    • paperswithcode.com
    zip
    Updated May 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diogo D. Carvalho; Diogo D. Carvalho; Diogo R. Ferreira; Diogo R. Ferreira; Luís O. Silva; Luís O. Silva (2024). Dataset and Model Weights for Plasma Sheet Model Graph Network Simulator [Dataset]. http://doi.org/10.5281/zenodo.10440186
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Diogo D. Carvalho; Diogo D. Carvalho; Diogo R. Ferreira; Diogo R. Ferreira; Luís O. Silva; Luís O. Silva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This repository contains the simulation data and pre-trained Graph Neural Network (GNN) models produced in [1].

    Two *.zip files are provided:

    • data.zip - contains the datasets of train/test simulations produced using the Sheet Model algorithm [1, 2]
    • models.zip - contains the GNN model weights (*.pkl) + relevant training information and model parameters (*.yml and *.txt)

    Dataset subfolders are named according to dataset/{'train' or 'test'}/{number of sheets}/{boundary condition}/. Each subfolder contains multiple simulations and a single info.yml file with relevant information regarding the overall setup. For each i-th simulation the following files are provided:

    • x_{i}.npy - array with sheet trajectories (#time-steps, #sheets)
    • v_{i}.npy - array with sheet velocities (#time-steps, #sheets)
    • x_eq_{i}.npy - array with sheet equilibrium positions (#time-steps, #sheets)

    Model sub-folders are named according to :

    • models/{time step}/{seed} - default architecture (preferred)
    • models/{time step}/{'collisions', 'nosent' or 'equivariant'}/{seed} - alternative (less performing) architectures mentioned in the paper appendices.

    For each model we provide:

    • params_best.pkl - model weights that performed the best during training on the validation set
    • params_final.pkl - model weights at the end of training
    • model_cfg.yml - GNN architecture metadata
    • train_cfg.yml - training configuration metadata
    • train_data.yml - training dataset metadata
    • loss.txt - training and validation loss per epoch
    • loss_i.txt - training loss per gradient update step

    Source Code

    The source code used to produce the data, train, and test the models can be found at: https://github.com/diogodcarvalho/gns-sheet-model

    References

    [1] D. D. Carvalho, D. R. Ferreira, L. O. Silva, "Learning the dynamics of a one-dimensional plasma model with graph neural networks", Mach. Learn.: Sci. Technol. 5 025048 (2024)

    [2] J. Dawson, "One‐Dimensional Plasma Model", The Physics of Fluids 5.4 (1962): 445-459.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Organization logo

Data from: Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction.

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

Search
Clear search
Close search
Google apps
Main menu