9 datasets found
  1. Z

    MISATO - Machine learning dataset for structure-based drug discovery

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Filipe Menezes (2023). MISATO - Machine learning dataset for structure-based drug discovery [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7711952
    Explore at:
    Dataset updated
    May 25, 2023
    Dataset provided by
    Marie Piraud
    Grzegorz M. Popowicz
    Till Siebenmorgen
    Fabian J. Theis
    Sabrina Benassou
    Filipe Menezes
    Erinc Merdivan
    Stefan Kesselheim
    Michael Sattler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Developments in Artificial Intelligence (AI) have had an enormous impact on scientific research in recent years. Yet, relatively few robust methods have been reported in the field of structure-based drug discovery. To train AI models to abstract from structural data, highly curated and precise biomolecule-ligand interaction datasets are urgently needed. We present MISATO, a curated dataset of almost 20000 experimental structures of protein-ligand complexes, associated molecular dynamics traces, and electronic properties. Semi-empirical quantum mechanics was used to systematically refine protonation states of proteins and small molecule ligands. Molecular dynamics traces for protein-ligand complexes were obtained in explicit water. The dataset is made readily available to the scientific community via simple python data-loaders. AI baseline models are provided for dynamical and electronic properties. This highly curated dataset is expected to enable the next-generation of AI models for structure-based drug discovery. Our vision is to make MISATO the first step of a vibrant community project for the development of powerful AI-based drug discovery tools.

  2. Z

    Data from: BioEncoder: a metric learning toolkit for comparative organismal...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lürig, Moritz David (2024). Data from: BioEncoder: a metric learning toolkit for comparative organismal biology [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10909613
    Explore at:
    Dataset updated
    Jul 26, 2024
    Dataset provided by
    Lürig, Moritz David
    Porto, Arthur
    Di Martino, Emanuela
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BioEncoder: a metric learning toolkit for comparative organismal biology

    Abstract - In the realm of biological image analysis, deep learning (DL) has become a core toolkit, e.g., for segmentation and classification. However, conventional DL methods are challenged by large biodiversity datasets characterized by unbalanced classes and hard-to-distinguish phenotypic differences between them. Here we present BioEncoder, a user-friendly toolkit for metric learning, which overcomes these challenges by focussing on learning relationships between individual data points rather than on the separability of classes. BioEncoder is released as a Python package, created for ease of use and flexibility across diverse datasets. It features taxon-agnostic data loaders, custom augmentation options, and simple hyperparameter adjustments through text-based configuration files. The toolkit's significance lies in its potential to unlock new research avenues in biological image analysis while democratizing access to advanced deep metric learning techniques. BioEncoder focuses on the urgent need for toolkits bridging the gap between complex DL pipelines and practical applications in biological research.

    Dataset - This data repository includes two things: a snapshot of the BioEncoder package (BioEncoder-main.zip, version 1.0.0, downloaded from https://github.com/agporto/BioEncoder on 2024-07-19 at 17:20), and the damselfly dataset used for the case study presented in the paper (bioencoder_data.zip). The dataset archive also encompasses the configuration files and the final model checkpoints from the case study, as well as a script to reproduce the results and figures presented in the paper.

    How to use - Get started by consulting the GithHub repository for information on how to install BioEncoder, then download the data archive and run the script. Some parts of the script can be executed using the model checkpoints, for orther parts the training rountine needs to be run.

  3. [Dataset] Towards Robotic Mapping of a Honeybee Comb

    • data.europa.eu
    • zenodo.org
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo, [Dataset] Towards Robotic Mapping of a Honeybee Comb [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15042164?locale=bg
    Explore at:
    unknown(4855)Available download formats
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    "Towards Robotic Mapping of a Honeybee Comb" Dataset This dataset supports the analyses and experiments of the paper: J. Janota et al., "Towards Robotic Mapping of a Honeybee Comb," 2024 International Conference on Manipulation, Automation and Robotics at Small Scales (MARSS), Delft, Netherlands, 2024, doi: 10.1109/MARSS61851.2024.10612712. Link to Paper | Link to Code Repository Cell Detection The celldet_2023 dataset contains a total of 260 images of the honeycomb (at resolution 67 µm per pixel), with masks from the ViT-H Segment Anything Model (SAM) and annotations for these masks. The structure of the dataset is following:celldet_2023├── {image_name}.png├── ...├── masksH (folder with masks for each image)├────{image_name}.json├────...├── annotations├────annotated_masksH (folder with annotations for training images)├──────{image_name in training part}.csv├──────...├────annotated_masksH_val (folder with annotations for validation images)├──────{image_name in validation part}.csv}├──────...├────annotated_masksH_test (folder with annotations for test images)├──────{image_name in test part}.csv}├──────... Masks For each image there is a .json file that contains all the masks produced by the SAM for the particular image, the masks are in COCO Run-Length Encoding (RLE) format. Annotations The annotation files are split into folders based on whether they were used for training, validation or testing. For each image (and thus also for each .json file with masks), there is a .csv file with two columns: Column id Description 0 order id of the mask in the corresponding .json file 1 mask label: 1 if fully visible cell, 2 if partially occluded cell, 0 otherwise Loading the Dataset For an example of loading the data, see the data loader in the paper repository: python cell_datasetV2.py --img_dir --mask_dir

  4. d

    gnaf-loader

    • data.gov.au
    Updated Aug 21, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). gnaf-loader [Dataset]. https://data.gov.au/dataset/7863c61a-a46f-4441-a1c6-97c516012aac
    Explore at:
    Dataset updated
    Aug 21, 2016
    Description

    A Python script for quickly loading the complete G-NAF and PSMA Administrative Boundaries into Postgres, simplified and ready to use as reference data for address validation, geocoding, analysis and …Show full descriptionA Python script for quickly loading the complete G-NAF and PSMA Administrative Boundaries into Postgres, simplified and ready to use as reference data for address validation, geocoding, analysis and visualisation. It also customises G-NAF and the Admin Bdys to remove some of the known, minor limitations of the data.

  5. e

    Data from: HEAPO – An Open Dataset for Heat Pump Optimization with Smart...

    • earth.org.uk
    Updated Mar 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brudermueller, Tobias; Brudermueller, Tobias; Fleisch, Elgar; Fleisch, Elgar; Vayá, Marina González; Vayá, Marina González; Staake, Thorsten; Staake, Thorsten (2025). HEAPO – An Open Dataset for Heat Pump Optimization with Smart Electricity Meter Data and On-Site Inspection Protocols [Dataset]. http://doi.org/10.48550/ARXIV.2503.16993
    Explore at:
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    arXiv
    Authors
    Brudermueller, Tobias; Brudermueller, Tobias; Fleisch, Elgar; Fleisch, Elgar; Vayá, Marina González; Vayá, Marina González; Staake, Thorsten; Staake, Thorsten
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Heat pumps are essential for decarbonizing residential heating but consume substantial electrical energy, impacting operational costs and grid demand. Many systems run inefficiently due to planning flaws, operational faults, or misconfigurations. While optimizing performance requires skilled professionals, labor shortages hinder large-scale interventions. However, digital tools and improved data availability create new service opportunities for energy efficiency, predictive maintenance, and demand-side management. To support research and practical solutions, we present an open-source dataset of electricity consumption from 1,408 households with heat pumps and smart electricity meters in the canton of Zurich, Switzerland, recorded at 15-minute and daily resolutions between 2018-11-03 and 2024-03-21. The dataset includes household metadata, weather data from 8 stations, and ground truth data from 410 field visit protocols collected by energy consultants during system optimizations. Additionally, the dataset includes a Python-based data loader to facilitate seamless data processing and exploration.

  6. Z

    A subsection of England and Wales EPC households, joined with PPD data, used...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Phillips, Tom (2022). A subsection of England and Wales EPC households, joined with PPD data, used for simulation modelling [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7322966
    Explore at:
    Dataset updated
    Nov 15, 2022
    Dataset provided by
    Chan, Stephanie
    Lopez-Garcia, Daniel
    Jenkinson, Ryan
    Phillips, Tom
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Area covered
    England, Wales
    Description

    If you want to give feedback on this dataset, or wish to request it in another form (e.g csv), please fill out this survey here. We are a not-for-profit research organisation keen to see how others use our open models and tools, so all feedback is appreciated! It's a short form that takes 5 minutes to complete.

    Important Note: Before downloading this dataset, please read the License and Software Attribution section at the bottom.

    This dataset aligns with the work published in Centre for Net Zero's report "Hitting the Target". In this work, we simulate a range of interventions to model the situations in which we believe the UK will meet its 600,000 heat pump installation per year target by 2028. For full modelling assumptions and findings, read our report on our website.

    The code for running our simulation is open source here.

    This dataset contains over 9 million households that have been address matched between Energy Performance Certificates (EPC) data and Price Paid Data (PPD). The code for our address matching is here. Since these datasets are Open Government License (OGL), this dataset is too. We basically model specific columns from various datasets, as set out in our methodology section in our report, to simplify and clean up this dataset for academic use. License information is also available in the appendix of our report above.

    The EPC data loaders can be found here (the data is here) and the rest of the schemas and data download locations can be found here.

    Note that this dataset is not regularly maintained or updated. It is correct as of January 2022. The data was curated and tested using dbt via this Github repository and would be simple to rerun on the latest data.

    The schema / data dictionary for this data can be found here.

    Our recommended way of loading this data is in Python. After downloading all "parts" of the dataset to a folder. You can run:

    
    
    import pandas as pd
    
    
    data = pd.read_parquet("path/to/data/folder/")
    
    
    

    Licenses and software attribution:

    For EPC, PPD and UK House Price Index data:

    For the EPC data, we are permitted to republish this providing we mention that all researchers who download this dataset follow these copyright restrictions. We do not explicitly release any Royal Mail address data, instead we use these fields to generate a pseudonymised "address_cluster_id" which reflects a unique combination of the address lines and postcodes, as well as other metadata. When viewing ICO and GDPR guidelines, this still counts as personal data, but we have gone to measures to pseudonymise as much as possible to fulfil our obligations as a data processor. You must read this carefully before downloading the data, and ensure that you are using it for the research purposes as determined by this copyright notice.

    Contains HM Land Registry data © Crown copyright and database right 2021. This data is licensed under the Open Government Licence v3.0.

    Contains OS data © Crown copyright and database right 2022.

    Contains Office for National Statistics data licensed under the Open Government Licence v.3.0.

    The OGL v3.0 license states that we are free to:

    copy, publish, distribute and transmit the Information;

    adapt the Information;

    exploit the Information commercially and non-commercially for example, by combining it with other Information, or by including it in your own product or application.

    However we must (where we do any of the above):

    acknowledge the source of the Information in your product or application by including or linking to any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence;

    You can see more information here.

    For XOServe Off Gas Postcodes:

    This dataset has been released openly for all uses here.

    For the address matching:

    GNU Parallel: O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014

  7. HER2 Breast Cancer Digital Image Dataset (ADEL Dataset).

    • zenodo.org
    zip
    Updated Jul 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gauhar Dunenova; Gauhar Dunenova; Natalya Glushkova; Natalya Glushkova; Aidos Sarsembayev; Aidos Sarsembayev; Alexandr Ivankov; Alexandr Ivankov; Elvira Satbayeva; Elvira Satbayeva; Zhanna Kalmatayeva; Zhanna Kalmatayeva; Dilyara Kaidarova; Dilyara Kaidarova (2025). HER2 Breast Cancer Digital Image Dataset (ADEL Dataset). [Dataset]. http://doi.org/10.5281/zenodo.15872690
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gauhar Dunenova; Gauhar Dunenova; Natalya Glushkova; Natalya Glushkova; Aidos Sarsembayev; Aidos Sarsembayev; Alexandr Ivankov; Alexandr Ivankov; Elvira Satbayeva; Elvira Satbayeva; Zhanna Kalmatayeva; Zhanna Kalmatayeva; Dilyara Kaidarova; Dilyara Kaidarova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 13, 2025
    Description

    HER2 Breast Cancer Digital Image Dataset (ADEL Dataset).

    We have developed the first Kazakhstani dataset of digital images for HER2 breast cancer analysis. The dataset consists of images sourced from the pathological archives of the Department of Pathology at the Almaty Oncology Center and the Kazakh Institute of Oncology. Each image is labeled with HER2 expression levels manually assessed by experienced pathologists, with in situ hybridization (ISH) performed in equivocal cases to establish ground truth.

    The dataset contains 418 images in PNG format. The annotations can be found in the file her2_dataset/labels.csv.

    HER2 IHC High-Resolution Dataset (Version 0.2)

    This is **Version 0.2** of the HER2 Immunohistochemistry (IHC) dataset. It includes **high-resolution `.tar` archives** containing processed image tiles extracted from whole-slide images (WSIs). This dataset is hosted on the [Hugging Face Hub].

    Digital images were acquired via a fully automated digital system (KFB PRO 120 scanner) at INVIVO LLP with 40x magnification and one focusing layer, ranging in size from 50 MB to 2 GB, depending on the size of the tissue sample fixed on the original slide. The dataset consists of 418 images, which were preprocessed using a conversion script that transformed SVS files into sub-images with a 1:1 aspect ratio in JPEG format. A non-overlapping sliding window approach was applied to generate these sub-images, optimized for machine learning applications.

    The compressed .png version of the dataset may serve as a visual reference to the characteristics of the original images.

    ---

    📥 Download Instructions

    🔸 Option 1: Using Python (via `datasets` library)

    python
    from datasets import load_dataset
    
    dataset = load_dataset("aidosSarsembayev/adel_dataset_1")

    This will provide access to metadata or a data loader if defined. For raw files (e.g., `.tar`), use git-lfs:

    🔸 Option 2: Using `git` + `git-lfs` (recommended for large files)

    bash
    git lfs install
    git clone https://huggingface.co/datasets/aidosSarsembayev/adel_dataset_1
    

    This will download all parts of the dataset, including large `.tar` files.

    ---

    📜 Dataset Contents

    The dataset consists of multiple `.tar.gz` archive files:

    - `HER2_001_009.tar.gz`
    - `HER2_010_019.tar.gz`
    - ...
    - `HER2_420_429.tar.gz`

    Each archive contains high-resolution tiles from several HER2 slides.

    A JSON manifest (`manifest.json`) is provided, mapping each archive to the slide IDs it contains.

    ---

    🧪 Usage

    This dataset is intended for research on:

    - HER2 status classification
    - Digital pathology and WSI analysis
    - IHC image processing

    ---

    🔧 Processing Scripts

    To reproduce or analyze the dataset, use the scripts provided in the following repository:

    🔗 [GitHub – HER2 Data Processing]()

    ---

    📝 Citation and License

    Please refer to the associated Zenodo record or publication for citation and licensing terms. Creative Commons licenses may apply (e.g., CC-BY 4.0).

    ---

    Maintained by: [@asarsembayev](https://huggingface.co/aidosSarsembayev)

  8. OGBN-Proteins (Processed for PyG)

    • kaggle.com
    zip
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBN-Proteins (Processed for PyG) [Dataset]. https://www.kaggle.com/dataup1/ogbn-proteins
    Explore at:
    zip(677947148 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    Redao da Taupl
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    OGBN-Proteins

    Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-proteins

    Usage in Python

    import os.path as osp
    import pandas as pd
    import torch
    import torch_geometric.transforms as T
    from ogb.nodeproppred import PygNodePropPredDataset
    
    class PygOgbnProteins(PygNodePropPredDataset):
      def _init_(self, meta_csv = None):
        root, name, transform = '/kaggle/input', 'ogbn-proteins', T.ToSparseTensor()
        if meta_csv is None:
          meta_csv = osp.join(root, name, 'ogbn-master.csv')
        master = pd.read_csv(meta_csv, index_col = 0)
        meta_dict = master[name]
        meta_dict['dir_path'] = osp.join(root, name)
        super()._init_(name = name, root = root, transform = transform, meta_dict = meta_dict)
      def get_idx_split(self, split_type = None):
        if split_type is None:
          split_type = self.meta_info['split']
        path = osp.join(self.root, 'split', split_type)
        if osp.isfile(os.path.join(path, 'split_dict.pt')):
          return torch.load(os.path.join(path, 'split_dict.pt'))
        if self.is_hetero:
          train_idx_dict, valid_idx_dict, test_idx_dict = read_nodesplitidx_split_hetero(path)
          for nodetype in train_idx_dict.keys():
            train_idx_dict[nodetype] = torch.from_numpy(train_idx_dict[nodetype]).to(torch.long)
            valid_idx_dict[nodetype] = torch.from_numpy(valid_idx_dict[nodetype]).to(torch.long)
            test_idx_dict[nodetype] = torch.from_numpy(test_idx_dict[nodetype]).to(torch.long)
            return {'train': train_idx_dict, 'valid': valid_idx_dict, 'test': test_idx_dict}
        else:
          train_idx = dt.fread(osp.join(path, 'train.csv'), header = None).to_numpy().T[0]
          train_idx = torch.from_numpy(train_idx).to(torch.long)
          valid_idx = dt.fread(osp.join(path, 'valid.csv'), header = None).to_numpy().T[0]
          valid_idx = torch.from_numpy(valid_idx).to(torch.long)
          test_idx = dt.fread(osp.join(path, 'test.csv'), header = None).to_numpy().T[0]
          test_idx = torch.from_numpy(test_idx).to(torch.long)
          return {'train': train_idx, 'valid': valid_idx, 'test': test_idx}
    
    dataset = PygOgbnProteins()
    split_idx = dataset.get_idx_split()
    train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
    graph = dataset[0] # PyG Graph object
    

    Description

    Graph: The ogbn-proteins dataset is an undirected, weighted, and typed (according to species) graph. Nodes represent proteins, and edges indicate different types of biologically meaningful associations between proteins, e.g., physical interactions, co-expression or homology [1,2]. All edges come with 8-dimensional features, where each dimension represents the strength of a single association type and takes values between 0 and 1 (the larger the value is, the stronger the association is). The proteins come from 8 species.

    Prediction task: The task is to predict the presence of protein functions in a multi-label binary classification setup, where there are 112 kinds of labels to predict in total. The performance is measured by the average of ROC-AUC scores across the 112 tasks.

    Dataset splitting: The authors split the protein nodes into training/validation/test sets according to the species which the proteins come from. This enables the evaluation of the generalization performance of the model across different species.

    Note: For undirected graphs, the loaded graphs will have the doubled number of edges because the bidirectional edges will be added automatically.

    Summary

    Package#Nodes#EdgesSplit TypeTask TypeMetric
    ogb>=1.1.1132,53439,561,252SpeciesMulti-label binary classificationROC-AUC

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [3] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research, 47(D1):D607–D613, 2019. [2] Gene Ontology Consortium. The gene ontology resource: 20 years and still going strong. Nucleic Acids Research, 47(D1):D330–D338, 2018. [3] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

    Disclaimer

    I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

  9. CSEM-MISD - CSEM's Multi-Illumination Surface Defect Detection Dataset

    • zenodo.org
    application/gzip
    Updated Dec 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Honzátko; David Honzátko; Engin Türetken; Siavash A. Bigdeli; Pascal Fua; L. Andrea Dunbar; Engin Türetken; Siavash A. Bigdeli; Pascal Fua; L. Andrea Dunbar (2022). CSEM-MISD - CSEM's Multi-Illumination Surface Defect Detection Dataset [Dataset]. http://doi.org/10.5281/zenodo.5513769
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Dec 7, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Honzátko; David Honzátko; Engin Türetken; Siavash A. Bigdeli; Pascal Fua; L. Andrea Dunbar; Engin Türetken; Siavash A. Bigdeli; Pascal Fua; L. Andrea Dunbar
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    In automated surface visual inspection, it is often necessary to capture the inspected part under many different illumination conditions to capture all the defects. To address this issue, at CSEM we have acquired a real-world multi-illumination defect segmentation dataset, called CSEM-MISD and we release it for research purposes to benefit the community.

    The dataset consists of three different types of metallic parts -- washers, screws, and gears (temporarily only the first one is available). Parts were captured in a half-spherical light-dome system that filtered out all the ambient light and successively illuminated it from 108 distinct illumination angles. Each 12 illumination angles share the same elevation level and the relative azimuthal difference between the adjacent illumination angles on the same level is 30 degrees. For more details, please read Sections 3 and 4 of our paper.

    The washers dataset features 70 defective parts. Some defects, such as notches and holes, are visible in most images (illuminations) with intensity and texture variations among them, while others, such as scratches, are only visible in a few.

    We split the datasets into train and test sets. The train sets contain 32 samples, and the test set 38 samples. Each sample comprises 108 images (each captured under a different illumination angle), an automatically extracted foreground segmentation mask, and a hand-labeled defect segmentation mask.

    This dataset is challenging mainly because:

    • each raw sample consists of 108 gray-scale images of resolution 512×512 and therefore takes 27MB of space;
    • the metallic surfaces produce many specular reflections that sometimes saturate the camera sensors;
    • the annotations are not very precise because the exact extent of defect contours is always subjective;
    • the defects are very sparse also in the spatial dimensions: they cover only about 1.4% of the total image area in washers; this creates an unbalanced dataset with a highly skewed class representation.

    The dataset is organized as follows:

    • each sample resides in the Test, Train, or Unannotated directory;
    • each sample has its own directory which contains the individual images, the foreground, and defect segmentation masks;
    • each image is stored in 8-bit greyscale png format and has a resolution of 512 x 512 pixels;
    • Image file names are formatted using three string fields separated with the underscore character: prefix_sampleNr_illuminationNr.png, where the prefix is e.g. washer, the sampleNr might be a three-digit number 001, and the illuminationNr is formed of 3 digits, first corresponding to the elevation index (1 - highest angle, 9 - lowest angle), and the additional two corresponding to the azimuth index (01-12).
    • Each dataset contains light_vectors.csv, which contains the illumination angles (in lexicographic order of the illuminationNr), and light_currents.csv that contains the numbers roughly corresponding to the light intensity.

    We provide data loaders implemented in python at the project's repository.

    If you find our dataset useful, please cite our paper:

    Honzátko, D., Türetken, E., Bigdeli, S. A., Dunbar, L. A., & Fua, P. (2021). Defect segmentation for multi-illumination quality control systems. Machine vision and Applications.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Filipe Menezes (2023). MISATO - Machine learning dataset for structure-based drug discovery [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7711952

MISATO - Machine learning dataset for structure-based drug discovery

Explore at:
Dataset updated
May 25, 2023
Dataset provided by
Marie Piraud
Grzegorz M. Popowicz
Till Siebenmorgen
Fabian J. Theis
Sabrina Benassou
Filipe Menezes
Erinc Merdivan
Stefan Kesselheim
Michael Sattler
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Developments in Artificial Intelligence (AI) have had an enormous impact on scientific research in recent years. Yet, relatively few robust methods have been reported in the field of structure-based drug discovery. To train AI models to abstract from structural data, highly curated and precise biomolecule-ligand interaction datasets are urgently needed. We present MISATO, a curated dataset of almost 20000 experimental structures of protein-ligand complexes, associated molecular dynamics traces, and electronic properties. Semi-empirical quantum mechanics was used to systematically refine protonation states of proteins and small molecule ligands. Molecular dynamics traces for protein-ligand complexes were obtained in explicit water. The dataset is made readily available to the scientific community via simple python data-loaders. AI baseline models are provided for dynamical and electronic properties. This highly curated dataset is expected to enable the next-generation of AI models for structure-based drug discovery. Our vision is to make MISATO the first step of a vibrant community project for the development of powerful AI-based drug discovery tools.

Search
Clear search
Close search
Google apps
Main menu