12 datasets found
  1. OGBG-Code (Processed for PyG)

    • kaggle.com
    zip
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBG-Code (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbg-code/code
    Explore at:
    zip(1314604183 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    Redao da Taupl
    Description

    OGBN-Code

    Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code

    Usage in Python

    from torch_geometric.data import DataLoader
    from ogb.graphproppred import PygGraphPropPredDataset
    
    dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input') 
    
    batch_size = 32
    split_idx = dataset.get_idx_split()
    train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
    valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
    test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
    

    Description

    Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.

    Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.

    Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.

    Summary

    Package#Graphs#Nodes per Graph#Edges per GraphSplit TypeTask TypeMetric
    ogb>=1.2.0452,741125.2124.2ProjectSub-token predictionF1 score

    License: MIT License

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

    Disclaimer

    I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

  2. Z

    Data from: HEAPO – An Open Dataset for Heat Pump Optimization with Smart...

    • nde-dev.biothings.io
    • earth.org.uk
    • +1more
    Updated Mar 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    González Vayá, Marina (2025). HEAPO – An Open Dataset for Heat Pump Optimization with Smart Electricity Meter Data and On-Site Inspection Protocols [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_15056918
    Explore at:
    Dataset updated
    Mar 24, 2025
    Dataset provided by
    Brudermüller, Tobias
    González Vayá, Marina
    Staake, Thorsten
    Fleisch, Elgar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Heat pumps are essential for decarbonizing residential heating but consume substantial electrical energy, impacting operational costs and grid demand. Many systems run inefficiently due to planning flaws, operational faults, or misconfigurations. While optimizing performance requires skilled professionals, labor shortages hinder large-scale interventions. However, digital tools and improved data availability create new service opportunities for energy efficiency, predictive maintenance, and demand-side management. To support research and practical solutions, we present an open-source dataset of electricity consumption from 1,408 households with heat pumps and smart electricity meters in the canton of Zurich, Switzerland, recorded at 15-minute and daily resolutions between 2018-11-03 and 2024-03-21. The dataset includes household metadata, weather data from 8 stations, and ground truth data from 410 field visit protocols collected by energy consultants during system optimizations. Additionally, the dataset includes a Python-based data loader to facilitate seamless data processing and exploration.

    Data Description Paper

    To use the dataset, please refer to the description provided in the current preprint. Note that this manuscript on arXiv is a preprint and is currently under peer review. The dataset and dataloader are available in their initial version, but future updates may occur. If you use the dataset in its current form, please cite the following arXiv paper: https://arxiv.org/abs/2503.16993

    Code Availability

    A Python-based dataloader and data usage instructions can be found on GitHub: https://github.com/tbrumue/heapo

  3. [Dataset] Towards Robotic Mapping of a Honeybee Comb

    • data.europa.eu
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo, [Dataset] Towards Robotic Mapping of a Honeybee Comb [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15042164?locale=hu
    Explore at:
    unknown(4855)Available download formats
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    "Towards Robotic Mapping of a Honeybee Comb" Dataset This dataset supports the analyses and experiments of the paper: J. Janota et al., "Towards Robotic Mapping of a Honeybee Comb," 2024 International Conference on Manipulation, Automation and Robotics at Small Scales (MARSS), Delft, Netherlands, 2024, doi: 10.1109/MARSS61851.2024.10612712. Link to Paper | Link to Code Repository Cell Detection The celldet_2023 dataset contains a total of 260 images of the honeycomb (at resolution 67 µm per pixel), with masks from the ViT-H Segment Anything Model (SAM) and annotations for these masks. The structure of the dataset is following:celldet_2023├── {image_name}.png├── ...├── masksH (folder with masks for each image)├────{image_name}.json├────...├── annotations├────annotated_masksH (folder with annotations for training images)├──────{image_name in training part}.csv├──────...├────annotated_masksH_val (folder with annotations for validation images)├──────{image_name in validation part}.csv}├──────...├────annotated_masksH_test (folder with annotations for test images)├──────{image_name in test part}.csv}├──────... Masks For each image there is a .json file that contains all the masks produced by the SAM for the particular image, the masks are in COCO Run-Length Encoding (RLE) format. Annotations The annotation files are split into folders based on whether they were used for training, validation or testing. For each image (and thus also for each .json file with masks), there is a .csv file with two columns: Column id Description 0 order id of the mask in the corresponding .json file 1 mask label: 1 if fully visible cell, 2 if partially occluded cell, 0 otherwise Loading the Dataset For an example of loading the data, see the data loader in the paper repository: python cell_datasetV2.py --img_dir --mask_dir

  4. MISATO - Machine learning dataset for structure-based drug discovery

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +1
    Updated May 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Till Siebenmorgen; Till Siebenmorgen; Filipe Menezes; Sabrina Benassou; Erinc Merdivan; Stefan Kesselheim; Marie Piraud; Marie Piraud; Fabian J. Theis; Michael Sattler; Grzegorz M. Popowicz; Filipe Menezes; Sabrina Benassou; Erinc Merdivan; Stefan Kesselheim; Fabian J. Theis; Michael Sattler; Grzegorz M. Popowicz (2023). MISATO - Machine learning dataset for structure-based drug discovery [Dataset]. http://doi.org/10.5281/zenodo.7711953
    Explore at:
    application/gzip, txt, binAvailable download formats
    Dataset updated
    May 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Till Siebenmorgen; Till Siebenmorgen; Filipe Menezes; Sabrina Benassou; Erinc Merdivan; Stefan Kesselheim; Marie Piraud; Marie Piraud; Fabian J. Theis; Michael Sattler; Grzegorz M. Popowicz; Filipe Menezes; Sabrina Benassou; Erinc Merdivan; Stefan Kesselheim; Fabian J. Theis; Michael Sattler; Grzegorz M. Popowicz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Developments in Artificial Intelligence (AI) have had an enormous impact on scientific research in recent years. Yet, relatively few robust methods have been reported in the field of structure-based drug discovery. To train AI models to abstract from structural data, highly curated and precise biomolecule-ligand interaction datasets are urgently needed. We present MISATO, a curated dataset of almost 20000 experimental structures of protein-ligand complexes, associated molecular dynamics traces, and electronic properties. Semi-empirical quantum mechanics was used to systematically refine protonation states of proteins and small molecule ligands. Molecular dynamics traces for protein-ligand complexes were obtained in explicit water. The dataset is made readily available to the scientific community via simple python data-loaders. AI baseline models are provided for dynamical and electronic properties. This highly curated dataset is expected to enable the next-generation of AI models for structure-based drug discovery. Our vision is to make MISATO the first step of a vibrant community project for the development of powerful AI-based drug discovery tools.

  5. Z

    Data from: BioEncoder: a metric learning toolkit for comparative organismal...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lürig, Moritz David; Di Martino, Emanuela; Porto, Arthur (2024). Data from: BioEncoder: a metric learning toolkit for comparative organismal biology [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10909613
    Explore at:
    Dataset updated
    Jul 26, 2024
    Dataset provided by
    University of Oslo
    Florida Museum of Natural History
    Authors
    Lürig, Moritz David; Di Martino, Emanuela; Porto, Arthur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BioEncoder: a metric learning toolkit for comparative organismal biology

    Abstract - In the realm of biological image analysis, deep learning (DL) has become a core toolkit, e.g., for segmentation and classification. However, conventional DL methods are challenged by large biodiversity datasets characterized by unbalanced classes and hard-to-distinguish phenotypic differences between them. Here we present BioEncoder, a user-friendly toolkit for metric learning, which overcomes these challenges by focussing on learning relationships between individual data points rather than on the separability of classes. BioEncoder is released as a Python package, created for ease of use and flexibility across diverse datasets. It features taxon-agnostic data loaders, custom augmentation options, and simple hyperparameter adjustments through text-based configuration files. The toolkit's significance lies in its potential to unlock new research avenues in biological image analysis while democratizing access to advanced deep metric learning techniques. BioEncoder focuses on the urgent need for toolkits bridging the gap between complex DL pipelines and practical applications in biological research.

    Dataset - This data repository includes two things: a snapshot of the BioEncoder package (BioEncoder-main.zip, version 1.0.0, downloaded from https://github.com/agporto/BioEncoder on 2024-07-19 at 17:20), and the damselfly dataset used for the case study presented in the paper (bioencoder_data.zip). The dataset archive also encompasses the configuration files and the final model checkpoints from the case study, as well as a script to reproduce the results and figures presented in the paper.

    How to use - Get started by consulting the GithHub repository for information on how to install BioEncoder, then download the data archive and run the script. Some parts of the script can be executed using the model checkpoints, for orther parts the training rountine needs to be run.

  6. Lots of code

    • kaggle.com
    zip
    Updated Dec 20, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vladislav Zavadskyy (2017). Lots of code [Dataset]. https://www.kaggle.com/zavadskyy/lots-of-code
    Explore at:
    zip(15060061555 bytes)Available download formats
    Dataset updated
    Dec 20, 2017
    Authors
    Vladislav Zavadskyy
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Content

    Plain text, pulled from GitHub, sorted and concatenated into one file per language. Among those languages are:

    • abap: 0.138GB
    • actionscript: 0.684GB
    • ada: 0.002GB
    • assembly: 2.134GB
    • c: 4.452GB
    • clojure: 0.136GB
    • cobol: 0.483GB
    • code: 7.725GB
    • cpp: 3.248GB
    • crystal: 0.069GB
    • csharp: 1.205GB
    • css: 0.881GB
    • cuda: 0.275GB
    • d: 0.990GB
    • dart: 0.655GB
    • delphi: 0.514GB
    • erlang: 0.343GB
    • fortran: 1.127GB
    • go: 4.471GB
    • haskell: 0.447GB
    • html: 2.158GB
    • java: 1.049GB
    • js: 4.863GB
    • julia: 0.144GB
    • lua: 0.301GB
    • matlab: 0.257GB
    • perl: 0.585GB
    • php: 1.300GB
    • prolog: 0.146GB
    • python: 0.911GB
    • r: 0.214GB
    • ruby: 0.625GB
    • rust: 0.434GB
    • sas: 0.272GB
    • scala: 0.458GB
    • shell: 0.175GB
    • tex: 0.554GB
    • vbnet: 0.389GB
    • xml: 5.160GB
    • coffeescript: 0.106GB
    • lisp: 0.699GB

    Useful things

    Data loader written in python and simple classifying LSTM (TensorFlow), are going to be available here, once I put them there.

    Acknowledgements

    I would like to thank to all contributors to any repository on GitHub as it's hard to thank only to the contributors of repositories presented in this dataset. But I'll try anyway: if you've contributed to one or more repository in this list, thank you.

    I'm sorry, if I forgot to mention you, even though your code is in the dataset, it's hard to keep in memory list of this size. If you feel your repository should be there, feel free to write me.

    Also, I'd like to thank those guys for providing this cheesy hi-res stock image of code.

  7. CSEM-MISD - CSEM's Multi-Illumination Surface Defect Detection Dataset

    • data.niaid.nih.gov
    Updated Dec 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Honzátko, David; Türetken, Engin; Bigdeli, Siavash A.; Fua, Pascal; Dunbar, L. Andrea (2022). CSEM-MISD - CSEM's Multi-Illumination Surface Defect Detection Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5513768
    Explore at:
    Dataset updated
    Dec 8, 2022
    Dataset provided by
    Swiss Center for Electronics and Microtechnologyhttps://www.csem.ch/
    École polytechnique fédérale de Lausanne (EPFL)
    Authors
    Honzátko, David; Türetken, Engin; Bigdeli, Siavash A.; Fua, Pascal; Dunbar, L. Andrea
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    In automated surface visual inspection, it is often necessary to capture the inspected part under many different illumination conditions to capture all the defects. To address this issue, at CSEM we have acquired a real-world multi-illumination defect segmentation dataset, called CSEM-MISD and we release it for research purposes to benefit the community.

    The dataset consists of three different types of metallic parts -- washers, screws, and gears. Parts were captured in a half-spherical light-dome system that filtered out all the ambient light and successively illuminated it from 108 distinct illumination angles. Each 12 illumination angles share the same elevation level and the relative azimuthal difference between the adjacent illumination angles on the same level is 30 degrees. For more details, please read Sections 3 and 4 of our paper.

    The washers dataset features 70 defective parts. The gears and screws datasets feature 35 defective, 35 intact and several hundred unannotated parts. Some defects, such as notches and holes, are visible in most images (illuminations) with intensity and texture variations among them, while others, such as scratches, are only visible in a few.

    We split the datasets into train and test sets. The train sets contain 32 samples, and the test set 38 samples. Each sample comprises 108 images (each captured under a different illumination angle), an automatically extracted foreground segmentation mask, and a hand-labeled defect segmentation mask.

    This dataset is challenging mainly because:

    each raw sample consists of 108 gray-scale images of resolution 512×512 and therefore takes 27MB of space;

    the metallic surfaces produce many specular reflections that sometimes saturate the camera sensors;

    the annotations are not very precise because the exact extent of defect contours is always subjective;

    the defects are very sparse also in the spatial dimensions: they cover only about 0.2% of the total image area in gears, 0.8% in screws, and 1.4% in washers; this creates an unbalanced dataset with a highly skewed class representation.

    The dataset is organized as follows:

    each sample resides in the Test, Train, or Unannotated directory;

    each sample has its own directory which contains the individual images, the foreground, and defect segmentation masks;

    each image is stored in 8-bit greyscale png format and has a resolution of 512 x 512 pixels;

    Image file names are formatted using three string fields separated with the underscore character: prefix_sampleNr_illuminationNr.png, where the prefix is e.g. washer, the sampleNr might be a three-digit number 001, and the illuminationNr is formed of 3 digits, first corresponding to the elevation index (1 - highest angle, 9 - lowest angle), and the additional two corresponding to the azimuth index (01-12).

    Each dataset contains light_vectors.csv, which contains the illumination angles (in lexicographic order of the illuminationNr), and light_intensities.csv that contains the numbers corresponding to the light intensity on the scale from 0 to 127. Please, be aware, that the azimuth angles were not calibrated and might be a few degrees misaligned.

    We provide data loaders implemented in python at the project's repository.

    If you find our dataset useful, please cite our paper:

    Honzátko, D., Türetken, E., Bigdeli, S. A., Dunbar, L. A., & Fua, P. (2021). Defect segmentation for multi-illumination quality control systems. Machine vision and Applications.

  8. OGBN-MAG (Processed for PyG)

    • kaggle.com
    zip
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBN-MAG (Processed for PyG) [Dataset]. https://www.kaggle.com/dataup1/ogbn-mag
    Explore at:
    zip(852576506 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    Redao da Taupl
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    OGBN-MAG

    Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag

    Usage in Python

    Warning: Currently not usable.

    import torch_geometric
    from ogb.nodeproppred import PygNodePropPredDataset
    
    dataset = PygNodePropPredDataset('ogbn-mag', root = '/kaggle/input')
    split_idx = dataset.get_idx_split()
    train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
    graph = dataset[0] # PyG Graph object
    

    Description

    Graph: The ogbn-mag dataset is a heterogeneous network composed of a subset of the Microsoft Academic Graph (MAG) [1]. It contains four types of entities—papers (736,389 nodes), authors (1,134,649 nodes), institutions (8,740 nodes), and fields of study (59,965 nodes)—as well as four types of directed relations connecting two types of entities—an author is “affiliated with” an institution, an author “writes” a paper, a paper “cites” a paper, and a paper “has a topic of” a field of study. Similar to ogbn-arxiv, each paper is associated with a 128-dimensional word2vec feature vector, and all the other types of entities are not associated with input node features.

    Prediction task: Given the heterogeneous ogbn-mag data, the task is to predict the venue (conference or journal) of each paper, given its content, references, authors, and authors’ affiliations. This is of practical interest as some manuscripts’ venue information is unknown or missing in MAG, due to the noisy nature of Web data. In total, there are 349 different venues in ogbn-mag, making the task a 349-class classification problem.

    Dataset splitting: The authors of this dataset follow the same time-based strategy as ogbn-arxiv and ogbn-papers100M to split the paper nodes in the heterogeneous graph, i.e., training models to predict venue labels of all papers published before 2018, validating and testing the models on papers published in 2018 and since 2019, respectively.

    Summary

    Package#Nodes#EdgesSplit TypeTask TypeMetric
    ogb>=1.2.11,939,74321,111,007TimeMulti-class classificationAccuracy

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [2] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020. [2] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

    Disclaimer

    I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

  9. Z

    A subsection of England and Wales EPC households, joined with PPD data, used...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Nov 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jenkinson, Ryan; Chan, Stephanie; Phillips, Tom; Lopez-Garcia, Daniel (2022). A subsection of England and Wales EPC households, joined with PPD data, used for simulation modelling [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7322966
    Explore at:
    Dataset updated
    Nov 15, 2022
    Dataset provided by
    Centre for Net Zero
    Authors
    Jenkinson, Ryan; Chan, Stephanie; Phillips, Tom; Lopez-Garcia, Daniel
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Area covered
    England
    Description

    If you want to give feedback on this dataset, or wish to request it in another form (e.g csv), please fill out this survey here. We are a not-for-profit research organisation keen to see how others use our open models and tools, so all feedback is appreciated! It's a short form that takes 5 minutes to complete.

    Important Note: Before downloading this dataset, please read the License and Software Attribution section at the bottom.

    This dataset aligns with the work published in Centre for Net Zero's report "Hitting the Target". In this work, we simulate a range of interventions to model the situations in which we believe the UK will meet its 600,000 heat pump installation per year target by 2028. For full modelling assumptions and findings, read our report on our website.

    The code for running our simulation is open source here.

    This dataset contains over 9 million households that have been address matched between Energy Performance Certificates (EPC) data and Price Paid Data (PPD). The code for our address matching is here. Since these datasets are Open Government License (OGL), this dataset is too. We basically model specific columns from various datasets, as set out in our methodology section in our report, to simplify and clean up this dataset for academic use. License information is also available in the appendix of our report above.

    The EPC data loaders can be found here (the data is here) and the rest of the schemas and data download locations can be found here.

    Note that this dataset is not regularly maintained or updated. It is correct as of January 2022. The data was curated and tested using dbt via this Github repository and would be simple to rerun on the latest data.

    The schema / data dictionary for this data can be found here.

    Our recommended way of loading this data is in Python. After downloading all "parts" of the dataset to a folder. You can run:

    
    
    import pandas as pd
    
    
    data = pd.read_parquet("path/to/data/folder/")
    
    
    

    Licenses and software attribution:

    For EPC, PPD and UK House Price Index data:

    For the EPC data, we are permitted to republish this providing we mention that all researchers who download this dataset follow these copyright restrictions. We do not explicitly release any Royal Mail address data, instead we use these fields to generate a pseudonymised "address_cluster_id" which reflects a unique combination of the address lines and postcodes, as well as other metadata. When viewing ICO and GDPR guidelines, this still counts as personal data, but we have gone to measures to pseudonymise as much as possible to fulfil our obligations as a data processor. You must read this carefully before downloading the data, and ensure that you are using it for the research purposes as determined by this copyright notice.

    Contains HM Land Registry data © Crown copyright and database right 2021. This data is licensed under the Open Government Licence v3.0.

    Contains OS data © Crown copyright and database right 2022.

    Contains Office for National Statistics data licensed under the Open Government Licence v.3.0.

    The OGL v3.0 license states that we are free to:

    copy, publish, distribute and transmit the Information;

    adapt the Information;

    exploit the Information commercially and non-commercially for example, by combining it with other Information, or by including it in your own product or application.

    However we must (where we do any of the above):

    acknowledge the source of the Information in your product or application by including or linking to any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence;

    You can see more information here.

    For XOServe Off Gas Postcodes:

    This dataset has been released openly for all uses here.

    For the address matching:

    GNU Parallel: O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014

  10. OGBN-Products (Processed for PyG)

    • kaggle.com
    zip
    Updated Feb 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBN-Products (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbn-products/code
    Explore at:
    zip(3699538358 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    Redao da Taupl
    Description

    OGBN-Products

    Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-products

    Usage in Python

    import os.path as osp
    import pandas as pd
    import datatable as dt
    import torch
    import torch_geometric as pyg
    from ogb.nodeproppred import PygNodePropPredDataset
    
    class PygOgbnProducts(PygNodePropPredDataset):
      def _init_(self, meta_csv = None):
        root, name, transform = '/kaggle/input', 'ogbn-products', None
        if meta_csv is None:
          meta_csv = osp.join(root, name, 'ogbn-master.csv')
        master = pd.read_csv(meta_csv, index_col = 0)
        meta_dict = master[name]
        meta_dict['dir_path'] = osp.join(root, name)
        super()._init_(name = name, root = root, transform = transform, meta_dict = meta_dict)
      def get_idx_split(self, split_type = None):
        if split_type is None:
          split_type = self.meta_info['split']
        path = osp.join(self.root, 'split', split_type)
        if osp.isfile(os.path.join(path, 'split_dict.pt')):
          return torch.load(os.path.join(path, 'split_dict.pt'))
        if self.is_hetero:
          train_idx_dict, valid_idx_dict, test_idx_dict = read_nodesplitidx_split_hetero(path)
          for nodetype in train_idx_dict.keys():
            train_idx_dict[nodetype] = torch.from_numpy(train_idx_dict[nodetype]).to(torch.long)
            valid_idx_dict[nodetype] = torch.from_numpy(valid_idx_dict[nodetype]).to(torch.long)
            test_idx_dict[nodetype] = torch.from_numpy(test_idx_dict[nodetype]).to(torch.long)
            return {'train': train_idx_dict, 'valid': valid_idx_dict, 'test': test_idx_dict}
        else:
          train_idx = dt.fread(osp.join(path, 'train.csv'), header = None).to_numpy().T[0]
          train_idx = torch.from_numpy(train_idx).to(torch.long)
          valid_idx = dt.fread(osp.join(path, 'valid.csv'), header = None).to_numpy().T[0]
          valid_idx = torch.from_numpy(valid_idx).to(torch.long)
          test_idx = dt.fread(osp.join(path, 'test.csv'), header = None).to_numpy().T[0]
          test_idx = torch.from_numpy(test_idx).to(torch.long)
          return {'train': train_idx, 'valid': valid_idx, 'test': test_idx}
    
    dataset = PygOgbnProducts()
    split_idx = dataset.get_idx_split()
    train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
    graph = dataset[0] # PyG Graph object
    

    Description

    Graph: The ogbn-products dataset is an undirected and unweighted graph, representing an Amazon product co-purchasing network [1]. Nodes represent products sold in Amazon, and edges between two products indicate that the products are purchased together. The authors follow [2] to process node features and target categories. Specifically, node features are generated by extracting bag-of-words features from the product descriptions followed by a Principal Component Analysis to reduce the dimension to 100.

    Prediction task: The task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.

    Dataset splitting: The authors consider a more challenging and realistic dataset splitting that differs from the one used in [2] Instead of randomly assigning 90% of the nodes for training and 10% of the nodes for testing (without use of a validation set), use the sales ranking (popularity) to split nodes into training/validation/test sets. Specifically, the authors sort the products according to their sales ranking and use the top 8% for training, next top 2% for validation, and the rest for testing. This is a more challenging splitting procedure that closely matches the real-world application where labels are first assigned to important nodes in the network and ML models are subsequently used to make predictions on less important ones.

    Note 1: A very small number of self-connecting edges are repeated (see here); you may remove them if necessary.

    Note 2: For undirected graphs, the loaded graphs will have the doubled number of edges because the bidirectional edges will be added automatically.

    Summary

    Package#Nodes#EdgesSplit TypeTask TypeMetric
    ogb>=1.1.12,449,02961,859,140Sales rankMulti-class classificationAccuracy

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [3] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] http://manikvarma.org/downloads/XC/XMLRepository.html [2] Wei-Lin Chiang, ...

  11. HER2 Breast Cancer Digital Image Dataset (ADEL Dataset).

    • zenodo.org
    zip
    Updated Jul 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gauhar Dunenova; Gauhar Dunenova; Natalya Glushkova; Natalya Glushkova; Aidos Sarsembayev; Aidos Sarsembayev; Alexandr Ivankov; Alexandr Ivankov; Elvira Satbayeva; Elvira Satbayeva; Zhanna Kalmatayeva; Zhanna Kalmatayeva; Dilyara Kaidarova; Dilyara Kaidarova (2025). HER2 Breast Cancer Digital Image Dataset (ADEL Dataset). [Dataset]. http://doi.org/10.5281/zenodo.15872690
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gauhar Dunenova; Gauhar Dunenova; Natalya Glushkova; Natalya Glushkova; Aidos Sarsembayev; Aidos Sarsembayev; Alexandr Ivankov; Alexandr Ivankov; Elvira Satbayeva; Elvira Satbayeva; Zhanna Kalmatayeva; Zhanna Kalmatayeva; Dilyara Kaidarova; Dilyara Kaidarova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 13, 2025
    Description

    HER2 Breast Cancer Digital Image Dataset (ADEL Dataset).

    We have developed the first Kazakhstani dataset of digital images for HER2 breast cancer analysis. The dataset consists of images sourced from the pathological archives of the Department of Pathology at the Almaty Oncology Center and the Kazakh Institute of Oncology. Each image is labeled with HER2 expression levels manually assessed by experienced pathologists, with in situ hybridization (ISH) performed in equivocal cases to establish ground truth.

    The dataset contains 418 images in PNG format. The annotations can be found in the file her2_dataset/labels.csv.

    HER2 IHC High-Resolution Dataset (Version 0.2)

    This is **Version 0.2** of the HER2 Immunohistochemistry (IHC) dataset. It includes **high-resolution `.tar` archives** containing processed image tiles extracted from whole-slide images (WSIs). This dataset is hosted on the [Hugging Face Hub].

    Digital images were acquired via a fully automated digital system (KFB PRO 120 scanner) at INVIVO LLP with 40x magnification and one focusing layer, ranging in size from 50 MB to 2 GB, depending on the size of the tissue sample fixed on the original slide. The dataset consists of 418 images, which were preprocessed using a conversion script that transformed SVS files into sub-images with a 1:1 aspect ratio in JPEG format. A non-overlapping sliding window approach was applied to generate these sub-images, optimized for machine learning applications.

    The compressed .png version of the dataset may serve as a visual reference to the characteristics of the original images.

    ---

    📥 Download Instructions

    🔸 Option 1: Using Python (via `datasets` library)

    python
    from datasets import load_dataset
    
    dataset = load_dataset("aidosSarsembayev/adel_dataset_1")

    This will provide access to metadata or a data loader if defined. For raw files (e.g., `.tar`), use git-lfs:

    🔸 Option 2: Using `git` + `git-lfs` (recommended for large files)

    bash
    git lfs install
    git clone https://huggingface.co/datasets/aidosSarsembayev/adel_dataset_1
    

    This will download all parts of the dataset, including large `.tar` files.

    ---

    📜 Dataset Contents

    The dataset consists of multiple `.tar.gz` archive files:

    - `HER2_001_009.tar.gz`
    - `HER2_010_019.tar.gz`
    - ...
    - `HER2_420_429.tar.gz`

    Each archive contains high-resolution tiles from several HER2 slides.

    A JSON manifest (`manifest.json`) is provided, mapping each archive to the slide IDs it contains.

    ---

    🧪 Usage

    This dataset is intended for research on:

    - HER2 status classification
    - Digital pathology and WSI analysis
    - IHC image processing

    ---

    🔧 Processing Scripts

    To reproduce or analyze the dataset, use the scripts provided in the following repository:

    🔗 [GitHub – HER2 Data Processing]()

    ---

    📝 Citation and License

    Please refer to the associated Zenodo record or publication for citation and licensing terms. Creative Commons licenses may apply (e.g., CC-BY 4.0).

    ---

    Maintained by: [@asarsembayev](https://huggingface.co/aidosSarsembayev)

  12. OGBN-ArXiv (Processed for PyG)

    • kaggle.com
    zip
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBN-ArXiv (Processed for PyG) [Dataset]. https://www.kaggle.com/dataup1/ogbn-arxiv
    Explore at:
    zip(169289809 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    Redao da Taupl
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    OGBN-ArXiv

    Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv

    Usage in Python

    import os.path as osp
    import pandas as pd
    import datatable as dt
    import torch
    import torch_geometric.transforms as T
    from ogb.nodeproppred import PygNodePropPredDataset
    
    class PygOgbnArxiv(PygNodePropPredDataset):
      def _init_(self):
        root, name, transform = '/kaggle/input', 'ogbn-arxiv', T.ToSparseTensor()
        master = pd.read_csv(osp.join(root, name, 'ogbn-master.csv'), index_col = 0)
        meta_dict = master[name]
        meta_dict['dir_path'] = osp.join(root, name)
        super()._init_(name = name, root = root, transform = transform, meta_dict = meta_dict)
      def get_idx_split(self):
        split_type = self.meta_info['split']
        path = osp.join(self.root, 'split', split_type)
        train_idx = dt.fread(osp.join(path, 'train.csv'), header = False).to_numpy().T[0]
        train_idx = torch.from_numpy(train_idx).to(torch.long)
        valid_idx = dt.fread(osp.join(path, 'valid.csv'), header = False).to_numpy().T[0]
        valid_idx = torch.from_numpy(valid_idx).to(torch.long)
        test_idx = dt.fread(osp.join(path, 'test.csv'), header = False).to_numpy().T[0]
        test_idx = torch.from_numpy(test_idx).to(torch.long)
        return {'train': train_idx, 'valid': valid_idx, 'test': test_idx}
    
    dataset = PygOgbnArxiv()
    split_idx = dataset.get_idx_split()
    train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
    graph = dataset[0] # PyG Graph object
    

    Description

    Graph: The ogbn-arxiv dataset is a directed graph, representing the citation network between all Computer Science (CS) arXiv papers indexed by MAG [1]. Each node is an arXiv paper and each directed edge indicates that one paper cites another one. Each paper comes with a 128-dimensional feature vector obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are computed by running the skip-gram model [2] over the MAG corpus. The authors also provide the mapping from MAG paper IDs into the raw texts of titles and abstracts here. In addition, all papers are also associated with the year that the corresponding paper was published.

    Prediction task: The task is to predict the 40 subject areas of arXiv CS papers, e.g., cs.AI, cs.LG, and cs.OS, which are manually determined (i.e., labeled) by the paper’s authors and arXiv moderators. With the volume of scientific publications doubling every 12 years over the past century, it is practically important to automatically classify each publication’s areas and topics. Formally, the task is to predict the primary categories of the arXiv papers, which is formulated as a 40-class classification problem.

    Dataset splitting: The authors consider a realistic data split based on the publication dates of the papers. The general setting is that the ML models are trained on existing papers and then used to predict the subject areas of newly-published papers, which supports the direct application of them into real-world scenarios, such as helping the arXiv moderators. Specifically, the authors propose to train on papers published until 2017, validate on those published in 2018, and test on those published since 2019.

    Summary

    Package#Nodes#EdgesSplit TypeTask TypeMetric
    ogb>=1.1.1169,3431,166,243TimeMulti-class classificationAccuracy

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [3] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020. [2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, pp. 3111–3119, 2013. [3] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

    Disclaimer

    I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Redao da Taupl (2021). OGBG-Code (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbg-code/code
Organization logo

OGBG-Code (Processed for PyG)

Abstract syntax trees obtained from 450 thousands Python method definitions

Explore at:
zip(1314604183 bytes)Available download formats
Dataset updated
Feb 27, 2021
Authors
Redao da Taupl
Description

OGBN-Code

Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code

Usage in Python

from torch_geometric.data import DataLoader
from ogb.graphproppred import PygGraphPropPredDataset

dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input') 

batch_size = 32
split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)

Description

Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.

Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.

Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.

Summary

Package#Graphs#Nodes per Graph#Edges per GraphSplit TypeTask TypeMetric
ogb>=1.2.0452,741125.2124.2ProjectSub-token predictionF1 score

License: MIT License

Open Graph Benchmark

Website: https://ogb.stanford.edu

The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

References

[1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

Disclaimer

I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

Search
Clear search
Close search
Google apps
Main menu